Watch Episode Here

Read Episode Description

In this episode, we delve into the intricate world of AI inference with cofounders of Firework AI. Discover the strategies behind optimizing AI performance, the importance of balancing latency and throughput, and the nuances of different AI architectures from GPT-3 to Stable Diffusion. Learn about their partnership with Stability AI, their unique focus on reducing total cost of ownership, and their vision for a seamless developer experience.

RECOMMENDED PODCAST:
How Do You Use ChatGPT with Dan Shipper via @EveryInc
Dan Shipper talks to programmers, writers, founders, academics, tech executives, and others to walk through all of their ChatGPT use cases (including Nathan!). They even use ChatGPT together, live on the show. Listen to How Do You Use ChatGPT? from Dan Shipper and the team at Every, wherever you get your podcasts : https://link.chtbl.com/hdyucha...

SPONSORS:
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

Plumb is a no-code AI app builder designed for product teams who care about quality and speed. What is taking you weeks to hand-code today can be done confidently in hours. Check out https://bit.ly/PlumbTCR for early access.

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

CHAPTERS :
(00:00:00) Introduction
(00:08:34) Compute Stack
(00:19:23) Fireworks Product Philosophy
(00:24:11) Sponsors : Brave / Omneky
(00:25:40) Fine-tuning Strategy
(00:38:40) Sponsors : Plumb / Squad
(00:41:37) NVIDIA Stack Overview
(00:47:14) TensorFlow Triton Service
(00:55:25) Reduced Precision Advantages
(01:03:57) Different Deployment Scenarios
(01:08:27) Seeking Intuition on Sharding
(01:28:28) Announcing Stability AI Partnership
(01:32:00) Closing Remarks

Full Transcript

Transcript

Lin Qiao: (0:00) That is a complexity or application product developers as they are doing things fun stuff themselves or in enterprises, they're all facing this challenge. So that's where we're coming and say, you don't worry about it. We handle it all for you. So you just focus on your product application development.

Dmytro Ivchenko: (0:19) I'm gonna just directly apply the techniques we learn from text models on the image model because it has quality implications. So you need to do some extra work to make sure that the policies are progressing. So that is quite a bit different.

Lin Qiao: (0:34) Over time, all these database management systems become smarter and smarter because they all have a layer called optimizer. Those optimizer observed the workload and it start to create, oh, you're doing a lot of future on this particular column. So I'm going to create index. I'm going to partition those columns based on your future criteria. So it's much better search, much faster search.

Nathan Labenz: (0:53) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torrenberg. Hello, and welcome back to the Cognitive Revolution. Today, my guests are Lin Qiao and Dmytro Ivchenko, cofounders of Fireworks AI, a company that specializes in inference compute, partnering with the world's leading generative AI researchers to serve the best models at the fastest speeds. Lin and Dmytro both previously worked on PyTorch at Meta, which is today the default GoToAI framework powering applications used by billions. There, they gained firsthand experience with the immense challenges of running large language models at a massive scale and the many trade offs between latency, cost, and scalability that are always involved. With Fireworks, they're building an end to end platform to make it radically easier and more cost effective for any company to put generative AI into production. This spans the full technology stack, including providing simple tools for executing parameter efficient fine tuning techniques like LoRa that help developers iterate quickly toward product market fit. Also, developing highly optimized deployments, leveraging multiple layers of abstraction, including custom CUDA kernels to deliver consistently low latency, and managing and scaling hardware across major cloud compute providers in a way that's seamless to their customers. This is a wide ranging discussion. Lin and Dmytro share their hard earned expertise on the intricacies of AI inference, and we dive deep into the weeds on topics including the different priorities that their customers have, such as minimizing time to first token, which is particularly important for voice applications, how some inference compute providers today are using Uber style subsidized pricing to win business, and why Lin thinks developers should be cautious about building on these platforms. Also, why she considers OpenAI and Anthropic to be Fireworks' real long term competition. Why Fireworks is betting that all small models, whether open or closed source, will ultimately converge in capabilities. The main parallelization techniques, including tensor and pipeline parallelism that they're using to spread models across GPUs in different ways with different benefits, why software is struggling to keep up with the pace of advances in hardware, and how Fireworks is working toward an automated optimizer that will eventually allow even nontechnical customers to choose the best configurations for their use cases. Finally, at the end, we brought Dmytro back for a short bonus discussion to cover their recently announced partnership with Stability AI, which has them powering stable diffusion 3 generation on an exclusive basis. We talked a bit about some of the subtle differences between the image and text generation use cases. And overall, I came away with the sense that this partnership makes a ton of sense and might become a new pattern in the industry as research groups look to make their work widely and effectively available while also finding ways to earn a return on their investment. Whether you are an AI engineer wrestling with model deployment or an executive evaluating AI platforms, Lin and Dmytro offer a rare peek behind the curtain of the infrastructure layer, which may not get as much hype as the latest state of the art model, but is obviously critical to realizing the potential of this technology. If you, like me, default to high speed podcast listening, I might suggest slowing this 1 down a little bit. Lin is originally from China, and Dmytro is originally from Ukraine. And this conversation does get quite technical at times. So when I listened back, I found that 1.5 x speed was as fast as I could comfortably go. As always, if you find value in the show, we'd appreciate an online share or review, and we always welcome your feedback via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Finally, for today, if you're an AI engineer looking for a new opportunity or a business owner thinking about investing in AI tools or custom applications, I especially encourage you to get in touch. A number of companies that I'm connected with, including the executive assistant company Athena, which I've mentioned many times, and the Licit, where I'm a small time investor, plus a number of previous guests, are looking for AI engineering talent. And I'm also helping to start a boutique AI advisory and custom application studio, which will serve small and medium sized businesses. That company is still in stealth mode for now, but the founder is offering free 1 on 1 consultations to business owners who have a practical problem that AI might be able to help solve. If any of this sounds interesting, send me a DM, and I can connect you with the right people behind the scenes. Now here's my conversation on AI inference optimized for scale, speed, and efficiency with Lin Qiao and Dmytro Ivchenko of Fireworks AI.

Nathan Labenz: (5:42) Lin Qiao and Dmytro Ivchenko, cofounders of Fireworks AI. Welcome to the Cognitive Revolution.

Lin Qiao: (5:47) Thanks for having us.

Nathan Labenz: (5:48) Thank you. Yeah. I'm excited for this conversation. You guys are in a super interesting business, which I will confess to not knowing a ton about. You provide primarily inference compute, and people broadly are well aware of the fact that compute is 1 of the hottest commodities in the world today and don't need to look any farther than NVIDIA's stock price to get a sense for how high the demand is for compute. I also hear speculation that it's a tough business to be in because commodity businesses long term can be tough. And then I also know that there's a lot of low level execution details that really matter in businesses like these, and I'm super interested to learn more about some of the close to the metal work that you're doing. So maybe to start off, just give us a quick intro to Fireworks AI, how you guys got the idea to start the business and what your big picture vision is for it.

Lin Qiao: (6:41) Yeah, definitely. So I think this is a very interesting question about, hey, is inference support provider a low margin commodity? So a different way to answer these questions. First of all, reselling hardware is a low margin business. I still remember when we just started, we were brainstorming all kinds of different ideas and we did notice a demand that because of GPU shortage, GPU arbitration could be an interesting problem to solve. And we decided to stay away from that because indeed, as you said, it's low margin and it's not a sustainable business. I have also seen artificially manufactured pricing from highly competitive landscape, and there's no way those provider can build a sustainable long term business. And at least at the minimum, people need to take caution building on top of such stack as when these companies or startups run out of funding, they will disappear. 1 of the important thing we want to build fireworks on top of is like focus on our specialty based on our experience. And here, when we think about GNI, so it is gonna empower a whole slew of application and product disruption, particularly being consumer, prosumer and developer facing. The fundamental reason is this is a new revolutionary technology that didn't exist before that can generate content emulating what human can generate. And by definition, the receiving part of this content are by and large human. And for those B2C application latency is very important part of product experience because it has to be hyper interactive without that interactiveness is not a viable product. So many of our customer come to us for extremely low latency requirements. At the same time, the content generated from those providers has to be high quality. And we provide high quality through automated loop across fine tuning and inference. And last but not least, we provide low TCO in a highly sustainable way. And here low TCO is very important because it's a different business from traditional application development where built on top of commodity CPUs. The gross margin is very high because CPU is so cheap and now shift towards GPUs and GPU, not just the hardware expensive, it is very power hungry. It consumes a lot of power and the general other heat and traditional air cooling doesn't work. People have to use liquid cooling or immersion cooling. That means put the GPU inside oil and fully immersed. And those are all costs of operating and genuine inference. So that very high cost make it very challenging to justify business impact. And we specialize in reducing total cost of ownership. So if you have a viable product, it can turn that into a viable business. So that's the value add we have been focused on delivering from Fireworks side. In terms of product, quick overview, we provide the general AI platform to do fast experiment iteration and inference production scaling. So here there are 2 development loops we're optimizing for. Inner loop product experimentation, we optimize for iteration speed. In outer loop production, we optimize for hard system metrics, including latency, TCO, scalability reliability I just mentioned. The specific product feature on these 2 loops are fine tuning and inference with on demand serving and automation. This is for inner loop fast iteration of experimentation. And we provide a faster inference at scale for the outer loop. When you have product market fit, you want to scale a business and that this auto loop of production, we help you to get the best seed and best TCO. So that's the very high level summary of Fireworks product.

Nathan Labenz: (10:38) Cool. Okay. Several threads there that I want to follow-up on. 1, interestingly, it sounds like you're saying that a lot of products on the market today are in an Uber moment where you think they're essentially being offered below cost and in a fundamentally unsustainable way. And by contrast, I understand that you are operating your business without doing that. You're not radically subsidizing the customer. Do I have that correct?

Lin Qiao: (11:06) That's correct. But also, I just want to call out our value add is it's not extremely low cost. Our value add is low latency, high quality, and low TCO. So I almost saying low TCO is a byproduct of our high performance.

Nathan Labenz: (11:22) Yeah. This is a real, I've experienced this in my own application development where TCO is not net the natural way that a lot of engineers think of things, and they specifically don't tend to factor in the cost of their own salaries as they start to build out infrastructure. I am a big believer in certainly the specialization of, you know, dedicated infrastructure providers just because I've seen how hard it is to create a reasonably reliable stack when you're doing it on your own. But where are we in this market development cycle and where are things headed? It's an interesting observation off the bat that some of these bargain the cheapest options that are out there are subsidized in an unsustainable way. I know that's happening a lot at the application layer, because certainly people are offering free demos only. That's obviously subsidized. Right? Because all the tokens are costing money, and people are certainly giving away a lot of free accounts and a lot of free inference to their end users. But I hadn't really considered how much that might be happening at the inference compute layer. So it's really interesting to consider that is also a reality in today's world.

Lin Qiao: (12:29) Yeah. A lot of our customers, whether they are developers or enterprises come to us, not because we are the lowest pricing provider at all. They come to us because again, they're building consumer, consumer developer facing applications that requires very low latency and they cannot get it by themselves. They cannot get it from any other providers, even from OpenAI and topic, they didn't get the right latency or latency is not stable. And they are seeking solution from our side. And the gathering to low latency is actually not easy and Dmytro can speak a lot to it today. But the high level challenging is GenAI model is among the largest model sizing in the complexity in the whole spectrum of machine learning. In early days of machine learning, the algorithm is extremely simple. It's tens of megabytes. And now we're talking about the tens of billions of parameters, but that complexity, it doesn't change the nature of all these B2C applications, whether built on traditional machine learning or built on top of Jenny. It doesn't change the latency requirements. It has to be extremely interactive and that put a lot of back pressure on the inference serving tier to do even more aggressive optimization. And that's the challenge you are, we are really good at addressing and that's our biggest value add. In addition to pricing, I don't think pricing in the long run is a sustainable value add.

Nathan Labenz: (13:58) Yeah. Makes sense. Okay, cool. So I wanna get into then how you're doing it to the greatest degree that you can share and that I can comprehend because I do think this stuff is gonna get very into the weeds as we get close to the metal. Do I understand correctly that you are managing your own servers on racks? You're talking about cooling and all this sort of stuff. So are you vertically integrated to that level?

Lin Qiao: (14:22) Yeah. So that's a good segue to talk about our compute stack. And we currently build on top of CSPs. So we're running on top of AWS, GCP and Oracle OCI. The reason is when we grow bigger, there are various different way to be more efficient. And right now we are optimizing for velocity of our product development and those CSPs has been battle tested. So that's why we build on top of them. And we also, as a company aspire to run on top of the best hardware across the whole entire industry. And of course it's the easiest to build on top of NVIDIA GPUs. And that's how we started. But at the same time we see a lot of emerging hardware's coming to this hardware landscape for JDI, including MD, including Intel, including custom ASICs. So there are various different interesting trade off. We'll love to pass on to Dmytro to talk about the trade offs across those different hardware providers.

Dmytro Ivchenko: (15:20) Sure. Yeah. So as Lin mentioned, historically, we've been primarily dominated by NVIDIA in the overall AI landscape. And the reason for NVIDIA domination is twofold. First is they have pretty good hardware. And if you look at the latest h 100, it has 2 beta flops. This is pretty high and they have with pushing with a new black belt to double that and then you go to 10 petaflops for their super chips. And they also are doubling quadrupling memory bandwidth along the way. They're also improving on their cross host interconnect. And the most like welcome development there is the actual latencies, which will allow us to over interconnect, which will in turn will allow us to reduce the latencies of our generation speed from it. That's 1. But the the second thing is NVIDIA historically has had a very robust developer stack. And because of these 2 reasons, I think NVIDIA has been dominating. But lately, I would say NVIDIA's arrival, AMD is coming back. And then now with the new MI 300, which has better hardware specs than NVIDIA. And they are also working, I'm sure, day and night on their software stack and the new welcome open source development on the Orakma stack. They are catching up basically in all aspects to it and media as well. We also see other newcomers that there is development from Intel with the Gaudi 2 and company Gaudi 3 release, and which is actually even more powerful than MI 300. So it's in between H100 and Vue, 100, Vue 100 Vues. So that's like on the more programmable side of things. And now we've looked at the different spectrum, the less programmable, more specialized hardware, namely ASICs. So here I would probably just select 1 is RAG, which has been making a little bit of a use lately on the red because of their nice demo, the high generation speeds. But the thing is that ASICs always compromise on something. So in here is that if you look at the graph and supposedly there is very little detail and sometimes you need to guess what these guys do because they're not as open as NVIDIA. But what they're fundamentally doing, they're compromising on the fact that they wanna remove the shared memory of this HBM in GPUs and they wanna put it as close to units as possible. They wanna put it on the die. But then they are compromising on the processing units, so they have a shippable flops all over. They also, this memory is very expensive because putting anything on die is way more expensive than having a memory shared across the units. And now what the results is that they have much lower achievable flops. What they can do faster, they can run the generations faster because they don't need to copy memory from the shared memory to the memory which is on time. So this is like fundamental kind of compromise. They're going, they're also way more expensive to operate because if you want to host a single model, you need to have a human guest set up, which costs millions of dollars. So these are gonna compromise in the. Although it's very interesting and, credit school demos, but you can see limited application of this in the real world and sort of limited disruption. Although it does solve some interesting cases, but they seem to be more of edge cases right now where it wins over traditional programmable GPU hardware. There's also GPUs, of course, that that has been the case. They've always been around and they are setting out from program programmability in between ASICs and GPUs. Although technically, supports TPUs, but to get the best performance, you have to go to Jackson, Google's world, the Google software stack. So it becomes quite different operational, more quite different business to operate. So this in a nutshell.

Lin Qiao: (19:09) Yeah. So speaking about that, so as you can see, it's a spectrum. It's a spectrum from most programmable to hyper specialized. And I want to talk a little bit about our meta experience because both of us and actually a large portion of the funding team came from Meta and we all worked on PyTorch extensively over the past 5 years. So it's very interesting experience there. Of course, as you can see, Meta like other hyperscalers running on top of CPU, GPU, ASICs, this is the public information, a variety of hardware. And recently they mentioned they deployed more than 600,000 H100 in house. And what does that mean? Put that in perspective, that's equal to the power needed for 600,000 average American household. That's roughly twice the size of San Francisco. Huge power consumption, but Meta didn't get there overnight. It took us more than 5 years to go from basic machine learning on CPU to complex deep learning on GPU and ASICs. So there are lot of interesting lesson learned because we work on software stack, right? So the first lesson learned is a good product cannot compromise and need to be opinionated about what's a product philosophy and what's the target audience. So at Meta, when we start to work on diving to AI infrastructure, we had at least 3 different AI frameworks that's in 2018. We'll have KV2 for mobile, ONNX for server production. So our charter is to unify them into 1 and cover both server, mobile, research and production. It's called PyTorch 1. This feels like mission impossible at the moment because it has so many different goals. So we merge all these teams together, but there's no consensus how to build this 1 framework. So we came up with a idealistic zipper approach and the project was actually literally called zipper taking the PyTorch front end, which is very easy, simple to use and zip it with Kafu2 backend, which is highly performant. It failed miserably because these 2 frameworks was never designed to work together. The integration overhead was more than writing a new framework from scratch. So we end up ditched that plan and we kept the PyTorch front end and rewrote this backend into Torch script. And that's the interpreter to execute PyTorch efficiently. So notice that the design decision, the product choice we made is to hold the line for ease of use, hold the line for ease of AI innovation for developers built on PyTorch. And then we take all the complexity of you building the backend. So that is the design decision. However, TorchScript for PyTorch 1, it requires the user, the developers to annotate which part of PyTorch code is scriptable. That's not a very good UX because people need to know how to annotate. So 2 years ago, we also started Torch Compile to fully automate the process. People don't need to annotate at all. So that's called PyTorch 2. We left before the project fully shipped, but I'm thrilled to see like there's great progress made from the PyTorch team. So another interesting takeaway we had in that process is we thought it should be easy to bring PyTorch production. We thought the process is just swap the libraries because both PyTorch and KV2 and honest, they're just library, they're not services. Swap those libraries in the service. However, that's a wrong assumption. A framework or library innovation requires a completely new infrastructure service to build to support it. We end up have to build new data loaders, new data transformer, transformation layer for training for PyTorch. We need to build a new training loop for PyTorch. We need to build new distributed inference app for PyTorch. So after 5 years, we wrote a whole entire PyTorch platform from ground up. It was serving more than 50,000,000,000,000 inference per day across 50 data centers. So as you can see, this journey took us 5 years, but we start Fireworks, we want to reduce, we want to shrink this 5 years journey into 5 weeks or even just 5 days for current broader set of developers in the industry. We have built better tested large scale AI systems. So we're confident that we can all innovate the incumbents and significantly shorten time to market for everyone out there want to build new disruptive applications and products on top of generic.

Nathan Labenz: (23:31) Hey, we'll continue our interview in a moment after a word from our sponsors. If I try to summarize that and and make the translation from the Meta experience to the Fireworks product direction, it seems like the key principles are never compromise on the product, user experience above all, the users hate to wait. I certainly know that from personal experience. And there's an echo of that at the developer layer too. Like, developers hate complexity and don't wanna have to deal with all this mess. So internal focus on the end user experience above all. Focus on the developer experience second. Simplify everything. Bring all the complexity in house. Abstract away from all the different kinds of hardware. And it sounds like the place that you wanna carve out for yourselves with Fireworks is a sort of layer that sits between the developers and all the different hardware providers so that folks can easily develop their applications and not really have to worry about what hardware they're running on, the relative strengths and weaknesses of all that, what bottlenecks they're gonna hit at at all these different depending on all the different stacks. Tell me if that's right, first of all. And then I'm very curious to get into a little bit more detail on maybe what some of those bottlenecks are. I'm like pretty rudimentary in my GPU knowledge, but I'm learning. And the audience will be pretty varied in terms of how much command they have of this stuff. First of all, do I have that general story and kind of market position and vision right?

Lin Qiao: (25:10) Yeah, so it's mostly right. A little bit more nuance is what is getting the way of the fast iteration loop. Here, we're talking about current set of application product developers. They haven't jumped onto AI yet and they have a slew of AI tools they have to learn and pick and choose different models for their use cases. And also try to optimize for latency and try to justify business impacts through low TCO. All of those are on their plate right now. And our goal is to take away those concerns, let them focus on where they should be focusing on is application and product development. And we give them proper tools and higher level obstruction. So they get low latency, low TCO, high quality, extremely easily. Of course, it sounds really nice and we can dive a lot deeper into how we get there. But I, at a high level, I will break it into 2 buckets. 1 is low latency, low TCO. That's a bucket we should dive deeper into that's driven by our high system optimization, performance optimization. The other bucket is quality. And then we have to talk about quality because at the end we think our competition are not other inference provider. Our real competition is actually open the air and the end topic. So solving and addressing people's quality issue is very high priority of our company goals.

Nathan Labenz: (26:33) Yeah. Okay. Cool. Those are both really interesting. Mel, let's maybe start with the quality 1 because that's probably the more intuitive 1. And I was noticing in trying the product, and I always do go in and try products on the, in my preparation for these conversations. I would say the experience is super easy to get started. Just create an account. Next thing you know, you're in a playground. You can choose from all these models. You've got what you call serverless models, which is literally just a very similar experience to what you get with OpenAI and Anthropic where you're making an API call, you're purely paying for tokens. And that's all super simple. You can test stuff out, try all the different models in the playground, and then hit the give me the code button and it, you know, pops up the code and you can go copy that code and drop it into your application. So all that is probably reasonably familiar to folks in our audience. They've done that kind of thing at least with an OpenAI or an Anthropic, if not another inference provider. Quality though is a tough challenge, right? Because with some notable exceptions that are starting to crack into the top 10, certainly still the very best models are proprietary to the OpenAI and Anthropics and Google deep minds of the world. And notably too, the price is also starting to get pretty low from those guys. The Haiku from Anthropic is like a really interesting point of comparison. The price that you guys have for the small tier of models up to 16,000,000,000 parameters. I'm interested to hear how that cutoff was selected, but up to that 16,000,000,000 threshold is 20¢ per million tokens, which is crazy to think. Not that long ago, it was 6¢ per thousand tokens from OpenAI with the original GPT 3. So we've gone from 6¢ per thousand to 20 per million for better models than the original GPT 3 in just 2 years. I always like to stand back and just marvel at how fast that cost has come down. But the leaders are not too far behind. Right? They're with HYCU, Anthropic is at 25¢ per million. At least on input, they charge a little more for the output. But with the long context, presumably a lot of the time, it is pretty input heavy workload there. So, yeah, let's talk about quality. How do you win in a world of of haiku? I would assume that's like the number 1 direct competitor for those smallest models. Curious to hear how you're thinking about the quality challenge.

Lin Qiao: (28:57) Yeah, definitely. So between closed source vendor and open source vendor, who is going to win? It's clear that we are currently in a very intense race between open source and closed source models. This week is an interesting week. Mistral team just dropped a MOE, a new MOE model with 8 experts of 22,000,000,000 parameters, and we just enabled it. And Meta will open source Llama 3 in a few weeks. And Google has been continuous to open source newer gamma models. And Q1 and e models from those Chinese institutions are getting better and better also. At the same time, OpenAI and Anthropic keep improving their model quality, including at a smaller model size scale. So I think all these models in the same size bucket will converge in quality eventually. The reason is for smaller models, because of the size, it has the upper bound on how much knowledge it can absorb and it determines capability. And as we are all also converged on how much data we can train a model. So just inevitable over time, whether open source or closed source, the model quality will be similar to each other. If that's true, and I will argue open source models have a much stronger ecosystem potential because it has a lot more active people engagement. Developers are engaging in tuning over source models and then they open source them. And this keeps going. It has compounding effects. So more and more people can build on top of each other's work. This is how I'm thinking. So that's why Fireworks fundamentally build on top of those open source ecosystem. We also just launched our fine tuning service, it's relatively new. So in terms of revenue, it's not dominating any of other product features, but it's 1 of our fastest growing product feature. With that said, I will still say right now we are not there yet. It's open source, closed source models, small models are converging. Right now the open source model still need fine tuning. And the challenge of fine tuning is it's a lot, much longer development process than prompt engineering. And we fully acknowledge that. And that's why we, as a company, we want to solve those problems to make fine tuning much faster, easier, and actionable. Because once you see the result, Hey, what is not working? It's actionable. And also we will by and large automate a lot of this pain in fine tuning away. So our developers will have much simpler experience as close to prompt engineering as possible. So that's 1 aspect. The other aspect is eventually OpenAI and Anthropic, their goal, including matter automation, their goal is to get to AGI, right? What is AGI? It's basically building AI system that's smarter than human. It means this system can solve very complex problem and hundreds of them at the same time. But if we look at the problems developers and enterprises they are solving, they don't solve hundreds of very complex logical reasoning problem at the same time. They probably have a handful of very specific problems. For example, they want to solve classification problem from intent routing based on intent route to different agents to categorize product catalog nicely to re ranking, retrieve results, get top K of the best answers, or to do structured data extraction from images, like extract information from receipts, from medical bills, from insurance policies to paraphrasing emails for better sales follow-up or for better marketing lead generation. And the list can go very long, but as you can see every single example as I mentioned here, they're very specific. So that's create a interesting opportunity for us for especially for smaller open source models. Think about model training as a process of aligning the model to optimize for a set of objectives. So this is very important framing because what the model is good at is being decided at the beginning of the training process. How you form the training data set? What is the proportion of what kind of training data set you mix together is basically a product opinion you put in what is the capability of this end model. So no model today is gonna be good at solving all kinds of problem. You have to pick and choose by set the objective at the beginning. So that's a general framing, but you can also think about it's much easier to align model to solve a specific problem very well instead of ask the model to solve hundreds of problem very well. Solving narrow down the objective to 1 problem is a much easier alignment process. So that's why in practice, all the practical problem we're seeing today is amenable to smaller models because the problems are very narrow and well defined. And we have seen a lot of success using fine tuning to solve those problems. And second, almost every single enterprise we talk with, they have data, they have data to align a model better. And sometimes not just on par, even better than GPT-four. So based on those observations, we're pretty bullish on the direction of continue down to make the feedback loop of quality between fine tuning inference, data collection, cleansing, this feedback loop very efficient for our users. And again, we will try to automate as much as possible in this process.

Nathan Labenz: (34:34) Okay, cool. I have a number of follow ups there as well. So just speaking to my own experience briefly, the general notion that fine tuning is that seems like a really good answer to how do we compete with haiku, because I would agree that and in fact, at Weimarc, my company, we use a fine tuned GPT 3.5 currently to power our core script writing task. You know, the number 1 most important AI function in the product, we could use GPT for it. It's not really like a budget thing for us. We have a pretty high value use case. So our strategy is use whatever gives us the best results, not just the best overall user experience, I'd say. And the quality of the output is probably the number 1 factor there. Latency also is important. As we've discussed, people don't wanna wait. And so at the moment, we're on a 3.5 fine tuned model. GPT-four could probably do the task pretty well, and we wouldn't be scared off by the cost, but it is a little bit slow sometimes. And, also, it's a little bit unwieldy. You're trying to prompt engineer your way of all these different caveats and rules and whatever, and it's hard to represent all that stuff in the prompt. And you could also think, oh, maybe we could use haiku, and maybe we could even put, like, 10 examples into haiku. And that could, in theory, get it to do the in context learning. But now we've gone up an order of magnitude in price because we're doing 10 examples at every prompt. Fine tuning, I do think is a really good answer. And certainly, as you said, companies have data, but even if they don't, a lot of companies would probably be very surprised with how little data it really takes to do a reasonable fine tuning. Our data set is in the 3 figures, hundreds, but not even thousands of data points. And that works quite well, actually. We don't have to have a huge thing for a narrow task. If you're trying to maintain generality, then you have a much bigger challenge on your hands. But for this 1 task of write a script for this business, for this video, hundreds of data points we have found pretty well suffices.

Lin Qiao: (36:36) That's excellent point. Yeah.

[Ad break]

Lin Qiao: (37:15) Yeah, this is a great way to summarize it because we are 1 of the product design philosophy is to be OpenAI compatible because OpenAI has a lot of draw for the initial set of developers to try out ideas. The benefit of being OpenAI compatible from fine tuning to inference is it's really easy to migrate. So even if our developer don't start from us, but once they get to stage, they need more interesting fine tuning that OpenAI doesn't provide and they need low latency and low TCO for production and they can move to us. So you're absolutely right. The API will feel very familiar, but on top of that, we're adding also new APIs that doesn't exist in OpenAI. So that's also our unique advantage, but coming back to fine tuning. Yeah. So here we started offer a special fine tuning called PAPT, performance efficient fine tuning, actually found a long time ago. The most popular technique in that bucket is called LoRA. LoRA stands for low rank adaptation. So the idea is you freeze the pre trained model weights and inject trainable rank decomposition matrices into each layer of the transformer architecture. And the end result is greatly reduced the number of trainable parameters for downstream tasks. So usually fine tuning, you can think of it as short training, right? So you need to go forward, backwards and do all this stuff. With LoRa, then your base model is only forward and then your additional adapter, you go through the full training steps. So that's how it saves compute and be very efficient. And also you just mentioned, actually you figure out many people with this consistent feedback, you don't need a lot of training samples or tuning samples, typically like around 1,000, in your example, just hundreds of them will be sufficient. A compound with very efficient LoRa fine tuning, that means your feedback loop is gonna be really fast and that make the whole fine tuning process even more appealing alternative. So let me spend a little bit time explaining what LoRA means. So under the hood, it leverages the concept again, called rank decomposition. And rank decomposition is basically allow us to represent a high dimensional matrix with a product of 2 lower dimensional matrices. So if the pre trained weight matrix, let's say of dimension n by k, and then it can be represented with a product 2 matrices, let's call it n by r for the first 1. And the second 1 is R by K. So if we do a multiplication across these 2 matrices, you will still get N by K, right? So that's the high level idea. And let's make the saving more concrete. N is 2,000, K is 5,000. And R is 1 for extreme case. Just make the point here. So the total of parameters in the original matrix is 2,000 by 5,000. That's 10,000,000. And then if we decompose into lower rank, then the first matrix is 2,000 by 1, it's 2,000. And second is 1 by 5,000, it's 5,000. So total normal parameters across these 2 lower rank lower dimensional matrix is 2,000 plus 5,000, 7,000. So 7 is a significant lower number of parameters than 10,000,000. That's original matrix. So this is more than 1,000 times reduction. That's the idea behind why using LoRa to do fine tuning is so much faster. It's not just for a faster fine tuning at inference, we can also do something very interesting. So we actually have a many customer come to us. They need to deploy many LoRa adapters against the same base model. So the naive way to deploy those LoRa adapter is you merge it with the base model and then you deploy this 100. If they have 100 LoRa adapters, you have to deploy models. And all these 100 models sits in memory. We are 1 of the first to deploy many LoRa's by sharing the same base model. You can imagine it looks like a tree. There's a trunk, that's a base model. And each LoRa adapters are hanging on the trunk. So by doing this sharing, we can save a lot of cost in production for inference. Because for a 100 LoRa adapters, basically originally without the saving, you have to deploy a 100 times and then you just need to deploy the 1 based model that's dominating the cost. By that you save a 100 times of cost. We've kind of used LoRa pretty extensively across both fine tuning and inference, and that significantly increased the velocity of iteration. If you haven't used, try that. I strongly encourage you to try our fine tuning service.

Nathan Labenz: (42:03) So yeah, let's just flesh that out a little bit more for people. You tell me if I'm going wrong anywhere. The high level situation is you wanna have a model deployed that you can get inference from quickly. If you have all your computers turned off, you have the cold start problem of, okay. I gotta boot up a Docker container and load new stuff in, and then these, like, billions of parameters take time to move in. So typically, most services are going to try to have some mechanism of not making you wait for full cold start. So instead, what you have when something is deployed is you have a GPU sitting there with the model loaded into the high bandwidth memory, which is to say the second tier memory. I've gone into this a bit on a previous episode where I dug into the mamba architecture. But, basically, on your GPUs, you've got your many computation cores. There is the SRAM that is the, like, highest or the, let's say, the lowest latency closest to the computation core RAM, but it's small. You can't fit the whole model in there, so you have to have this second tier RAM, the high bandwidth memory where the actual billions and billions of parameters of the model sit. And then when you're actually doing inference, there you're paging in parameters from high bandwidth memory into the SRAM for the actual computation, paging them in and out to to do stuff. And what you're saying with the LoRa like, many LoRa's configuration is that you could you the naive approach as you described it would be to say, okay. I'm gonna have a 100 different servers. Each 1 will have its base model and its LoRa. And the LoRa is, whatever, 1 to 3% or something as many parameters. Again, depending on the setup and exactly how you do it. So you could have a 100 of those each with the base model and the LoRa, or you could have your base model and a bunch of LORAs all sitting in high bandwidth memory on a single GPU. And then you just have to manage the the paging in or, you know, keeping track of which LoRa we're using for which use case, that kind of stuff gets a little bit more complicated, but it allows you to to save on having to to set up all these servers and have them waiting there. What's the next, you know, level of sophistication in terms of the analysis there that could be like, what are the bottlenecks? What are the trade offs that you're facing? Tell me, what do I need to know next to get smarter beyond that base level description?

Lin Qiao: (44:37) Yeah. So we did a lot of very complicated inference stack optimization to bring down latency. As we have discussed in this episode, our hyper focus on latency, but also while we're hyper focused on very low latency, we also hold the latency bar. We can pump up throughput to very high, and that is a result of low TCO. I will pass it to Dmytro to talk about all the trade offs nuances we have put into the inference optimization stack.

Dmytro Ivchenko: (45:06) So I wanna hear you, a step back and try to go over some very basic steps of how to build your own GNI inference service and what are the key points you should pay attention to because we have right now a huge information overload. There's so much information. Oh, this attention information is the best, or this Mob architecture is the next big thing, or we need to run a sharp model differently. There's so many different techniques. But the question still remains, what are the actual techniques which really do matter? Right? So that's what I wanna focus on these important points here. So, okay, now let's say we wanna build the generic inference. And right now, let's always focus maybe on 1 or 2 cases. 1 is text generation and then another is image generation. Luckily, the very welcome development is that the new model from stability for the image generation is based on the transformer architecture. So the old architecture based on the rest of it is gone. And with that, all the convolutions are not as important anymore. So all you have to worry about is the transformer now, which to some extent is great, but it has its own challenges. Now, what does matter for transformer? So with transformer, I would say there are only 2 operations which matter. 1 is the the most ubiquitous matrix multiplication, and that's all the ODI, more close, GDI or not, but move is the most important 1. And the second most important 1 is is of course attention. And attention is a bit special kind of METMOOL. I would say kinda it's back to back batched METMOOL and optimizing it is quite critical. Okay, so now let's say you optimize these 2 operations, but guess what? There are different flavors that you need to, you of these operations. Okay, these flavors are coming from the text generation workload, because text generation is a bit special. There is input tokens and output tokens, And processing them is vastly different. The processing of input tokens is mostly compute bound and generating tokens is mostly memory bandwidth bound. Now you have to have 5 of these 2 operation. 1 is flat optimized, compute optimized, and the second is a memory bandwidth optimized to speed up this paging loading of model weights from this HBM to the SRAM and then to registers. So that's what you need to implement. Now how to implement? So let's take NVIDIA stack for for example. Can I just

Nathan Labenz: (47:35) interject for 1 I just wanna make sure I understand and that it's clear to the audience also that the 2 we're we're all within inference here? Right? But the kind of 2 forms of latency that matter are time to first token and then speed of token generation from there. And these are essentially 2 different phases of the process because and here's where I'm learning from you and, putting 2 and 2 together in real time. You said that the input tokens, aka the processing that happens for time to first token, is gonna be compute bound. And I understand that to be because in that process, we're doing all the attention, all the MLPs for all the tokens, but we can do that in such a way where we don't have to page in and out the parameters for each token. We can process all the tokens with those parameters, page new parameters in, process all the tokens with those parameters. And so that's why it's compute bound. Right? Because we're not paging in and out as much on a per token basis versus then when that's done and we get into token by token generation, now we need all the parameters in and out for every single token. Is that right?

Dmytro Ivchenko: (48:50) So it's actually much simpler than that. So the thing that if you look at typical use cases is they have, I would say, roughly 10 to 1 ratio of input to output of those. So now when you have your input, and the input also typically like 1,000 and more tokens goes to 10,000, practically it's very common to see this kind of workload. Now that's for the in for the generation when you run the generation. First of all, you generate 1 token at a time. Unless you do some speculating generation, which is like separate, but it's mostly 1 1 token at a time. And of course, yes, you wanna also batch multiple generations together, but there is also a limit there because your big prompt length, you need to allocate lot of memory to host your intermediate activations. Okay? So in fact, again, we see is that the batch size for the generation is around, I would say it goes from like 16 to 2 to hand. The batch size for prefilled can go tens of thousands. So that's why the here you see this what order might do even more, 2 orders of magnitude difference. And that's why here the compute versus memory bound thing comes in. So for prefilled, you have so many tokens to to process. So your bottleneck is actually a that moves and your bottlenecks, it flaps on the computer on the GPU. For the generation, you don't have that many. And your bottleneck is but in both situation, you need to load the weights from HBA to all the way to register. You do it in both situations, but just the bottleneck on the generation is this loading. It's not the flops. So it's quite simple, actually.

Nathan Labenz: (50:30) Okay. I'm not fully sure I understood the distinction between what I was trying to say and what you said. The big idea as I understood it was that when you do the input tokens, like, to create an attention matrix, for example, you have to do every token, right, and all the attention heads throughout the thing. And so there's a lot of compute there. And so just on a relative basis, there's more compute relative to the moving of parameters in and out. Whereas in the generation side, now you're generating 1 at a time. There's not really a way to get around the fact. You have to run the whole transformer, right, to generate 1 token.

Dmytro Ivchenko: (51:06) So you do have to run the full transformer, but the fundamental thing here is that you use fundamental optimization, which is called KV caching. So in this case nutshell, it is, as you mentioned for the refill initial stage, all tokens have to attend to all previous tokens. So it's basically quadratic divided by 2. So that's the generation. You need to do the same, but for the previous tokens, you can cash this attention. You don't have to run it again and again and again.

Nathan Labenz: (51:32) This is the KV Yeah,

Dmytro Ivchenko: (51:34) exactly. Results that you do only the token you're trying to generate attending to the prior ones. So this is, as you see, is much cheap way cheaper, because it doesn't have a quadratic nature to linear nature.

Nathan Labenz: (51:46) Yeah. How big does the KV cache get and does it have to come out of the SRAM and and go on to the high bandwidth memory or does it just stay in the SRAM the whole time?

Dmytro Ivchenko: (51:58) Yeah. So it gets really big. Right? And it varies because of multiple factors. 1 is the ratio of the q heads to KV heads. So 1 of the MQA paper popularized this, and then now all the big models, they have this. So they reduce amount of KV hats, usually order of around 4 to 8, sometimes even more. But 8 isn't been very common there. So that's 1 thing. Nothing, you can shut the model. Right? And now you're reducing as well. So what we see in low end k models, QCache is not typically big of a problem. But in the end of the day, you have to keep it in the HBM for fast access. In some case, you can actually put in CPU memory, and it's gonna be permissible. But that that introduce a lot of complexities as well. Yeah. But most typical case, you'll keep it in HBM.

Nathan Labenz: (52:48) Okay. Cool. So you were just about to say when I interjected, okay. So let's do that on NVIDIA.

Dmytro Ivchenko: (52:53) Yeah. So going through the how do you implement these met moves and attentions very efficiently? So NVIDIA stack has a lot of APIs and they go from very low level to very high level. And let's start the other direction. So from the high level, NVIDIA has now this already serviced TFC LLM. So it's based on trying to inform a service, it's already ready to go. But it's easy to get started, although it's not as easy as some open source offering, but compared to the rest, it's much easier. But it's the most rigid as always. So if you wanna change anything, it's C plus plus code, basically good luck. And then when you go down the stack, you have enough flexibility from this setup. You can go and take a step down and you can try to write your own user, all the customized metamool using, for example, QDNN. These are pre compiled versions. There's a config you can tweak, but can fundamentally change them. Then you wanna take even step down, you wanna mean more flexibility. Now we're looking at the Cutlass library and with the new sub library, which is called Qt. It's pretty Qt, I would say. Much better than the the older API. This is very program way more programmable. It's basically C plus plus code library. You can perform a lot of customizations there. But even if that is not enough, you can program yourself in c plus plus Even sometimes if that is not enough, then you can go all the way to the hardware instructions and you could program in PTX. So all these options are possible. It's interesting how a GPU programming involved. Right? So it all started with the programming CUDA course. And if you look at GPU, that's what it is what used to be. So you had CPU core, which is very programmable, but not as parallelizable. Now we go into the GPU, and GPU has this CUDA cores. It's also programmable, not as CPU. They are fast. Now the time passes by and that is also not enough because as I mentioned, we're only optimizing really format moves and this attention, which is also sort of meta move. So how can we do that? Now we need to do next best thing is embed ASIC into your GPU. And that's where the Tensor cores come in. That's basically ASIC embedded in GPU. But guess what? It complicates the programming big time. And especially on the newest hardware on edge, the Tensor cores are asynchronously programmable from the CUDA cores. Now you have 2 levels of asynchronicity. You code the CPU to launch CUDA kernels, and then from CUDA, you'll also asynchronously launch Tensor Core kernel. So this becomes really complicated. And what we see as a result of it, I see practically very few good kernels which are geared towards h 100. V 100 and a 100, they had a lot of good CUDA kernels geared towards them. People wrote it and enjoyed GPU programming. Comes hopper with the the async programming nature, and the amount of those public code is just drops, crazy. And there is a very good explanation to it because the program becomes even more complicated. So now the question is how do we solve this problem? 1 good step of it is welcome development is Triton from OpenAI. So actually, it's a different compiler. I would say that C plus plus was a good choice probably back in the day when NVIDIA chose it for CUDA, but now it's really getting in the way. Like, in some cases, it's much easier to program in assembly for CUDA than in that CUDA c plus plus to be honest. Because of that, we see a lot of experimentation and this Triton is a very, very neat idea. And so there you literally program in Python and there is a lot of magic happening behind this. Of course, there is never free lunch. Right? So we basically introduces some kind of structure to your GPU programming and it splits it in different levels. So first is like a very simple layer where you program just in Python. Then it translates in so so called MLIR, intermediate representation. And then even that 1 is a low level MLIR, which is closer to hardware, and then it gets translated to PTF. So they go to the lowest level right away. The nice thing about this is this structure. So because of this very complex GPU architecture we talked about, this structure actually makes it much easier to understand the existing new kernels and longer term maintenance is much better. Because if you compare it to the C plus plus implementation based on CUDA to Triton, those are really hard to understand. There are thousands thousands lines of code. This provides a very nice clean separation. But again, no free lunch. And if you hope that, oh, I'm just gonna write a CUDA kernel in in Triton and just, you know, write Python and it's gonna be super fast, I'm most likely unless you're doing a point wise, probably not. You have to still look into what is getting generated, what instruction they use, then GPUs, are they pipeline, are they not, what tweaks I need to do, and so forth. You have to still look in the output in the ad. There is no way around it to get to achieve the cutting edge performance. But the nice thing is that it has this compiler, because I don't really wanna write the assembly. Right? Because they need to, like, keep track all the registers and there's like 255 registers and it's just that code is unreadable. Right? So here, by introducing this structure, the hope is and it does work in many cases, although not a silver bullet, this structure actually makes the code much much more readable and maintainable, and you could still achieve state of the art performance. In fact, the EMET model and attention implementations, although they are not as flexible as you can write yourself, they achieve close to state of the art performance, which is very nice proof of this concept.

Nathan Labenz: (58:21) Okay. So let me try basic summary and then and then ask a couple basic questions. There's a lot of layers between electrons moving around at the very lowest level to Python level scripting, or now we can even say prompting GPT4 to write me the Python to do the things that I want it to do. So I and certainly I would expect most of the audience is pretty familiar with the fact that the higher level at which you're working, the the more you sort of are at the mercy of all the other layers to determine what your ultimate performance is gonna be. I have lived a pretty privileged life where most of the things I've done have been fine at the Python layer, and I've only rarely had to think too much about intensive optimization. So that much, I think, is fairly intuitive. Like you go lower level to do more optimization. What is not super intuitive to me is like, what is happening that is causing the need for lots of ongoing optimization today? 1 might naively think, hey, if it's all transformers, there's presumably only so much optimization that would need to be done across these handful of operations within the transformer. And yet it sounds like that's not really the case. So what are the things that are driving the need for continual optimization or why isn't just like a finite set of problems that have already all been solved by the community?

Dmytro Ivchenko: (59:41) Good question. Yeah. I would say number 1 reason is a new hardware And new hardware trying to get more optimal, and it's not just getting better with the existing precision. So, like, used to be we do everything in a single precision, float 32 bit, and now it's shifted to a few years ago, we're doing half precision. FP16 for inference, FP4 16 for training. I would say kind of a new standard. So if it's somewhere new precision, go to new precision. But guess what? That is not also good enough. NVIDIA needs to have an MD need they need to have new step functions. And they what they're doing, they lowering the precision. So now I will go to precision inferences, and to some extent, e training is FP 8. It's only 8 bits, right, we use for a single parameter. And we also do compute in 8 bits. Then we look at the v100. Guess what we'll introduce? Of course, they go to 4 bits. And look at the recent papers, and 1 of the best papers that I like is so called 1.5 bit. It's what they call 1 bit, it's not 1 bit. Fundamentally, the theory is that to train LLM, you just need 0 1 minus 1. So basically go left, go right or stay toot. So this is like a fundamental property. And if you can have that, you can train the model and looks like this is enough. You can basically replace the larger parameters model with these smaller parameters. Of course you have to recheck, this is a crazy reduction precision, but looks like these models perform on par with the handful full precision models. So this is a very welcome development and you will see even more HUD optimizations, But if you do this 0 1 minus 1, guess what? You don't need multiplication. It's all addition now. You'll see even more optimizations from NVIDIA and AMD in coming coming years once we train bigger models for the only like 1, 3000000000 models trained for this. Once we train in bigger models, this transition will become even more critical. So now these new precisions, they require very different instructions. And because you see, like, here, the ratio of generation GPU is really optimized for this specific ratio of computing and and memory back. So here, when the ratio changes, they change the nature of of the APIs, and you need to basically code from scratch, to be honest, with for the GPU generation. So all the testing works, but it's much slower than it could have been. A good example could be, I would say, a fresh attention implementation. Fresh version 2 was called for a 1 for for 16 precision. And then you run an H100, you have just half of what you can get. It's same kernel, but it just uses all the operations, all the instructions from INTEAR, while the HAPP instructions are different. So this is like the why fundamentally it's happening and we still have a lot of work ahead of for foreseeable few years ahead until we reach 3. Once we reach these 3 bits, let's see what's next. But there is still a long way to get there.

Nathan Labenz: (1:02:47) Yeah. That 3 bit paper is, or the 1.58 bit paper is super interesting.

Lin Qiao: (1:02:52) I will add to that. Right? So Dmytri just mentioned, hey, hardware keeps evolving, right? So not just NVIDIA hardware keeps moving forward and the basic instruction set keeps changing, but also there's other lines of product, right? There's MD, there's Intel, there's customistics and so on. So they're very broad and we want to simplify that for our user, but also if we look at upper stack, so we are looking down, if we look up at the application level, so we mentioned latency, you mentioned time to first token, right? So sometimes latency also means time to the first 30 tokens because they are voice streaming out and they have to gather 30 tokens before they can streaming, or they care about the end to end latency. So latency also means different things and people always want to get a spectrum of latency cost trade off. They want to see a curve and they want to pick a point in that curve best for their business. And then on top of that, the input output ratio is different per application. We see a lot of RAG usage where it pushed the input to upper ratio to be a very high. It can be 10 to 1 or even much higher, or sometimes people just generate 1 token for classification, or sometimes they're generating a lot, like they're generating code. So the ratio keeps changing and the best deployment to find depends on what kind of latency you care, which dots in this latency cost curve, what is our input output ratio, what are generating, what kind of is your input, the context window length, it all matters. It's very complicated to optimize upwards towards where your application looks like. Does it have repetitive prompts, for example, as another application vertical down to ever moving hardware landscape. That is a complexity or application product developers as they are doing things, fun stuff themselves or in enterprises, they're all facing this challenge. So that's where we're coming and we say, you don't worry about it. We handle it all for you. So you just focus on your product application development. So I just want to double click back to what is our roles in this complex ecosystem.

Nathan Labenz: (1:05:10) Yeah. What it's funny that you've mentioned the curve and and picking the the point on that curve. I'm putting together a little talk for business leaders and developing a top 10 or whatever things to know for business leaders, 1 of my tips for them is to learn to think in Pareto curves. And this is something that I see at literally every level of the stack in including, like, just false positives, false negatives. Truly, every level seems to have this. So do you envision a future of your product experience being literally just showing people these curves? And like, should I expect to see a number of Pareto curves where I can pick like, okay, for this application, want absolute minimum time to first token, I'll take the highest cost. For this 1, I'll take the happy medium. For this 1, I want the lowest cost. It's a background process or whatever, so it's okay to wait. Is that the kind of experience, the kind of choice that you want to expose ultimately to developers?

Lin Qiao: (1:06:07) So I will draw analogy here. So 20, 25 years ago, database is a new domain and database management systems or, you know, start to come into pictures. Actually, it's very interesting analogy. Now I'm thinking about it on the fly. So database queries has strict patterns. It's called SQL. You have your select clause, you have your where clause group by doing aggregation. So this is a very clear pattern. If you think about that, GenAI, the models has very clear pattern too. You have your operator layers. It's more or less stable right now. Although the pattern is simple, but depends on how many columns you have, which column you're going to put filter on, which column you're going to do aggregation, which column you're to do group by. They're all different ways to optimize your query. And at early days, none of these database management system are smart. That's why it created a whole new entire career. It's called DBA. So basically all database has knobs for human to tune, to try to optimize things. And those people make a lot of money because once they optimize, it's a much better experience. It save a lot of money too. It's very similar to what we just talked about latency and TCO. But over time, all these database management systems become smarter and smarter because they all have a layer called optimizer. Those optimizer observed the workload and it start to create, oh, you're doing a lot of future on this particular column. So I'm going to create index. I'm going to partition those columns based on your future criteria. So it's much better search, much faster search. You can skip a lot of things. You have to do sequential scanning and so on. So there, then the optimizer becomes smarter and you don't need to hire DBAs at all. It's fully automated. So as long as you observe the workload, you have the workload defined. It tune, it's self tuned towards the best outcome. So that's our vision. Now today we'll have multiple configurations based on what you tell us, your workload patterns, what you care about, we deploy for you, which is really good. But over time, we want to automate ourselves our way from that process. Basically we'll learn from what you are running and the system becomes smarter and smarter. It can become lower latency over time. It can be higher quality over time. So that's what we aspire to build.

Nathan Labenz: (1:08:27) So these options in the product line today would be in the dedicated deployments product, right? Can you run through some of the, I scanned these documents, but I wouldn't say I conceived of it in the way that you're describing it now. So what are some of the choices that I get to make today depending on what my application needs are and how does that evolve in the short and medium term?

Lin Qiao: (1:08:51) Yeah, for example, if you look at our product offering, we have 3 tiers, the developer tier, the business tier, the enterprise tier. So the business tier is more like serverless pay as you go. And enterprise tier, usually they have clear workload definition and they know where the latency cost curve, where they want to be because they have budget. They also have product requirements and they know their workload, like in terms of input, how much RAG they're using and how much generation and so on. And then we come in and pick a perfect spot for them. And then we deploy a specific configuration towards our workload. But over time, this can be more and more automated. And because their workload can evolve, they don't always have to get us in the loop. And our system is just self adjusting towards ever changing workload. So that's how the direction we're heading towards.

Nathan Labenz: (1:09:44) So could you give me a little bit more intuition for, let's say I'm an enterprise customer and I have 3 applications. I have 1 that's user facing and it's going to generate tokens for a voice app. And those first 30 tokens, as you said, are super critical to get fast because I wanna have conversational fluency. So that's 1. Then the other thing is like a background job where I'm just like, give it to me as cheap as I can. And then there's maybe a chatbot 1 that's in the middle. I don't want to pay top dollar for that, but I don't want it to be slow. How does that trickle down into what you are actually deploying for those 3 different scenarios?

Lin Qiao: (1:10:18) So usually in your enterprise, they have many applications and they usually the distribution of traffic is always skewed, right? They're heavy hitters, they're very high volume and they're long tails. They each doesn't have high volume, but they add up. So we recommend they basically think about long tail as 1 bucket of deployment and we give them 1 configuration and then we help them optimize for the heavy hitters. Each of the heavy hitters, as I said, they care about different kinds of latency and those heavy hitter probably are coming from their product team. Usually we also work with ML infra team coming from their product team and they have a certain product budget. So then we work together to figure out, hey, which thoughts you want to pick in this curve? When it comes to quality, then the prompt length coming to picture, right? How much instruction tuning you put there, whether you fine tune take away the instruction prompt, or you put a lot of RAG information as the context to constrain the model to do less hallucination. So all these kind of product context start to come into picture. And then we have different kind of deployment to hypothermized for very long context window processing versus they do a lot of generation. So those are also different configurations.

Dmytro Ivchenko: (1:11:28) Yeah. I can give a short, some concrete example where you have to have different deployments for specific use cases. So let's say you really want to get a very low time to first talk and the problem is long. Right? And let's say your model is small. Smallish meaning that, like, say, 7,000,000,000 parameters, you don't have to shut shut it really to put in different GPUs. But the thing is that it's still not fast enough to run on a single GPU because it promised so long. So now for this specific use case, you wanna shut it. But there is a cost for shut. So there are like fundamentally 3 shutting techniques. 1 is to shard your model weights. Another 1 is to shard the activation of the input. The third 1 is to chop modeling pieces that is called pipe pipeline parallelism. So the last 1 doesn't really help you with the latencies. It just helps when the model is is too big. The case when you shard model weights, also called model parallelism, aka tensor parallelism, this help also with the model size and does help with the latencies. Although there is a cost for sharding, so basically your throughput overall declines. And the first 1, input sharding, usually it doesn't help you with the model size and it's pretty cheap, but there's a caveat attention. So you need to be careful attention because you attempt to other tokens. So now you have to have a bit more complexity in your setup. So the bottom line here is that for the pre fail, for example, the most common sharding right now is the tensor parallelism. If you choose it, you don't want to always sham. You really wanna don't sham as much as possible up to the point where you have to, because there is a cost, because your throughput will be will be tanking the the more you sham. Right? So much you have to sham because you cannot meet the latency budget. So this is, like, fundamental thing, and I don't think this prompt goes away any anytime soon.

Nathan Labenz: (1:13:19) Yeah. Okay. So I think, again, everybody will be conceptually familiar with the idea that there is a, like, latency throughput trade off that seems to be a very recurring pattern. But if I understand what you're saying correctly here, it is that if you have a really long prompt, even if you have a small model that can fit onto 1 GPU, you may wanna split across GPUs. I'm not quite clear on the nature of the splitting. Are you splitting, like, layer wise? Like, you'd have early layers on 1 and then later layers on another?

Dmytro Ivchenko: (1:13:51) So you every layer, you split. So it's more complicated, but it's, like, very in in nutshell because it goes to Megatron sharding. So you you shuck to make moves even into 1 column wise and not the roll wise and minimize the amount. Because in the end, once you do this, do the reshuffle. It will reduce. So you wanna minimize that, and this megatron style shadding helps you. But fundamentally, yeah, you're splitting actually the weights across GPUs, and then you need to all reduce the activations and stuff.

Nathan Labenz: (1:14:19) Can you give us a little bit better intuition for that? Because that's something that is not super clear to me, and I suspect a lot of people are not gonna be very clear on it either. I've seen examples of this in my study of the mamba and mamba related things recently. I've been trying to get a better handle on this. And 1 thing that's, like, surprising or counterintuitive is that there are ways to take advantage of basically the associative property where you can do these quick counterintuitive algorithms even for just, like, adding up numbers that kind of look pretty weird, but end up being faster, especially because it allows for this kind of parallelism. Am I on the right track with that?

Dmytro Ivchenko: (1:14:58) Yes. So here it applies not necessarily this cache is like, I would say, modern architecture independent. So these are the input data parallelism or Tensor parallelism and pipeline parallelism. They've been there, like, before even Gen AI kicked in. So they're kinda fundamental to machine learning. But then which 1 you wanna apply totally depends on the architecture, because the 1 architecture is much better. For example, if you don't have a sequential nature, you're using pipeline in parallelism. It's actually no goal. It isn't gonna save anything, because you cannot even chop the model. And there used to be a model like that. All transform models tend to have layers, so it's much better. But something like if you look back on the image, older image model based on the ResNet, they have same connections that it's become harder to do by binding parallelism. Yeah. And then the test aparism actually conceptually pretty is harder to understand. And the data input parallels is easier because you just split data and data is independent. So you could just duplicate the model, split the inputs. Right? And so the and they go process and then join them in there. That is easy. The model parallels and Tensor parallels, which actually do use right now, are hardest to understand. Because once the weights start interacting with each shatter, you need to make sure your math is correct. Because you cannot just shatter across if you need to look, for example, 2 values, if you need to add them in there. If you you wanna shatter, but you have to still do do the addition. If you need to multiply, you still have to do with the multiplication. So when you do that, you need to probably gather them on 1 rank or another rank or or send them across them and then split these operations across the ranks. Of course, sending across GPUs is way more expensive than just sending from single GPU to the registers, for magnitude worse, even more. So becomes quite complicated, but these are the kind of compromises you have to go through. So basically, other thing you need to make sure that your math is still correct once you do the shot. So this is the other thing, different techniques and basically not every operation can be easily displayed. Despite you need to look at specific operations to see matrix multiplication, how you how you can yeah. So if you do lot of research about this, yes, it's Megatron is a good starting point. It has like fundamental descriptions how you do the mat machine.

Nathan Labenz: (1:17:12) Yeah. Okay. Cool. I'll check that out. That's a good pointer. Let me just try to summarize the 3 kinds and make sure I at least have the conceptual framework right. Simplest 1 is data parallelism. That is just make multiple instantiations of the model. You can split the data across them. That's essentially like a load balancing setup. The next level up, which maybe is the most complicated, is actually splitting the weights. This is the kind of thing that you would do because you have a super long prompt, for example, and you want to have a fast time to first token. So you need to have even more parallelism of the computation, and it can be worth it. But it comes at this cost of now you've got GPU to GPU communication plus the complexity of what the hell am I doing across all these different, now that I'm splitting weights, okay, that's a lot to keep track of. And then the third 1 is the pipeline. And that is where you actually split early layers from middle layers, late layers, whatever. And you're feeding through multiple things in a pipeline in a sequential way.

Dmytro Ivchenko: (1:18:16) Yeah. That's good.

Nathan Labenz: (1:18:17) Cool. That's helpful. Lots to learn. I've been doing this around the clock, I feel like for a few years now, and still a target rich environment for new things to learn.

Dmytro Ivchenko: (1:18:27) Yeah. And they can combine them in many different ways and go through really complicated setups. So it gets scary.

Nathan Labenz: (1:18:34) Yeah. All independent dimensions as well. Okay. So we've covered a lot of ground. 1 other question I wanted to ask about is with all this complexity of everything we've talked about, the 1 thing that was surprising to me is that you are also training your own models. You have your own branded Fireworks function calling models. And especially given your earlier comment that you think at least the small models are converging, how does that fit into the overall strategy? Like why bother training your own models as opposed to just focusing on the insane complexity of the inference stack and letting that convergence happen independently of your efforts?

Lin Qiao: (1:19:15) That's a great question. And 1 of the biggest, like the north star of the product we're building is not just serving individual models. So we want to build the inference stack for knowledge extraction. And when we talk about knowledge, we can first look at the knowledge provided by large language models is very capable now. It can answer many questions. It surprises us, but at the end, all large language models has limited knowledge. Why? The knowledge is limited by its training data. Its training data is limited in terms of time range, right? It doesn't have the most latest information and it doesn't have the information. People cannot crawl from internet, those public information. So they are just limited to the corpus of information people can crawl directly from. And many of those knowledge also live outside of large language models because there are foundation models, they generate images, they generate videos, generate audios. They also do information extraction for images. So I want to create a framework to think about the knowledge distribution is beyond just large language model. And even for given large language model, it's limited. And there are other modalities that carry a lot of information. Beyond that, there are a lot of knowledges hidden behind APIs that we cannot extract from. And we use API very extensively day to day. For example, we do search, right? So there's Google search, there's weather API, there's doc extraction API, there's map API. For personal productivity, there are APIs for docs, for spreadsheets, for counters, and so on and so forth. So APIs are everywhere. Within enterprise, 1 enterprise internally can have hundreds of API surface areas too. There are a lot of key value stores, document stores, production logs, pop up systems, internal search. So API itself contain a large corpus of knowledge. So think about function calling. What is function calling? So our file function, the model we build ourselves is serving the function calling area. And function calling is the layer to tie knowledges together behind all different modalities of foundation models and APIs together. So to us strategically, this is a critical layer for us to build the best inference tier for knowledge. And with that, you can build a lot of fun application. For example, you can build a personal admin to Nathan and to everyone on the planet that can sort out tedious work. This application can learn what you like over time and can preemptively finish work before you even ask and put out the reminders or even suggest things up ahead of time. I wish I could have a person like me like that. But those kinds of new application has to depend on tools and ways to extract knowledge across the board from foundation models and from APIs. That's the fundamental reason why we build our own function calling model. We have released 2 versions open weights so everyone can use it. You can download from hugging face. We are currently working on our third version that's gonna come out soon and we'll have a great quality bond. So stay tuned.

Nathan Labenz: (1:22:32) Cool. I can look forward to that. I know we're just about at the limit of our time together today and you guys have been very generous with your time and knowledge. Is there anything that we haven't managed to touch on yet that you want to make sure you mentioned?

Lin Qiao: (1:22:44) I think we touched on a lot. Yeah, I think you have 1 question about SSM.

Nathan Labenz: (1:22:48) Yeah, that's a hobbyhorse of mine. Is that on your customer's radar? Is it on your radar? Is it on your roadmap?

Lin Qiao: (1:22:54) So actually, we haven't heard too much about the requirements for our customers. I think it's a very cool technology. So of course, when we talk about SSM, it's in the context of mamba. The main barrier to practical adoption is mainly quality in my mind, compared with transformer models. So we haven't seen a pure mamba SSM based model emerge as highly competitive in quality to Mistral, Claude, GPT-3.5 or 4 yet. I think AI 21 Labs just open source and Jamba. It's a hybrid of SSM and transformer models. The but the real quality benchmark is 1 thing, real quality in practice another thing. So real quality using practical use cases yet to be verified. So that's why I think we haven't seen huge demand there. And I guess people are more cautious. It's hard to say what will be the technical challenge. Think Dmytro is pretty confident. We can support it in no time. But again, come back to quality, it's a little bit hard to assess. So those kind of architecture, and SSM is really good for long context by removing the quadratic nature of transformer. But for long context, it's hard to assess the quality. Needle in this haystack is the most commonly used benchmark, but it's not very comprehensive. I don't think as an industry, we have standardized the benchmark for measuring long context quality yet. But this is another area the whole industry need to move forward.

Dmytro Ivchenko: (1:24:22) And these people from benchmarks still show that there is room to go for mobile architecture to meet the the current transformer based model, attention based models.

Nathan Labenz: (1:24:34) Yeah. My money's on the hybrids for what it's worth. There's been some really interesting kind of very fine grain studies of the comparative strengths and weaknesses. And Yeah. There will be, I'm sure, many more developments. Boom. So we are doing a quick bonus recording, 10 minutes or whatever, on some exciting company developments. This goes to show how fast everything is moving in AI. I think it's only been 7 days since we recorded. Then I noticed that you mentioned the new image generation model from stability on our first call, but I didn't know at that time that Fireworks has been in partnership talks and now has announced a partnership with Stability AI to be the exclusive inference provider of stable diffusion 3. That's pretty awesome. I'm sure that is a position that a lot of inference providers would love to have. And I am just very curious to hear more about how that came to be and what the competition was like, what sort of work you had to do to make it happen, anything you can share about the nature of the partnership. But for starters, congratulations.

Dmytro Ivchenko: (1:25:38) Thank you. Yeah. We've been working in this for a while now. And overall, the partnership makes total sense because if you look at on the company level, stability has a lot of very good researchers who train very good models. If you look historically, they've been very successful monetizing that compared to OpenAI. On another hand, Fireworks, we have very good production minded engineers who are also very good at model performance. We talked about at length during our previous conversation. We can bring these models, which are trained by those amazing researchers to production in all time and run them very cheaply. So that's the high level, while the partnership makes sense. I can talk a little bit more about optimization strategies, what we implore for the image models and how they are different from the text models. And I would say similar at the same time. So 1 exciting development about the image models data is that with the new architecture of stable diffusion 3 ability, folks move to the transformer based architecture. And that changes a lot of things. Because if you look at the performance profile, before it's mostly like that. And the rest has some convolution. There's nothing common with the test models. But now it's all transformer based. So we were able to apply a lot of optimizations, which we learned of our running hosting and text models on these image models. So that actually provided us quite a bit of boost. For example, we apply most optimal kernels, applying different sharding strategies. Yeah, at the same time, the image models are still different. They don't have this generation phase. I would say they are running in the custom prefilled phase, so to speak. And there is no only 1 big batch sizes. So, you know, batch size 1 is typically big enough for a GPU use. But there are still a lot of these commonalities because of the transformer architecture, so we're very excited about that.

Nathan Labenz: (1:27:43) Yeah. That's cool. I think folks will probably be fairly familiar with multimodal language models that accept images as input. So in that context, there's a translation rate depending on the size of your image to the number of tokens that sort of counts as. And broadly speaking, again, people will know that you tend to break the image down into patches and feed it in that way. And then in most of these multimodal models, it gets mapped to a language latent space and then was used to inform your generation. This is obviously quite a different situation. So can you give us a little bit more about how it works? Are we generating 1 image patch at a time? And how should we think about familiar concepts from the transformer, like the vocabulary size or the attention window? How do those concepts map onto this image generation use case?

Dmytro Ivchenko: (1:28:36) Yeah. For the image generation, the best analogy here is to map the prefill part of the text generation to the image generation. So those are mapped pretty nicely. The biggest difference there that on the text generation, the input is variable. Right? For versus on the image generation, yes, we will run it in Codra. But then once you run through those tokenizers, right, and encoders, then the input actually to the transformer layer, this is the heaviest by far,

Nathan Labenz: (1:29:07) is fixed.

Dmytro Ivchenko: (1:29:08) So this side is fixed, which actually makes, I would say, optimizing a bit more straightforward comparatively that with text models. The new stable 3 architecture, so the whole of Hedgehog is quite different. Right? But in the in the same time, the heaviest part by far is the transformer layer now. Right? And so that's what our focus has been. And only we're learning is from the text models about the sharding. This still implied to this transform layer. But it's just like a very long sequential model. And basically, all these techniques do apply.

Nathan Labenz: (1:29:44) So what is a token output? I I assume it's like a square that consists of a certain number of pixels, but I I haven't studied this in-depth.

Dmytro Ivchenko: (1:29:54) Yeah. I would say that concept is still the same across the diff diffusion models. It didn't change with the introduction of the transformer architecture. Yeah. So basically, in the image generation, you have to map these textiles essentially to default inputting values. And so there's basically 2 different modes of how you process the input. For the rest, for example, there is a text driven in image generation. So there you have to organize the text, right? And then get us from into it. And then they augment the input to image generation. But I would say the input is quite different because what you say, input, you generate this noise because it's diffusion process. Right? So that's what the fundamental difference is. So you basically generate this noise. It's literally a randomized noise generation. And then you run into these diffusion steps 1 x 1. And you write to the repertoire diffusion process, and you're enhancing this image with each step. So that's what's happening in that. Yeah. And in the end, you run the and then convert this output of the transformer to the to the actual pixels.

Nathan Labenz: (1:30:59) Yeah. Let me try to state it back to you. You start with noise. You're running then multiple forward passes, where each forward pass constitutes 1 denoising step. And you have how many denoising steps on the way to a a full image?

Dmytro Ivchenko: (1:31:16) It depends. So for this stable diffusion 3, you can choose. But default 1, we run around 50 steps. But for the turbo model, it's quite a little bit different architecture. And that actually, you run around 4 steps. So if you a much more is worse, but it's smaller. The turbo model really oriented around speed. So speed is the number 1 factor. And the stable version 3, the quality is the number 1 factor.

Nathan Labenz: (1:31:43) Gotcha. And in each forward pass, unlike a language model where you're auto regressing and you are holding all the current tokens fixed and adding 1 to the end, here, you are starting with a bunch of patches, and all of those patches get denoised. So your input is n patches, and your output is n patches all denoised. And you do that 50 times in a row until you get to the refined final image.

Dmytro Ivchenko: (1:32:11) Yeah. But this stays stays constant. Right? So that's very different from the text model. So this this constant.

Nathan Labenz: (1:32:17) Yeah. How many patches? Or or is it patches and tokens basically equivalent. Right?

Dmytro Ivchenko: (1:32:22) Yeah. You can say that. And they have a different settings on the different model sizes. So, again, that's not, like, constant in the changes. Gotcha.

Nathan Labenz: (1:32:30) Okay. Cool. So how did this it sounds like something that was maybe driven by the stability leadership team saying, hey. We're good at 1 part of this, but we're not exactly set up to do the other part. Let's go looking for a partner. Do they come to you directly and say, hey. We think Fireworks is the company for us? Whatever you can share is cool with me. If I'm inferring reading between the lines correctly, it sounds like they identified you, and this was not like a tournament that you participated in, but more of a they knew who they wanted to go with, it sounds like, which is obviously a

Dmytro Ivchenko: (1:33:01) Yeah. So there's many factors which were in play. And, again, like, from the company shows standpoint, our knowledge and the leadership with the performance during models plus the writing them in production reliably. So those will 2 of the most important factors. And yeah, so if you ask people who know Fireworks, mostly people say, number 1 is for the seed. And so I think those 2 factors have played the key role. Gotcha.

Nathan Labenz: (1:33:28) Cool. I tried it this morning. It was very simple to get started because I just saw the the news of the partnership. So I went to Fireworks directly to try to find the model in the playground or whatever, and then I realized, oh, it's actually still presented as part of the stability API. So I ended up on the stability docs and got the code there and made a little notebook and started making calls. Is there anything that you can tell about how it works behind the scenes? It seems like I'm still hitting their domain. Right? So there's sort of a middle layer, and then they're routing over to you. Is there anything interesting in the way that part of the technical setup is constructed?

Dmytro Ivchenko: (1:34:06) The main reason for this kind of setup is that stability is the company. They are quite worried about people misusing their models, And they want to sanitize the input. And that is also, I would say, quite relevant related to work. So they trained models to do that. So it impacts photosensitive. They couldn't control that process of their input sanitization. So that's 1 of the reasons we need to route this way. Yeah. Plus for image generation, you can say, oh, there is some latency is, and we are actually working on reducing them. But for image generation, typically, I would say quality is that is 1 of the main things plus the cost of running models. So there's like this routing is actually works out pretty well in this situation.

Nathan Labenz: (1:34:54) Cool. Yeah. It's interesting. Any more detail or interesting aspects of the optimization strategy that you'd like to share?

Dmytro Ivchenko: (1:35:02) Yeah. Basically, to repeat what I just had to emphasize that we're very excited about the proliferation of transformer architecture. And next thing we try to work on is a low precision quantization of 4 image models, because we are still running now in FP16 for precision. But 1 of the methods mostly to reduce cost of running these models will be looking in the opposite compared to the text models, which are the most common ones in the PA. So we're not doing that for image models yet. So that's 1 of the area of study. And interestingly, like, you cannot just directly apply the techniques we learn from text models on the image model because it has quality implications. So you need to do some extra work to make sure that the quality is not progressing. So that is quite a bit different. But for image, if you think about it, how is it different from that as is for text to generate in tokens. So it's funny, you're choosing the right token. Next token, you're good. It doesn't really matter what the other tokens were really in play. For text, you're producing the image that you need. Any kind of glitches on the image can be visible. So those are kind of main differences. And if you go on lower level is that if you look at the quantization techniques is that on the image versus text models, on the text models, you really wanna incorporate the high spikes on activations, for example, which is a quite common problem for each of position. And you really want to incorporate them because they do affect the next token choice, right? The second, the token choice is the most important thing. Versus with the image models, those high activation are not as important. So you can make sure you cut them and and the word is not although image word is not as effect. So it's more decision of every activation is as aggregate more important. So that's like a kind of high level intuition on the quantization optimization techniques.

Nathan Labenz: (1:36:58) Yeah. That's quite interesting. Is another way to say that basically sort of mental model that I have for the language models is that they have, obviously, very messy, noisy, kind of overlapping, but nevertheless, some sort of internal circuits that a given forward pass kind of slots into such that you have these, as you said, high activation points in the network that seem to be doing most of the work beyond a certain layer. Whereas here, it's like all to all, the whole whole way through because you need every ultimately

Dmytro Ivchenko: (1:37:36) Yeah. Every next account. Exactly. Yeah. We'll definitely be announcing more things, going forward. So, watch out for for new announcements. And we also be adjusting based on the impact we're receiving, from our audience. So, yeah, some more things to come.

Nathan Labenz: (1:37:51) Guys, I know you gotta go. This has been a great conversation. Lin Qiao and Dmytro Ivchenko, cofounders of Fireworks AI, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute

AI Inference: Good, Fast, and Cheap, with Lin Qiao & Dmytro Ivchenko of Fireworks AI

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute

AI Inference: Good, Fast, and Cheap, with Lin Qiao & Dmytro Ivchenko of Fireworks AI

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve