Luma Labs' Diffusion Revolution: from Dream Machine to Multimodal Worldsim - Amit Jain, Jiaming Song

Luma Labs' Diffusion Revolution: from Dream Machine to Multimodal Worldsim - Amit Jain, Jiaming Song

In this episode of the Cognitive Revolution podcast, the host Nathan Labenz welcomes Amit Jain, CEO and Jiaming Song, Chief Scientist at Luma Labs, alongside co-host Stephen Parker.


Watch Episode Here


Read Episode Description

In this episode of the Cognitive Revolution podcast, the host Nathan Labenz welcomes Amit Jain, CEO and Jiaming Song, Chief Scientist at Luma Labs, alongside co-host Stephen Parker. The conversation delves into the latest advancements and products from Luma Labs, makers of the Dream Machine, including cutting-edge models and features like camera motion and creative video generation tools. They explore technical aspects like pre-training for diffusion models and the development of concepts to improve AI capabilities. The discussion also covers the philosophical and practical implications of AI interpretability and multimodality, along with a deep dive into the intellectual history and recent innovations in diffusion models.

Upcoming Major AI Events Featuring Nathan Labenz as a Keynote Speaker
https://www.imagineai.live/
https://adapta.org/adapta-summ...
https://itrevolution.com/produ...

SPONSORS:
ElevenLabs: ElevenLabs gives your app a natural voice. Pick from 5,000+ voices in 31 languages, or clone your own, and launch lifelike agents for support, scheduling, learning, and games. Full server and client SDKs, dynamic tools, and monitoring keep you in control. Start free at https://elevenlabs.io/cognitiv...

Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive

Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(05:21) Introduction and Guest Welcome
(06:01) Exploring Creative Models and Image to Video Workflows
(08:43) Challenges in AI Model Training and Out-of-Distribution Scenarios
(11:03) Advancements in Ray Models and System Improvements (Part 1)
(19:51) Sponsors: ElevenLabs | Oracle Cloud Infrastructure (OCI)
(22:18) Advancements in Ray Models and System Improvements (Part 2)
(24:00) Concepts and Teaching Models New Capabilities
(28:41) Multimodal Intelligence and Storytelling (Part 1)
(31:56) Sponsors: Shopify | NetSuite
(35:21) Multimodal Intelligence and Storytelling (Part 2)
(42:28) Philosophical Questions on AI Understanding and Interpretability
(45:15) Human 3D Perception and Machine Learning
(47:19) Philosophical Perspectives on AI Interpretability
(48:22) Debating AI Interpretability and Concept Representation
(50:11) Empirical Science and Machine Learning Models
(52:37) Training Processes and Model Interpretability
(56:28) Challenges in Dataset Construction
(58:28) History and Evolution of Diffusion Models
(01:06:54) Classifier Guidance and Consistency Models
(01:10:51) Inductive Moment Matching and Future Directions
(01:16:02) Multimodality in AI: Current State and Future Directions
(01:18:49) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Nathan Labenz: (0:00)

Hello, and welcome back to The Cognitive Revolution. Today, I'm speaking with Amit Jain and Jiaming Song, CEO and chief scientist at Luma Labs, makers of the Dream Machine and the new Ray 2 video generation model. I'm also joined for this episode by my friend Steven Parker, creative director at Waymark and one of the few creators that has logged a proper 10,000 hours with video and image generation models dating back to the original DALL-E over the last few years. Our conversation begins with a discussion of how the Luma team trains models to create fantastical and other fundamentally out of distribution visuals for which there is little to no relevant training data available. But considering the force of intellect that both Amit and Jiaming display, their belief that video models are on the critical path to AGI, their ambition to create multimodal AGI at Luma Labs, and the range of novel and occasionally hot takes they share, I think this episode should be of interest to anyone regardless of whether or not you're particularly interested in video generation models specifically.

Keys to Luma's model development success, as you'll hear Amit explain, include a relentless focus on dataset curation, frontier advances in efficient learning algorithms, and a strong drive to understand what their models are actually learning as they go through the training process. These fundamentals create base models that can learn new concepts, including the Bolt Cam and many other camera motion concepts they've recently introduced in a highly sample efficient way. Meanwhile, for things that existing models can't learn so quickly, we also discuss Luma's outer loop of product development, which consists of building scaffolding and other behind the scenes systems that unlock new model capabilities and also validate customer demand. With that done, they then seek ways to internalize those capabilities in the next generation of the model. And then they repeat this process for each generation as customers continue to apply new and better models to harder and more valuable challenges.

For me, the most interesting part of this conversation was the discussion of model interpretability. Emphasizing that we should not expect AIs to process, represent, or understand information like we humans do, or even to do so in a way that's generally human grokable, Amit likens current interpretability techniques to archaeology in the sense that they're fundamentally limited to piecing together what models have already learned in the past. More interesting from his perspective is the study of training dynamics and engineering of datasets that are needed to teach models what they most need to know.

In the last 15 minutes or so, Jiaming offers an intellectual history of diffusion models. This gets pretty technical, for most people, myself included, it will require some additional study to fully understand. But I would summarize it by saying that the generative AI era really began with the realization that with the right problem formulation, unsupervised learning can work on web scale datasets. For text, this was simple next token prediction. And for images, it was gradually adding noise to real images and then training models to remove that noise one step at a time. Since then, there have been a mix of practical tricks, theoretical insights, and model enabled dataset improvements that have unlocked far more precise steering of outputs and breathtaking efficiency gains. From distillation techniques, which amount to training a model to perform multiple denoising steps in a single pass, to consistency models which try to ensure that a model will generate the same output regardless of where it begins on its denoising path, to flow matching models which use theoretical connections to differential equations to take a more direct path through the latent space, to Jiaming's latest inductive moment matching technique which optimizes the model in distribution space and performs generations in a small number of optimized steps, all of this should at minimum give you a sense of how inference prices have fallen so precipitously even as quality has dramatically improved.

While we didn't have time to go as deep into the philosophical underpinnings of Luma's multimodal strategy as I might have wished, I left this conversation with a sense that Luma Labs is definitely a company to watch. It won't be easy for any model development startup to compete with the big tech hyperscalers in the scaling laws era, but Luma's mix of product market fit, vision and ambition, and research prowess gives them as good a chance as any I've seen. And I absolutely look forward to having them back again in the future.

As always, if you're finding value in the show, we'd appreciate it if you take a moment to share with friends or write us a review on Apple Podcasts or Spotify. And of course, with the stakes of AI development continuing to rise, I welcome any feedback you might have for how I can do a better job of elevating the discourse and help steering the future away from catastrophic risks and toward the dream of AI abundance. You can reach us via our website, cognitiverevolution.ai, or by DMing me on any social platform.

Finally, a quick reminder that I'll be speaking at Imagine AI Live, May in Las Vegas, the Adaptive Summit, August in Sao Paulo, Brazil, and the Enterprise Tech Leadership Summit, September, again in Las Vegas. Tickets are on sale for each of these events now. And if you'll be there, please do reach out and let me know so we can meet up in person.

With that, I hope you enjoyed this insightful and thought provoking conversation about the development and philosophy of frontier video generation models with Amit Jain and Jiaming Song of Luma Labs.

Nathan Labenz: (5:21)

Amit Jain and Jiaming Song, CEO and chief scientist at Luma Labs, makers of the Dream Machine and Ray 2. Welcome to The Cognitive Revolution.

Amit Jain: (5:32)

Thanks for having us. It's very exciting to be here.

Jiaming Song: (5:35)

Yeah. Thanks for having us.

Nathan Labenz: (5:36)

I'm excited for the conversation as well and also excited to have my good friend and longtime teammate, Steven Parker, here as well. Steven occasionally cohosts when we do an episode on creative models, especially in the image and video domain because he is the creative director at Waymark, and he has logged more hours than anyone I know with these kinds of products and really has an excellent handle on the exploding array of options in the market today. So I thought to broadly structure this conversation, we might start with discussing the latest and greatest stuff that you guys have launched in the products, the latest models, the camera motion, all that new cool stuff, and how it fits into the broader picture and what you're seeing in terms of usage. And then I wanna get into some of the more technical stuff because you've also put out a very mathematical paper recently on an advance in pretraining for diffusion models and a really interesting position paper on how you think we should be thinking about pretraining going forward in general. So I'm excited to get into all that as well. And then maybe at the end, if there's time, we can get a little bit more speculative and talk about world models and the future of multimodality and what superintelligence looks like and all that great big picture stuff as well. So we've got a lot of ground to cover, and I'm excited for it.

Amit Jain: (6:48)

Awesome.

Nathan Labenz: (6:50)

Steven, kick us off with maybe a few reflections on your use of Ray 2 recently and what that has you thinking about as we begin today.

Steven Parker: (6:58)

Yeah. Thank you. Thanks for having me. Amit and Jiaming, it's an honor and a pleasure to speak to both of you. So thank you for taking the time. I have been playing a lot with Luma Labs recently, really enjoying it. It's one of many video gen tools that I love to use in my arsenal of possibility when I'm working on various projects. And for my own money, tend to find tremendous value in several different models being used at the same time as they all have different strengths, different weaknesses, really just a whole blend of capabilities. And Luma is right up there at the tippy top of especially the cinematic end of models that I like to use. So I just really wanna give a shout to you guys there. And my first question there is, I think more and more of what people want from an image to video scenario starts with AI images. That's just the hypothesis on my part, but is that correct?

Amit Jain: (7:54)

I think a lot of workflows do start on image side because images are just much easier to iterate through. Iteration cycles are fast and generation times are low. So you can generate a hundred images and find exactly what sort of things that you are thinking about. So, yeah, I think a lot of people actually lean in on image workflows as a significant part of your work today.

Steven Parker: (8:16)

Okay. So that makes sense with my own workflow, and especially what I see out there on social media. And so I think what I'm driving at is a lot of those images to me seem like they can be strange or new. I'm thinking of avocado chair type combinations, weird characters, all of these sorts of things. And what I really wanna know is, what has it been like to push these models toward a greater understanding of what I presume is more novel subject matter?

Amit Jain: (8:42)

Yeah. That's really interesting. And currently, we are designing a feature, and that is one of the more important problems because, see, there's no data, right? Because by necessity, these are out of distribution things and stuff that you would just not find in regular use cases. I think the singular answer there is you need to really work on a very strong base model that has strong capabilities of understanding what is happening and being able to then deal with these very unrealistic outer distribution scenarios. So let's say like a pickle sitting on an avocado chair, and you start with the pickle standing up and you want it to come and sit down. There's no scenario you've ever seen that, right? But you've seen chairs, you've seen people and you've seen anthropomorphic things and you've seen them doing things like sitting down. Right? So this again comes down to the idea that these models are not memorizing behaviors. These are not memorizing how this thing's done, how this is done, how this is done. They're generally distilling out base ideas and capabilities, like the core things. What does it mean to be anthropomorphic? What does it mean to sit? What does it mean to be a chair? That kind of things, right? And then when you combine them together, the better your model is at this foundational understanding, the better it's going to be able to do when it's presented with these entirely out of distribution, funny uncharacteristic scenarios.

Steven Parker: (10:06)

If I'm understanding that correctly, improving the base model is just giving you better and better performance with things that are out of distribution. But another thing I seem to notice is that, especially with image to video, the kind of inherent characteristic qualities of that initial image also seem to be carried through better, more effectively with longer and longer gens. Do you think that's aligned with the improvement of the base model as well, or are you guys doing more stuff behind the scenes to check against that image more regularly as the gen is occurring? What's going on behind the scenes there?

Amit Jain: (10:40)

There's a lot that goes on behind the scenes of trying to understand what is in the image. These things are not just models, they are systems. And so we definitely take care of many things in the back to try to break some of these down. But generally, they are crutches for the model to be able to understand. In the next iteration of the model, we make it so that we don't need that system.

So the initial model, which was called Dream Machine, and now it's retroactively renamed to Ray 1, it had many of these crutches to make it feel and work much better, right? To understand motion, to understand different vocabularies, to translate it into language, it understands all this kind of stuff. Ray 2 doesn't need even 10% of that. But now because people are now pushing it to do more things, we have to build some of these systems again. Like, oh, well, okay, I see. That's what someone is trying to do. But how do we understand this part? For instance, like characters. Right? All the character understanding in Ray 2 is very much external to the model, right? But then as we design Ray 3, it's all gonna be internal to the model. So this is a trend we have seen in language models as well, that when people push the current model to doing things it was not designed to do, labs like us, we build systems around it to at least address. We gather the data and then we just bring it to the model. And this also applies to what we call application layer, that kind of stuff. Application layers come up, they build these specific use cases, but then we see that and we are able to then just build it out in the next model. And the general rule of thumb in this industry is that, or at least in the technical side of it, is that anything that can be done in the model is gonna be better than what is done external to the model. And we can talk a lot about that, but that's generally the situation.

Steven Parker: (12:26)

That's great. I'm happy to double click there as much as you want. I know we're early in, but I do think this is a super fascinating topic that it kind of skews towards secret sauce, company proprietaries, that sort of thing. And so there isn't frankly a lot of conversation about these kind of systems built behind the models that then you try to reincorporate into the model moving forward. Is there anything else you could say maybe more specifically about beyond data as a loose term, like making sense of these systems and how you teach a model from there to appreciate what was there and about that system that was helping you? I think that would be fascinating to know.

Amit Jain: (13:04)

That's a big conversation. Jiaming, you should also chime in at any time. To go a little bit deeper into inside the model versus outside the model, right, like intra model versus extra model. That's a really important thing because whatever you can do in the latent space, all the thinking, whatever you can do about - so for instance, let's talk about video for a second. If you're able to reason about the actions that are being performed, the characters that are there, what would that character do, the lighting of the character, all these kind of things inside the model, rather than a system that actually generates the right lighting that tries to piece together the narrative arc, then tries to piece together all these things. Of course, that system can work. That's not the problem. But of course, people try to do that. Like, if you remember what was the name of the company? It was a script writing company or a copywriting company, Jasper. Right? And they tried really great ideas, to be honest with you. Right? But the thing is, when you can do it inside the model, inside the latent space, you're just able to work with a lot more information. You're able to do the kind of edits that outside the model you just don't have the control mechanisms for the model to be able to do. It's like this thing, but there's a blood brain barrier between the generative model and the outside world, right? And when people say, hey, we want controllability, they want to penetrate the blood brain barrier better, right, and be able to tell the latent space what they exactly want. And it will get there. There's no question about it. But inside, there's a level of intelligence that is just really, really great. Outside, there's a level of intelligence that's really, really great, which is you or I who is using it. And then there's a barrier.

So things where you can actually communicate from outside in a higher level and then let the brain inside do as much as possible internally, collate information, think about sequencing of events in the video, think about what happens to a character, their arc, their timeline, their causality, all these kind of things. The more you can do inside the model, the richer the outputs you're gonna get. This is what chain of thought also looks like in language models, right? Instead of forcing the model to structure its thinking into like, Oh, do this, do this. No, no. Chain of thought is - if you have read the recent Anthropic paper, where they claim chain of thought is actually facetious and it's just a way to seed the latent space of the model, right? To direct it, make the model be able to direct it and whatever it's outputting is not that realistic of what it's actually thinking about. That is actually an indication of that.

We have seen that now, like instead of trying to force the model to output JSON through external coercion, let's just make the model very good at structured output. And then suddenly you have agents, right? So you're gonna see that in multimodal especially, because in multimodality, if you're able to simultaneously think in audio, video, language, image altogether, right, and combine reasoning from language, appearance from video, and all the aesthetics you have seen in images, and then audio that you're able to hear in all the videos, these kind things. If you're able to combine them, you're gonna have much better outputs than through external systems. But, yes, we do build these systems, then we obviate them in the next model training.

Steven Parker: (16:24)

Jiaming, any particular thoughts there?

Jiaming Song: (16:26)

Yeah. Of course. So I tend to agree with Amit here. So instead of trying to program this into complicated workflows, just imagine how you would communicate with another human. You don't actually need to program their brains for them to understand that you want to achieve this level of workflow. Of course, there might be some back and forth, but the process itself is pretty much natural. And I think this can only be done if you build these capabilities into the model in the sense that the models need to be more intelligent. Whereas, I guess, right now, even the current generation of models, especially video models feels less intelligent in the sense that you have to tell them or program them to achieve some particular type of task, like image to video, for instance. Of course, other practical reasons that this is done, but I think eventually, it will be intelligent enough to basically take in other tasks in the multimodal sense. You tell them, for example, I want to do image video keyframes, camera motion, and all these different tasks, and the model should be able to do it without even thinking about it as a different type of task.

Nathan Labenz: (17:40)

So could you guys maybe give a couple of examples of, presumably wouldn't be too sensitive to say, things that were essentially scaffolding in the Ray 1 generation that are now handled internally by the model. And then if you're willing and open to it, things that are currently scaffolding on the Ray 2 that you hope to be inbuilt into the model when we get to Ray 3?

Jiaming Song: (18:08)

I guess one example would be maybe camera motion. In Ray 1.6, we had camera motion features, but that was obviously harder to actually control, and lower quality than what we recently released. So in this iteration, you can see that there is a lot more different types of flexible camera controls such as Bolt Cams, which without the user having to prompt for exactly the extrinsics of the camera, it just works. So, of course, you can also give it more explicit controls. That's what some users will want. But I guess for other users, maybe what they really want is just like, I convert this scene into - give me this effect into the scene without me having to specify the very detailed controls. So I think that's one of the features that we have in the model.

I think what would be really interesting is to build into the model now is one level ahead of being able to represent the scene, like the Bolt Cam. Again, you want slightly more control than just generating a Bolt Cam, but not as much control as you have to specify the exact camera coordinates. Maybe somewhere in between, you can have something like, oh, given this camera trajectory scene, I want this Bolt Cam to be moving faster, moving slower, having more angle changes, or focusing on some other subject matter. So this kind of more interactive editability of the scene is something that we are thinking about maybe having in the model in the next generation.

Nathan Labenz: (19:46)
>
Hey. We'll continue our interview in a moment after a word from our sponsors.
>
Let's talk about ElevenLabs, the company behind the AI voices that don't sound like AI voices. For developers building conversational experiences, voice quality makes all difference. Their massive library includes over 5,000 options across 31 languages, giving you unprecedented creative flexibility. I've been an ElevenLabs customer at Waymark for more than a year now, and we've even used an ElevenLabs powered clone of my voice to read episode intros when I'm traveling. But to show you how realistic their latest AI voices are, I'll let Mark, an AI voice from ElevenLabs, share the rest.
>
ElevenLabs is powering human like voice agents for customer support, scheduling, education, and gaming. With server and client side tools, knowledge bases, dynamic agent instantiation and overrides, plus built in monitoring, it's the complete developer toolkit. Experience what incredibly natural AI voices can do for your applications. Get started for free at elevenlabs.io/cognitive-revolution.
>
In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Steven Parker: (22:14)

Cool. That was actually my next question. So maybe just to restate a little bit here for our audience who might not be super familiar with Bolt Cams. That's bringing us to Apple TV Plus Severance Season 2, right, which made a huge splash this year with their opening shot, which is a Bolt Cam robo arm shot. I think it was super impressive. Famously took them months and months to create and was just a big wow moment for audiences. And now, you know, just a little bit later this year, we have it as one of your key camera motion concepts already available for people like me to use when generating. It's one of a range of motions that you have added to the model capability. Super useful. And I imagine very highly in demand from your editor type of user approaching the model as well as novices and everybody else, but gotta be high on the professional list. What has it been like to develop that feature specifically with an outlook on the professional user or maybe customer requests coupled with modern trends like Severance Season 2 and the Bolt Cam?

Amit Jain: (23:19)

So basically, there are models. Right? The things you teach it during pre training. And then there are many capabilities that people want to teach them on. And language models have this really special ability, which is called in context learning. You show it some examples and it becomes that. And I think that's the truly emergent intelligent capability. Visual models aren't there yet. They'll get there and soon enough, actually, but they're not there just yet. But that doesn't remove the need for teaching these models specific things you want in the moment.

So we have been working on this idea. We published a post about this, a white paper, whatever you wanna call it. We call them concepts. And we realized that in visual, especially creative use cases, there are many things people wanna teach. Many things that they come up with, like a specific motion or particular kind of color grading or a particular human pose, right? Which is just really suited for the story you're trying to tell or the ad you're trying to make, whatever it is. But there's no way to get it out of the model because it's a new thing you just came up with. There's no data for it on the Internet. There is no way to generate large samples of it to even be able to fine tune the model. So we designed this idea of concepts to make it so that models can learn from one to just a few examples. Their capabilities don't degrade like they do when you're fine tuning or creating LoRAs, and they can be composed together. So how closely can it represent a capability that the model had at pretraining, but users actually teach it.

So we call them concepts. The tool that we're using to build them, they're called Concept School. Right? You take the model to school, and you're the teacher, and you're gonna teach it concepts or lessons, whatever you wanna call it. So the camera motions you're talking about, the Bolt Cam and dolly and the reverse dolly, all these kind of things that we taught, actually, we were able to teach it to the model very quickly or relatively speaking, quickly with very few examples. And you're gonna see the next batch of it come out this week and the next batch come out next week, so on and so forth. And our goal is to basically teach our models everything about filmmaking this way and eventually also give this ability to people to be able to teach them. Right now it's a little bit finicky as new technologies tend to be. So we haven't made it open access yet, but we will in the future.

So these camera things, the way we are building them to answer your question on feedback and these kind of things, there are some of these which are basic storytelling tools. Being able to track a shot, being able to move left, being able to move right, basic things. And then they're done by people thousands of times a day whenever you're shooting something. It's like, oh, the camera moves left. Duh. Right? Or you're tracking an actor who's doing these kind of things. So you want that. And then you wanna also balance it with some things which are absurd and which are funny or which people just can't do in real life very easily. Like, for instance, Bolt Cam. If you wanna do Bolt Cam really well with that smooth tracking, you actually need a robot. Yeah. So MKBHD has one. I'm sure James Cameron has a few, but most of us don't. So can we just have it in the model?

So we try to balance this great utility with some things which are just absurd and fun. But ultimately, we're building this for professionals. We're building this for people who want to tell stories, who are already telling stories, or who want to become professionals in that world. And we can talk about the changes that are happening in the movie making industry very quickly. But yeah, we're designing for people who want to tell stories, and these are all storytelling tools.

Steven Parker: (27:15)

So two things there that just brings to mind for me. One of them, selfishly is talking about those repeat actions that the pro user takes. One of the pro actions I take all the time is leveraging your audio generation feature, which is great. For people who don't know, after you generate a video, you can just press the audio button, and it will give you another prompt opportunity and give you the ability to generate audio for that clip. However, one thing I do all the time is reverse the playback direction of my video when I'm editing. And so just a tiny little plug here. Would love to have the ability to quick flip the playback direction before generating that audio so that I'm not dealing with reverse audio when I take that clip out. But I'll just leave that as a footnote. I think really what this is getting to is you're about storytelling. You have all kinds of users telling all kinds of stories, and that skews towards fantasy, Hollywood cinema, anime. That's everything. Right?

Amit Jain: (28:17)

Yeah.

Steven Parker: (28:18)

How do you imagine that translating to multimodal understanding? Is it just like it's an attempt to understand everything everywhere all at once, or is there particularly unique insight that you feel like you gain in the pursuit of art first that helps with the overall mission toward multimodality?

Amit Jain: (28:38)

See, what we are trying to do is - so the mission of Luma is to build a multimodal intelligence. Right? Multimodal general intelligence. If you were to materialize it in front of you and you were to ask, what does that look like? The intelligence that LLMs embody, that looks very much like abstract intelligence that humans have, that language and things like that. When you think about multimodal intelligence, of course it has that abstract part. But what's that actually look like? It starts to look very much like a world in a globe. You have this world in front of you, and it has all the physical properties and all the physical phenomena that happen day to day in our physical world. It also has those intelligent beings inside it that do things that increase the entropy of the universe, that interact with each other, that do all these kind of things. So a term a lot of people use is world simulator, right? But I think simulation is a weaker term here than it should be used. It's basically just a world model. It's a physical manifestation of the universe that you have outside. Of course, it's a weak facsimile, it's an approximation, but yeah, it's a model of the world, of the processes that we are having.

Now, coming to the creative side of it and why that is important for this mission is like storytelling. If you think about it, models. Right? And we were like, oh, this is only good for JSON, and then we're gonna produce these JSONs. Right? That's not very good. When you force intelligent systems to play games, to come up with abstract new things, to come up with instruction following, like, oh, no. I want this. Then I want this. When you ask them to dream, that's when you get intelligent systems versus just procedural systems that are following a set of rules that you have created. When you force them to deviate from that, making movies, telling stories is a very critical part of it. It's a part of human existence too, right? How good someone is at storytelling is generally a very good indicator of IQ, right? Can they actually think beyond just the most physical thing that is in front of them, right? What is the consequence of A? What is the consequence of that? What is the consequence of that? What does it lead to? That's what stories are, right? Event A then happened, then B happens, then C happens.

Video models are on the critical path to that general intelligence. Once we're able to combine video, audio, language altogether, they will become really, really good at storytelling. Right? Being a partner for us, who can actually sit next to us and help us dream, help us imagine, help us think through those kind of things. So creative pursuit. It's not a surprise, by the way, that people thought the first things to be automated will be the mechanical things. It's not a surprise that the first things where AI is able to help are actually creative pursuits. Because that's where we require intelligence. That's where these things come into play. So, yeah, I think art has always had a very significant role to play in general thinking, general intelligence. Art is also going to have a very significant role to play in building artificial general intelligence.

Nathan Labenz: (31:47)
>
Hey. We'll continue our interview in a moment after a word from our sponsors.
>
Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one and the technology can play important roles for you. Pick the wrong one and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.
>
It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number one cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into one suite. That gives you one source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's one system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (35:13)

Can I ask a couple kind of world model questions? Yeah. I think this is super interesting. One big thing that jumps out at me though is if I was trying to create a generally useful AGI to go out and do stuff in the world, right, to maybe control robots and come walk into my house and make me a coffee to take a famous example and to vacuum up my living room after my kids have thrown toys all over it, whatever the case may be. I would want this sort of world model, and I'm certainly a big - I wouldn't say believer at this point is the right term because I think it's pretty well demonstrated that the large foundation models are learning these higher order concepts, and that's I think become pretty much indisputable at this point. So, well convinced by the evidence in the literature that this is happening.

But I wonder about the sort of fictional side of it. The magical side, though, like, there are no dragons in the real world, but your models have learned to also represent dragons. So in a way, you have a real world model plus and sort of goes beyond what is real and into the imaginative, the fictional, etcetera. And I wonder, obviously, that's good for storytelling because we wanna tell stories that are not bounded by reality. But does that have downsides for the sort of practical utility of making multimodal intelligence that I just want to come into the world and do useful work for me?

Jiaming Song: (36:50)

Yeah. So I think there is a short term answer and there's the answer for the long term. I guess the answer for the short term is that if you want to use this type of model for your daily activity task right now, then yes, it might be better to angle towards the more physical applicable vertical, so to speak, for this type of models. However, there are many cases where you ask the model to do some tasks. So for example, say, pick up this clothes with a dragon on it and put it in the washing machine. And if the model doesn't really know about the concept of a dragon, you can't actually pick up the dragon pattern clothes into the washing machine. So even this kind of knowledge that is fictional can still be very useful for real world physical tasks. Because, again, even the concept of a dragon by itself is related to other physical concepts that people see in the world. For example, it looks like a lizard. It has scales that are exhibited in reptiles. It has wings, which you see them in bats and birds. So even these concepts, if you think about it, are what humans' interpretation to what they see in the real world. So it's like the dragon is generated by humans looking at the real world as part of their training data, and they generate something that is out of distribution.

But back to the original question, yes. For the short term, yes. I do think maybe focusing on more physical, realistic things can be better for these kind of tasks. So the model doesn't hallucinate as much. But in the long term, believe that the model should be able to tell itself apart from what kind of world it is acting in. Because there's also other types of world that needs to be able to act in, and it is not just in the physical world. For example, in a virtual world where things people are currently doing with agents are also a perfect example where the model needs world knowledge, but it is a world that is different from the physical world that we are acting in. So having a unified AI that can be doing both tasks could be useful. For example, eventually your robot could be like, okay, put this dragon t-shirt into my washing machine, and then go to the computer and type an email responding to my friend or something. This requires the AI to be able to reason between physical and nonphysical or imaginative worlds or virtual worlds just like humans can do. So I think eventually this will be a capability that does not have to be specialized into any sort of models.

Nathan Labenz: (39:35)

I don't know if you're doing interpretability work on your models internally or if you have any partnerships, academic or otherwise, that allow you to do that kind of stuff. Maybe you know the answer because you've maybe you have done this work, or maybe you could speculate as to whether you would expect to find features in the model that would represent, this is a realistic physics simulation versus this is a magical realism type of scenario versus this is a Minecraft environment that we're in right now or we're playing Pokemon or whatever the case may be. Like, it seems like that probably would happen at sufficient scale, but I don't know if we are there yet or if anybody's really had the opportunity to look.

Jiaming Song: (40:18)

Upfront, I'm not an expert on the topic of deep neural network interpretability, or I haven't delved too much into this space. My very shallow understanding about this is that it might be good to seek more alignments or try to get the models aligned with what you want, and try to reason from there. So for example, one feature that we have discussed heavily in the first version of Dream Machine is this implicit knowledge about 3D. So for example, you can ask it to generate things that reason about depths. And even in a fantasy, unrealistic art, it has a reasonable interpretation of depth. And also other types of physical features, such as clothing simulation, such as waves, and such as air, this kind of stuff.

So the question you're asking is, is there a kind of neuron or a substructure I can find within the model that shows that? And I guess the honest answer will be it's very difficult to find that within the model, but it probably might be easier if we prompt the model and ask it to do the task for us. So one example is, for example, people have shown that these models are very great priors for things like 3D vision tasks like depth estimation. So in this case, even the almost - a lot of the state of the art depth estimation methods are actually based on these pretrained models, which actually have a very good understanding about the world. So I do believe that some degree of interpretability or the model has a great internal knowledge about the world, but it's also up to humans how interpretability exists. So currently, it seems like the API between the human neurons and the model neurons are not that well welded just yet. So probably we'll still have to use the language that both sides understand to communicate this interpretability thing.

Nathan Labenz: (42:16)

Yeah. There's definitely some fraught challenges there. Yeah. Steven, go ahead.

Steven Parker: (42:19)

This is an interesting philosophical question that Nathan and I go back and forth on all the time. I mean, we tend to hear from people a lot that these models have physics knowledge inherent in the training. They have a great rich 3D understanding of the world that we see it in this and that.

I personally am not entirely convinced that they aren't just seeing more and more kind of movement across a 2D space and have greater and greater pixel understanding. I can understand somebody arguing against that from maybe a systems based training regime where we're training first on 3D scaffolding or something like that. But my own naive understanding is that sort of thing isn't happening in the training. I just think it's a really interesting question. Like, are we actually seeing the physical appreciation for a 3D space, or are we just seeing more of an expert interpretation of a 2D pixel space?

Jiaming Song: (43:11)

So I would say I would draw this ability to maybe predict or reason about physical scenes in this representation of the 2D pixel space to a Turing test. So in the Turing test, your goal is not to tell whether the model knows grammar or not. Whether this internally represents the scene with the grammar that humans define is not really part of the question that we are asking. What we are asking is that whether talking to the model, it feels the same to talking to a human. In this case, it's similar. So on one hand, being able to render, do the next scene or prediction for even perfectly over the next few seconds or minutes or so doesn't actually mean that inside it is the same physical knowledge that we currently define as humans. But on the other hand, to a utility standpoint, it might constitute as passing the Turing test for visual generation.

Amit Jain: (44:16)

Yeah. And I have a lot of time to think about this problem. And the question is basically philosophical, to what Jiaming is saying, that it borders on the definition of understanding, right? What does it mean to understand? And people make the argument that humans have something very special or deep where we understand at some level. But it's very hard to actually argue about that as well because how would you know? Right? You can say, oh, yeah, but I know this is 3D. That's why I understand it. Like, it clicks in my brain that this is 3D and not 2D. But that's all, like, the brain is this really interesting organ that is simultaneously thinking and telling you whatever it is or itself that it is thinking. And that's so weird, right? Think about it. It's self aware. The self awareness part is the most interesting part of it.

So coming back to, let's say, 3D understanding for just a second. What does it mean for humans to have 3D understanding? Actually, humans don't really - I'll take a slightly different stance than most computer vision researchers take, which is that all humans actually perceive 3D and things like that. We don't really perceive 3D. Yes, we have two eyes. We have stereo vision. But stereo dies out at like about 20 centimeters. Outside of that, the disparity is almost nothing, right? And then, case in point, if you hurt one eye and you have an eye patch, you can still drive. You have some initial trouble trying to grab a piece of glass or something close-up, but you can really drive. Of course, you have hands and your proprioception, so the brain gets other signals, right? That teaches it depth and that teaches it some of these things. But it's hard to really argue that the brain actually maintains some sort of 3D representation in the brain, right? It might not. We just don't know. We have no way of understanding it.

So a generative model, why is it any more special that it has an explicit 3D representation inside it, like a mesh or whatever have you, than just understanding these concepts more implicitly or more in terms of 2D space and time. And as long as, to Jiaming's point, as long as it's able to generate something that looks really consistent, why do you care how it actually did it? Like, planes fly, but they don't flap their wings. The phenomena that birds use and the phenomena that planes use is actually very similar, right? Birds generate lift by moving the wings itself. Planes do that by pushing themselves through the air and doing that. So while the phenomena is the same, the mechanisms are very different, right? Humanoids right now, like we are trying to build them, they move, but their movements seem very different from how we do it, right? It's going to happen more and more. Like all machines that do things humans do, like dishwashing. We wash dishes very differently than a dishwasher. But are the dishes any less washed because the dishwasher actually didn't have hands? I don't think so.

So philosophically coming back to it, I don't think machines have any prerogative to think exactly how we do, have representations inside them like how we do. It doesn't make them any less intelligent. It doesn't make them any less capable. In fact, they should do it differently because the substrate they're on is very different. Human brain is a 20 watt piece of organic tissue. On the other hand, here we are running things on gigawatt scale clusters. Why should they think the same way? So that's my answer to that problem, actually. I think people who are focused on making machines think exactly as humans do, making them exactly as interpretable as humans, are misguided. They are not only reducing the capabilities of machines in that process, but also wasting their own time. They should use that time to scale attention better. They should use that time to design regimes that can do more efficient learning. All these kind of things. Right? Yeah. I think it's a waste of time.

Nathan Labenz: (48:14)

Alright. I agree with the first 85% of that, but the last 15%, I wanna challenge.

Amit Jain: (48:20)

Go for it.

Nathan Labenz: (48:21)

When it comes to interpretability, in terms of why would you care? And the 85% does include the idea that I don't expect the AIs to be representing things in the same way that we are or thinking broadly in the same way that we are, and I don't think that should be used to discount them. I have a funny series of tweets where it's like, it's only a concept if it comes from the concept region of the human brain. Otherwise, it's just sparkling notions or whatever. So I'm with you in terms of viewing that as a straw man.

But I do think when you look at something like Golden Gate Claude, for example, where they were able to say, okay. We were able to go in, isolate what very strongly appears to be the Golden Gate Bridge concept. And now when we artificially turn that up, we get Golden Gate Claude. And then, yes, that's just a curiosity and in-a-tube demo, but I actually thought when you were talking about the concepts feature that you were building that maybe you were doing it that way where I could imagine that you might say, if you had taken an - and then I'd be interested to hear how you are doing it if it's - because it seems like it's not this. But what I had imagined you might be doing there is taking a similar approach, trying to identify directions in the latent space, and then injecting them or turning them up in order to enable these different concepts. And it's like you probably could do that and be like, turn on anime or turn on Toy Story style or whatever. Steven has a much better vocabulary for the different styles than I do. But I assumed that your concept work would be similar to Golden Gate Claude. And I guess the question there would be like, doesn't that seem like it would be a very useful thing to study and potentially be able to marshal? And if you're not doing that, maybe can you tell us how you are doing it?

Amit Jain: (50:03)

So there's a difference between treating something as a black box and being able to do empirical experiments on it. But we are in the philosophical land again, because we are not really - Think about physics, right? A lot of people think our knowledge of the world, and it's completely interpretable, and because we have an equation, we understand how the systems work. That's not how any of the laws of physics actually are, right? Science is not the act of interviewing God, if that existed, right? I'm not saying one way or the other. Science is basically, we have an observation, can we derive a pattern out of it? That's about it. Science is empirical. Theoretical physics is also, again, very much about building a mathematical model, an approximation, a representation. Machine learning is very much the same. The entire universe is very statistical, right? The best theory that we have of how the universe functions right now, quantum mechanics, is entirely uninterpretable. Today, we don't understand why things are the way they are at all. The measurements get verified up to like 6 sigma, 8 sigma, sometimes 10 sigma, as accurate as it gets. There are some measurements that are up to 23 sigma, right? But we don't understand why the world is that way. That's what you say when you're asking for interpretation, right?

I'm saying ML models, because of their sheer scale, are just not grokable by the human brain. You're not going to come up with this coherent, consistent model of how the ML model functions because it's just not that kind of a system. We can do empirical experiments on it after the fact, like the Golden Gate experiment that you're talking about. Like, oh, we found a cluster of this. We found this. It's like archaeology. We can do archaeology, all right? But here's something really powerful that we can do in the physical world. In the physical world, we can only do archaeology. Here, we are actually involved in the formation of the planet and all the eons it went through. That's called training. So we know what the model is going to do.

So instead of people spending their time on archaeology, I think a much better use of time is understanding data and training processes. You actually learn so much during training. At the risk of, Jiaming, giving something away here, when we train our models, we are working on actually designing these models with great degree of efficiency. And in that process, we do so much information theoretic work on our models to try to understand what sort of information flow architecture is in there. Like, what is happening to high frequency details? What is happening to low frequency details? What is it learning in the earlier stages, later stages? When should we actually change the curriculum of what it is actually learning? This is all basically interpretability, what people talk about. This actually has real consequences. This changes what the model learns.

And some of the things we have learned are really interesting, which is like the information theoretic pathways are set about in the first 20, 30,000 iterations. Like the model starts out as this scalar field, really. Is a good example? Think about a field of maize or a field of corn that is grown. It's just plants everywhere. They look just uniform, all these kind of things. As you train, pathways keep forming in between them, like someone walking and putting the plants down, right? And then they crop circles. And when you look from the top, you're like, oh, there's a pattern, there's a circle, and there's things like this over there. These happen very early. And the distribution of data you have in that time and the kind of learning rate regimes you have used decide a lot of what is going to happen in the later stages. But in the later stages, a lot of different things happen, right? In the later stages, these pathways are emphasized or de-emphasized. Someone goes and undoes the crop circles. Someone goes and undoes, like, oh, we're not going to take that path as much.

This is the kind of interpretability that allows you to design the models that give you the output you want. So the post-fact interpretability work, I think it's pretty interesting. Now, don't get me wrong. Intellectually, this is like empirical science, right? It's very good. And if you have an alien model in front of you or something someone else made, it's very useful to understand what's going on. But if you want to actually control the models, you want to control the data and the training process. And I mean the actual hyperparameters of the training process. There's so much that goes into that. So that's where actual interpretability comes into play, in my view. And that's where we do a lot of work.

Nathan Labenz: (54:37)

Yeah. That's super interesting. So, I mean, you could expand on any number of dimensions of that. One that I've been thinking about quite a bit recently is you might call it batch strategy. It sounds like you are probably getting pretty intentional in those early phases of training about making sure that you are feeding batch by batch the right mix of data so that - because I could imagine if you had - I've always looked at these loss curves and you see these occasional spikes, you're like, what's happening there? And my one sort of interpretation is like, maybe that's a bad batch, some sort of cluster of data that the model wasn't prepared for. But then also, if my interpretation is right, then that batch is probably sending a bad signal to the model as well that's not actually constructive for general learning purposes. So are you actually doing this sort of batch by batch construction of data to try to get the mix right at that very granular level, or am I over-interpreting what you're saying?

Amit Jain: (55:43)

I think Jiaming can answer this a bit better, but I can clarify my own statement. I don't mean handpicking it that much, but I think we come at it from the other direction, which is a lot of curation and filtering and removing garbage examples. The technique to training really great models is not to throw billion samples of garbage data at it and it will figure out out of the garbage, right? The technique is show it what you want it to learn and show it good examples from there, right, at large scale. So yeah, you do a lot of this preprocessing, filtering, that kind of stuff. But, Jiaming, I don't know if you have a different take on that.

Jiaming Song: (56:20)

Yeah. I think it's a very interesting question to be seen, basically, is like, what is the right dataset, essentially? Interestingly enough, the statistical methods that we're using for training these machine learning algorithms are actually quite different from what the real world actually is. So if you run this algorithm, whether it is on language models or a diffusion model, you are mostly making IID assumptions. Basically, you have a data point, you assume that these were drawn independently and identically from the world distribution. But the real question is, is there a world distribution or, how you even define the world distribution is a real question.

So I think regarding the question of what is the best dataset? It's actually very hard to have this right answer. So I think people probably take a second, less statistically driven, but more objective driven approach. So it's like, okay, how do I control the dataset such that I get a better quality or better outcome? And that's been deeply studied also in the realm of, let's say, language modeling, for example. Like, how much English do you want to have in your entire corpus? To be honest, it is very hard to reason about these in a theoretical level because you wouldn't want to say that, okay, we just based on how much English to be put into the dataset based on number of populations who speak English. So that's probably not the right way of doing it. And instead, having a more empirical focused approach is probably the right way of finding it. But the exact solution is probably very messy, and there's not really a lot of key, concise principles behind it despite how easy it is to explain maybe the idea of a machine learning tool, like, general purpose obvious.

Nathan Labenz: (58:20)

Maybe, Jiaming, would it be too much to ask for you to give us a short masterclass in the intellectual history of diffusion models? Because I think you're probably the best person in the world maybe to do that.

Jiaming Song: (58:33)

Yeah. Of course.

Nathan Labenz: (58:33)

So I sketched it out. And the mantra that I always say to myself about the original diffusion models is the big unlock was there's all these images out there. It's really easy to just programmatically add noise to them step by step until you get to pure noise. And then the insight is if you just train the model to do reverse and learn a denoising step, then if you run a bunch of those steps in a row, you can start from pure noise and eventually get to some image. The original versions of that were totally undirected and just generated an image out of nowhere. Then we got classifier guidance, and that sort of was enough to steer us in the direction of particular images. And then I want you to take over and tell us the major updates in the field that have brought us to where we are in terms of efficiency and control.

Jiaming Song: (59:18)

Yeah. Of course. So I guess the first known method of diffusion models was actually reported in this 2015 ICML paper. It's from Sohl-Dickstein. It described the actual algorithm for how to train or how to formulate the forward and reverse process of a diffusion model. However, that never really caught up, because back then the experiments were done on very small MNIST datasets, and it has still the same problems that plagued diffusion models very early on back then. So it was like, to researchers of that field, it seems like, okay, I have GANs that work in one step, and it also works much better on the large scale. Why am I bothering myself on this method that seems very hard to grasp, and very slow to generate samples.

So the first real breakthrough on this field came then in 2020, which was on the paper called Denoising Diffusion Probabilistic Models by Jonathan Ho et al. And the basic idea is that, first of all, it validates the validity of this idea and made it as performant as GANs on certain tasks. So it was still very inefficient, in the sense that you need many steps to converge to the right image. But it is one of the first algorithms that does not have the problem of the unstable training that GAN gives. So for practitioners like us, the algorithms being stable to train is very important. Because it's a huge difference between you lost the job, you can't go to sleep at night, and waiting until the next few hours the result gets better, versus in the GAN case, things just become unstable out of nowhere. And you have to do a lot of digging in this case. So the training algorithm being stable is a very huge deal here. And this is actually why people got interested in the diffusion models, even though they were very slow for the time.

I was part of the people who work on these generative directions. I worked on other types of generative models, including GANs before. So to me, diffusion model seems a very fresh perspective because before the existence of that paper, there was really not any method that trains stably and has high quality sample generation. So then I was trying to think more into this case of now we have this model that generates high quality samples, back in that time for CIFAR. This model is very stable to train. What is the problem of the model? So the problem of the model is that it generates very slowly. It takes like 1000 neural network iterations to get to a high quality sample. So I was more looking into solving that particular angle of the problem, and this is how I led to my work on denoising diffusion implicit models, or something that is able to accelerate the model something like 20 to 50x back in the day.

Of course, simultaneously, Yang Song was also from our lab at Stanford, was working on the more generalized idea of diffusion models. So his idea was more like, instead of having this fixed discrete number of time steps, we can actually make the time step continuous and wrap it more into the math that came from the 1980s, from this person called Anderson. So it was all in this stochastic diffusion equation framework and score matching. And Yang was a very early advocate on score matching and denoising score matching, so it naturally fits into his framework. So then he found this very interesting connection between differing models and denoising score matching. So that was kind of the first two breakthroughs, one on the theoretical level and the other is on the practical level.

And the rest you have described from OpenAI did the work on training these models on ImageNet, which is again a big breakthrough at the time. It shows that it is competitive with GANs in these cases, as well as using classifier guidance that people are using until now. And then people start to use classifier free guidance in the sense that they realize we don't actually need to train an additional classifier. You can add this unconditional signals to the model such that you can replace the goal of a classifier. And this is also from Jonathan Ho and Tim Salimans.

After that, there's a few papers around 2022 which tries to even scale it up beyond ImageNet. In early 2022, there was this paper called GLIDE from OpenAI, which basically tries to feed the models on text to image. A few months afterwards, the Imagen paper from Google came out, and a few months afterwards, Stable Diffusion basically got released. So basically, this is how we got to Stable Diffusion. But of course, after Stable Diffusion, there has been an explosion of these techniques and similarly open source models.

And then the next thing people care about is one, how to get higher quality models, how to make it work on videos, and how to get it to be more efficient. I will talk more on the more efficient side of things. So there has been a lot of efforts initially to do distillation, and there still is. Previously the way people do distillation was kind of hacky. So the idea is that I have this, I followed it maybe many steps time process. I try to use a model to represent two contiguous steps, so that I reduce the number of steps by two. And then I try to train another model to simulate that two time steps so that I overall have 4x acceleration and repeat this again. So this is called progressive distillation. It also came from Jonathan Ho. But this is pretty tricky to train, because the implementation gets a bit hairy.

So early 2023, there's this other paper called Consistency Models, which at the time was aiming to be a replacement to diffusion models in the sense that it is both efficient and we can train it in a single stage. Of course, he also described a method called Consistency Distillation, which is a distillation based method for consistency models. Basically, you use an existing diffusion model as the base, and you try to distill it with consistency models ideas. What actually became more popular in the field was consistency distillation because actually when people tried consistency models training on other use cases, it turns out to be not as easy as it seems to train it very stably. So people came up with a new method to stabilize the training process for the consistency models, which usually is still based on initializing the model from a diffusion model, so to speak.

So then comes with this type of distillation techniques, and there's another set of techniques that are based more on GANs, so to speak. So the difference between consistency models and the GAN based methods is that the consistency distillation techniques are in some sense more stable to train. But the GAN based methods are less stable to train, but they have maybe higher quality if you run them at fewer steps. So again, this is this interesting trade off between how stable it is to train the model, and how easy it is to get to high quality samples. Back to the original story between GAN and diffusion models again. So, you know, that kind of is the current status quo on diffusion models, the history of diffusion models, how people are scaling up, and what are the key problems in how to make diffusion models even faster via this path of distillation.

Nathan Labenz: (1:06:48)

Could we do just a little bit more on classifier free guidance and maybe also on what the intuition is with the consistency model? Like, when we're doing guidance, what exactly are we doing to guide the model? I get it when the classifier sense or at least I have an intuition where I would be like, okay. Yep. Feed it to the classifier and take its feedback. But when we're classifier free, I think it's a little less intuitive. And the consistency model thing, I think, is also a little less intuitive than the simple distillation.

Jiaming Song: (1:07:21)

Yeah. Sure. So before we talk about classifier free guidance, we can talk a little bit more about classifier based guidance. So the idea is that diffusion models, it tries to represent a score, or the gradient of log probability of this distribution called P of x. You try to guide it with some conditional signal. Let's call it y. So basically, instead of actually trying to sample from P of x, you want to sample from P of x given y. However, this, you can train on this, of course, but another way to treat P of x given y is to apply Bayes' rule. So basically, the P of x given y is equal to the joint of P of x and y divided by P of y. So I guess that's basically, you can think about P of x as the score or the diffusion model without any condition. And the P of x and y part is the score of the model with condition. Because the other reason is that the P of Y, which is the condition part, you don't actually care about doing sampling, so you can just drop it. So basically, that is why in classifier free guidance we have this unconditional model, which represents the denominator, and the conditional model which represents the numerator. But all in all, it's basically application of Bayes rule.

A consistency model is basically something like this. Maybe it's easier to explain consistency distillation first. The idea is that I want to have a one step model to generate the right solution. But I want to find a way to bootstrap it from a regular diffusion model. So the idea is like this. Suppose I have a perfect one step model, and it follows the trajectory of the diffusion model. So basically the idea here is that we want to learn a model to distill the process in which a diffusion model would normally go through with many steps. We're just distilling this function. So a consistent model would do, in principle, in the distillation case, is that you can compute the one step prediction at different time steps. And I guess time step is a concept that is correlated with how much noise you add. So the more noise you add represents a different time step or maybe a higher time step in this particular case.

So in a consistency model, the idea is to build a connection between two quantities. The first quantity is basically at a given particular time step, what is the prediction that you are going to make in one step? The second quantity is that you're supposed to using two steps, and one step is the regular diffusion model step. So you run one regular step of diffusion model to a time step that is closer to your original state. And then you run this same consistency model to reach the final step. So basically, the way how consistency distillation is trained is that it tries to basically minimize this loss function. And because your consistent model at the earlier times, at the time step that is closer to the clean signal, has an easier time predicting what a real number is, then it allows the model to build this connection and train this function. But essentially, what consistency model tries to do is that it tries to use a model to distill the otherwise hard to compute process that is run by these diffusion models.

Nathan Labenz: (1:10:41)

So there's so many directions we could go here. Let's do your latest contribution, the inductive moment matching, which I basically take as being a best of all of these prior approaches combined into one. There's sort of echoes of the consistency model idea and echoes of distillation with a sort of an interesting - what jumped out to me maybe most about it was the idea that the model is being optimized in distribution space and more of a - I mean, I guess it's always batch level, right? But it's at a higher level than just example by example?

Jiaming Song: (1:11:19)

Like I mentioned, we want to achieve these in the generative modeling algorithms. One, it is that it has high sample quality. Two, is that it is stable to train. And three, is that it is relatively efficient when you are trying to draw a sample from it. So all of the existing methods suffer from one of the two drawbacks. So for example, generative adversarial networks like GANs are not that stable to train. Same kind of goes to some degree of consistency models as well. Diffusion models are stable to train, has high sample quality, but it can't generate high quality samples in very few steps. So the inference cost is high.

So what we want is actually we want to find an algorithm that satisfies all three of them. So basically, high quality generation, fast to sample, and also a stable training process. In this case, we try to actually reason about a generalization of what consistency models are kind of doing. So in the sense that instead of trying to match this point wise sample, so basically I have this function and I want to exactly match the samples, what we can end up with is we only need to match the distribution wise. Because we actually don't care about exactly this function has to match. For example, if you try to have a bunch of samples and you're trying to push them to another set of samples, there are many different solutions you can go with. But in consistency models, you are forced to follow one type of solutions. And that is actually a little bit more restricting to the model, and maybe your model needs more capacity to achieve what it is being asked for. And this is possibly one of the reasons why these GAN based methods are having advantage because the GAN based methods are actually comparing samples at a distribution level.

So instead, we are thinking about, okay, instead of trying to match the samples point wise, we can just match the samples at a distribution level. What is another algorithm that can match distributions based on samples that is not GANs? So it turns out actually this idea has been discussed in the statistics community at least 15 years ago. And this idea is called maximum mean discrepancy. So the idea can be - it is maybe a bit more scary to hear about this concept. But what you can think of is it has a very interesting relationship with GANs. So in GANs, you have the discriminator that is trying to maximize the distance between the prediction of a real sample and the prediction of a generated sample. But in maximum mean discrepancy, you do the same thing except for that the discriminator is no longer a neural network. It is a function defined on the space called RKHS or reproducing kernel Hilbert space. So this you can just think about it as a feature representation of the function - a simple function based on complicated infinite dimensional features. It is a more closed form function.

And what is interesting about MMD, or maximum mean discrepancy, or the RKHS, this choice in general, is that you don't actually have to optimize the discriminator. So once you define a particular type of space to optimize for, the optimal solution can already be represented. So meaning that in this case you skip the inner loop of optimizing the discriminator, which makes this whole process more unstable. And you're just ending up with a very stable optimization process that still minimizes the distance between distributions as you're trying to learn with the generator. Of course, the downside of that is that you chose this space of functions a priori, so you don't have the space to maybe optimize it to the best case scenario. But in our experiments, we didn't find that to be a huge problem.

So essentially, the easier way to interpret how our approach on inductive moment matching works versus what consistency model is doing is that it's a generalization of consistency models in the sense that it does distribution level matching, whereas in consistency models, the distribution matching is based on a single point. And of course, you can easily represent a distribution with a single point, so it becomes a more degenerate case. So this kind of explains why in certain cases, consistent models are unstable to train. We actually did ablation studies controlling the amount of samples we used to compare distributions. And with one sample, the consistent model is unstable. With two samples, it is also a bit unstable, but it becomes unstable later. But with four samples or more, the training process actually becomes more stable. So that's how we get into this one stage process that has high quality generation.

Nathan Labenz: (1:15:54)

Maybe just last question because I know we're at time. Just looking forward, you guys are obviously deeply invested in multimodality. I would love to understand how you think about multimodality broadly and where you think it's going.

Amit Jain: (1:16:10)

We probably don't have time to explore these answers, to be honest with you. This is like the entire foundation of the company. But I will say this, which is that currently, what people think of multimodality is not really it. These are language models that have been fine tuned to work with image, work with audio, work with video. And it shows some capabilities which are actually really beneficial in the context of whatever the thing the language model is doing, that's very good. But as we're seeing, individual capabilities like understanding of objects, like object recognition or segmentation or just being able to understand what's going on, following long threads of things, people doing causality, all these kind of things. These understandings in multimodal models is still worse than dedicated computer vision models. And that tells you that this might not be the approach, right? That we're not quite there. It would be like if we built all these language models, but they were still worse than RNNs at interpretation of language or natural language, right? That's not the case. Language models are fantastic at that.

What we're seeing basically is that, yeah, this is a very promising direction and obviously the bag of tokens approach is extremely helpful. But we need to look beyond just language backbones and trying to retrofit everything to it. So Luma's approach is very different, and we are coming at it from the direction of this unified, singular latent space where we can think and reason about all these different pieces of information as if it's one. And technically, it is one, right? Nature doesn't make a distinction between video and image and audio and things like that. These are just signals. They just happen. They're all part of the same simulation. They're all part of the same environment that we are in. And action produces these signals in all of these different modalities, exposing different facets of its existence and occurrence. So we need to think about it in that same way.

And currently, we see that most of the industry is extremely shortsighted when it comes to thinking about multimodality - well, for a good reason, by the way, right? There's so much to be done in language and people should continue to do that work. But there's a new approach that is necessary to actually solve multimodality. Right? And that's what we are taking. It doesn't answer all your questions, but we don't have time. We'll come again next time.

Nathan Labenz: (1:18:31)

Okay. Cool. I'm looking forward to part two already. So for now, Amit Jain and Jiaming Song from Luma Labs. Thank you for being part of The Cognitive Revolution.

Nathan Labenz: (1:18:41)

It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.