AI-Powered Filmmaking with Waymark's Stephen Parker and Josh Rubin
Nathan explores 'The Frost,' a 12-minute AI-powered short film, delving into the creators' process and the evolving landscape of AI art.
Watch Episode Here
Video Description
In this episode, Nathan sits down with Stephen Parker and Josh Rubin of Waymark, and creators of The Frost, an AI-powered 12 minute short film. In this episode, we get a behind the scenes look at their creative process, the prompting and creative techniques they used, and an overview of the current state of AI art. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive
TIMESTAMPS
(00:00) Episode Preview
(00:01:00) Nathan’s introduction for Stephen Parker and Josh Rubin
(00:05:01) - The Frost is a 12-minute short film created using DALL-E 2 images.
(00:07:06) - The Frost started as an experiment to see if a narrative film could be created completely from AI imagery.
(00:08:38) - The filmmaking process was different because DALL-E images provided a starting point to build the story.
(00:10:38) - Parker started generating images with DALL-E 2 when he got access to the early preview.
(00:12:26) - Prompt technique to get consistent images by providing context about a hypothetical film.
(00:15:57) Sponsors: Netsuite | Omneky
(00:19:37) - Compositional continuity, like shot-reverse shot, was hard to achieve through prompting.
(00:22:13)- Rubin would request specific shots and the team would prompt DALL-E 2 to create them.
(00:25:24) - Filmmaking with AI as opposed to traditional filmmaking
(00:32:25) - Getting consistent facial features for characters was very difficult.
(00:39:03) - The storytelling helped cover inconsistencies that viewers might not notice.
(00:40:15) - Working with the images DALL-E provides
(00:41:54) - MacGuffin Object to tie scenes together
(00:44:53) Inpainting and compositing to refine DALL-E Images
(00:45:41) - Prompting for complex or novel compositions remains challenging.
(00:50:43) - The AI art is limited by what exists in the training data.
(01:02:05)- Animating the human characters was challenging because of missing or incorrect appendages.
(01:07:36) - The team had to find creative ways to convey emotion through the limited animation.
01:02:24 - Animating subtle human movement and emotion is still very difficult.
(01:06:35) - A romantic comedy would be much harder to produce with current AI capabilities.
(01:12:17) - For Frost 2 they are using text-to-video models like RunwayML.
(01:15:43) - AI voicing advancements applied to filmmaking
(01:19:27) - The future of AI in Hollywood and filmmaking: quality narratives still require human vision
LINKS:
The Frost: https://www.thefrostpart.one/
MIT Tech Review Feature Article: https://www.technologyreview.com/2023/06/01/1073858/surreal-ai-generative-video-changing-film/
Behind the Scenes Videos: https://www.youtube.com/watch?v=p31COxNbTWs and https://www.youtube.com/watch?v=F8k9MeXpSUU
The Frost Part 2 – trailer – https://www.youtube.com/watch?v=RcmwtRd_NIs
X/SOCIAL:
@Stephen_Parker (Stephen)
@bigkickcreative (Josh)
@Waymark
@labenz (Nathan)
@eriktorenberg
@CogRev_Podcast
SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
Music Credit: GoogleLM
Full Transcript
Transcript
Josh Rubin: (0:00) This stuff, it was exceptionally hard, maybe even harder than a traditional animation. You're working with an unknown artist in the room who can give you exactly what you want or who can give you some random wonderfulness or some, as other people say, grotesque image. It's a huge challenge. In order for it to be good, there needs to be a big 500 foot human vision.
Nathan Labenz: (0:30) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, I'm thrilled to be speaking with my longtime friends and teammates, Stephen Parker and Josh Rubin, creative leads at Waymark and creators of The Frost, a groundbreaking short film made entirely with DALL-E 2 generated imagery. For a bit of context on the AI journey that Stephen, Josh, and I have been on together, from 2017 to 2021, Waymark had built the easiest to use video creation app on the market. And the quality of the video templates that Stephen, Josh, and the creative team produced was our standout feature. However, feedback showed that users wanted more than an easy to use DIY solution. What they really wanted was an app that could create content for them. Now, I had been interested in AI forever and was always looking for AI tools to enhance our product, but I'd only done a tiny bit of hands on development because frankly none of the AI technology available at the time really worked for our purposes. That first started to change with OpenAI's release of GPT-3. And in September 2021, when we successfully fine tuned the Curie model for the first time, I became convinced that generative AI was the solution. Some on the team, I think, thought that I'd lost it when I used my prerogative as CEO to pause just about everything else we were doing up to and including board meetings to organize a generative AI 101 crash course for the team and reorient our product road map entirely around generative AI. Stephen and Josh, to their credit, came along for the ride and have since caught the AI wave in their own unique way. While I've gotten my 10000 hours of AI usage with a mix of language, computer vision, and text to speech models, Stephen and Josh have gone super deep on AI art. We got first wave access to DALL-E 2 as an OpenAI innovation partner customer in early 2022. And since then, Stephen has personally generated over 1,000,000 images with DALL 2. The result? The Frost is not just a proof of concept, but a legitimate 12 minute film with a coherent narrative and a consistent aesthetic. In this episode, we get a behind the scenes look at their creative process, a sense for the challenges they face and the strategies they use to overcome them, and overall, 1 of the most sophisticated accounts of the current state of AI art that you'll hear anywhere from a team using these tools with the highest level of taste, vision, and skill. Creating the frost was not easy, but this project does show how transformative generative AI is likely to be as it continues to mature. Only 7 people are credited on this film. Josh and Stephen our Waymark team members, Tommy Herman, Zach Polly, and Lexi Dietz, and collaborators, Matt Sessions and Robert McFalls. As always, we appreciate your reviews and your online shares. But this time, I really want to encourage you to watch the film itself and also the trailer they've recently released for the Frost 2, which they are already making with an entirely new generation of AI tools. We'll have links to these in the show notes and also to some visual behind the scenes content and an MIT tech review write up of the film as well. Now please enjoy this fascinating conversation with Stephen Parker and Josh Rubin, AI creative pioneers and makers of the frost. Josh Rubin, Stephen Parker, welcome to the cognitive revolution.
Stephen Parker: (4:26) Thanks for having us.
Nathan Labenz: (4:28) Very excited to have you both. Obviously, we've worked together for a long time at Waymark where you guys have led the creative department and, brought a level of creative quality to the work that we do that was certainly previously inaccessible to the likes of me. So, very much, appreciate that over the years. But today, we're here to talk about your recent project, which, you've done at at Waymark, but, kind of for broader exploratory and creative purposes, and that is The Frost. So I guess for starters, tell us what is The Frost?
Stephen Parker: (5:07) So The Frost is a short film, 12 minutes, part 1. It is a film that we created using DALL-E images essentially, which are still images that we prompt for curate and then take into After Effects, cut up, use puppetry, various styles of animation, run the images through DID, do a whole bunch of things to essentially create video from still imagery, and then assemble a short film out of that new quasi video. I would say that Josh probably has more to say about what the Frost is than me. He is the director on the project. But it's just fundamentally about the exploration of what new AI can enable for creators.
Josh Rubin: (5:58) Yeah. it was the frost was kind of a happy, wonderful accident. It was kind of born out of out of the curiosity to see if we can make a film generated completely out of AI imagery. What by film, it's more not just a montage of images set to music and to see if we could animate them and to get the images looking as best as possible. I think there's a lot of people out there in the world doing that, doing that well and doing that probably better than us. What by film is to create a narrative out of AI generated imagery. Could it be possible? Is it possible? We saw the text to image generations as the beginning of AI cinema, which is the beginning of this whole new revolution in entertainment, really. And this is we just saw the beginning of it. And the frost was kind of an experiment put into action. Then 3 and a half months later and 13 minutes later, we had a narrative. Better or worse, we achieved our goal. So we're we're quite happy with it. Yeah.
Nathan Labenz: (7:28) It's been very well received. I think it's it's been really interesting to see how the, the the sort of media has kind of taken an interest in the project and what seems to capture people's attention most is this, kind of standard that you're speaking about, right, where you're you guys really set out to do something that is not a proof of concept, but instead saying, given these new tools, can we create something that, we as creators and filmmakers would be proud of and that people would actually wanna watch not as a pure AI curiosity, but as something that, hopefully stands up, against other entertainment, on its merits. Right? I think that's a a key difference in the way that you guys approach this project relative to, so many things where people are like, look what I, kinda spit out or, look what AI spit out for me. A lot more I guess I don't really know a lot about your process, I don't really, to be honest, know a lot about filmmaking in general. So maybe you could take us through the process and maybe highlight kind of to the degree that it differs from a traditional filmmaking process, obviously, in many ways, especially at the kind of technical execution level where it does. But even maybe just starting at kind of the conceptualization, how different did this play out? How differently did this play out relative to a, a traditional project when it came to, just the first questions of what are we trying to make? What story are we trying to tell? Did you feel you could start with the same questions, or did you have to approach it in a fundamentally different way given the different tools that you're gonna be using?
Josh Rubin: (9:11) I thought the genesis for the project was it's a very different process from when you're kind of ideating a completely original piece without any images to go off of, really. Normally, you start with a script or an idea. You start with an idea, then it evolves to the script phase, and then you storyboard things out. And then if you're lucky, if you have enough money, then you get to shoot the thing. And you go location scouting, you choose your environments and your sets and scenes, and then you kind of go out and film it. And with DALL-E, it kind of gives you a great starting point. It gives you some place to start. And whether you take to it or not, that's up to the creator. But what happened with our project was Stephen was extremely excited about this new technology and just went gung ho in terms of generating all these fantastic, extremely photorealistic, cinematic images that really lent themselves to fantasizing about, Hey, this could become a movie. So it was easy, at least for me, to see Stephen's initial Frost series, whereas I think it was a series of 20 or 30 images. From there, basically these images were faces and mountains and close ups of gear and things that. It was the beginning of a world and all it needed was a story to tie it into. So we had this amazing starting point for this, which normally you don't get unless you're drawing inspiration from a bunch of different things, then you ultimately have to make it your own. But with a DALL-E image generation, it gave us this incredible starting point that we could just hit the ground running with.
Nathan Labenz: (11:26) So let's dig in on each of those a little bit more. I think people have seen a lot of cool stuff, I would say. You know, safe to say. If they're listening to this podcast, they've seen some cool AI art. Probably, though, it's it's often the case with different AI systems that you kind of, to a first approximation, kind of get out of it what you put into it in the sense that if you don't have any vocabulary or expertise in an area, then you kind of get amateurish stuff out. And that seems to be true both on the image, creative side and also on the language models as well. Right? You know, some of the most creative and interesting projects in both realms are predicated on the person, the creator really knowing deeply what they're doing. So I'd love to help people leave this conversation with a little bit better sense of how can they prompt for consistency or kind of create a world? You know, when I go in there myself, I feel like, okay. I, I want an image of this. It doesn't really come out how I'm looking. I kind of mess around. If it's still not how it's looking, how I'm wanting it to look in my head often just give up or I try something else. But to make I don't know how many cuts there are in this 12 minute film, but it seems like, typical or whatever, every 2 seconds or something. You've got a lot of shots over the course of 12 minutes. And so to sustain an aesthetic for a full 12 minutes to make something that ultimately feels coherent, it seems you really pushed on this frontier of consistency, predictability. I'd love to hear kind of more about how you conceptualize that and and the techniques that you and the team developed.
Stephen Parker: (13:12) Right. So I think it's important to kind of imagine in your mind how you think these images were captioned originally. Because those captions are critical in the training of the dataset and are kind of a fundamental first base for how to think about getting images. So before everybody had access, I was taking a lot of prompts from people and people would just kind of give me random prompts and I would help to improve those prompts. Right. And they would say things like, I really wanna get something that looks a photo, a real photograph of a gray alien skull. So they would just put gray alien skull into an AI image generator, and they would get back an illustration or painting or something that didn't look a skull at all, or maybe it wasn't a gray alien. So I would say things to them like, Hey, let's imagine this were real and there were a gray alien skull, where would it be? It would probably be somewhere Museum of Natural History, right? It would be in New York. And like, I want you to see it in your mind as that thing. And like, now we're going to prompt and we're going to say the skull of a gray alien circa 1946 or whatever in the museum of natural history, something that really sets up a kind of contextual framing for DALL-E or an AI image generator in a way that helps it to understand what are we going for? What are we looking for? And where might we find that thing? And so, if you were to imagine gray alien skulls being captioned also inside of something National Geographic, right? Like let's imagine there's a photograph of a gray alien skull held in a museum collection taken in the sixties. Now we wanna prompt for that image. That is really the way to think about achieving images for me inside of these generators.
Nathan Labenz: (15:14) Hey. We'll continue our interview in
Josh Rubin: (15:16) a moment after a word from our sponsors.
Stephen Parker: (15:18) So if we kind of take the pre context and then apply it to a production the Frost, the first thing we really wanna do is just kind of set up a contextual prompt structure that looks like, hey, what kind of image do you want? Portraits. I want portraits from the film, then let's give it an imaginary film name. For the Frost, we called it Tundar just because it's kind of a tweak on the word Tundra and felt something that wouldn't be in there, but was pointed enough to kind of set the mood. So we want portraits from the film tundra, then we're gonna add a comma and Oxford commas are a huge part of crafting the prompt, if you will. But the next section is kind of like, okay, what are we shooting? Right? So we're shooting people climbing a snowy mountainside, right? That's kind of the subject matter of the prompt. And then we put another comma and we say something filmed by, various famous Hollywood cinematographer or director. And now it's kind of like, okay, if you were to imagine the location of that image in a dataset, maybe it would be in a blog post about a film, or maybe it would be in an IMDB kind of still or caption, or maybe it would be, there's tons of film sites. Right. And so sort of imagine the imaginary caption for this film image that is really kind of pointing to where that image might be hypothetically and also kind of an art directed structure. We have a structure that's like, Hey, I want this type of image from this imaginary project. Here's the subject I want in this particular instance. And then we can add additional kind of consistency information to the end. So maybe it's always directed by the same director or maybe it's not, but we always want a consistent element to be in our shot the color yellow or the colors blue and gray in the case of the frost. By setting up what I keep referring to as a contextual wrapper this, we kind of have a frame around the subject matter. And now what we do is we go in and we play with that subject matter to kind of achieve different shots. So if you go look at the frost and you think of it not as kind of moving video, but just 1 representative image from each of those video moments, that is you could basically substitute that out for the subject matter portion of the prompt.
Josh Rubin: (18:12) That's what helped us get the look of the frost. There's a couple extra things that I think would help in continuity to answer that question. That definitely helped us in terms of attaining that at times, seamless look, the whole thing took place outside in the snow. So when we're cutting from shot to shot, that's an easy transition for your brain to make. Like, Okay, well, in this shot, it might not be exactly the same background. Those tents might not be the same as it was in that reverse shot, but I'm buying it because there's snow and there's mountains. So that was huge, creating an environment that was outside. Because once we got inside, that's a whole different ball of wax too. Also, another challenge with this stuff was to try and create the cinematic continuity. By that, just like, okay, just the fundamentals of a conversation where you have a shot and a reverse shot. That kind of stuff was the most challenging. Once we had that look, then we had to go back in and refine, Okay, that's when we had to use our knowledge and apply it. Because it's like, okay, you can only have a certain amount of wide shots within a scene. You need to punch in or you need more wide shots, so you need a different angle. You need a profile. You need this. Having that knowledge of filmmaking essentially was a great help because once we had our foundational shots, our big master shots that the teams created, then it's like, All right, how do we enter this world? How do we explore this world in a cinematic way, in a way that audiences are used to experiencing? And that's where the real specific prompting came into play. And we got mixed results out of that. Sometimes we get exactly what we want, and then other times we kinda let DALL-E take take the reins and and and got some great stuff there. So
Stephen Parker: (20:39) Right. I think that's a good way to think about it maybe for the audience. It's like, not so much me, but my team. They're like, they're they're really trying to get out there and dig up the raw materials. And my raw material think of cinematography, think of these shots. That's kinda my side of it. Then Josh's side of it is making that into a film focusing on the story, figuring out how to build a story out of those raw materials. And then we kind of get into an iterative feedback loop where, he's like, okay, this scene is working or this scene isn't working. I need more reverse shots here. You know, how do we, let's go take a crack at that. And then we try to go mine for some more of that type of shot and bring it back to him. Then, maybe it is working or maybe it isn't working. We need to make a tweak. Like definitely the biggest thing in all of this, that's kind of a standard flow or could be considered a standard flow in terms of iteration around a project this, but DALL-E or any AI gen is kind of an artist on its own. It's almost a cinematographer that you fully control. Think what's interesting about all this is that, we talk a lot about these things or we refer to them quite often as tools, but there's also sort of an artist component there. There's definitely an unknown with the AI that it's just not fully within your control. And so we have to treat those AIs as if they were themselves artists and sort of be willing to be receptive to what they give us in the moment on the day in this shot or any other shot. And then, see what we like, see what new ideas come out of that. That's definitely kind of the most new aspect of this, would say, because in a typical production, control everything, you you go there, you art direct the scene, you set it up the way you want it. You have the talent you want. You've got a script exactly what you want them to say. Like you may deviate here and there, as encouraged by the director, but fundamentally you're in control. In the case of AI ImageGen, it's kind of like, it's much more an active substrate that you're trying to sort of manage and mess around with in order to create something that fits into your project and works for you.
Nathan Labenz: (22:58) If this were real, let's imagine it were real. Where would it exist? Who would have created it? How would it be described So that you're not just starting off with, a kind of naive, purely descriptive thing, but you're trying to bring in all these other associations and kind of marshaling the, the vast knowledge base that the system can draw on. On the language model side too, we've had a couple moments this in the past where, I've been trying to use adjectives to get a language model to do something for me write better, copy, right, for our videos at Waymark. And then occasionally, you'll say something to me like, why don't you tell it to make it David Ogilvie wrote it? And then I'll be like, oh, yeah. That's quite a different approach, but in some ways, way more powerful. Right? Because I'm I'm usually very literal and kind of sitting there much your kind of, gray alien skull example. I'll be like, I want concise copy, and I want vivid copy, and I want memorable copy, and I want, all these sort of act action verbs, and I'm kind of just giving it a checklist. And you are instead invoking a master, which is, I think, something probably a lot of people could stand to benefit from in really any AI project that they're doing. So that's a bottom line that I think, people should definitely take away and incorporate into their own usage. You know, there's this kind of iterative cycle. Not knowing a lot about filmmaking, to me, that sounds pretty profoundly different. Right? I imagine in some cases, especially with all sorts of technology that is used, you can kind of patch over things that you didn't actually film when you were doing the filming, later at the editing stage. But still, I also would imagine that you're pretty limited in most cases by what did you actually capture when you were in the capture phase. And there's it's not easy, right, to go back and reset up the set and bring the actors back together. So this fundamentally does kind of change the flow from 1 that is much more planned out in advance and kind of, waterfall to use a software analogy to now 1 that is actually much more kind of agile and allows you to, to spin back to the raw materials anytime you want. Am I overstating how different that is, or is that really a a very different reality for making a film?
Stephen Parker: (25:30) No. It's it's it's super different. you can punch in as of as you were referring to on various projects, whether it's audio or something else with video, but yeah, it's very expensive and time consuming to go back and shoot this stuff. How I would set it up though, for the audience is like, it's really a trade because yes, it's very iterative. You can go back anytime you want. You can revisit shots, get new material, stuff that, would otherwise kind of cost you a fortune to go back and reshoot and rethink about. The trade, however, is like, you're not fully in control. You still have to take those ideas again from DALL-E, from Midjourney, from whoever. And, yes, you're getting more, but you're also getting more again, that is not within your control. This is really where Josh, I think has stuff to say about what that process actually feels coming from the world of a traditional director to a newer project this.
Josh Rubin: (26:31) that was probably the most revolutionary, most eye opening, Oh, shit moment that I had during this project was when we were actually We're in the cutting room. We had our stills. They were still yet to be animated. Just in any kind of creative process, you're searching for more. You want to make it the best it could possibly be. Like, Yeah, this is cool, and this shot is kind of cutting with that, and the music is hitting right, these 2 people are talking, and it's a scene. But wouldn't it be great if this could happen? And then you said, in a normal traditional production, even in animation, to go back to shoot is a luxury that a lot of people can't afford to do. It's you got to work with what you have a lot of time, especially if it's already in the can. And with DALL-E, you can, in real time, say, All right, let's get this hand clutching this rope. That's going to heighten the drama. We want this scene to be super tense when the person is clinging for their life and has slipped off the mountain. It's like, okay, how do we heighten those moments? We could literally engage the team for a prompt. They can get us back an image within 20 minutes. And then we could put it in the timeline, watch it down. No, that's not working. The fingers are messed up or let's pull back even more, have them out paint it 2 times or something. Ah, that's it. Beautiful. And then you have your shot, that's how the polish phase happens, really. Was just that's how this thing became polished. I think as these models start to develop even more, we're just going to see the polish get even that phase just be faster and be more exhilarating.
Stephen Parker: (28:49) Yeah, in terms of practicality there, I think Josh mentioned the storyboard earlier, which we do use, but I think in our case, something that's different is ours is really an active storyboard, right? So we use just a whiteboard website that all the team members can go and see simultaneously and kind of work on collectively. But we kind of have a row for each scene and images laid out in sequence there, but the images can be taken down and new image gets put up or we make space and put a new image in. Those images are not a hypothetical or concepted or an idealized version of what we want, those images are actually what we're working with. So we go through this active storyboard phase where without worrying about the animation or anything else, we're just kind of worried about a sequence of stills. And I do think that's not totally different from an animatic process or something, but it's nice to just be able to do that continually throughout the project. And so, as Josh was saying okay, he's decided in the scene, he really wants a tight close-up of somebody gripping a rope. All right. Seems straightforward enough. But now the variable part is, is that hand masculine or feminine, young or old? What does that rope look like? Is it like, a sleek, black, mountain climbing rope of some sort, or is it an old nautical rope from a ship, from the 16 hundreds or whatever? And you really getting all of it back at the same time. And you, then you kind of have to decide like, what fits here. And obviously you hope to hone it and get more specific than that, but hopefully that kind of helps to contextualize or provide an example for people of, even though you've ID'd this very simple thing all the variability that comes or explodes out of that 1 maybe individual request.
Nathan Labenz: (30:49) When it comes to faces, the the continuity of the characters through the movie, you had mentioned the tents, in the background can kind of change, and you can kind of get away with it because, it it seems coherent enough and people kind of, don't latch on to those kinds of details. But for things the characters' faces, I would assume, first of all, a lot lower ability to slip past, the viewer when those things change or are kind of inconsistent from scene to scene. And that seems it would be really hard to get consistent across, a bunch of different shots, a bunch of different contexts. So how did you approach that problem of getting these characters to feel coherent across all these different images when in fact presumably, they're being kind of generated from scratch each time. Right?
Stephen Parker: (31:46) Yeah. it was stupid hard in the context of our project to be clear because we're working with DALL-E because it's, back in time in terms of the tech. It another caveat for now is like, this is much easier now within Midjourney, which you can do either by blending images together, which will kind of give you a consistent character or feeding it uploading an original image and then asking for subtle variation on that image. It's still not perfect, but it's a lot easier today than it was then. In terms of back then, I think we, yes, we really spent a lot of time banging our head against the wall. Josh in particular, yeah, I remember I was all the time trying to ask couldn't we just kind of follow a group and like, let's make the story about the group so that we don't have to care too much about any particular character. But Josh was like, and I'm kind of twisting his arm the whole direction, but there was this 1 character, Doctor. Ulrich that he was like, look, we like, we really need time with a few characters. You know, you gotta find some way to give me a consistent character. You know, he was very amenable to many different possibilities for how that might manifest. But ultimately the thing we kept coming back to is this idea of an archetype, right? So I think, not to be understated is just the prevalence and significance of archetypes within these image generators, right? You can think about a common archetype a statue of the Virgin Mary that we've all seen a 100 different times all over the world, DALL-E or any AI image gen has certainly seen, a ton of those statues. You could also think of an archetype in terms of something the Mona Lisa that's, essentially the same image, but 1000000 different times in the dataset. There's a lot of different ways to think about this, but we, for the character of Ulrich, the archetype we set up is like, a gray haired Wiley esque mad scientist in a white lab coat, with, long gray hair and glasses, something archetypical enough that, it's not the same person, but if you study the faces, obviously the faces are are changing, but the idea is really to kind of just establish continuity so that that character feels the same character. Josh uses a lot of tricks there, but at the gen level, we're really working with archetypes and thinking about it in that way. Knowing that, again, we're not going to get the same thing. We're just kind of in an address field of possibility for what the generator may return.
Josh Rubin: (34:29) That was a huge thing, Doctor. Ulrich, to try and get that character. There's a scene in the movie where this Doctor. Ulrich character is making a speech at the United Nations, trying to save the world. And it's action packed. When you're hoping to get action and movement and meet the bar that's been set by modern day editing techniques, it's you have to go there. You have to cut. You have to make cuts. You got to move your camera. You got to play the game that's been set in motion. So we found this tropey archetype of the mad scientist. And then at first I was like, Yeah, it's not the same guy. Guys, how do we get the face transplanted onto this shot? It's just like, it's not going to happen. But after a while, you take a step back from the project, and it's like, you look at the scene, and when the voice of Doctor. Ulrich was read, and the, that's a consistent voice and the music is put into play. I bought it, it's like, oh, it's the same guy. But if you look at it from shot to shot, it's not. It's some of them look very different. 1 guy looks Santa Claus and he's, probably 30, and the other guy is is a little on the more trim side and it's just got a couple different facial features. But, we're relying on that suspension of disbelief that's just very prevalent in a lot of movie making anyways. So we were just really leaning on that with these characters, especially with that character, because he probably makes the most appearances. So that was a huge struggle, and I'm kind of I'm I'm really proud of how that turned out.
Nathan Labenz: (36:28) Yeah. That's super interesting. I honestly had not studied it that closely, and, I've watched it a few times. But in retrospect, know, hearing what you're saying now, it just kind of, it just kinda worked for me. Like, I didn't really notice those variations or think twice about it, frankly. The human perception is actually kind of accommodates a lot of things. Like, we're not really trained to to be on the lookout for things this. there's a really interesting, potential lesson here for kind of human AI interaction in general. Right? That certain things that we're not primed for are really easy to slip past us. We've we've got no prior context or, reason to be on the lookout for a person to be not the same person, from scene to scene. It just doesn't happen. So in reality then, you can actually slip a lot past somebody because they're, they're kind of prior, so to speak, on that happening is just so low that, it doesn't it has to be pretty egregious, I guess, or maybe not pretty may egregious might be too strong, but it has to be significant to get over a threshold where people would see it if they didn't specifically come in looking for that.
Josh Rubin: (37:44) I think it speaks to the power of the story too. Not our story, but just the power of storytelling. And when people are in a story, you could forgive a lot of things. You know what I mean? It's there's countless Reddit boards dedicated to gaffes in movies. Where you could see the, Did you spot the boom in the shot? Did you spot this actor wasn't the same actor as in this scene? It's just a lot of that stuff we don't catch just because if you're watching a story, you want to be engaged. You want to feel And so that's that did nothing but help our cause with this. It was a cool thing to see put into practice.
Nathan Labenz: (38:30) So going back to your comments about the, the shot with the rope. Right? What kind of hand is it? What kind of rope is it? All these little details. How much of that are you accomplishing through iterating on a single image? Like, people have, seen these kind of mask and fill techniques or out paint techniques. When an image actually gets to the point where you're gonna use it, how often is that something that DALL-E spit out and you're like, great. We'll use it versus, it spits something out, but then you went and, redrew the rope 5 times and, the hand 5 times. How much of that kind of image level, partial editing and reworking are you doing?
Stephen Parker: (39:14) Quite a bit. Quite a bit. most of the images have some particularly with DALL-E because, those images are starting out so small and then kind of scaling up from there in terms of the way the images are generated. You know, you're often going back and correcting an eye or, painting over hands or something in order to try to get a better result. The rope example is a quality example in terms of adding a new shot in that you want. But I think in many, many, many of the shots, there's just some aspect of it that you want to touch up or make different. Or, sometimes DALL-E just does crazy things. Like it'll just throw a random object into the background. You know, it's kind of a fun creative occurrence, but also when you're looking for consistency and there's a steel pallet or something in the background that just has no reason for being there. Then Josh is making notes like, get rid of this, get rid of this, please change this. Another example is like, okay, these outside shots are great, but I'd need a fire. Like, can I get some plumes of smoke? So things that are happening all the time. And then in terms of the real cheating Photoshop work that we're doing in this process the 1 thing we really needed was this sort of MacGuffin object. I think of it that way anyway. And it was this idea of a transponder that is sort of moving up the mountain with these characters over the course of the story. And so in order to achieve that, we went into DALL-E to first create that object, went through many, many, many sort of prop iterations on what that thing might look like. Once we had it though, in a few different angles, then that is an example of an object that literally gets composited, or you can think of it as stitching it into various shots over the course of the film so that it's there so that we have the kind of 1 little consistent object that can follow you through the film and sort of help add another layer of continuity.
Nathan Labenz: (41:23) Can you unpack that notion of a MacGuffin for folks that aren't familiar with it?
Stephen Parker: (41:28) The MacGuffin is kind of the idea of an answer, an object or something that isn't really there. It's more of the idea of it that's there. So I'm not exactly using it in the correct in the context that it would most often be used in terms of a film where people are searching for a thing. That's the McGuffin, it's the answer, it's the object. They may find it or not find it. It kind of gives meaning to the story without providing substantive meaning itself. The transponder is similar in that it's not a real transponder. It's not really there. It is something we're just kind of, stitching throughout the film. I think of it that way because it's not really coming from the generator in place. Right? Like, we're putting it in in order to add this kind of idea of consistency to the project. So maybe I'm equivocating a bit there, but it's how I think of, the construction of the film.
Nathan Labenz: (42:31) So another kind of challenge Josh had mentioned this kind of, shot and counter shot or, was it reverse shot, characters in dialogue you're gonna kinda see over each one's shoulder. Right? People can recall how that often happens in the in the movies and and TV that they watch. It strikes me that, if you want to go back and replace a rope, there's pretty good tools to do that, right, where you can mask it out and try it again and, iterate on that. But for broader composition, that's obviously a lot tougher. Right? You can't just mask simply you're masking out the whole thing is, yeah, the layout, the whole kind of composition of this. You know, if it's wrong, it's kinda wrong. You can't just kind of locally change that. So how did you work on that, and how would you describe kind of what the tools can and can't do there? Right? there's there's actually benchmarks around this where people are like, can you create an image where there's 1 blue circle and a red square and the red, square has to be, below the blue circle, whatever these kind of compositional, sort of somewhat stilted, but, clearly defined compositional tests. Yours are obviously much more kind of in the eye of the beholder in terms of whether they're gonna pass or not. But I mentioned that must have been a real challenge. And, what do you do to try to dial that in and and, get the hit rate up to an acceptable level?
Stephen Parker: (44:02) Couple of things to think about there. 1, it's always getting better. You know the ability to prompt for unique things and unique situations, which I think is ultimately what you're getting at there is all the time improving. So I would just put that upfront. 2, Josh is asking for any number of things. He's not just asking for small tweaks to a hand or a rope. He's very often saying, I want a different background here or I this character from the image. I want them combined with another AI image. And so that, there is kind of some compositing happening at the level of putting images together. If you want to think of them almost as collage from the DALL-E generated images, that's another way you can think of it. But then I guess at an artistic level, there's multiple things happening as well. 1 thing is sort of knowing what form looks to begin with is going to give you a much better idea of where you should be, erasing, if you will, in order to sort of force a better image to arrive based on the amount of space you've taken away. And that is also contributing to the available space to paint back in. So if you want to refine a form, thinking about it as an artist, there's definitely a structure or a method of attack in terms of how you are in painting and out painting within an image. But then also you have to go back to the dataset and think again to the dataset. This is like, I remember we had a number of shots that were daytime shots that Josh wanted to see fires in. And they were overhead shots of the camp. He wanted fires to be there. It's like, if you just go do a survey of film or go do a survey of this type of image there just won't be that many examples of images that are daytime with a brightly burning flame as well. Like you might get that at night, you might get that in other contexts. And so a lot of times it feels you should be able to just ask for object X, a fire right here in this place, but all of the informational context surrounding the place where that object goes is contributing to how the machine is going to paint into that space. And then that's also coupled with the sort of contextual information that lives in the dataset. There are other times the color grade of an image. I remember we were trying, we had shots that were very, very, very blue. And we were trying to paint in something a red scarf or a purple scarf or yellow scarf and all of those colors, there is no way that after a film has been graded, that that color is gonna show up bright in the final image. So there are many, many instances in film where someone is wearing, a bright piece of costume or whatever, but after the grade happens, the color grade, those colors become much more faded. And so DALL-E is never going to paint in an electric red or an electric purple into that scenario because it's never seen any examples historically in terms of the way lighting works in these images that is, gonna give it any indication that that should happen. That's an interesting mystery in terms of where we are with the training right now is kind of trying to twist the arm of these machines to reach into new or novel spaces to create things where there really isn't great kind of pre context around what those things should look in the images. That also is a bit of a tangent, but like, it's cashing out in many ways when you're interacting with these generators. If an idea is complex, well understood, but maybe without pre context, you're going to get a fuzzier image. It's almost the machine is having to think about a lot of different things at once. And so like, you're really kind of getting something that's maybe a little fuzzier, maybe a little bit more painterly. When there's a really clear idea of what that thing is, it's almost the image dials in to something more specific, much more detailed. That's a long way of saying all these things are kind of happening at the same time. At an iterative level, we're kind of just listening to what Josh wants and doing our best to get that thing. But we also have to be conscious of what is possible in this image given what it sort of already looks such that we don't step wildly out of bounds and make a novel request that's just kind of impossible for the machine.
Nathan Labenz: (49:05) I think this is another really fundamental point for different kinds of AI systems as well. Right? It's the notion of can AIs have breakthrough insights? You know? Can they have eureka moments? Can they suggest science experiments that are actually, worth running because there's enough insight behind it, that makes it a not previously kind of well trod, hypothesis? In this case, can it create stuff that doesn't look anything in the set? In general, my read is like, that's very tough and, almost vanishingly rare. Certainly, when I've looked for things like, can it come up with good science experiments ideas? You know, with current systems, I would say, not really. You know, it can take a good science experiment idea and start to map out the steps, but I haven't seen any examples where it will actually give you a really good, science experiment idea upfront. But I guess, so in the context of this image stuff we have seen things the avocado chair. How do you how do you make sense of something the avocado chair, which, maybe that exists somewhere out there, but, that's pretty vanishingly rare. I I don't know. I'm just kinda trying to get a little bit better understanding of what are the kinds of never before seen things that it can do versus the kinds of, never before seen things that you're still feeling are impossible.
Stephen Parker: (50:38) This is definitely at the forefront of a lot of different machine learning research and insights being worked on yet to be developed, etcetera. I don't have the definitive answer on this. You're gonna get lots of different answers from lots of different people. But I think there is important thinking within this realm of questioning. So the way I think about it and the way I appreciate these image generators the most is really celebrating the fact that there was a dataset that contributed to them. I don't to interact with these machines kind of apart from that thought and that appreciation. Like at a high level, this is really an exploration of human history that I feel I'm taking part in. The way I think about that sort of metaphorically or hypothetically is by thinking about that latent space and thinking about kind of a 3 d space where a lot of points exist. And I think of those points as sort of all the things previously created, all the things that exist within that dataset. Another way to think about it is maybe a library of babble analogy or a space where all possible things exist. But if we were to conceive of a thing a space where all things exist, then the dataset is only the things we've found within that space of possibility. And then what we're actually interacting with this generative approach is really the space between that set of coordinates. So to reframe it within a space of all possible things, there are all the things we've found. And now we're interacting with the space between all the objects we've found. So in the case of the avocado chair, no, there weren't a lot of avocado chairs, but like, there are a lot of avocados. There are a lot of chairs. There are also a lot of avocado pillows. There's a lot of like, there is a ton of different chair designs that look many, many things. Certainly many of them have been close to the shape of an avocado. And if I were thinking about the space between those points where an avocado comes close to a chair probably that is occurring many, many, many times. The fact that the machine is able to, create an inference point between those 2 and manifest it as an image of a chair is not that surprising to me. And there are a lot of cool examples where that happens in ways that just we haven't thought of yet. And that's kind of where the novel happens here, right? It's like, what happens when we blend 2, 3, 5 points that we haven't really blended before? That's the sort of everything is a remix philosophy. That's the idea that, we're getting to new places by mixing the places where we've previously been. And there is great opportunity for that to happen. There are a ton of just purely abstract kind of artists who are producing novel, I would say just by kind of putting unique concepts together. But I think that is very, very different from asking the machine to fundamentally travel outside of the space of points that we know about and get to a new 0.1 new location, a novel location. That really doesn't happen. And it's my understanding that that is not really the space we're working with when we interact with a ChatGPT, a dolly or a Midjourney.
Nathan Labenz: (54:23) I still a little bit struggle with why can't the scarf be super vivid red or, just things that that seem that doesn't seem too conceptually crazy. Maybe it's more of just a you only have so much a bandwidth. Right? You only you only have so much data you can communicate to the model in your prompt. And so it kind of has to have I don't know. I'm struggling, though, because it's like, you can get avocado chair. Seems you should be able to get that bright red scarf if you want it, even if it's not super coherent with the rest of the shot.
Stephen Parker: (54:58) Yeah. I'll just jump in for a second. Like, the pre context though of the prompt is important there. Right? Like, in the case of an avocado chair and really with any prompt, the less kind of specific it is, the more opportunity there is for what I would call open blending, just intersection between points or forced intersection. The more we contextualize it, the more we say like, oh, I'm looking for cinematic 8 ks portraiture from a hypothetical film, by this director or lit this way within this context we are triangulating into a tighter and tighter realm of the space, right? The Avocado chair is a great example of, if I could pull from any image in this dataset and put it together with any other image in this dataset, in this mathematical mind of the AI, then a lot more freedom is available to me. And also to be certain, the Midjourneys of the world are all the time doing a better and better job of being able to make ad hoc demands even within context for things the bright red scarf in a very, very blue cast image. So that's kind of happening. It's happening through RLHF. It's happening through a bunch of other techniques for the type of images that get generated. And we are still developing and figuring out arm twisty ways to have these machines generate the sort of images that we want to see them generate. But I also think it's worth thinking about being in the space, vectoring into more and more specific places in that space. And then contextually what that means for, what the AI sees in terms of its local area of what it might grab from. It still does cool stuff all the time. Stuff that's really creative. DALL-E trolls with the Mona Lisa continuously. And I don't know that people know that if they haven't generated a 100,000 images with it or something that, or they're not looking for it. And maybe, maybe the developers toss it in from time to time as, a prepend or an append to the prompt. I don't know. Maybe DALL-E just does it because it's seen the Mona Lisa so many times that like, it will throw the Mona Lisa in here and there. But in the case of DALL-E, Mona Lisa is really kind of a character, a ghost in the machine, if you will, that just shows up in all kinds of different places and contexts, maybe as a background character, as a photo real version, as a photo bomb sort of character, in another instance, as a drawing, as an illustration, as a sculpture in many, many different scenarios, Mona Lisa just kind of pops up in the most creative ways. And that's just 1 example, but it is really, really fun and funny at times to see that level of creativity sort of coming from the machine. And that's happening all over the place. The suggestions that the AI is making through its lack of specific knowledge are also really, really cool. Like I see a lot of clothing that the machine conceives of this kind of a mix of 2 different ideas. You know, it's a button down shirt yours, but maybe it ends, right where that top button ends and, it's a poncho from there on out. And it's kind of like, the machine has no reason to care. Even though we talk about kind of bias in these machines at the art level, it's really willing to kind of mix and match and pull from so many different things that like, it's really creative. It's really interesting and it's really unique at times.
Nathan Labenz: (58:49) Fascinating. Okay. So just being somewhat mindful of time, I wanna talk a little bit about the motion techniques that you guys used in this project. Then I kinda wanna get a, a little bit of an update on where you're going next. I understand there's a Frost 2 in the works. And, as you've mentioned several times, the tools are improving and the what's available is changing, and you're probably gonna use significantly different tools and techniques for the second, installment. So really interested to hear about how that's going. And then I maybe just to conclude, we could kinda zoom out for a second and talk about the big picture of, what all this means for content creation in general, content consumption. You know, it seems it's gonna have a significant impact. So I'd love to hear your speculations about that. But with that road map, tell me about the motion. You know, 1 comment that we've heard from a reviewer is that the vibe of the film is, I'm quoting, grotesque and unsettling. And I think, that obviously plays nicely with the story that you guys are telling. I wonder to what degree that is kind of a is that was that an early 2022, early 2023 constraint where you were kind of like, hey. This technology is still kind of in the uncanny valley. Let's just make it, uncanny valley and grotesque and unsettling because it plays that way. Could you have made something that wasn't, grotesque and unsettling? And I'm understanding that that is in the motion layer. Correct me if I'm wrong with that assumption, but seems DALL-E 2, is spitting out pretty realistic, not uncanny valley images. And I'm I'm guessing it's the motion layer that was kind of introducing this sort of this vibe. But, correct my misconceptions there.
Josh Rubin: (1:00:45) Yeah. The motion was a challenge, because we have these very cinematic images that feature human beings. And human beings move a certain way. And human beings have appendages that look a certain way. Knees bend at a certain angle and elbows bend and fingers move. Sometimes DALL-E doesn't want to give you all of those appendages in anatomically correct places that would beget a traditional human movement. So we figured out quite early that our movement was, at least with the humans, the human characters was going to be a very rudimentary look. Especially with the climbing up of the mountain, that was probably where the motion was most evident and was going to be featured the most. You have to show them moving up the mountain. That was done in a couple of ways that Chuck can go into. But when we were using the DALL-E generated images, you kind of had to just go the coarse puppeteer way in terms of After Effects, literally taking an arm, moving it very time consuming way or a knee and moving it up and then down. We leveraged our amazing animator. I wanna call him out. His name is Matt Sessions. He's been working with Waymark for a long time, and he did a tremendous job breathing life into this project with the motion. So I definitely wanna call him out. We were playing into the characters. We had no choice. This is what DALL-E was giving us, and this is what we're working with. We're working with characters that might not have a hand or might not have a complete foot. So it's like, how do you animate that? You know what I mean? Well, you to keep them climbing up the mountain, so to speak. So we were doing our best there. And also, in terms of the climbing, we were really inspired by Akira Kurosawa movie, The Blizzard, which is featured in the anthology piece Dreams, which is just now being remastered, I think, released this week by Criterion, which is really cool. But in that piece, the characters move a certain way. It's a very sluggish kind of movement that we felt good about. It's like, Okay, Kurosawa is doing this up the mountain, we can do it. It was a great inspiration to have. Also, something that I was looking at when I was seeing storyboards come together and then I was seeing the early animations, was just coming off as like, This is a graphic novel come to life. And sometimes you just want to see the graphic novel, you want to see the comic book just move a little. It doesn't have to be realistic. It doesn't have to be in 24 frames per second movement. You want to see a little movement and we're good. And that goes a long way. And that's kind of how we got that look. And whether it's grotesque or not, that's up to the viewer to decide. I'm not going to comment on that. There's also different other motion animation that we set into play to really breathe life into this thing, to get the kind of scope that we wanted in terms of a group of people moving up a mountain and not really concentrate on moving every single appendage. It's it would be too time consuming. We utilized 3 d models from Mixamo, which are basically 3 d characters that you can buy and animate in a coarse way and intersperse them throughout the frame. That helped us give us a little more fluid motion throughout.
Nathan Labenz: (1:05:09) So do you think you could have done a different genre at this stage? I'm wondering, what if we came back and said, all right, we want to do a romantic comedy with the same tech? Would that just be impossible given the
Josh Rubin: (1:05:25) sort
Nathan Labenz: (1:05:25) of limitations of how realistic know, is it is it would it be possible to make somebody, the object of romantic desire with this with this tech or is that just not quite there yet?
Stephen Parker: (1:05:35) maybe if you just wanna do it as still images, honestly we can get something very photoreal, but I think that's a very good question. Like, I'll I'll let Josh jump in, but my immediate reaction to romantic comedy is like, I can't think of anything harder to try to do where subtlety and facial expression where so many back and forth between people. Yeah. we're definitely afforded a lot of things by the genre, by the scope, by the kind of interesting world building that we had around the project. A much more character driven thing would just be really, really difficult right now.
Josh Rubin: (1:06:13) I agree. 1 of the biggest challenges of this piece was to try and mine emotion out of these characters. It's we could get really amazing photorealistic frozen nomads or whatever, but they'd be For example, when we were building the avalanche scene All right, we need people in here looking up at the mountain. Sometimes it would just give us people looking up in a romantic gaze, even though in our prompt it was just in utter terror. They are about to die, panic, exclamation point, exclamation point, trying to throw everything at it to try and receive something back that resembled a human being in distress. And sometimes we just wouldn't get that back. You know what I mean? And we had to go in there manually sometimes and tweak eyebrows, tweak lip position. What's really cool, we started implementing at the very tail end of the project was Photoshop has a little AI emotion element to it. We were able to kind of tweak some facial stuff with using Photoshop and also some After Effects techniques as well. And also with in painting, there's a lot of in painting in this trying to a smile, trying to get a frown, trying to get that. So I think, yeah, to answer your question, a romantic comedy might prove to be difficult. Now, there's this new tech that we're just starting with Frost 2. I don't know if we wanna talk about that.
Nathan Labenz: (1:08:03) I wanna talk about it if you're open to talking about it. And, in not to spoil the story or anything, but just the the fact of how fast things are changing. Hasn't It been that many months, since you were doing the first version and released the the first edition of this. And tools are changing. Things are getting easier. So I'd love to hear, what are the big advances that are making the second edition, different in terms of your process and different in terms of the possibility of what you can create. What are you using tool wise, and what more can you create than before? Voice is another 1 too that I'm really interested in. And we've seen, it seems we're hitting some thresholds right now in terms of the viability of deep fake voices for better or worse. You know, would you consider a AI voice, for the next generation? there's a lot of different frontiers too that could be considered.
Stephen Parker: (1:08:57) Josh just put together a trailer for the Frost 2, and I'll let him speak to all of the the new tech therein. I think maybe just short preamble there is we obviously started creating the frost at a time when the tech to us was much more new. Then an interesting phenomenon working on the project is like, we're totally eclipsed by the tech by the end of it. We also, the frost ended up being a way bigger project than we thought it would be initially. Like we imagined it as, just a couple of minute piece to begin with, and then ended up with a part 1 running at 13 minutes by the end of it. So we kind of have this thing where our first pancake ends up also being the whole brunch. And it's like, how do you do that effectively? It's a bit kind of live iterate and react for us over the course of time. But 1 of the things lately we've started to, I think become attracted to is this idea of the Frost as an IP, a recognizable IP or storyline that people hopefully at some point become familiar with and then can watch sort of morph through these AI technologies as new tech comes along. And I think that's just, for us, we that idea because the world of AI is kind of moving so quickly. It's like, you only kind of get 1 chance very often to see a thing, think about it, and then it's on to the next, 10 things on the day. So the idea of an intellectual property of some sort, a storyline that you can sort of follow along with and also experience new tech with is becoming a really cool idea to us. So I'll let Josh talk about that new tech.
Josh Rubin: (1:10:49) Yeah. For the Frost 2, at least for the trailer and for a lot of our tests that we're running, we're using Runway Gen-2, which is basically a text to video model, which is something that when we were making the frost back in, starting the frost back in January or December, we didn't even think like, Oh, that's 2 years away, whatever it is. At least I was thinking. But it's here. I think as we were finishing cutting this trailer, as I was finishing cutting it, they just released a newer version of Runway Gen-2, where instead of giving you a 4 second output, which is kind of limiting, it gives you an 18 second output, which is plenty of time to give any filmmaker to tell the story of a shot or a scene if you want to just keep 1 shot running. It's a tremendous advancement, And it seems it's only getting better. But just interacting with that was pretty different in that the images were moving. We're no longer getting still images and then breathing life into them. The life is already breathed into them. With that comes more issues. It's like, No, well, now this person's head is resembling a tomato more than it is a human head. Yeah, we can't use that shot anymore. So it different weird stuff. It's you're counting different little weird quirks that it has. But I feel that'll get ironed out and as we work with it more, we'll be able to articulate our best practices there. But yeah, it's just super exciting to have worked with that.
Nathan Labenz: (1:13:01) So the new technique is now just straight text prompting with Gen-2. That's the focus at present. Yeah. That's interesting.
Stephen Parker: (1:13:08) There's also images involved in those prompts at times too, which, it's cool because you can go over to DALL-E or Midjourney craft sort of the image that you want Runway to use in order to create that scene. And now you can do a couple of things. You can just do a pure text prompt. You can do that image plus a text prompt to sort of ask for the type of movement you want, or you can just use that image only and just see what Runway spits out as a result of that image. So it's kind of 3 things available you there. Now in painting, as Josh was saying, in painting, out painting, these things are, they can be done, rotoscoping can be done, but like, it's way more taxing to do that than typically to just kind of prompt for another generation. So there is kind of less flexibility at the level of video output, but, it's kind of a, we're building a flow now where we're sort of achieving the image first in many cases, not all, but in many cases and sort of using that as a starting point for the generation of these clips. You also mentioned AI voice. Josh, you are using AI voice.
Josh Rubin: (1:14:26) When we were voicing Frost part 1, we auditioned a ton of AI voices just because there's so many different characters. It's a job to go in and audition people, hear different clips from people reading lines. It's a thing. We were hoping to maybe employ some AI VO, and it just wasn't happening. It didn't sound real. It was too quirky. It was too robotic sounding, whatever. But now during our time with the Frost 2, it's the advancements are significant, especially applying them in a dramatic context. I don't know anything about anything outside of a drama, but within this dramatic context, it's some of them really, really excel. So yeah, we were blown away.
Nathan Labenz: (1:15:28) What are the tools that have jumped out to you the most right now in the AI voice? I ElevenLabs. Their pro voice clones are insane recently. I have been just amazed by that. I'm waiting for mine, that I can, delegate the whole podcast, adventure to the AI as well. Also, PlayHT, former guest, on the show, they just released something in the last few days that I think we we really wanna check out too because, Matt, it is, to be able to prompt emotion on top of just the text. I don't know if if ElevenLabs has that. You can tell me if they do, but the ability to go in and be like, this is said in anger, versus said in surprise versus whatever. just a whole new layer of control that has just, as far as I know, just come online with their last release. So it's your your toolkit is expanding exponentially right now.
Stephen Parker: (1:16:27) Yeah. is a bit of a sign up, but like, it is really, really nice as artists to start to see these tools begin to incorporate more of a tool set that we need in order to kind of get emotion out of things. And Josh was talking about trying to mine for emotion side of DALL I really think a lot of that is a lot of the emotion has just kind of been RLH stuff to death out of these datasets. And it's understandable at the level of creating images that feel scary to people, faking real world events, disasters, that sort of thing. But at the level of drama or action movie or sci fi, you need people to look scared at times. You need to be able to put explosions into the scene. You need people to sound angry in their voice delivery or happy. You need them to be able to cry, argue with each other, love each other the full range of human emotion and visual expectation we need that as artists in order to create effectively with these tools. So I think we're starting to see it. There's a delicate balance there. I can understand why it's not there initially, but we're spending so much time talking about what's in these datasets. We don't really spend a lot of time talking about what artists really need in order to get the full range of capability out.
Nathan Labenz: (1:17:55) I think that's maybe a good place to transition to the last thing I wanted to ask you guys about, which is just the broad future of all this stuff. obviously, for context, we've got multiple Hollywood strikes going on right now, writers and actors. And in general, people are, kind of expecting a lot of disruption in this space. I think, you could look at that from the standpoint of production, what roles, become more or less important, how does the mix of jobs change, how do the budgets change. I think it's also really interesting to consider what gets created, and how does that change. I think an important point that, our good friend and CEO Alex always says is we just never would have done this, otherwise. Right? Like, we didn't have the budget to do the traditional production and never would have dreamed of it. So this is something that just simply could not and would not have existed before. And that's an important, an additional layer to to bring to the whole thing because it's not just about the way things are produced changing, but also what can be produced changing. But there's just so many different, fallouts from this economic, cultural, atomization perhaps of people kind of falling into their own. we're already, very concerned about echo chambers and information silos and, algorithms, curating content just for you in a way that, may or may not always be healthy. But it seems there's also, a lot of potential for things to be created, just for you in the future. So how you guys have been really at the forefront of this. And I wonder what if if you try to peek around the corner, beyond frost 2, but kind of extrapolating that to 3, 4, 5. But, where do you think this goes, or can you even, predict that at this point?
Stephen Parker: (1:19:59) few things there. We did this for fun. You know, we're not a film production company. We're a technology company at Waymark. We're our mission is fundamentally about people being able to make their own commercial. So we were able to pursue this project really as a luxurious byproduct of the fact that we're out there kind of looking at new tech on behalf of our customers and trying to figure out what the best ways there are for us to incorporate this stuff and to be the best practitioners we can so that when we do have times where we need to construct a prompt or something in our own workflow, in the case of our customers, we know how best to do that. So this is really about us kind of understanding what's possible and that's really what got us to the frost and what gives us this kind of fun area to play in. That said because this is fun, because this is not for profit, because this is kind of just, to share and hopefully share a lot of knowledge as well about, the creation and the process. You know, my own opinion is just kind of like, think there's still room for a lot of things. There's still kind of room for all of us. You know, the tech is certainly going to transform roles. It's going to create a lot of new, cool and interesting roles that we haven't thought of yet. It's going to enable much, much smaller groups of people, the pursuit of these projects. I would to think of that as a lot of people without means previously to become filmmakers or do things creatively, or, gonna have the capability to do that very, very soon. And I think that is a very exciting idea. I also think some jobs are gonna go away. Certainly that's kind of just the reality of technology in our world. That's kind of couched at the level of artist. I don't necessarily think I buy into that argument so much. Think like, yeah, maybe if you have only 1 modality of art creation and you never plan on expanding that or trying something new, okay, there's gonna be issues for you down the road. But if you embrace these tools and think about the use of them in your own workflow, I think just about every artist can find something cool and find new paradigms and new possibilities for themselves. I think like, the ability to fine tune now in AI on your own work. If I were, an individual fine artist who maybe hasn't given this a shot yet, or hasn't really thought about it, the idea to go fine tune on your own work and then kind of explore with the AI within the sort of realm of what you're already doing sounds a very new and exciting thing for me to think about. Also, when we prompt for things a specific artist or a specific production type, we're only kind of getting an amalgam from the AI that's based off of a myriad of things that's seen in the dataset. We're getting results back that have an artist noted properly. We're also getting results where an artist isn't noted properly. And so like, we're really getting a melange that like, though it may say 1 artist represents any number of people. And in the case of film is representing, everyone from screenwriter to grip, to just armies of people untold that are represented there in the dataset, but are not equivalent to the artist name that someone might put into a prompt. So I think that is an important thing for us to consider. And then maybe to build on I'm excited about the idea of, directors coming along and authoring definitively a fine tuning that other people in society can maybe purchase or license the ability to use, what they exactly what they wanted the machine to understand their artist thumbprint as, and for people to be able to mix and match those definitively authored fine tunings with others. I think that is a new kind of ex economic explosion potential that I'm really trying to talk about right now more openly and be a big advocate for, because I think that is maybe a healthy new world where artists and creators can intersect and find a lot of benefit. And then last thing I'll say, and then I'll shut up is just that in terms of what I put out into the world versus what I've created, I'm going to hit 1000000 images generated this year here in about 2 months running my numbers. You know, I put out a small fraction of that into the world in terms of what I let people see. That's okay with me. I have my own worlds that explore. I explore my own kind of stories I enjoy by myself. And that is going to happen for a lot of people in a lot of ways going forward. And I think we can just kind of try to be a bit more understanding about that and appreciate that for what it is.
Josh Rubin: (1:25:30) That's a hot topic right now. I have friends and family in Hollywood who are worried about this and ask me about it all the time. It's definitely controversial. I can tell you from my experience that directing this movie and trying to see this production through, this is not easy. This stuff, it was exceptionally hard, maybe even harder than a traditional animation. You're working with an unknown artist in the room who can give you exactly what you want or who can give you some random wonderfulness or some, as other people say, grotesque image. It's a huge challenge. I do believe that there needs to be In order for it to be good, I think a human vision, there needs to be a big 500 foot vision that a human being has. At least I'm talking about making narrative film. I don't know how far away we are from where the human being is not engaged at all. Because right now, you can't really just ask ChatGPT to write you a screenplay, feed that screenplay into another AI, and then all of a sudden you have a movie. It's not really how it works. There's a lot of tweaking. There's a lot of human conversations. There's a lot of critical thought that needs to go into this thing. So I think it's a great tool, and it's improving every day. So, 20 years from now, 10 years from now, that might happen. You might be able to type in a prompt and get your ready made movie. But right now, it's you you still need you still need the human to kind of be the storyteller, in my opinion.
Nathan Labenz: (1:27:48) Well, the AI assisted enabled short film that you guys have created is The Frost. It has been reviewed all over the place on the Internet. You can watch it on YouTube, and we will certainly be keeping an eye out for The Frost 2 with a whole range of new techniques and, look forward to continuing to follow your work as you guys continue to pioneer what it means to create truly high level content with AI. This has been a ton of fun, guys. Josh Rubin and Stephen Parker, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.