Nathan explores the future of AI-generated video with Joshua Xu, founder of HeyGen, and Victor Lazarte from Benchmark.

Watch Episode Here

Read Episode Description

Nathan explores the future of AI-generated video with Joshua Xu, founder of HeyGen, and Victor Lazarte from Benchmark. In this episode of The Cognitive Revolution, we discuss HeyGen's success in practical AI video creation, serving over 40,000 businesses. Learn about the transformative potential of AI in video production, from content translation to personalized experiences, and HeyGen's industry-leading approach to trust and safety.

Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.c...

RECOMMENDED PODCAST: Second Opinion
A new podcast for health-tech insiders from Christina Farr of the Second Opinion newsletter. Join Christina Farr, Luba Greenwood, and Ash Zenooz every week as they challenge industry experts with tough questions about the best bets in health-tech.
Apple Podcasts: https://podcasts.apple.com/us/...
Spotify: https://open.spotify.com/show/...

SPONSORS:
Building an enterprise-ready SaaS app? WorkOS has got you covered with easy-to-integrate APIs for SAML, SCIM, and more. Join top startups like Vercel, Perplexity, Jasper & Webflow in powering your app with WorkOS. Enjoy a free tier for up to 1M users! Start now at https://bit.ly/WorkOS-TCR

Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

CHAPTERS:
(00:00:00) About the Show
(00:00:22) Sponsor: WorkOS
(00:01:22) About the Episode
(00:05:25) Introduction
(00:06:15) Joshua's Background
(00:09:47) Video Consumption Trends
(00:10:49) Creating with HeyGen
(00:12:46) Localization Benefits
(00:14:02) Cost of Localization
(00:16:19) Sponsors: Oracle | Brave
(00:18:24) Content Creation
(00:19:32) User Journey
(00:23:56) Avatar Usage
(00:26:33) Engagement vs. Realism
(00:31:44) Future of Content
(00:33:50) Gaming Applications (Part 1)
(00:35:43) Sponsors: Omneky | Squad
(00:37:30) Gaming Applications
(00:39:27) Personalized Video Potential
(00:42:57) Future of HeyGen
(00:44:49) Improving Quality
(00:46:53) B-Roll Generation
(00:49:13) Creator Experience
(00:50:56) AI Tools Integration
(00:54:21) Trust and Safety
(00:59:35) Celebrity Restrictions
(01:01:34) Closing Remarks
(01:03:03) Outro

---
SOCIAL LINKS:
Website : https://www.cognitiverevolutio...
Twitter (Podcast) : https://x.com/cogrev_podcast
Twitter (Nathan) : https://x.com/labenz
LinkedIn : https://www.linkedin.com/in/na...
Youtube : https://www.youtube.com/@Cogni...
Apple : https://podcasts.apple.com/de/...
Spotify : https://open.spotify.com/show/...

Full Transcript

Transcript

Nathan Labenz: (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg.

Nathan Labenz (AI Avatar): (0:23) Hello, and welcome back to the Cognitive Revolution. I'm Nathan's AI avatar created on HeyGen, and I'm here to introduce Nathan's conversation with Joshua Xu, founder and CEO of HeyGen, a leader in AI powered video creation, and Victor Lazarte, general partner at Benchmark, which recently invested $60,000,000 in the company. Over the last 2 years, while generative AI for images and text were crossing critical thresholds for usefulness and hitting mainstream adoption, AI generated video has remained, for the most part, a future promise, something that AI watchers could clearly see coming and which has provoked much speculative analysis, particularly about the future of the entertainment industry and the risks associated with convincing deep fakes. But still, for practical purposes, a novelty technology that produces uncanny valley outputs. That has started to change recently with OpenAI showing off Sora's general world modeling capabilities, Google previewing Vio, China's Qaishu technologies debuting their cling model, and most recently, Runway's Gen 3, among many other options starting to hit the market. But even today, these products remain quite unwieldy and inconsistent, and as a result, impractical for most people to actually use in their day to day work. Against that backdrop, HeyGen's success stands out all the more. By focusing their product development work on practical business needs, emphasizing the value drivers of quality, consistency, and control, while taking a best tool available approach to technology, assembling their own AI avatar models with best in class tools for script writing and text to speech. They've established a market leading position in the generative AI for video space and are now serving more than 40,000 businesses and earning $35,000,000 in annual revenue. In this conversation, we discuss what customers are using HeyGen for today, which includes creating new video content that wouldn't otherwise have been possible, translating video into more than 100 local languages, and personalizing video down to the individual viewer level. And Joshua and Victor paint a picture of a world where video content is not just more easily produced, but fundamentally transformed. Imagine immersive video experiences that adapt in real time to viewer interactions or AI avatars that can engage in live dynamic conversations. These aren't just disruptive alternative production methods that allow us to make the same content with fewer resources. They are potential paradigm shifts in how we think about video in general. As someone who spent much of the last 3 years applying AI to video creation at Waymark, with business results that, while not as explosive as HeyGen's, are objectively strong in their own right, I can say that such reminders to zoom out and challenge oneself to think bigger and stranger are always worthwhile. As it happens, we had a recording issue in this episode that turned out to be the perfect opportunity to demonstrate the power of HeyGen's product. For whatever reason, Joshua's video track, which looked fine while recording, was corrupted. We considered releasing this episode in audio only form, but ultimately decided to recreate Joshua's portion of the video with HeyGen. After this intro, the video you'll see of Nathan and Victor will be real, and the video you'll see of Joshua will be created with his HeyGen avatar based on the transcript of the original conversation. I have to say, it came out amazingly well. Toward the end of the episode, we also discuss HeyGen's industry leading approach to trust and safety. At a moment when most AI application developers are just trying to make their products work and many are ignoring risks of misuse and abuse entirely, HeyGen has done an outstanding job of taking the responsibility that comes with the power of their products seriously and has implemented robust consent mechanisms to protect famous people from being impersonated and other safeguards designed to protect the public from political misinformation and scams. Their thoughtful approach demonstrates that rapid innovation and ethical considerations can go hand in hand, and I recommend that any entrepreneur or AI product builder looking for inspiration in this area check out HeyGen's product experience or read my red teaming and public thread that breaks down their defense in-depth strategy. And by the way, if anyone listening would be interested in contributing to the red teaming or other safety testing of public and soon to launch products, please send me a DM. The red teaming in public project has been a bit quiet lately as another team member has taken the lead and is building some foundational infrastructure behind the scenes, but we are still very much interested in connecting with folks who enjoy testing and breaking AI product guardrails. As always, if you're finding value in the show, we'd appreciate it if you take a moment to review us on Apple Podcasts or Spotify or just share online with your friends. Of course, we always welcome your feedback and your AI advisor and AI engineer resumes via our website, cognitiverevolution.ai, and you can always DM me on your favorite social network. For now, I hope you enjoy this glimpse into the transformative potential of AI powered video with Victor Lazarte of Benchmark and Joshua Xu of HeyGen.

Nathan Labenz: (4:26) Joshua Xu, founder and CEO at HeyGen and Victor Lazarte, general partner at Benchmark. Welcome to the Cognitive Revolution.

Victor Lazarte: (4:33) Thank you. It's great to be here.

Joshua Xu: (4:34) Yeah. Thank you for having me.

Nathan Labenz: (4:36) My pleasure. This is gonna be fun. So folks that don't know HeyGen, I suspect most probably are at least passingly familiar, you guys do generative video, space near and dear to my heart, though I come at it from a a pretty different angle product wise. And I thought it'd be interesting just to kind of start off with, like, briefly, how did you get into this business? I heard this story on the No Priors podcast, but give us kind of the brief background as to how you got into HeyGen. We can talk about where the business is today. I really am interested in then going into more deeply into, like, use cases, technology, and especially the trust and safety stuff given the nature of the product that you've built. But the floor is yours for the backstory.

Joshua Xu: (5:16) Yeah. Sure. Yeah. Before founding HeyGen, was working at Snap starting from 2014 to late 20 20. Initially, I was working at the advertising team, basically building out a lot of our ad system at Snap, working with a lot of advertisers, helping them to get the right ROI on the Snapchat ads platform. And then later on, I switched my team to work on camera, AI camera. Basically, you know, still remember back then in 2018 technology, there's nothing called generative AI, but there's some technology called GAN where you can actually generate something that does not exist in the role. During that time, that was just the first time I saw a computer can generate something that does not exist in the role and still feels high quality and highly realistic. And I was staying very front line about those technologies and just had a feeling that that could potentially change the way how people create content. So, you know, especially if you look at the past, I would say, almost 10 to 15 years, the new content platform evolved with the raising of a mobile camera. And since iPhone 4 came out, this application such as Instagram, Snap, TikTok, like, I lost so many creators to be able to create good contents. But and especially working on the camera software, we still see many people are not able to create good content with a mobile camera. Either they are showing in front of camera, they don't have time for the camera, or they're not good at performing in front of camera. And we felt that if we have a way to really replace that component, replacing the camera, we could have a way to really unlock the visual storytelling for everybody. And that's how we kind of found HeyGen. We wanted to replace camera because we think AI can create a content, AI could become the new camera, we want to use AI to generate a video instead of having people to actually film it with the camera. That's how we get started, you know, found the HeyGen back in December 2020.

Nathan Labenz: (7:26) Cool. Well, it's been, your timing is certainly impeccable from the original adversarial network generative AI technology. Obviously, now we've got a much more powerful set of tools to choose from. And you've grown quite a business on this technology with, I understand, 40,000 plus businesses that use it, dollars 35,000,000 in annual revenue, which is pretty impressive, And obviously a big raise from Benchmark to, you know, put a further feather in your cap as well. So a lot of great traction. I'd love to dig in a little bit on like, what do people use this product and platform for? I feel like, you know, people had told me for a long time, oh, you should do a podcast. I'm always told I have a voice for radio. And sometimes they also say I have a face for radio, but certainly the voice they say. So and I was always like, oh, man. I don't know. You know, the world has so much content in it, and I'm not, like, that charismatic. I'm not that funny. It was only when I got totally obsessed with AI to the point where I felt like, well, actually, in this narrow domain, like, might have something, you know, that that would be interesting enough, you know, that people might wanna listen to it. So I guess for starters, I'm like, in a world that's so awash with content, like, where do we need more content? What kind of content do you see not existing that you're allowing to come to exist, and and why does that matter?

Joshua Xu: (8:48) So I I guess, first of all, I will start it with the audience, the consumer, leave that video first world. And everybody want to watch videos. There's more than billion hour of videos being watched on YouTube every day. And the example I used was that, you know, I had some issue with my car the other day and my Tesla. And I didn't actually look at the menu, but I actually look up a video on YouTube because video is just much easier for me to get the information I want and for me to consume as a consumer. I would say 1 thing we learn from customer is that every business wants to make more videos. But the problem is that video just it's very expensive to make. What we are trying to solve the problem here is that helping the business to be able to make more videos, to unlock their needs, to match up the pace with what their customer needs. So that's 1 angle to look at it. And I think Nathan is asked a little bit about how people are using it today. And I would generally just categorize into 3: create, localize, and personalize at the current platform of HeyGen. What meant by create is that people can come into the platform, they can either create their own avatar or sit at the stop avatar we have, and they'll pick up a template and type the script to generate the video. That's the great piece about that. This is great for product spinner, product demo, product announcement, and some of the training, learning development videos. And the second piece comes to the localization, where we can take your existing videos and translate that into more than 40 different languages. We have a feature called video translation, where we can take the original video, preserve the voice of an original speaker, and also the facial expression, but in just different languages. And, you know, lots of company use it for localize their content, especially for some large corporate where they have a specialized localization team. We help them really shrink down the workflow cost and time by more than 10x. The last piece I would say is just personalization, where you can actually highly personalize your message. We have seen personalization at email, right? Basically, you've got email who say, hey, do you need this service or that? But we're inventing a way that you can do that in video as well. I think personalization on name is just 1 starting point, but we can also personalize the content itself, what actually works for you in your business, combining with the power of LLM, like company like Publishers Group create more than 100,000 personalized email as a video format and send it to internal employee as a thank you video. And those are the main 3 category, I would say, in the use case of HeyGen.

Nathan Labenz: (11:42) So create, localize, and which is translate The and I'll start in the middle. The localization 1 to me seems like an absolute no brainer. I think 1 of the things that's kind of most exciting about the AI moment is just how many barriers it stands to break down and how much better we'll be able to communicate with people around the world. I'm going to Brazil next week, and I don't have to worry really at all about translation anymore. I can just have my phone do it, that's gonna be such a a freeing, experience relative to certainly whatever I would do in the past. Right? Get out my phrase book or whatever. So it makes a ton of sense to me that businesses would be like, sure. Let's you're telling me I can affordably translate my content into however many languages and and reach people all over the world. That seems like something that is probably a pretty easy sell. What does the cost look like on that just to kind of make the ROI case for a business if I'm, you know, in ecommerce or whatever and I've got, let's say, 100 product videos and I wanna translate that into a 100 languages, I'm gonna make 10,000 videos. What sort of I can only, you know, imagine what my cost would be if I was gonna do that old school. The obvious outcome would be just to not do it. In the HeyGen world, what would that cost me? Like, give me a little sense of the the practicalities of doing a project like that.

Joshua Xu: (13:02) Yeah. Yeah. Sure. So I think really the cost about using AI to generate that to do that is really the cost of upper GPU in that case, the GPU compute compute cost. Like, typically, when we see the dubbing roll, so I would put it into 2 ways. Like traditionally, when we talk about localization, we mainly talk about hiring a voice over editor to dub it and then put it back on the original video, right? That costs about, it really depends on which part of the region we hire their voice actor, but let's say somewhere around 10 to $20 per minute. But the problem is that with that approach, your video still doesn't have the engagement or facial expression that match with your voice. Right? I think what our approach, our method on video translation is that we actually made that engaging with a new language, new voice. So I wouldn't see the value on purely on the reducing the cost. Yeah. We reduce the cost by more than 10 x and making the time much, much, you know, probably a 100 x faster in that case because it usually take, like, days or weeks to really coordinate the voice actor, but now we can do it bluntly with AI. I think the other important value prop here is that people always think that, you know, a video topic in a different language is always like the second version, like the version that's less engaging because I I kind of want to look at original video, but just look at the subtitle. But now you can really enjoy the native language version, put it natively, with the speaker actually with the lip sync movement matching the new voices. I think there's another value of that where our version of the video translated video is actually much more engaging than the traditional talking videos.

Nathan Labenz: (14:56) That seems like a killer to me. I don't know if you are inclined to say, but is that, like, a big part of the business?

Joshua Xu: (15:04) I mean, I think that those 3 out of 3 main pillar, like a localized and and personalization. I couldn't share the specific details on that, but that's definitely 1 of the most powerful and popular use case among our customers.

Nathan Labenz: (15:20) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (15:25) Cool. On the create side, I wonder how you think about I mean, obviously, video is is a vast world, right, with the infinite range of things that can be created. I wonder, like, what sort of taxonomy you have for that in terms of, like, do you organize it by the kinds of content that are being created or the sorts of people that are doing creation? How much do you see being created that, like, would have been created previously but with more time and money required versus how much stuff stuff that just couldn't have been created or wouldn't have been created before. And do you guys mostly create inputs for people to take to other projects, or do people actually, like, leave the HeyGen product with a completed video? Like, I could imagine somebody might generate something with an avatar and then go use that in Adobe Suite as part of a bigger process. But I know you also do have the the, like, studio portion of the product. So, yeah, just tell me everything about who's creating what.

Joshua Xu: (16:34) Yeah. So who's creating, what are they creating, and maybe the last 1 is like, how are they using it after they create it on HeyGen? So HeyGen is a very horizontal platform. There are many, many different use cases ranging from marketing, sales, learning development, and we mentioned, localization. And I think another aspect is also that there's a variety different of segment of customers on HeyGen, ranging from some consumer customer, small business, mid size, and some of the largest global, you know, Fortune 500 company as well. We see us as a creative tools and where we want to build a tool for everybody that, you know, they want to create videos to to create videos. And I would say in terms of a content that actually being created, I think you asked a good point. Is it more about unlocking their existing workflow, made that faster, made that cheaper, or is it more about unlocking the audience by making video that they couldn't be, you know, doing that before? Right? I think we talk about a little bit on the video translation example. I don't think translating a video, it is a term people thought about that previously. We, you know, we did open up that new market and making it possible for people to localize their video. And this very similar thing apply on the avatars video side. The way how I put it is like this. You know? Assuming a business needs to make, let's say, a 100 video a month. Right? Just for example. And maybe 10 of them, they could actually come to HeyGen to create with some sort of, like, spokesperson, the human presenter videos. And we made that 10 videos even if there's only 10% of the total video they're making, but we made the 10 videos much, much cheaper and faster. And as an outcome of that, instead of not making 10 videos, instead because a better tool out there, they're making a 100 videos out of that original 10 video demand. And that's 1 way to look at it. I think another way is that I give you 1 example would be, let's say a small business customer. They usually don't have a lot of budget on working with a video agency to create videos, but they really want to leverage video as a format to do go to market for external marketing. And now HeyGen, it is a perfect tool for them to really unlock that. And in that case, we can imagine that we are really unlocking the video they couldn't be creating before. So I guess it comes back to the point. HeyGen is a creative tools. We help to shrink down the the creation costs out there, and it just have a value and effect on both sides. Yeah. The last piece is about how people are using it. Either they, you know, create an outcome from HeyGen or they put it onto other video editors. The way how I look at it is that so let's say we largely categorize the the customer, the audience into 2 group. 1 group is that I call them professional video creators, where they have access to Adobe, Final Cut Pro, and the other categories that people actually don't know how to make a video before is they don't know how to use those sophisticated softwares. Right? You know, the first piece of the customer, I'll call that, let's say, 1% of the total business users. And they will probably use HeyGen as a generator footage. And the value here is actually HeyGen help them to replace the camera, the process, where they used to have to use a professional camera crew to film it with the footage. Now they don't need to do it anymore. They can use it on HeyGen and plug the footage into Adobe or whatever editing tool they're using. That's for that 1% professional video creation players. But, actually, for the rest of the 99% users, they couldn't actually make a video before. In this case, they would just finish everything in not only the camera part, but also the editing piece on HeyGen producing content directly that way. Yeah.

Nathan Labenz: (20:56) So do you see people creating stuff? I mean, obviously, the and I'm thinking actually for the introduction to this episode, I might go clone myself and, like, have my avatar read the intro essay instead of doing it myself. You know, the company is most well known certainly for the avatars, which and when you go into the product, which I've done, I always get deep on the products before the interviews. I noticed that there's like a handful of pre done avatars that I can choose from or I can clone my own. How often do I mean, that seems to be the thing that people are, like, initially hearing about that they come to, you know, the product for. I guess how often are they using, like, an off the shelf avatar, a cloned avatar, or no avatar? I imagine in some cases, people might be making videos that doesn't even feature an avatar, but there's so much emphasis on, like, the deep fake technology of of the clone. But I wonder if that is actually what is driving most of the value or if it's just kind of convenience of video generally that's driving the value.

Joshua Xu: (21:57) Yeah. Yeah. First of all, I would say the voice only use cases where they don't have an exposed person generated, that is not the may the major use case of HeyGen, even though probably you can still get around that work to generate video either. The way we really look at it, whether that's a stock avatar or your own avatar, is really about how we design the user journey. And I think the stock avatar, the major benefit out there is to help user to get the value of this technology and can quickly experience what's possible with that. I think the user journey is always trying to help customer create their own digital twin so that they can use that on their own businesses. So it's not about the end use case. It's actually about the user journey.

Nathan Labenz: (22:48) So I take that to be most of the people are actually doing the clone. Is that right?

Joshua Xu: (22:54) I mean, I will put it this way. Usually, if we look at a customer journey, they would usually come to HeyGen, test it out with a stock avatar and trying to see the value or even test out how effective their content in terms of the performance. And then as a next step, people will come to HeyGen to create their own avatar and continue to stretch that, build brand specific identity around it, and that's kind of a user journey to drive that.

Nathan Labenz: (23:23) Gotcha. Okay, cool. Still staying on the avatars for a second, Where do you see these, like, working well versus not working well, and how do you think that's gonna change over the short term? I heard on the note priors podcast you went into some detail on, like obviously, when you're making marketing videos and and advertising videos in particular, the ROI on that is like super determined by how good the content is, how engaging it is, you know, whether people swipe immediately or not. Right? Versus something where if it's like a required corporate education experience, you know, you have to finish it kind of regardless of how engaging the, presenter might be. So I wonder how you see like, where are we right now on the spectrum? I'm definitely someone who expects that wherever we are right now, we probably will get to things that are outside the uncanny valley. But, like, today, are we still how do you think about the uncanny valley? Are we in it? Are we leaving it? Does it depend on the use case, like education versus, you know, top of funnel advertising? And where is the technology, like, good enough to win versus maybe not quite there yet? And, again, like, how's that gonna evolve?

Joshua Xu: (24:36) Yeah. So I I think yeah. You mentioned something about the external use case and internal use cases, the quality part there are different. And especially for external marketing, advertising, the quality there would be much, much higher. And because brand or business has kind of put a big budget on that to fly the ads. I would look at it this way. So assuming we try to prop the quality, you know, as a special like this. Right? And I think certainly we have surpassed in a lot of aspect in terms of the accounting value already. Like, for example, if you look at some of my avatar, you probably couldn't tell whether that is avatar or real human presentation out there. But I I think the tricky part is that let's come back to a question. Why would people love watching videos instead of actually reading text or just, like, listening a a voice clip? Right? Because I think 1 important piece of a video is that it's just more engaging for the audience. And I think the question comes from not, like, what makes a video look real. It's actually how to make a video engaging. And that piece is still something I think needs a lot of development and improvement out there. I'll give you 1 example is that. Nathan, you're probably really good at performing in front of the camera. You do podcast, like, interview, like, pretty much every week. Right? And what I mean, in terms of, like, if if replacing that for you, you know, having a realistic generation for you is not enough. I think there's something else out there. I was talking to the other person about some feedback on the product, and she mentioned something about conversation flow. Right? Okay, what does conversation flow mean in the video generation problem? And that's actually very subtle, but it's very, very important to make a video engaging. And the other 1 is that the conversation flow usually comes with a natural movement of a gesture. Like, I'm just using my hand to help me to articulate some problem. And I think that's another challenging piece of that. And then the last piece is that generally more emotion expression, and because you can say the symbol in different tone and different emotion, and actually, it represent different meaning behind it. I think those areas still is an area we're actively working on to improve. And that is actually the breaking point to where we can surface more and serve more like high end, high quality external advertising or marketing use cases.

Nathan Labenz: (27:20) I think your comment about the target not necessarily being full realism, but just being high engagement content is a pretty profound 1. It's all happened so fast over the last couple of years, but I remember maybe I don't know if it was a year ago or 18 months ago or something. But I remember saying as we were looking at some of these, like, early like, image generation was already getting pretty good at this point and video was, you know, definitely not as far along as it is today. And I recall having this hypothesis that maybe we'll sort of adjust as well. Like, of course, people are gonna try to make the video generators more lifelike in some respects, but also there's sort of like a super stimulus feeling to some of the early stuff where it has this almost psychedelic sort of wavy form that it seems like in a way it also kind of recalls some of the interpretability results where you look at what sort of visual input makes a certain neuron in a vision model light up. It's like, wow, it's sort of a trippy thing that maximizes the response of this 1 particular neuron. And I feel like people are kind of racing toward realism and at times, like, forgetting that, in fact, you know, cartoons are really popular and, you know, there's, like, lots of other form factors besides, like, purely lifelike video that maybe can win. And this is maybe a good question for you, Victor, too, because I know you have a background in video games. Where do you guys think the future is of content? Is it about realism or is that kind of a lazy person's way to think about what really matters?

Victor Lazarte: (29:03) I personally think that realism is not the most important thing. To Joshua's point, it's all about the engagement. The choice of consuming content in different formats, I think it's pretty clear that humans choose video. 1 way of having it be more engaging is realism, but in entertainment, there's a lot more than just realism. I think 1 of the things that first got me super interested in HeyGen is, I started in a mobile gaming company for a long time. In mobile gaming, user acquisition was always a big part of the business. In the beginning, people would show static ads, and eventually it evolved to video ads. Video ads is a huge part of the mobile gaming industry. For example, for my company, we had 100,000,000 video views per month. Getting the video to convert at a higher rate is just so fundamental to the business. That's 1 of the largest KPIs. I think that translates to other businesses as well. Companies, they want to communicate with stakeholders both internally, but more and more as the quality of videos get better, more and more external stakeholders like, Hey, how do you make these videos more engaging? And I think in my experience, what really worked was not necessarily like, Oh, how do we make this higher fidelity? But it's more like, How do we test a bunch of different things? And the key to testing is how do you get the price to be lower and how do you get the iteration speed to be faster? So I think that's a very powerful thing that HeyGen does for businesses. You're able to create videos faster at a lower cost, which means you can try a lot more stuff.

Nathan Labenz: (30:52) Do you see a application in gaming? Because I'm also kind of thinking about the personalization side.

Victor Lazarte: (30:58) So so I think what got me excited with HeyGen is that what we're seeing now, I think it's like we're scratching the surface. And I think the thing that impressed me about the company is just the product velocity. And so when you look at some of the the new products, for example, like streaming avatar, which is your avatar. So it's a digital clone of a person that talks to you in real time, right? So instead of like, Hey, I want to create a video that someone's going to watch asynchronously, it's like, no, it's like someone's going to talk in real time with a bot and like, HeyGen powers that technology. And no 1 has that technology. I think it's 1 of the most exciting things about the company, because this new wave of AI started with ChatGPT capturing people's attention. And I think what captures people's imagination is like, okay, people are now talking with a bot. But the first situation of ChatGPT is like you're talking with a bot, but you're typing, which is suboptimal. And with the GPT-four 0 announcement, what was really cool is like, Hey, now talking with voice, it's kind of cool. And the next step of that is, it's not like a voice call, but it's a video call. And streaming avatar, we have the best product in the world to do that. Tying back to your question around entertainment, I think there's just so much more pleasurable for people to engage in this way. We don't yet know, and I'm happy that we're a platform that provides the technology, but we don't yet know what companies will do with that technology. But if I had to guess, there will be huge implications in entertainment. It's like, Hey, I want to talk to virtual friends. I want to talk to a pet. We have these avatars that some of them are cartoons, some of them like, have these different features. So I guess it's a long way of saying that we're just getting started.

Nathan Labenz: (32:45) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (32:50) Yeah. I think that those even the nonhuman avatars are really an interesting concept. When I think about the mass personalization along the lines of email, there I'm like, oh, man. I don't know. I I guess a couple of things. 1 is like, if I ever get to the point where I'm getting an AI generated video from my boss, I'm gonna feel maybe a little bummed out about it. Just like, wait a second. You didn't have time to actually talk to me or something feels weird about that. I also find in what little exposure I've had to these sort of large agency, large brand personalization projects in the past that they love the idea, but then they're also and I wonder if you've found this too, but they've also they've typically been like very shy about actually meaningful personalization. Like, they love to get that name in there, but then I've seen solutions in the past where it's like, we'll do the top 100 most frequent names. Then after that, we'll just say, hi, friend or whatever. And they're like, yeah. That's great. We love that. But then, you know, are we actually gonna do any other customization deeper? Well, we'll maybe put how much you spent on your utility bill this month or something. But it's very, very local, very sort of programmatic. It's almost, Yeah, do you really even need AI for some of these things? But then I imagine this other world where it's like, you're on an adventure in a dynamic world. Maybe the whole world is getting generated for you. And now, like, it's a lot easier for me to see how that real time and the, like, personalization to whatever, you know, adventure you're on experience you're having makes, like, a lot more sense. Do you think I'm analyzing this in the right way, or do you see big brands as kind of actually being more adventurous than I give them credit for?

Joshua Xu: (34:32) Yeah. I mean, I think 1 way I would look at it is that, first of all, I think by looking at personalization combined with the generation only at the name variable level, it is still pretty narrow view, in my opinion. I think the bigger implication behind that is actually so first of all, video can be generated on the fly. Second, video can be highly personalized. People started with name because that's much easier to get started, and that's, you know, immediate use case for that. But which also means that you can personalize the the content itself. What is the messaging hitting that customer? What is the story behind that video? Right? And I I think we have been always get used to live in a row in the past 20 years when you watch a video. You and I have to be watching at the same 1, but it doesn't need to be like that. Right? So we could be totally watching 2 different angle to talk about the same story, the same thing, the same topic behind it, and the video could be presented in different ways. 1 example I want to use, like, the the the example Victor brought up is that, hey. Let's say you have a video getting a 100,000,000 views every month. You want to optimize that. I think the conversation will always come into, okay, I think most people would probably like this, so let's change the video that way. And then you keep iterating. And then every time you are optimizing that, you sacrifice some sort of customers. Right? You are basically optimizing for the rest of the broader audience to reach the global message mode, but still there's only 1 video out there. But imagine a role is we break all this down. We can actually serve every single customer landing on this page, a different version of that, and we have AI behind that to support it that way. I think we can achieve the global maximum in terms of ROI and optimization there. And I think when we really look at it, how is that possible? The highlight personalization is 1, technology behind it. And second, I think what's really important is also about can you actually generate the video on the fly? And when I give you a variable and then you generate the story behind it. Like, 1 example would be the streaming avatar. But streaming avatar is really the the starting point of that. But imagine a world that we can actually stream the entire video, not only streaming the watching experience, but also streaming the generation experience. I think that is just something we will look at it and feel very excited about the future about generating the video.

Victor Lazarte: (37:09) Yeah. And just adding on this, I think the technology to produce these personalized videos is so new. Some of the applications that people make, we have to test a bunch of things, and in the beginning, not everything resonates, right? So for example, you can receive a video from your boss that is just your name being personalized, that doesn't make you feel good. But at the same time, a video that is personalized, to just, Oh, I'm going to make you feel like this was made exclusive for you, and that's the only value. No, imagine it's a video where we intake all the information about the stuff that you care about, and we personalize not to make you feel special, but we personalize because we actually are going to deliver the contents that you care most about. I think as we find more and more of those applications, then people start feeling like this technology is really in their favor, right? I think it's a process. It's a process iteration until people find the right use cases.

Nathan Labenz: (38:10) So I guess that's a good transition to the future of the technology, future of the product. If I recall correctly from the no priors episode, I believe you said 40 employees at the company today?

Joshua Xu: (38:23) Yeah, we have about 40 and 50, somewhere around that. Yeah.

Nathan Labenz: (38:27) So I'm just doing a little back of the envelope math. If you're doing 35,000,000 in annual revenue, that would seem to be plenty to cover the employee base. And so 1 would then assume that a fresh slug of capital would be going into a lot of compute, maybe like licensing of underlying content as well. Like, what's the kind of, you know, use of proceeds here and what is sort of the next big act for HeyGen?

Joshua Xu: (38:53) So when we initially found the company, I think you touched a little bit about when you look at HeyGen as a customer, as a user today, you saw that mainly as a featuring the avatar videos. That is the main power of the HeyGen platform today. But we actually never see us as a avatar company. What we really, really want is solving the problem of generating a video for businesses. And if you look at really the way how we solve that is to sequence it in the way that we want to solve the a role problem first. What I meant by a role is mainly the human spokesperson, the avatar, the actor piece. I think there's a big piece is still not solved in the industry. It's actually the b roll. Right? All this background music, transition animation, stuff like that. You know, I think the next act for HeyGen, 100% we want to continue to improve the quality, improve the engagement on the a role piece. But also, I think putting a lot of investment on b role generation is another thing that will be critical for us to achieve our mission to generate the entire video end to end. I think that involve mostly probably model training, investing in the talent, product development, stuff like that.

Nathan Labenz: (40:07) So maybe taking those 2 things in turn, in terms of improving the quality, is that something that you have to, like, scale up a base model for? Is there, like, a scaling law for avatar video generation, or is it more about, like, a reinforcement learning approach where it's not so compute intensive but more about, like, actual user feedback? Like, what is the kind of key input to improving quality on the current margin?

Joshua Xu: (40:37) I I would say it's a combination of both. It really depends on what other problem we are trying to solve. Like, for some of the quality problem that the more data you are able to collect, the more edge cases you are able to cover, and your model naturally get better performance. But I think I want to bring up another point is that solving a video generation problem is not only about solving a mathematical problem. It's very different, right? And how would you modulize the conversation flow problem in the actual AI training? That is unsolved yet. Being able to identify those areas and engineering in a way that having AI model can actually help to solve that and figure out a mathematical way and also apply that into the product. I think it's another big topic about the future development as well. Generally, would say it's a combination of more data, more compute, more model breakthrough, and also more product innovation there.

Nathan Labenz: (41:41) On the B roll side, right now you have kind of the studio type experience where you can go in and edit stuff and kind of move it around a timeline a little bit. Is the sort of more advanced version of that starts to kind of encroach on, like, Adobe territory, the more, like, large scale foundation model maybe starts to go more in, like, a Sora type of direction. Maybe it's both there too, but I'm kinda wondering, like, how do you plan to tackle the sort of everything else after the avatars?

Joshua Xu: (42:13) So I think you mentioned 2 things about 1 is the traditional timeline editor, and the other 1 is more like end to end test to video generation. I think the approach we are believing is different from these 2. First of all, I have a very aggressive view and a strong point of view that timeline editor will be gone in 5 years. But today, the entire video role was defined by a timeline editor. That's actually my angle, probably 20 or 30 years ago. Right? But the whole point of a timeline editor works because camera is expensive. And because camera is expensive, you need to film it 20 times or 30 times, and then you put it on the timeline, so that's the best 1. But when the foundation or the starting point is changed where you can generate the footage on the fly, the editor experience will be massive different. Do we need a timeline anymore? I don't think so. And I think in order to enable the, like I said, the rest of the 99% people be able to create content, timeline editor is actually a very big learning curve for everybody to learn about. And there must be a new paradigm in terms of editor exist that can help to embrace this new technology generation paradigm. That's 1. And the other thing is that HeyGen focus a lot on the business videos. And when we look at business videos, what do they need? They they need, like, quality control and consistency. Right? We found that purely you can either solve it with the end to end test video model way, or I think what the approach we believe in is that we kind of like build an acquisition engine to really understand what the brand needs. And then in this case, you see the avatar generation, AI avo generation is 1 component behind it, And then voice, music, voice over, B roll, and stuff like that. And the approach we we did believe in is that building this acquisition engine and combined with the editor to enable the experience down the line.

Nathan Labenz: (44:15) So do you expect somebody like me to sit in front of a screen and say, that's good, but replace that background with green park background instead of the fall colors background and just wait a few seconds and then watch a new version and just kind of talk to the computer to iterate? Or, like, what does that actually look like from a creator experience standpoint?

Joshua Xu: (44:39) Yeah. I mean, like, forward looking a little bit, let's say, Nathan, the AI got to know some of you you, your past videos that you have. It can naturally just learn about who you are and then how to regenerate your talking videos like that. I don't think it has to be that you talk to the computer to switch. I mean, it could be anything. It could be a UI talk or any stuff like that, but being able to put a person in a different scene, it is very possible today. Just the fact that does it actually match the quality that we want in a professional setting, professional quality, and the problem there is that actually the lightning, right? So not about the background removal and background replacement, and how do you listen, I put my view right now and put it into a beach, you know, and the lighting actually feels weird in that way, But any approach you can solve in order to make that happen? And then the next question would be, you solve the static version of the lighting problem, then how about the dynamic lighting? You know? You're moving your gesture. How do you make sure the whole thing is, like, feels natural and realistic and engaging? I don't think we are very far from being able to generate a roll footage in any of the setting that you mentioned. I think that's put it possible.

Nathan Labenz: (45:54) So if I am understanding correctly, it sounds like the vision that you have is for you know, leaving flexibility at the level of the user experience, a kind of Swiss army knife of a lot of different AI tools under the hood that the user maybe doesn't even know about in detail in terms of, like, what what model is being used or whatever. But it sounds like maybe a couple core, like, generate the footage type models and then a lot of, like, auxiliary models that do specific things, like change the lighting or do this, do that. And those get orchestrated through some, like, user experience that presumably is, like, high level, qualitative. You get to say what you want. Maybe even it learns from your past stuff. But under the hood, it's doing like an intricate manipulation with a lot of different tools. And that would be, as you said, like that's in contrast to something like Asura. Like you're not trying to create a, you know, general world modeler or like an intuitive physics engine, but instead just all these kind of things that do their job really well and your kind of macro job is to make them work together.

Joshua Xu: (47:00) Yeah. I want to add 1 thing to that though. When we talk about the different piece of the system, right, acquisition engine have a different model behind it to empower each of the feature or function as a tool, let's say. I do believe there is a role where a single model can solve a lot of these different tooling problems. I'm not entirely sure what would that look like, but with technology evolving this fast, I think we have a way to weed it now down to couple of major models behind the scene to enable that. And then another angle, I would say, we talked about lightning just now, but when we really look at it, the lightning problem was defined in the camera world. Right? Because we use camera. Okay, we need to pay attention to lighting. That could be a problem. And if there's a problem, we could actually fix that in another downstream video editor. But when we really look at the new paradigm, lighting may not be a legit dimension we need to talk about. It could be the dimension that, hey, make it more engaging. You know? It it is it could be another dimension about make it more dynamic. Right? And make the conversation float more fluently to express the excitement about the speaker. And I think those dimension will be new and could be possible with a new paradigm of this no camera roll where you can actually generate the entire thing. Yeah.

Nathan Labenz: (48:31) Yeah. Cool. That's interesting. I think this the dramatic simplification of the underlying architecture is something I've personally experienced with Waymark. I don't know if you've seen what we do, but it's another kind of text to video generation experience except we are basically making TV commercials most of the time. We partner with a lot of cable TV companies, broadcast TV companies. And so the constraints are, like, super, super specific. Like, it has to be exactly 30 seconds, yada yada yada. And I started the company actually, but now I'm the AI r and d person. And the simplification that we've had over the last year from an earlier form where it was like, you know, even just to figure out what images to choose out of a user's image library used to be, like, an incredibly gnarly thing where we'd be, like, captioning, but the captions were super generic, and we use, like, an aesthetic evaluation model to try to figure out which ones, like, look good. And now that's largely just been reduced to, like, ask 1 of the latest large multimodal models which images are good to use, and we get better results actually from that. So it's a remarkable trend that these, like, Swiss Army Knifes are, like, seemingly evolving toward fewer features, but they still work a lot better. I want to move on in the last few minutes. I know we're gonna be, out of time before too long. Certainly anything else you wanna talk about in terms of technology future direction is welcome, but I also wanted to get into the trust and safety stack because that was actually the thing that most impressed me about HeyGen and it really stands out in the broader market. So I'll just tell you what my experience has been. So I was a GPT-four red teamer. That was kind of how I initially got into the headspace of like testing AI products for risk of misuse and abuse. And more recently I've been noticing that a lot of products are being stood up, especially like agent products, but also these sort of video generation and voice generation, especially with cloning type products. A lot of them are being set up with no guardrails at all. You can go into a lot of these products. I've cloned Donald Trump and Joe Biden and Taylor Swift's voice on many different platforms at this point, including some that allow you to make a call to an arbitrary number and just say, know, solicit like, a donation from this person or whatever. Right? You get literally no constraints at all. It's kind of mind blowing that people would set something up like this and put it out there and have no guardrail. In contrast to that, HeyGen has, I would say, honestly, the best integrated security measures that I've seen. And I would love to just get your kind of take on how you guys have thought about that, how you prioritized it. I think a lot of businesses feel like that'll slow us down or it will annoy users or, you for whatever reason they're not doing it. But you guys have demonstrated that you can both build a business pretty quickly and to a significant scale while prioritizing that in a meaningful way. So I appreciate that and I wanna hear more about it.

Victor Lazarte: (51:29) Yeah. And maybe before Joshua comes in, just to reinforce that, like, was also very interested in this video creation space with avatars, and there's, a few teams going at it. Right? And and in my head, it's like, hey. Like, this technology for sure will exist, and, like, customers want it, and customers will use it. And it's like, Okay, the technology is so powerful, there's a huge risk of not using it well. The thing that stood out for me for HeyGen is this was central to their strategy. It's like, Hey, we're going to be the most trusted brand. Like, Joshua is going to come in and talk about the details. But 1 of the things that excited me most about the company is understanding how important trust and safety is.

Joshua Xu: (52:08) So I will start with this. We never see trust and safety as a factor to slow us down. And I always see trust and safety as a critical piece of the business, is actually part of our product. Every time we roll out a new feature, we have a new experience we want to share with the customers. There's always a session of trust and safety, how do we actually safely roll out to our customers. And I would say trust and safety is critical to our business because we are serving some of the largest company in the world. And I can share a little bit how it is being done today and how we really think about it. So, first of all, about the creation, we don't allow any political or celebrity figure to be created on HeyGen. And every single avatar or digital twin created on HeyGen has to obtain a first party consent in a video format. We also do a random generated dynamic code verification, as well as human review operation in the back to make sure the system is not being misused. Another thing we do is that for every single video you are creating or localizing on HeyGen today, we have a combination of AI model and human reviewer together to really make sure the content you're putting out there are compiled with our policy and terms of the product. Like, for example, no hate speech, no harassment, no fraudulent, no misinformation, and stuff like that. And those are some of the glorious and product design decisions we put together at HeyGen today.

Nathan Labenz: (53:48) It is really well done. I went through it and with the goal of actually trying to break it and never did end up breaking it in the most flagrant ways anyway. I was able to get a little bit of, like, offensive speech out of it here and there. But in terms of, like, the big things where I upload the Trump video or the Biden video and try to get the avatar from a public figure, I was not able to do that. And I I did kind of encounter several different I would say I don't know you would agree with this characterization, but it felt to me like a defense in-depth strategy where there were, like, multiple different checks happening. And I I think I systematically found quite a few of them, ultimately landing on, like, the final 1 being a human review that's just like, yeah. Sorry. We're not gonna allow this. When you talk about celebrity faces, this is 1 interesting question I had, and I wonder if there's something here that others could learn from. Is there any special tech did you have to invent the technology? How are you doing that? Who's a celebrity? How many celebrities are there in the world? How many celebrities are not allowed on the platform? And is there anything that you're tapping into that others should be aware of that they might also be able to tap into?

Joshua Xu: (54:55) Yeah. So I'll put it this way. So first of all, particularly for, you know, celebrity, I think that there were legit use cases for that for sure. And I guess we just don't want to expose the free from creation process in that case. And they do have to, like, contact us, and then we can involve, you know, with our operation team to help them to create the avatar or digital paint that way. So this is a 2 different approach, you know, but being a little bit more heavier in terms of process that way. The way how we do it is that obviously we work with some of the vendors who had a, you know, a celebrity or just generally the API that provide us a database. We also constantly having our own team to update database to ensure that we're having a well coverage.

Nathan Labenz: (55:46) Interesting. Okay. So there is a company. You're paying another company to provide that service?

Joshua Xu: (55:50) Yeah. Yeah.

Nathan Labenz: (55:52) Okay. That's interesting. I've been looking for 1, and I have not really found 1, so I'm gonna have to go do a little more digging on this. I I mean, the reason I wanna promote this is that I see so much product surface area out there, you know, not HeyGen, but the most of the rest of the space, honestly, is so wide open to abuse that I've been, like, reluctant to even disclose it at times because it feels like just irresponsible to put more attention on it before it's solved. But then when I talk to the developers and say like, hey. You know, here's what I did on your product. Just FYI. Like, this looks like a problem.

Joshua Xu: (56:26) K.

Nathan Labenz: (56:26) A lot of times they come back to me and they're like, well, I don't really know what you want me to do. You know? How am I supposed to detect, like, every celebrity voice that somebody might come through with? So, yeah, I don't know. Maybe I'll have to see what I can find out about that, or I don't know if you can leave me any hints somewhere, but I definitely want to popularize this. If there's a good solution for people to adopt and and all they have to do is, like, pay for it and use it, then I'm definitely keen to figure out what that is and and try to promote it. Cool. This is fascinating. I think the big takeaways I have from this are again, it's a common theme. This is also early. Don't look at just like an avatar today and think, like, this is the end of the line. Not only are they gonna get better, but also they're gonna get faster. They're gonna get real time. They're get more responsive. They're gonna get more contextual. They're gonna create a lot of different kinds of experiences where the paradigm of, like, what is a video that we have kind of intuitively developed ultimately maybe gets blown wide open. And it's like, as always, the future is, like, a little weirder and maybe cooler than we initially dare to guess. So I think it's another great reminder of that. And I do appreciate the care with which you guys have layered in all of the different safety mechanisms. I think that is something that truly people should come, like, study your product, look at how it's done, and take inspiration from that back to their own products. And I don't say that lately. I've tested a lot of things. So to to come out on the top of my personal power rankings there, I would say is actually a a legitimately, like, real accomplishment. Anything else you guys wanna cover before we break for today?

Victor Lazarte: (58:08) I think we covered the the most important things.

Nathan Labenz: (58:10) Okay. Cool. Well, I really appreciate this, guys. Joshua Xu, founder and CEO of HeyGen, Victor Lazarte, general partner at Benchmark. Thank you both for being part of the Cognitive Revolution.

Victor Lazarte: (58:20) Thank you for having us.

Joshua Xu: (58:22) Thank you.

Nathan Labenz: (58:22) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More

The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More

Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform

AI Avatars & the Future of Video, with HeyGen CEO Joshua Xu and Benchmark's Victor Lazarte

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Transcript

Read next

The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More

The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More

Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform

AI Avatars & the Future of Video, with HeyGen CEO Joshua Xu and Benchmark's Victor Lazarte

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Transcript

Read next

The Model Eats the Scaffolding: DeepMind's Logan Kilpatrick & Tulsee Doshi on 3.5 Flash, Omni & More

Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform

Milliseconds to Match: Criteo's AdTech AI & the Future of Commerce w/ Diarmuid Gill & Liva Ralaivola

"Descript Isn't a Slop Machine": Laura Burkhauser on the AI Tools Creators Love and Hate