Watch Episode Here

Read Episode Description

The development of ultra-realistic human voices is upon us, and Mahmoud Felfel's Play.ht is leading the next generation of text-to-voice models. In this episode we discuss the challenges and opportunities of automating a more human voice, as well as concerns about deep fakes and user safety.

Check out the debut of Erik Torenberg's new podcast Upstream. This coming season features interviews with Marc Andreessen (Episode 1 live now), David Sacks, Ezra Klein, Balaji Srinivasan, Katherine Boyle, and more. Subscribe here: https://www.youtube.com/@UpstreamwithErikTorenberg

Timestamps for E10: Mahmoud Felfel of Play.ht
(0:00) Preview of Mahmoud on this episode
(0:55) Sponsor: Omneky.com
(1:45) Nathan clones his voice using Play.ht
(6:11) Why Mahmoud started Play.ht and the problem they tried to solve
(13:08) The job to be done for Play.ht & how they’re thinking about APIs and models
(24:45) Mahmoud breaks down the architecture of Play.ht
(29:30) How the use cases have evolved
(30:00) New markets and opportunities with creators
(37:00) Are we all about to become prompt engineers/directors?
(44:50) Roadmap to other languages beyond English
(48:00) Managing the compute(52:00) If AI-generated voices becomes a commodity, what will happen?
(55:00) Why bigger companies are late adopters of AI tools
(56:30) The long-term moat of Play.ht and other applications
(1:00:00) Controversial voice-cloning and potential for societal abuse
(1:10:32) Commonly abused voices
(1:12:36) Rapid fire questions

*Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.*

Twitter:
@CogRev_Podcast
@_mfelfel (Mahmoud)
@labenz (Nathan)

Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack

Websites:
cognitivervolution.ai
play.ht
omneky.com

Full Transcript

Transcript

Mahmoud Felfel: (0:00) We're working with the South Park studios. They're making a new episode, and they'll be using one of our voices for one of the characters. That's super exciting. Oh my god. This voice is being used in production, actual shows. We started with traditional text to speech use cases. The main driver for us to do that investment, even though it was very risky, was because we wanted to get into that market of the human voice, all voice over actors, and the opportunities this can open. This was the first time ever you're automating the human voice.

Nathan Labenz: (0:32) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Erik Torenberg.

Erik Torenberg: (0:55) Before we dive into the Cognitive Revolution, I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sachs, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Mark and Threeson. The link is in the description. Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Nathan Labenz: (1:41) Mahmoud Felfel is the CEO of PlayHT, the AI powered text to voice generator. We've told you episode after episode that transformers are working for everything, and that includes generating what PlayHT calls ultra realistic human voices. When Mahmoud started PlayHT just 3 years ago, serviceable but obviously robotic text to speech services were the standard and neural voices were just getting started. Just 2 years ago, a custom voice with a leading provider required hours of audio and cost $10,000. Today, I can clone my voice with just 10 minutes of audio at minimal cost in less than an hour. I can hear myself reading content just like this. And just as in other subfields of AI, progress is not slowing down. Mahmoud talks about the next generation of text to voice models working more like image generators do today, with a rich prompt space that allows you to effectively direct the quality of voice output. If there's one AI development society has predicted and which is very intuitively dual use, it's deepfakes. But somehow they are already here, and we are definitely not prepared. Already on TikTok, I'm seeing creators say that their families have been scammed and starting to recommend family passwords that impersonators wouldn't know. Now I was genuinely impressed with Mahmoud's approach to user safety and preventing abuse. To his credit, he does have a multipronged approach, but it seems like there's only so much builders like Mahmoud can do to protect us from the downsides of AI, and we're likely going to have to figure out how to deal with the world of convincing deepfakes quickly. And as much as that does worry me, I also share Mahmoud's amazement and delight over the technology and the products that he's able to build with it. Making this was easy, quick, and genuinely fun. And I hope you enjoy my conversation with Mahmoud Felfel. Mahmoud Felfel, welcome to the Cognitive Revolution.

Mahmoud Felfel: (3:44) Thank you. Thanks for having me.

Nathan Labenz: (3:46) Yeah, my pleasure. I'm really looking forward to this conversation. You obviously are the founder and CEO of PlayHT, and you guys are in what would be traditionally called the text to speech business. You might also call it the voice generation business, as we'll get into a little bit later. You might even call it at least the audio side of the deepfake revolution that's coming our way. So I'm really excited to talk to you. I've been personally a customer, and I would say pretty demanding, exacting customer in the space of the technology that you're building. At my company, Waymark, we make TV commercials for local and small advertisers. And it's really appealing that we want to use technology to do that. It gives speed, it gives cost effectiveness, which is hard to match with any sort of service. But no matter how small the business, no matter how small the budget, folks want to sound good. They do not want it to sound like it was made by a robot. So I've personally spent quite a few hours going around and shopping for all the products, testing all the APIs. And what you guys are doing with the ultra realistic voices is really among the very best products out there today in terms of how it sounds, how easy it is to use. So I'm really interested to get into all of that technology and start to pull apart how it works. But first, I just wanted to ask you a question about the start of the business, because I think there's maybe something pretty interesting here. When I was first shopping around and used PlayHT, it wasn't quite clear to me. But I got the sense from using the product that the first thing that you had launched was essentially an API wrapper around text to speech services that were already offered by the big tech companies: your Googles, Amazons, maybe even an IBM or Microsoft. I don't know who all had offerings that you guys were making available. I'd love to hear the backstory of that. That seemed like a really smart strategy to me because, obviously, those things are reliable and they were pretty close to state of the art, if not state of the art, and they were fast. Also, reselling them, you have an unbelievable window into how customers like or don't like those services and where they're falling short. So tell me about that origin story. You're right about what you said. We started by using these third party APIs from AWS, Google Cloud, IBM.

Mahmoud Felfel: (6:21) IBM was the first one. IBM Watson, even before AWS released Polly and Azure also released the text to speech voices. We started with that one. The reason we started with this product is that me and my cofounder were working together as software engineers. We were always tinkering with different things, trying to solve different problems. We started maybe over a couple of years, 10 ideas all failed. And then at the end, we were literally taking a break of working on anything. I love listening to audiobooks. I have 2 subscriptions to Audible or something. But I didn't find the same with articles. At the time, that was maybe 4 years ago. Medium was pretty big, and there were many writers coming to Medium. They were finding this one place where all writers were there, and I'm reading a lot of Medium articles every day. I started to think, okay, I need to listen to this when I'm running or doing anything, like how I do with these audiobooks. Right? But there was no way to do that. And then I started to look into, okay, what's available? Can this be automated? And then I found IBM voices, and they were actually pretty good. Most of the voices in the market at that point were very robotic. They looked like you cannot consume them. They're not consumable for people. And I wouldn't stand listening to that for an hour or something. But the IBM voices were actually pretty good, and they're still robotic. They're Alexa-like voices or Siri, but you can tolerate listening to them for some time. And so we built the... I suggested that idea to my cofounder, and we started building it. We started with a Chrome extension for Medium, basically building it for myself. And we put it on Product Hunt. Actually, we didn't put it on Product Hunt. Someone found it in the Chrome store and put it there. And we got so many people using it, asking for mobile apps. We built mobile apps, and there was a lot of traction on that, listening to audio articles. We started to think about, okay, this is becoming maybe a little bit big. Let's just focus on this and make it a business. We started to think about monetizing it, and then we started to see it was very hard to ask people to pay for this. Consumers want to consume more and pay way less. And if you go the route of audio ads, that also wasn't very appealing to us to try to source ads. It's annoying even for us to have audio ads, so we didn't want to add this to the product. But at the same time, we noticed something. Many of the users of our application and Chrome extension were actually people who already have publications and have blogs, or even actually using Medium themselves as writers, and they have big audiences. And they're asking us, okay, this is very useful. I love it. I want to listen. I want my readers to be able to listen to my content. So we saw opportunities there for having this as a B2B product or a SaaS product where we offer this audio articles product, where writers and publications in general can add this. And now there are many publications. After that, many publications have this. Now you can listen to the article when you go there. So that was our target at that point. Yeah. And so we did this product. We started getting some growth there. It wasn't as big as what we hoped for, but we started to get a feeling of building the B2B product, SaaS business, and what people are expecting. And the problem there is that we started to find the audio articles product was more of this thing of vitamins versus painkillers. It was a good to have. Most people weren't searching or trying to find a solution for an audio article problem. Right? It's good to have. If you have a publication, the writing itself, generating the actual articles, that's a problem. But then once you have that, adding audio articles on all of this seemed like it's a good to have thing. Didn't get huge growth there. And at that time, COVID hit, and we had built this on top of these APIs. We had built that editor that has basically all the capabilities of these APIs, customizing the SSML and everything there, and just picking this and making it visual where people can easily create content using these text to speech engines. And we found that there are many customers coming using this for other use cases. And when COVID hit, some of these use cases were in learning and development and others. We can get into this more and actually why we started investing in our own in-house model. It's related actually to the point that you just mentioned, that people, regardless if it's a small or big company, they want the best quality. Right? And we started to see this in our users. That was the main reason we started to invest in training our own models because we found, okay, these generic APIs from AWS or Azure, they are good, but they are not built for the use cases we're trying to serve, and they are still very robotic. So we saw if we want this business to really grow, we either have to sell it and do something else, or we have to solve that quality issue. We have to make this voice human-like. And we started investing in training our model. We can get into this more.

Nathan Labenz: (12:03) Yeah. That's really interesting. You're one of a couple of different entrepreneurs now that we've talked to who actually started a company that was not originally an AI company and has become an AI company. You guys started with a problem, an observation, it sounds like, above all, creating an accessible product experience that people preferred to having to go to AWS Polly and figure that out directly. And then, you're midway and all of a sudden, what is possible with AI changes dramatically. So did you run into any challenges in reselling those things? Were the platform companies nice to you as you went about doing that? And how much of that is still the usage that you see in the product today versus all the new stuff that we'll talk about next?

Mahmoud Felfel: (12:57) Yeah, actually, we built a very good business. We're growing very fast on these APIs because we're mostly solving problems. I mean, at the end of the day, we are in this tech bubble, so we're really interested in tech and fancy stuff. But end of the day, consumers don't care about what's happening in the background - is it your model, is it an API, is it whatever. They have a problem, and they want it solved and solved well. There's a jobs to be done approach where the job here is to create good quality audio out of text. And these were good enough and fast enough to solve that problem to a point. And that point was the quality. Whenever we started to look into the high end use cases - with audio ads, with gaming - most gaming companies today, they don't use text to speech in production. They use it only for scratch audio. When they are developing the game, they just fill in some characters with voices. But when they go to production, they get actual voice over actors to get that feeling and expression and all of that. The same for media production in general. In all the high end use cases, even in audio articles, whenever we try to reach out to New York Times or any big publications, they care about their brand. They complained about quality. "This is not good enough for our users." So we found this is the biggest hurdle. If we manage to solve that problem, this will be huge because then you move from the traditional text to speech market - which is learning and development, IVR, all these things - to the human voice market. And that's a huge market. The use cases are huge there, and the potential also. These platforms, there was no problem at all. They provide an API. We were able to scale very well, and we're still using them until today. We launched our own model - I mean, we started working on it more than a year ago, but we launched it around 5 months ago. And over these 5 months now, our models are serving around 60% of our usage. And we have conversions every month, and now 60%. I mean, it didn't actually drop. The old model usage didn't drop. It has been growing, but also our models came up, and now we took over that. And the reason people are still using the old models, these APIs, is mainly because of languages. Our model now is still only in English, and these other models, they have 130-something languages and accents inside each language and everything. So that also has a lot of usage.

Nathan Labenz: (15:48) I'm reminded of the very first episode we did, which was with Suhail Doshi from Playground. He told us a couple of fascinating stats. One was that a full 10% of their users make more than 1000 images a day, which I still kind of shake my head at and try to imagine what that really means. But then, also, he said at one time, they found a latency bottleneck in their system, fixed it, and generation got twice as fast. And he said they immediately saw essentially a doubling of usage with that change, kind of a step change. People were just ready to use all of the generation that they would give them. So your story is kind of similar in a sense that you had this kind of breakthrough of quality with the new ultra realistic voices, and it just unlocks more use cases, more market opportunity. And it's really fascinating to know that the kind of big tech, last generation stuff not only continues to play a significant role, but even continues to grow. So that is very, very interesting.

Mahmoud Felfel: (17:04) Well, I mean, we have something similar also. I think people are starting really to use this and change their content creation flows completely because of these technologies - what Playground is doing and we're doing and others. We have some users that told me on calls, "We are spending 12 hours per day in your editor." I was so surprised. I have not spent 12 hours a day in an office. They're creating complete audiobooks, podcasts, and they're very careful, choosing characters and editing everything. They're really designing the experience. Initially, we thought people would use this just to replace, to be a quick thing to get something out. But people are very deliberate about these things and trying to think about the listener, how they will listen, and having conversations with different voices, and the voices they choose, and how the conversation would flow from one character to another and all of these things. And that's just so inspiring to see how people are using these tools.

Nathan Labenz: (18:06) So tell me about the process of, I guess, first of all, becoming an AI company. I'm sure you had to learn a lot yourself. I'm sure you had to go hire new people that brought critical skills to the team. And then it sounds like you spent more than half a year working on your first in house model before you were able to launch it. So tell me about the process, but then also the technology itself that you've built.

Mahmoud Felfel: (18:33) Actually, from the beginning, we have been very involved. Me and my co-founder are both software engineers. So the most natural thing was not to use the APIs from these platforms like AWS and others, but to just go and do our own and deploy our own models. But the problem is there - and actually many other companies in the market, they did exactly that. They just went and trained their own models. But what we found, the maximum quality we're able to get, whenever we take these models - FastSpeech, Tacotron, the ones available in the market - and we train them, the best results you can get is similar to what you get from the API. So it just didn't make sense. You will invest and will have a machine learning team in house and doing a lot of research and training and stuff like this. And at the end, getting the same quality and the same value you're getting from these APIs. At the beginning, that didn't make sense, and we continuously - I mean, almost every open source project or research paper came out about text to speech in those 2 years, we were in touch with the person who did it, and we had a call with them. And whoever open sourced something, we had to try it. And we've never found something that's really a breakthrough. But at that time, 2022, we started to see that shift in the architectures. DALL-E was there. GPT-3 was there for some time, and diffusion models, and there was that shift. Now because the problem with text to speech, standard text to speech models, is that they are not self supervised. They're supervised learning where you have, say, 20 hours of voice or recorded in a studio, and then you take that and you train a model on these 20 hours. So the model will generalize on how to speak like your voice, Nathan's voice, and that's it, not how to speak like a human. And that's why the voice that's generated in a studio from one person, it is really hard to make the model generalize to be able to speak like so many other voices and be very emotional. Basically, if you want the model to speak a different accent, you will have to get someone speaking that accent and train the model in that accent. If you want the model to speak in specific emotion, you will have to go and get someone speaking that, some dataset, and you label it like this, and you do some conditioning to the model to be able to generate that emotion, which was a very hard process to do, especially if you want to scale to so many voices. And then if you want to go into voice cloning, that's almost impossible. What we found at that time that now with self supervised learning, you can train this on a huge dataset of hundreds of thousands of voices, maybe. And there are already some - the good thing is there are already very large datasets available, like Common Voice from Mozilla and many audiobooks, public domain audiobooks available. And, I mean, you have the internet, right? You can also use a lot of data available on the internet. And now with self supervised, you don't need that data to be the best quality. You're basically training the model how to speak like humans. And then after training this large language model, it's actually very large. It's around, I think, almost 300 million parameters. So it's actually a medium model. Right now, we have another one now that's actually large. I can't tell you why we trained another model in every detail. That one, the first one was actually wasn't that large. But the reason it got very good is that because the dataset is big, and then we started to see some stuff that we didn't expect to happen. The model started to have these representations about emotions. And, you know, if there is - and just this, that stuff was mind blowing to us. You have the same voice, everything, and you have the same text, but in a very sad tone, "Oh, no." And another text that's like, "Oh my god. Wow." And the voice would be completely changing between those two just by changing the text. It was just the model has started to have these representations of what a sad voice sounds like and what a happy voice sounds like, what is an excited voice. And because in all the pairs of text and clips we trained it on, it must have seen when someone is saying, "Oh, no," it's usually a very sad voice. So it started to have, to attach these to that type of text, and now it can just from text extract these emotions and this style. So when we saw that change in architecture, transformers and large language model, that was the reason we started to invest in this. And it took us - actually, we had the model very early. We had some good results, but it was very slow to generate. It took maybe 20 minutes to generate a minute to 40, which, that's impossible to use. Right? And we spent maybe 6 months. Until now, actually, we're still optimizing the performance of the model. Actually, probably in the next couple of weeks, we will have a real time streaming API. That's something that we have been working on. It was very slow at the beginning, and now we're coming to real time streaming. It's just so exciting.

Nathan Labenz: (23:53) Yeah, I love your enthusiasm for this too. And I feel the same way. It's crazy just how much things have improved in such a short time. We're seeing multiple orders of magnitude improvement in a year or two years on some of these things. Just really incredible to see. What can you tell us kind of about the architecture? It sounds like, from what you're saying, that it's a transformer based approach, as increasingly everything is these days. Do you use any sort of off the shelf language model as kind of a base to get started? Do you have, for all the different voices that you have, is that kind of a single core model with different decoders at the end that kind of create the ultimately different sounding voices? You can go into as much detail on this show, as much detail as you want to share.

Mahmoud Felfel: (24:46) Yeah, it is a transformer based model. And honestly, it does not look like a secret. I mean, I will get into a lot of details because it's not 1 model. It's a pipeline of things that generate the audio. All the models that work in text to speech usually don't use waveforms. You don't use audio when it's training or using the model itself, but you use mel spectrograms, which is a representation of audio that can be compressed and used much, much more efficiently. And then at the end, you use vocoders to turn that into high quality audio, and vocoders have been there for a very long time. But the main model we're using, I've seen some of these papers, Valley and Audio LM and these models. I think people are starting to come up with similar ideas, using transformers training on a large dataset. I think Valley was trained on 40,000 hours. And I'm not sure about Dally. I don't think they released a lot of details, but I think these models will start to come up and they're basically very similar ideas. There are a lot of details inside the model, inside this pipeline to make it work because 1 problem with large language models that it's maybe good for images, but it is terrible for something like voice, is that these models are nondeterministic. These large language transformer models. But with something like voice, you want it to be deterministic. You want a specific word to be almost pronounced the same way, or a specific acronym, for example. So this also introduced a lot of challenges that we have been solving and working around. But yeah, it is as you say, a transformer based model, a large language model. And I think the trick mostly is also in the dataset, type of data and how to load this data. Another problem we faced when we're training the model is because there is no infrastructure out there for training large language voice models. So we had to literally build our entire infrastructure from scratch, how to process this data and how to load them into multiple GPUs for training. And the dataset also, the diversity in that dataset, what voices, what accents, how long are the data pieces or clips you're training the model on. There's so many details there that are also important in generating something that's high quality. You said different voices, how does this work? So the way it works now is that you can sample from the model itself. The model can talk in basically 100,000 voices or something. So you can sample almost any voice from the model. But what we did to make sure we have the highest quality, we actually went and we got some voice over actors who we know because they have a specific accent or something specific we need to have on our platform. And we got 4 or 5 hours of each 1, and then we fine tuned the model on these voices. So now the model, there is a copy of the models that can speak like that person. And it needs only 1 hour to 4 hours, something like this. And yeah, we were able so these other voices we have now on the platform, they're all voice over actors that we worked with, and we got them to record their audio for us in different styles. And then we fine tune the model on that, and we make that available to users. But everything is built on

Nathan Labenz: (28:29) the same base model. When you mentioned the spectrogram, some folks might have seen a recent text to music project that circulated quite a bit, where you would see the spectrogram that would then be translated to the music. So that's quite a fascinating pipeline that it ultimately ends up going through this visual representation that then gets translated to the audio. I found that fascinating. It's cool to know that you're doing something similar there.

Mahmoud Felfel: (29:00) Amazing project to see. Yeah.

Nathan Labenz: (29:01) So what are the big use cases that you're now tapping into with the ultra realistic voices? I gotta give a plug for the podcast that you're helping to produce with Reid Hoffman. I think he calls it Chats with GPT, where he's literally sitting, virtually, metaphorically, whatever, alongside ChatGPT and having a conversation with it that turns into a podcast, PlayHT powering the audio of that. Is that a very random far out use case, or are you seeing more and more of that kind of stuff? What are people doing with it now?

Mahmoud Felfel: (29:38) Yeah. Actually, we see a lot of people using that for podcasting also. Another thing interesting also is that probably when this episode is aired, it will be already live. We were working with the South Park studios. They're making a new episode, and they'll be using 1 of our voices also for 1 of the characters. But just, that's super exciting. This voice is now being used in production, actual shows, HBO. So this is just another example of these use cases. And I think this is exactly what we wanted to do. When we started with the use cases for L and D and IVR and the traditional text to speech use cases, the main driver for us was to do that investment because it was very risky, and we didn't know exactly even what, no 1 did something like this before. And when the model was generating 20 minutes, that's not usable. We didn't know if that's even doable to reduce it to something usable by users. And so we did all that investment because we wanted to get into that market of the human voice, all voice over actors and the opportunities this can open because this was the first time ever you're automating the human voice. Anything that can be done by human state, it now can be automated. And you can think about gaming, production use cases for gaming. Right now, we're working with multiple gaming studios where they want to create fully dynamic experiences. Usually, you get in games some characters, they would have voices, and all the NPCs in the game will have bubbles. There's no, it doesn't make economic sense for them to get a 100 voice over actors recording all the characters. And now the other layer on top of this, they want players to have a conversation with the character, with the NPC in the game. And that's something that's just not doable. Even if you have all the resources, you can't make this work. And the same, we're looking with another very famous streaming service now that creates podcasts, they want to use this to create also dynamic experiences for people listening to the podcast. Hi, Nathan. In today's episode, you see what, so these things are just these markets, it's not something that's already existing and we are replacing it with voice. It's just opening complete new markets. So we're very excited to see these new use cases coming up. And since we launched this, we have been seeing also a lot of creators. We're getting a lot of YouTube creators now coming and using these voices every day. And they usually we have seen 2 use cases there. They are that a non English channel or the creators of channel, not English speakers. And now they are telling us, we have hired a couple of British and American accent people in the teams. They can now create videos for us. It's so empowering. And there's other use case where creators are learning their voices, so they and they're starting to create more content. So instead of you have to go to the studio a couple of times a week to do an interview or to record a video or a product review or something like this, now you can just have the video, create, write the text, and then have that on top of the video in minutes. That used something that used to take a long time to do. And we are very interested in the use cases around voice cloning as well. That's 1 of the things that we have been investing in. And yeah, we're very excited to see what people are coming up with.

Nathan Labenz: (33:10) Well, I just want to talk a little bit about the challenges that you guys are still facing. Obviously, this technology is maturing rapidly, and the fact that you're doing stuff with South Park and HBO and all that, that's incredible. I want to just understand what some of the trade offs are that you are facing and what some of the challenges are as you go toward this more ultra realistic approach. The reason these other models, they have all these features, SSML, and

Mahmoud Felfel: (33:41) you can control styles and such, because they are very mature. They have been in the market for a very long time, and people have been working on this for a very long time to add all these capabilities. But on the contrary, with the large language model approach, we literally just did this 5 months ago. And we have been so focused on performance and optimizing it to actually work in almost real time or as close to the existing models as possible so it will be usable for users. Because when I tell you someone is staying for 12 hours a day working on this, if that takes each generation takes 1 minute more, that's why he will be taking another, way longer process for him. And iterate that creative process. The feedback loop is very important that you create something. You listen to it. You see how it sounds. You try different voice, different characters. We wanted to make that feedback loop as fast as possible, and I think we did it. But right now and the reason that I told you we trained another model that's actually much bigger and solving many of the problems the first model had, 1 of these problems are actually these pronunciations and the control on the voices. Right now, we have a lot more control on adding things like emphasis to specific words, or adding pauses, or make the voice speak in a specific style, for example, an audiobook, a narrative style, or a commercial style, or a training, an instructional style, these things where it's still the same voice, the same everything, but then it will not be as dramatic if it's instructional, for example, or as emotional. And then there were some styles like meditation or these things where it was actually very hard to get the voice completely changed, its way of speaking. But this new model is solving most of these problems, and over time, we're adding more and more of this control. Because if you're thinking about this model for the users, if you're now thinking for the job to be done, this model for the user is like a human voice over actor. And the user needs to tell him, no, speak, say this thing in a different intonation. Be more excited. Increase your voice here. This pronunciation needs to be a little bit different. It's how would you actually tell an actual human voice over actor? And we are seeing that feedback from users now, and we were surprised at the beginning. I mean, that's way better than the other voices, why people are complaining about these things. Because now they are not comparing it with text to speech, they are comparing it with humans. And that was a complete shift in the mentality that, okay, actually this needs to be directable. You need to be able to direct that voice exactly what you want. What you're saying. All that stuff with the SSML, and now we are actively working on these things. And I think in the next few months, all of that will be out, and it will be almost on par with the existing voices.

Nathan Labenz: (36:37) So what do I have to do to be a beta tester? Is my first question. Kind of hear from what you're saying there that your vision for this next generation is maybe less about an SSML, which makes sense given what you've told me about the market, right? You don't see the big opportunity in terms of taking share from systems that are still in production that do support SSML, but rather, it sounds more like we're headed for kind of prompt engineering for voices. So I'm imagining myself being the moment that a grandfather first saw his grandbaby and getting the voice to come out with the appropriate emotion from those kinds of qualitative scenario narrative type descriptions, which is, I think, what a director on set would say to an actor. They certainly don't give them SSML notation. So am I getting where you're going there? Are we all about to become prompt engineers directors?

Mahmoud Felfel: (37:42) Exactly. I think our goal eventually is that we have something where people can totally control their voice. And as you spoke earlier about the transition from a text-to-speech company to an AI company, in text-to-speech, you have a specific set of voices that turn any text into that voice's speech. But now you want to have a lot more control in designing the speakers, the actual voices. So, I want this voice to sound a certain way. This is actually what happens. I want a voice that sounds like Optimus Prime, the deep voice that's a little bit slower. This is how people actually think about content creation. And if you think about the future of content creation, it would be something like this. People are just becoming narrators, where they can just describe what they want, and then it will be generated, and they start using it instead of them actually sitting and doing the work themselves. They would become just describing it, similar to what's happening with images now. You're just describing what you want, and you generate it, then you iterate over it again and again until you have the exact things that you had in mind. The things that used to take days from artists or designers can be done in minutes if someone has the right ideas on how to iterate and guide this machine or this model in the right direction of generating what you want. And I think the same is happening with voice, where you will be able to design these speakers and create voices of exactly what you want and then start using them. So I totally agree with you. I think this is where it's going.

Nathan Labenz: (39:21) You mentioned briefly earlier the fact that these models are in some sense nondeterministic, and there is some level of just GPU indeterminacy, which is basically irresolvable at the software level right now, or at least would be extremely difficult to avoid. But I also understand that that's pretty rare and only comes up when values are very close to rounding points anyway. So is that what you mean by it? You can't make them deterministic?

Mahmoud Felfel: (39:55) Okay. There are two points there. The first thing, I'm comparing this with the previous generation of models where it is very deterministic. If you put the same text with the same voice and generate it 10 times, it would be identical. There'll be no difference. But with this thing, if you put the same voice and generate 10 samples, it'll be different, one sample from the other. There might be a very slight difference. Sometimes it's a little bit more, a bigger difference in the way a specific word is being stated or something. And that determinism with text-to-speech means that the user can't just come drop in an audiobook and click convert and forget about it. He needs to sit and make sure it's actually exactly what he wants. And that introduces a problem based on the use case. Some use cases, people actually want to do this. They want to come with a big document, drop it there, and have something that works 100%, says things exactly like what they want in the same specific styles. But in some other use cases, that actually is a feature where you want to generate again and again and see which intonation. Compare it with images, where you generate 10 images, and you can maybe choose the third one because it's exactly like what you had in mind, and then you iterate more over this. It's the same case here. So we allow users to generate multiple samples, and they listen to them, and they choose the right one. And the good thing about this is that this data also is very valuable for actually learning something about what people prefer to use in different scenarios. So this is the point of nondeterminism there. Another point is especially with things like acronyms or something, where sometimes when you generate again, the model might mispronounce because there are some words that the model never heard about before, just a company name or some weird acronym or something that might be hard for it to get right every time. Now we have people on the team who have a long experience in speech processing, and we started doing some stuff that the traditional models used to do to make sure we have this phonetic representation of words and pronunciations and try to freeze that stuff in the model where it will almost generate these words in that way, while still keeping the nondeterminism for the other stuff going on. And the good thing about also voice, if you again compare it with humans, I mean, you ask the same person to read the same sentence 10 times, there will actually be a difference. It will not be the same all 10 times. There might be a slight difference, but there is a difference. And people are okay with that. But then it becomes an issue when, with pronunciations or if these random generations for specific words are not carrying the right emotions or styles they want. And that stuff, we need to give them the control to do this when they want. But by default, users should be able to discover these different styles in this randomness and be able to select the ones that match what they think is the right way of saying this. Again, matching exactly how humans speak.

Nathan Labenz: (43:14) On a technical level, is there a setting that you could also think about exposing that could be the equivalent of a temperature?

Mahmoud Felfel: (43:23) Actually, yes. And this new model that we're creating has actually a lot more than just temperature and specific state you want the model to use or something. Things like the voice guidance. When you try to clone a voice, for example, you want it to be identical. How much resemblance you want it to be to the original voice versus to be a little bit like it. You want it to preserve the same accent. You want it to be exactly like that voice or not. So there are a lot more when it comes to voice and some stuff related to some attributes of the voice itself, like its style, its gender. For example, you can add to the voice, you can add it to make it female, but if the main voice is male, it will become a little bit feminine. But it will still be a male voice, but there's a lot of good stuff that people can start using and tweaking exactly how they want the voice to sound and just start using this. So that's stuff that we're actually starting to expose to users to use and be able to have more control on designing these speakers that they'll be using.

Nathan Labenz: (44:33) This conversation, I mean, just the way you're speaking about your product, does actually remind me a lot of talking to Suhail for the first episode just because control, control, control. We've got this amazing technology now, it's blowing people's minds. And now, how do we rein it in and really make it work for us? You guys obviously have different versions of that challenge, but it's fascinating to hear conceptually how much commonality there is. Let's talk about languages just for a second. Is it just a matter of lack of training data that, maybe also market, but from a technical standpoint, is it just lack of training data that prevents you from expanding the ultra realistic voices to other languages?

Mahmoud Felfel: (45:14) Actually, just the training data. You don't need the same amount of data for other languages. We have created internally some Japanese and Portuguese voices, and the result was actually better than what's available in AWS and these APIs only from maybe 10 hours of audio because the base model knows how to speak very well like humans. Now you just need to train it how to speak with a Portuguese accent, for example. The other challenge there is because our model wasn't trained on phonemes, so sometimes these other languages will have specific types of characters, especially with the non-English, the voices that don't have the Roman scripting. They will have Japanese or these languages. They will need to train what's called a G2P or a phoneme encoder to be able to understand that script when you give it to the model to translate. And once you do that and the model can just understand phonemes, it doesn't matter. You can just train it on any language or fine-tune it in any language. Another thing we're doing now is we are actually going to train a large language model from scratch that's multilingual. The reason for this is that we think that we will start to see some capabilities of cross-language speaking. Right now, we actually have it for English, that if you upload a voice, if you try to clone a voice that's speaking in, for example, French, and if you go there, you clone the voice, it's available in your studio, you can then start using it, you will get an English voice with a French accent. I honestly didn't expect that this would work. And I was mind-blown when I saw it, but it works so well. And we tried many voices across, and it works even a lot better with the new model we're training. Even with a few-shot training, you show it literally 30 seconds of a Japanese accent, and you will get an English person talking but with the same accent. Basically, the same voice but speaking English but with, it's as if a Japanese person does speak English and has the accent. And you can think about the capabilities of this with something like dubbing, for example. You can have the same actor across languages talking in the same voice, but in that different language. And so one of the reasons we're thinking of training that multilingual model is to start seeing this, not only from any language to English, like what you have now, but just across languages. You can have the same voice, so you design the voice for your game or your audiobook or whatever, and you love this voice and the way it talks and everything. And then you want to just take this and make it available in every language, in the same voice, but then speaking that language. So this is what we're trying to do.

Nathan Labenz: (48:19) Does the compute itself come with a huge price tag for you guys? And how do you manage that? Do you do it just on your own in the cloud, or do you work with Mosaic or some sort of specialist large model training company?

Mahmoud Felfel: (48:34) Currently, we're doing everything on our own, but usually, inference is not expensive. Inference is, I mean, users are paying for inference, so it's not a problem. And right now, the model is very optimized, that even for voice clones, we build infrastructures that can load and unload models on the fly in the CPU memory and handle hundreds of models per GPU. So we did a lot of work on optimizations on that infrastructure side, so it can scale really well now. So there's no problem with that usage side when users are using and scaling that. But the problem more is, or the cost comes more with training. This model we're training now, we're finishing, it actually cost us hundreds of thousands of dollars. I mean, it's not millions like these other large language models, but still, for a startup, that's a lot of money. And the good thing is that we have been in this market for a very long time, for three years now. We have a lot of customers, and we have very decent revenue that's covering everything so far. So I think that's not a challenge for us now. The challenge would be if we really want to scale this a lot more in training and experimentation. And also, one of the costs, what we found, is not only in training this one large model, but actually it's a lot more in experimentation, training much smaller models. For example, when we did this with Japanese and Portuguese, we wanted to know, is that actually doable? Can you just fine-tune this on another language and it works if you train this on phonemes? So this took a couple of weeks, not months to train. But this experimentation, and we do a lot of these experiments to improve the models, to do these things, what we have been talking about is control and phonemes and pronunciations and languages. So this continuous experimentation is actually, it's like an R&D department. It costs a lot of money. And I think that's still working fine with us. Over time, I think these models will reach a point where we'll exactly have what we want, and hopefully, we don't need to do the same investment again to get the same feature. Because right now, it's lacking on a lot of features in the existing models, and we are trying to get there in all these features. So one of the reasons for that also is that we are trying to train our own models. We're not using an API or something from another company or something that's already open source. So that also comes with that cost. You're just trying to build the operating system and the applications at the same time. So, yeah, I think this is where most of the cost is.

Nathan Labenz: (51:25) Just thinking long term, one of the things you said there triggered a thought for me, which is that voice, unlike other things, seems to hit a certain kind of plateau when it achieves human voice actor level. I know what it would mean to be superhuman at many tasks. But when it comes to producing human voice, almost definitionally, you can only be so good, and then you're beyond what is superhuman in that domain. So do you see things leveling off at a point where the models are so good that they can do what you want them to do and you can stage direct them and they're easy to use? Do you see that the core technology at some point becomes commoditized and even open source and just generally accessible?

Mahmoud Felfel: (52:28) There are two questions there. One, on the model itself, what will happen to this? Will this become a commodity? And another thing, if this becomes a commodity, what will happen to the applications? For the model itself, I think eventually the improvements you're trying to make on the base model will start to give you diminishing returns. Speech recognition, for example, reached a point where every speech recognition system is 95% good. To move from 95% to 98% needs a lot of work, but based on the use case, almost all of them are 95% accurate or something. I actually don't know, but I think they're all pretty good now. So I think the same will happen with voice, but then the difference we're seeing here will be this: yes, there might be some open source or an API available in a year or two from now that has similar quality to what we have today. But then for the specific use cases, to be able to generate voices and have these specific workflows for gamers, for example, or for people working on animation movies, I think we need to train some specialized models or fine tune on their data that is specialized that can give them exactly what they need. And that wouldn't make sense for, for example, OpenAI if they have a model that's available and very cheap or Stability if they have a model that's available and open source because they are trying to build something that's generic, that everyone is using. But we are interested in specific verticals, specific use cases where we can go and build these specialized models for these use cases that work really well for what they want.

And the other thing is, we are now talking about the use cases that have datasets available publicly. But if you think about things like customer support or sales calls, you cannot go anywhere and find 10,000 or 50,000 hours of customer support calls. We're working with some partners who have this data, private data from their customers, and then we can train that model and provide them the solutions that replace customer support, for example. But no one will be able to compete with this because this data is not available for anyone to train and have a variation. And it wouldn't make sense for everyone to have 100 models for all different use cases. So I think this is where we can start having something that's different from the market.

And the other thing is, for example, Copilot has been in the market for two years now, two and a half years. And I'm not sure if you know that almost all the large companies, no one is using Copilot. Even though it's good, it saves you a lot of time. But why are they not using it? Because they are afraid if they use it, their data will be leaked to OpenAI or whatever for privacy reasons. So that's another thing that these companies who want to create. Right now, we're talking with some big production studios who are very careful about the voices. They have actual actors, and they want us to clone their voices to use in post production in these use cases. And they're very careful about this. And we would have to just give them some in-house deployment of the model trained on their voices, on their actors. Because if something like that leaked, that's a big problem for them. So I think these things for enterprises and big production companies where you will have to have a solution that's specialized and deployed internally to keep the privacy for them, I think that will be something that can protect the business from just opening this for everyone.

But I agree with you that the individual use case over time will become very competitive. When OpenAI, for example, opens an API for this, everyone will start building applications, and so many startups will start building something similar to what we're doing, and the individual use case will become very competitive. But then the companies, production or gaming or the enterprise use cases, they will still need something specialized for them.

And the other thing on the application side, this is where the difference between building the operating system and the applications. Right now, we are doing both, but eventually going deep into these verticals and building more tools for them around what we're doing, this also will be something that will be special to our offering for them. To give you an example, even not with this model, with the old models we had, we had a team from Amazon using us, and they have AWS Polly. But the reason for that, they didn't want to go hire a developer and build their own solution. They just wanted something they can use and start getting their job done. And the same will happen here. If OpenAI has an API tomorrow or anyone else, the same will happen. Companies will continue to use the application that has a UI that they can just use and get their job done quickly.

Nathan Labenz: (57:46) I think it's one of the best rundowns of long term moats that I've heard. And it really echoes also what OpenAI is doing and the new market reality that they're creating right now in the language model space. It is striking to see that that is now at $2 per million tokens, functionally free for a lot of use cases. But at the same time, they're doing exactly what you're alluding to, which is they're going and selling seven figure deals to major companies that have data. I don't know if they're doing any on-prem offerings at this point or not, but certainly walling things off from other people's data, all that is going to be a huge concern for them and for their customers, especially driving it. So it makes total sense that you would feel some of those same tugs and be moving in some of those same directions.

Mahmoud Felfel: (58:45) Everyone is asking this question. Will all these foundational models be monetized and will everyone have it and there will be no value in that. But I think what this question misses is that this is just an operating system that then there are two things there: these specialized things that you can build for enterprises and for specific use cases, and then the applications that can be built on top of this. Because of different layers of this. Exactly like in cloud, for example. There's AWS and Azure, but then there's so many application layers on top of that that still make a lot of money and make huge businesses.

Nathan Labenz: (59:26) Let's go back to cloning then for a second. I think this is going to be one of the most powerful features or use cases that you offer. It certainly, in time, if not already, will become one of the more controversial as well. At this point, you can go into PlayHT and upload 10 minutes of audio of you speaking. And I was able to send you the first question from my list of questions in my voice generated on PlayHT. It was honestly all very easy, very smooth. And it was, as much as I have shopped for this product, I confess it was the first time I cloned my own voice. And it was a wow moment. Again, as much as I've shopped for this, hearing myself come out of the machine, and it really does have obviously a very high resemblance to my voice. I thought that was eye opening. So what are people doing with cloned voices? How much of the usage is that? And then, obviously, we've got to talk a little bit about the societal impact and the potential for abuse. I'm curious to know if you're already starting to see some of that as well.

Mahmoud Felfel: (1:00:39) You can get almost 100% resemblance. I've had multiple instances where users send us messages saying, "I sent this to my wife and she didn't know it's an AI voice." One of the things we're experimenting with is voice cloning for phone calls. We had this idea: what if we make phone calls to sell our product? We have SDRs on our team that we hire to make phone calls, reach out to people, go outbound, and sell our product to companies. When I listen to these calls, they seem very basic. "Hey, are you interested in this product? Yes, no. Okay, let's have a call." So we hooked up ChatGPT with our voices and started making these calls. We were very surprised that no one figured out it's an AI voice on the call. That was very surprising because we were thinking maybe people will feel like it's an AI voice and they'll hang up or something. But it's just normal conversations. I'm actually thinking about sharing some samples. But now this technology is becoming so good that, as you said, it's also dangerous.

Right now, regarding the abuse issue, that's something we've thought about a lot. One of the reasons we spent so much time on this is that the model can clone any voice from day one. Actually, all the voices we have available now are just voice clones. We got voice over actors and cloned their voices, and now we're making them available for everyone to use. But we were very careful about releasing this. We started by just asking people to sign up and have a call with us to try to understand what they wanted to use this for and what the use cases were. That's a new market for us for voice cloning. We wanted to know what people would be using it for. We spent probably a couple of months just having a lot of calls with customers, trying to understand their use cases. Out of this came a lot of YouTubers and podcasters.

When we launched it, at the beginning it was a very strict process with a lot of manual reviews. It wasn't scalable at all because we were very careful about abuse. We only opened it up to what you use now after we did two things. First, we added moderation. Right now, everything going through the API goes through a very strict moderation policy. If you try to put something offensive or racist, it's really hard. I'm sure everybody knows that. But we're trying to have some false positives rather than have someone harmed through this technology. We added this moderation through our APIs and our platform, and we tested it well.

The other thing we added is a classifier. This is a very hard problem to solve, building that discriminator. I know OpenAI is also trying to build this for GPT, to tell if content is human generated or AI generated. We built that classifier and it's available publicly on our website for free. Anyone can go drop an AI generated audio and be able to tell if it's AI generated or human generated. I know this doesn't block abuse in itself, but we're hoping that as this classifier advances and more people use it, it will maybe be integrated in social media, with banks, or with entities that depend on voice as a way of logging into your account or your biometrics, or on YouTube or other places where you have fake news. Hopefully, these platforms can have some message to inform the user. What Twitter has now with Community Notes, where they have this message saying "This content is actually wrong, this is what's real." Maybe if they have something integrated like this, they can start to tell people "This is actually wrong."

Right now, some people are using other services to create fake ads on TikTok and these things from celebrities recommending products. It's very dangerous. Some people, I was reading an article a couple of days ago, someone used a service like this to defraud a family, thinking that their son is in jail and they need to send money. That stuff is very dangerous to do.

We're trying to, I think what Sam Altman was saying, I'm paraphrasing here, is that he is hoping this technology, AGI, will take a little bit of time until the society gets used to it and adapts to that technology. I think it is even more so for voice. For images, the society is already sort of ready for generated images. You already doubt anything you see on social media. That's been there for 10 years with Photoshop. But for voice, if you hear a voice saying something, if you get a call from your brother telling you something, the first impression is not "Are you the real person?" That's not right. That's not yet very common in society, that these things are so real. But this will happen over time. People will have doubt for everything. Any media you see, the first thing you'll have in your mind is "Is that actually real, or is that a deepfake?" Because right now, this will become so real that you cannot tell the difference, and it will become worse over time.

I think it's a matter of responsible companies who try to have the tools in place and do as much as possible to block abusers. At the same time, society over time is getting used to the fact that now voice cloning and deepfakes exist. The same will happen soon with videos. Right now, lip syncing is becoming very good. There are some videos where you can see just lip syncing with someone's voice saying something they didn't say, and that's very dangerous. But society over time will get used to that. It will not trust seeing something on any source. I don't know what the solution is for society, but we're trying to be in the middle where we mitigate the abuse as much as we can through these tools and the moderation, and at the same time offer a service to users that they will find useful and easy to use.

Nathan Labenz: (1:08:15) Yeah, that all makes sense. It's a tough challenge, and you're not by any means alone in facing it. Two really quick follow ups there. One, on your discriminator, is that something that works across audio generated by other text to speech products as well? Or does it depend on a signal that you're embedding in or maybe just naturally occurs in the audio that you create?

Mahmoud Felfel: (1:08:41) It depends. We noticed from testing it on our models and on stuff in the market that it catches some things, but it's less effective on other stuff in the market. It can get a lot better. I think this is just the first version of it. It can get a lot better at detecting because these voices are generated through transformers, vocoders, and these models. They leave behind some specific things that you can detect. Over time, if you really invest in this across different types of vocoders and approaches of generating, decoding the audio to create waveforms, you can start to reach a point where it will be very accurate. But again, it's a very hard problem to solve to have a discriminator that's very good at detecting any AI generated content. But having something that's accurate enough, I think it can at least mitigate a lot of abuse if it's deployed in the right places. It's the same idea as with fake news. If you can't mitigate 100%, at least make it 90% so most of the abuse is mitigated, and try to do as much as you can.

Nathan Labenz: (1:10:05) Is there any way to detect commonly abused voices? I mean, I'm sure there would be a way if you had a small enough set. If it was just Joe Rogan and Joe Biden, you could probably solve that relatively easily. But is there a way to extend that? Do you think it would be realistic to extend that to the 100,000 most well known people that folks might want to clone and abuse a voice of?

Mahmoud Felfel: (1:10:32) Yeah, we actually thought about this. There are some datasets there for celebrities, for example. You can build classifiers to detect these voices and flag if the user tries to clone a celebrity voice. But the reason we didn't invest in this is because we found that cloning celebrity voices is maybe not the most dangerous thing. The most dangerous thing is cloning politicians' voices or voices of common people and defrauding them. When you're running a scam trying to get into someone's bank account with their voice, we will never be able to find that person's voice and put it in the classifier. That actually is the most dangerous thing, and that's why we went with the approach of: when someone clones a voice, what will they do with it? They will try to use our API or our editor and write some text and generate audio in that voice. We thought having this moderation layer and then flagging these users and blocking them can be a lot more effective than just having a database of voices that we can detect.

Nathan Labenz: (1:11:35) I've done a little bit of this kind of red teaming work on language models, and in that context, yeah, it makes total sense. What is the content? People asking for somebody's mother's maiden name is probably a pretty good red flag that they're up to no good.

Mahmoud Felfel: (1:11:54) We started to have some humans moderating these use cases, and I think we have a large language model that we can fine tune on this data and be able to tell with some probability if someone is trying to do something malicious. Sometimes if someone's trying to get into someone's bank account, the classifier might not detect it for moderation. But if over time you start to detect these use cases that are going through the cracks of the moderation detection, and then you add these things there, over time I think it will get better and better. That's a lot more effective than trying to detect the voices and block users based on that.

Nathan Labenz: (1:12:36) Really enjoyed learning a lot about how you're thinking about the business, how you're thinking about the technology. I'm very excited to see the next generation, larger model, and all of the prompt engineering or stage direction type of use cases and modes that it's going to unlock. It sounds like you're really on the verge of another big step forward, so I'm excited to see that. I've got three kind of quick hit closing questions for you. First one is, what are the AI tools that you are using in your day to day life right now that you would recommend that others check out?

Mahmoud Felfel: (1:13:16) I think definitely ChatGPT. I'm using it now for summarizing things. If I have a very long email, I'll just drop it there and ask it to summarize it. Also, writing any code now, it just doesn't make sense to do something else. I use it and keep asking it "Okay, change this from Python to JavaScript" or whatever, and do all the things that you want to do. It is really good at this. I do use Grammarly also. I think they are one of the biggest AI tools before ChatGPT, but no one knows about it. I use it a lot. It gives a lot of good suggestions now when you're writing emails or these things based on the context you're doing.

Nathan Labenz: (1:14:02) Yeah, ChatGPT is certainly the number one answer there. You're actually not the first to mention Grammarly as well. Do I infer that ChatGPT has kind of crowded out Copilot? You are finding that you just want to go to the better AI to do longer code generations? I'm still using Copilot, but

Mahmoud Felfel: (1:14:24) Mostly when I'm starting something new, writing something new, ChatGPT feels not like an autocomplete the way you might get from Copilot, but it's like you're having another developer next to you and you're thinking about different things. Okay, let's change this from, let's add some rate limiting here. How would this work? And then you write this and send it as code and wait for the response. Okay, let's change this from this language to that language. Just very quickly, these iterations until you reach something that you want to use, and then you take this into your editor, and then you can use Copilot there. So I feel like that chat-based coding, I never expected this. If you described this to me a year ago, I would say I would never use this, but it is actually pretty good. It reminds me of my career. Sometimes I would have interns working with me. It is very similar to that. It's like you're working with an intern and asking him to do something, and then he would go do it and comes back with it. And then you can think with him another iteration, and then he would do the same. It feels very similar to that.

Nathan Labenz: (1:15:35) Yeah, I think that's very apt. I use that metaphor pretty often these days too. And it works on a couple different levels, right? It needs a lot of context. It really benefits from a couple examples of what good looks like, aka few-shot learning. So there are, I think, a couple of pretty profound parallels there to the intern that does make that a useful way to help people, especially if they're new to the technology, to understand how they can get the most out of it. Hypothetical situation. I'm sure you're familiar with the company Neuralink. Let's imagine some time from now, I don't know how far into the future this will be, that a million people already have a Neuralink implant. So you're not going to be the first, but the question is, would you want to be the million and first if getting the Neuralink implant would allow you to basically have thought-to-speech? You could type as fast as you can think. Your thoughts can be translated to words and stored in the computer. Would you be interested in getting that implant for that capability?

Mahmoud Felfel: (1:16:49) I think I might try before a million even. It's amazing. It's an exciting thing to see if this works. Because the main thing there is that there's a bandwidth for humans. You think about something, and then you have to write it and do this manual physical thing. But then when you're able to change your thoughts to a machine directly and have it interact with it, the bandwidth would be a lot bigger and wider than if you're just using your own hands to describe something. So I think it will be so powerful. And people who will have this, they will start to advance really quickly into what they can do and their capabilities. And yeah, it has to be safe, of course. But yeah, I would try it for sure.

Nathan Labenz: (1:17:46) Just zooming out entirely, biggest picture, what are your biggest hopes for and also fears for the AI era that we are entering, as you think about how things will play out over the rest of the decade?

Mahmoud Felfel: (1:18:03) Yeah, I've been thinking a lot about AI safety stuff. And I don't know, I will not comment on this because there are so many opinions there and so many people say different things about it, trying to predict what will happen in the future. But I think what I hope will be that society will get the chance to adapt to these changes. This will be very empowering. And these tools will help us to be a lot more productive, and there'll be a lot more progress for humans in the next 10 years. I hope that will be a lot more positive. And what I hope doesn't happen is that the legalization of this technology now, we're seeing now with Stability getting sued because of what they're trying to do, and there might be... I hope that doesn't introduce another winter to the technology and delay it a lot, because if this happens in these markets that are trying to be responsible, other markets who are not as responsible will still keep it open, and they will be able to advance a lot faster. So it is very important that the legalization of this can be... When you think about something with the internet, we know with platforms like Facebook and Twitter and these things not being responsible for what users are posting on their platform. You can argue the benefits and the disadvantages of something like this, but you cannot argue that this is one of the reasons that made these platforms huge and really big and impactful and useful for a lot of people. And if that wasn't the case, that everyone who posts something, you can just go and sue Facebook for it, they would have died a long time ago, right? So I think I hope that society overall will be able to adapt, help progress that technology into the benefit of humans. Because I think it can be really useful and people can become a lot more efficient and productive and the overall gain for society can be huge from a technology like this.

Nathan Labenz: (1:20:24) Again, this has been a fantastic conversation. Mahmoud, thank you for being part of the Cognitive Revolution.

Mahmoud Felfel: (1:20:30) Thanks, Nathan. I enjoyed it. Thank you so much.

Erik Torenberg: (1:20:33) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

The AI Voice Revolution with Mahmoud Felfel of Play.ht

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Transcript

Nathan Labenz

Read next