The Pixel Revolution Part 2 with Suhail Doshi, Founder of Playground AI

Watch Episode Here

Video Description

In this episode, Nathan sits down with Suhail Doshi, founder of Playground AI, to discuss generative AI for images. They discuss the current state of AI image generation, how Suhail is building Playground while the technology for vision and image generation is still maturing, thought to image reconstruction, and more. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

We're hiring across the board at Turpentine and for Erik's personal team on other projects he's incubating. He's hiring a Chief of Staff, EA, Head of Special Projects, Investment Associate, and more. For a list of JDs, check out: eriktorenberg.com.

-- ---

LINKS:
- Playground AI: https://playgroundai.com/
- Pixel Revolution Part 1: https://www.youtube.com/watch?v=Waii0i1bBFY

SPONSORS:
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api

Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

X/SOCIALS:
@labenz (Nathan)
@Suhail
@CogRev_Podcast

TIMESTAMPS:
(00:00) Episode Preview
(00:44) The current state of AI image generation
(08:04) Are we currently at a GPT-2 level for image gen?
(15:40) Sponsor - Brave Search API | Shopify
(20:46) Shortcomings and use cases for GPT-4V
(22:46) Benefits of vision vs language
(28:30) Trajectory for what Playground will build next
(33:28) Sponsor - NetSuite by Oracle | Omneky
(34:48) How will the image generation experience change over time
(40:06) Thought to image reconstruction
(47:49) What if OpenAI fully focused on image
(50:09) Lack of training data in vision and the use of synthetic data
(51:30) Multimodal models increasing performance
(55:03) Images are information rich but lack the right annotation
(57:00) Building Playground while vision technology is maturing
(1:03:25) What should the rules for generative AI be?
(1:09:19) Parallel to the music industry and streaming for rev share
(1:12:21) What are the minimum standards that AI application developers should be expected to uphold?
(1:16:48) Wrap

Full Transcript

Transcript

Suhail Doshi: (0:00) I try to sometimes put myself in the shoes of the artists or the people making these images, photographers, whoever. We were the first site ever, and I think the only site where if there was a prompt on our site and someone references his name, we directly link back to his page. It might be generally okay to make things. In fact, many brands are perfectly fine with fan art or whatever. And there's kind of this question of commercial use. That's where things kind of stop. If this whole thing is very gradual, then I think probably society would find some way to assimilate to it. If it's vastly faster than that, then I think that we definitely need to do something about that. I think that who it's impacting very much needs to be considered - some craft that they've been doing for a decade or two. You can't ask people to evolve and upscale.

Nathan Labenz: (0:45) Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Eric Torenberg. Hello and welcome back to the Cognitive Revolution. My guest today is Suhail Doshi, founder and CEO of Playground AI. This is a special episode for me because Suhail was our very first guest on the show almost a year ago and today he returns as roughly our 100th guest. Of course, it's been a busy year for everyone in AI, and Suhail and team have been hard at work building and releasing a huge number of new features designed to help anyone create and edit images like a pro. Most recently, they've trained and open sourced a new foundation image generation model called Playground v2, which is preferred to Stable Diffusion XL some 70% of the time while being built on the same exact architecture. A decision that Suhail made so that the open source community can easily adapt and apply all the surrounding tools that they've already developed. If you're wondering how it makes business sense for a company like Playground to give the fruits of such a major investment away for free, Suhail's perspective on the state of AI art in general and on image generation in particular might surprise you. Because, yes, of course, it's undeniable that the state of the art has continued to advance and today not only Midjourney and DALL-E but also Stable Diffusion XL and Playground V2 can generate excellent quality, nearly photorealistic images with increasingly fine-grained prompting controls. And yet, in Suhail's analysis, we're still only at roughly the GPT-2 stage of image generation AI development. Just scratching the surface of all the promise that AI-based image manipulation still holds. Most of the use cases, as Suhail has come to understand them in talking to Playground users, are not very well served by existing models. And the leading AI artists use a mix of models and different complementary tools to achieve their best results. Meaning that things overall are still a bit too complicated for everyday casual users. So what's missing from this picture? A unified vision model that can do it all. Understanding, generating, and also manipulating images in all sorts of useful discrete ways. This does not yet exist, and the leading large language model developers don't seem to be really focused on it. So Suhail and team are setting out to build it over the course of 2024. As always, if you're finding value in the show, we'd ask that you take a moment to share it with friends. It has been incredible to see how the show has traveled and how the audience has grown entirely through word-of-mouth. And I think this episode would be perfect for both the artists and the application developers in your life. Now I hope you enjoy this conversation about the state and future of AI vision and image generation with Suhail Doshi of Playground AI.

Nathan Labenz: (3:55) Suhail Doshi, founder and CEO of Playground AI online at playgroundai.com. Welcome back to the Cognitive Revolution.

Suhail Doshi: (4:04) Thanks for having me.

Nathan Labenz: (4:06) I'm excited. So you were our very first guest when we launched the show almost a year ago now, and I think you'll be roughly our 100th guest. So this is a cool way to celebrate having put out a lot of episodes and another trip around the sun. And I'm excited to catch up with you on everything that's been going on in the world of pixels both broadly and at your company over the last year. For starters, I just listened to another interview that you did with Swyx and Alessio on the Latent Space podcast. I thought that was really good and figured we'd try to cover, of course, there'll be some overlap, but try to cover largely some different topics today. I guess for starters, I'd love to hear what's new at Playground, what's new in image generation. I'm sure there's a lot, and I'll have some follow ups, but I would love to hear how you would summarize the journey over the last year.

Suhail Doshi: (5:00) Yeah, I think the journey over the last year has had some interesting highlights and some interesting lowlights. I'll start with the lowlights because even though it may sound slightly depressing, I actually think it's the most exciting bit. And of course, everybody knows that things like DALL-E 3 have come out, Midjourney v6 has come out, Stability AI released SDXL back in, I want to say, June, July. So there's been a lot of new foundation models that have kind of come out, but I would mostly say that I have been fairly disappointed by the progress. And one way that I would maybe articulate this is that mostly the models are used for art and art has some value and utility in the world. But for the most part, we haven't really seen that general model high utility that you get out of these language models, right? And so to kind of harken back a little bit, maybe three or four years ago, when we looked at language, we sort of said, what do these AI models really do? They might summarize something, hallucinate, maybe sentiment analysis was a use case that was touted a lot. Many startups were built just doing sentiment analysis, those sorts of things. The models couldn't rhyme, the models couldn't really write code, the models couldn't really do anything. So there were these very specific use cases. And then as folks at Anthropic or OpenAI noted with scaling laws that they felt like maybe the models could get better and more generally useful. And the only reason why I'm taking us through this hopefully well-charted history that people know if they're in the AI world, but if you're not in the AI world, that's a rough snapshot of what happened with text and now you have things like ChatGPT and it writes code and you can ask it questions and don't have to go to Stack Overflow anymore and whatnot. That really hasn't happened with vision at all. What we got out of vision was I can make really amazing art. I can make a meme. I can make something cool. I can show it to my friend. Maybe I can make a book cover. Maybe I can make - one use case for some of these models is making coloring books. But the models are really great at extrapolating interesting characters or subjects or environments and making art, but they're not very good at doing anything else. And they're not good at manipulating pixels in any other kind of way. And so I actually was maybe expecting last year that perhaps the pace and momentum would be a lot faster, but it just turned out that wasn't the case. And so maybe the thing that I'm most excited about is that it kind of feels like vision is like a year or two off from this big moment that language has had. And there's far huge quantities of vision data compared to text.

Nathan Labenz: (7:51) Yeah. No doubt about that. There's plenty of - my dad has this crazy dad joke of, we're running out of pixels and the pixel mines are getting depleted and the reality is obviously there's plenty of pixels flying around if you can figure out how to use them. That is an interesting perspective. Would you - this is maybe so strange as to not be useful, but would you say we're at like a GPT-2 level relative to some inflection point that you're expecting for vision?

Suhail Doshi: (8:21) Yeah. It's funny that you ask that question because that is exactly how I phrase it at our own company. If I'm talking to people that we're trying to hire, I'm like, I'm not even sure we're at even GPT-3 level. Even though there was feeling maybe last year that perhaps maybe we were. But I actually don't think that we were at something greater than a GPT-2. And I think the continuum of where we were two, three years ago, GANs and stuff, things have significantly improved, but the overall utility hasn't really significantly improved. We've got our sort of three or four very simple use cases. You can make art, you can remove something, an object with the Pixel phones removing things, you can remove a background and that's about it. That's about roughly what you can do with images. Whereas in the domain of text, it's like the long tail of what you feel like you can do, the value that you get is just so much more significant. And so I say that this is mildly depressing in some ways, but I'm very excited and I'm working on it because I feel very excited about the possibilities that actually it's this huge open field of unsolved problems. And now there's a lot of momentum, a lot of effort, and a lot of desire to make it better.

Nathan Labenz: (9:37) Yeah. That's really interesting. I mean, I think for a lot of people that probably even who are in the AI space pretty deeply, I think that would probably come as a bit of a surprise. As you were talking like, I have remarkably little information about who the Cognitive Revolution audience is, but we're definitely not like an AI art or image generation or like a pixel specialist show. So I would imagine most people have a lot more experience with your ChatGPTs and your Claudes as opposed to with image creation. But from the outside, and I use these tools like, I'm not an artist, but I use them periodically. It definitely feels like things have gotten a lot better. Right? I rewind to a couple years ago when I was seeing this stuff start to bubble up on Twitter and seeing Rivers Have Wings, the OG true legend accounts. Rivers Have Wings, if you're listening, check your DMs. I wanna have you on the show. But that stuff was pretty gnarly, and it was like, wow, it's amazing that you can do anything, but it definitely was not passing any sort of visual Turing test equivalent or whatever.

Suhail Doshi: (10:50) It was sort of like swirling colors in backgrounds and abstract shapes. And now we're getting this amazing photorealistic thing. We kind of saw the hands are starting - we're starting to see models with text synthesis and that kind of thing. So it's definitely on an amazing pace. It's definitely not hitting some odd weird asymptote quite yet.

Nathan Labenz: (11:13) Given that progress, you said utility hasn't grown as much as you would have hoped. Is it the case that we maybe naively misunderstand or kind of misconceive of what actually drives utility, like the fact that we're able to generate these photorealistic outputs today is just not enough for a lot of the utility that you're looking for?

Suhail Doshi: (11:33) I generally think for the most part it's just at the moment, mostly a lack of imagination. It is somewhat hard to imagine extrapolation from these big models. I think even at GPT-3, it wasn't like there was this massive audience thinking that - I think if you had gone back in time to maybe when GPT-3 came out, maybe six months later, I don't know that there was this very big audience that thought for sure, what we were getting out of GPT-4 was coming. I don't think that - I had access to GPT-4 maybe in October of the year before last. And the first thing I did with it was I wanted to see if it could find vulnerabilities in my computer programs. It's like, how good would this be at finding security issues? That was the use case I had in mind. But I'm not sure that I fully internalized even back that year that this thing would be so instrumental in writing code as good as it is. And I wasn't sure that - I thought that it could answer relatively difficult questions about my life or things happening in my life, or if I have a leak in a ceiling, how might I solve that? I don't think people thought that it could be, not very many people thought that it could be that powerful. And so I think that's just similarly true in images. You're just sort of like, okay, it can make this beautiful art, it's really fun and novel. But what do I use this for day to day? Because you actually look at images - it can't, it kind of struggles making logos. It sort of struggles with spatial reasoning. Certainly things like inpainting or editing are extremely difficult. We're still sort of talking like eyes are still sort of a bit strange, hands aren't always coherent. Sometimes it adds an extra letter if you're asking for text. So really these little things that we see are present in the models, but there are some, I think there are much bigger things about vision just generally that can lead to a very big general model. Like one of the things that we tend to think about at Playground is how we might achieve a unified vision model. Right? And there's like three pieces to at least graphics, which is just a sub component of vision. There's creating pixels, there's editing pixels, and then there's understanding them. So GPT-4V is a good example of a very rudimentary model that's getting better at understanding things. Right now, it's quite good at captioning, but there's other reasoning capabilities in vision just like understanding an image or even a video. Like you can't do video, for example. Right now, we're kind of stuck on creating, but we're not very good at editing things. One thing that hasn't really happened yet is there hasn't been a very big effort on manipulating real pixels of real images. Right? We're mostly dealing with synthetic images, but we're not doing a whole lot with real photographs where I think there's probably a lot more utility and a lot more value. Like, I'll give an example. If my son was smiling for a Christmas card that we wanted to send out late last year, it's really hard to get the dog and the kid, everybody in the picture, getting them smile appropriately. But it would be nice if I could just be like, hey, highlight his face and just say, could you make him smile, or take two images, one where he's smiling, one where I'm not, like kind of finding a way to merge those two things together. That's a very simple application. And so you can kind of imagine what the other kinds of general useful things that you could do with manipulating pixels. And that's maybe the very, very tip of the iceberg. So I just imagine that there's a lot more in vision, understanding the context of a video scene. Another example, like if you ask a model to look at three minutes of a video and what happened, could the model reason about that? So just a bunch of those kinds of things. And there's just this general lack of, I would say significant investment with regard to trying to make a great general vision model.

Nathan Labenz: (15:36) Hey, we'll continue our interview in a moment after a word from our sponsors. Yeah, I would love to hear a little bit more about the use cases and also I'm curious about what shortcomings you are finding in GPT-4V. I have used it pretty successfully actually, at least for the use cases that I've tried. One that was like a mind blowing moment was I took my kids - I can definitely relate to the difficulty of getting them all to smile in a photo because we're up to three. And two out of three feels like the best we could ever manage to do. But I took them to Salem, Massachusetts leading up to Halloween because we were out there for an event that my wife was organizing. And it was actually a fun little day, but I came across this parking sign and this had just - at the time there was the famous one on Twitter where somebody was like, I'm never getting a parking ticket again because he took the picture of the eight different parking signs and said, can I park here right now? My version of that was the Salem version where it had some message about special parking rules for October and at the bottom it said Salem PD. And I was like, what's going on here? And it was able from the Salem PD and the special rules for October to infer that it appears that you're in Salem, Massachusetts, which has this historical connection to witchcraft and whatever. And now it's like a holiday destination leading up to Halloween, and so it's probably that that you're dealing with. But I was like, man, this thing, there's a lot of knowledge in there that it's kind of able to tap into. And we've had really good luck with that at Waymark too, where we just use it to filter user images. We've long had this product experience where we'll pull in a ton of images like every image on your website if you're a small business just for your convenience. So you don't have to go load them into our product manually. So we kind of pull them in, but then we get all kinds of crap. Right? So how do we filter that to use the relevant stuff? And GPT-4V is awesome at that for us. Literally just saying, here's the profile of the business, which of these images would be good to use, which would not be good to use. I've done a little bit with breaking video into frames and trying to get it to understand that as well. I wouldn't say I've fully characterized how well that can do, but anyway, those have been my experiences. What things have you seen or done that you feel like leave something to be desired in the GPT-4V performance?

Suhail Doshi: (18:12) I think one small, easy one is that GPT-4V is not always great with certain tasks like image segmentation. It can be kind of weak at that occasionally and you need image segmentation or finding the right bounding box for certain images tends to be extremely useful. I mean, there are just better models than GPT-4V and arguably a lot faster. And Segment Anything is a really great example. You can use DINO plus Segment Anything together to achieve really interesting state of the art results. And that's a very simple use case that's been going on for a long time in AI that probably could greatly benefit from a general model. So I think the promise of something like GPT-4V is really, really outstanding. That task is so valuable for a whole bunch of use cases related to cameras. Anything related to cameras is a really good example of that. Sometimes, I still think there are some issues with hallucinations with its descriptions of some of the images, but for me, my favorite use case is taking screenshots on my desktop computer and then asking it to solve some problem of mine. Lately, I've been just doing this funny thing where I'll just take screenshots of errors in my code and asking it to basically help me fix whatever the problem is. It's almost like I can be really lazy about things. But I had been thinking that how valuable would it be to sort of have - surely it would be valuable if there was a model that was able to just stare at my computer with me and kind of help me along as I go. And I hope someone at OpenAI is sort of working on something that can be like, hey, I might be able to help you with that. I'm kind of pairing with you as you're working, but I think that's going to take a lot of work. And that's just one domain I think of like pixels and images. So that's just understanding what it's seeing. That's not even necessarily manipulating graphics in any kind of way.

Nathan Labenz: (20:10) Yeah. Certainly the fact that it is all text out is a major limiting factor in terms of what it can do with images.

Suhail Doshi: (20:17) Because all I do is spend my time on vision. I spend a lot of time about what are the major benefits of vision versus say language, not just in the training data, but also in the outputs. And I'll give you another example. One thing that's really challenging with text is it takes a lot of effort to read text. It takes a lot of patience. It takes a lot of focus. And we kind of continue to live in a world where there's even less and less of that. But one thing that vision or graphics is really good at is just a very amazing visual explanation of something. Often there are circumstances where you want a visual explanation of something. Right? We see this all around us. Right? We know when we read a book and we see a diagram or something that it can be very valuable, or a graph, right? A graph can be extremely valuable. But there are other modalities like a video that's explaining something to me is really valuable. Audio is obviously valuable, but there are major areas where things tend to lack in that regard. For example, there are a lot of people that are just kind of waiting around for these image models to figure out how to make diagrams, flowcharts, ways to be able to understand something about the text. That'd be one thing. We all can very much tell when something doesn't look like us. One thing we learned over the last year is that humans are really good at looking at faces. They can really tell when something is off. You might have 10 friends that tell you that that cartoon version of you looks like you, but sometimes when you look at it, you're like, that doesn't look like me. You're almost offended by it. And so I think that there are all these different kinds of use cases that are just very unsatisfied. And I mostly think that language doesn't accommodate as well as it could and it will never accommodate. And it's totally reasonable.

Nathan Labenz: (22:07) Yeah. Those are two good examples. I've tried to work around the graphic thing a little bit by asking GPT-4 to create a text-based, basically to follow - there's a bunch of as you're probably aware of. There's a bunch of different flowchart diagram kind of syntaxes that you can use. So I've asked it to create some diagrams that way. And there's one called Mermaid and there's VisGraph, I think is another one. That's worked okay, but yeah, definitely I did this for Waymark actually with the AI system that has a bunch of different models that even for us on the development team, it's sometimes a little bit hard to remember, wait, which has to finish before the next thing? What are we parallelizing and what's the dependency on what? So to just try to have that at a glance is something that we didn't have. And I was able to get a pretty decent version of it from this syntax that then could render deterministically into the visual form. But you're definitely right that it leaves a ton to the imagination compared to what a real infographic or proper designer would do.

Suhail Doshi: (23:19) Yeah. That's one that people commonly talk about. A less common one could be as simple as something like, here's a picture of my house, my room in my house, my bedroom, I need to figure out - I want to figure out different combinations. If I put my bed on this side of the wall, what might my room look like? If I put my bed on the other side of the wall, keeping all the same objects in place, how would you reorganize my bedroom? Right. And if you could copy and paste that, you could put it in GPT-4V and it will spit something out. Right. But it's very hard to imagine - you have to now read this thing, it's extremely hard to reimagine your bedroom. But imagine if there were an image model that could completely reorganize your bedroom and show you all the different ways that it could work. Right? Imagine that you could interact with it. You could say, hey, what if the dresser was made out of this other material? What if it was a little bit bigger? What if the nightstand was lower? There are a whole bunch of vision use cases. So I think if you can try to imagine these little simple things that feel very hard today. It's very hard. Where do I go? What website do I go to? Okay, I got to start drawing lines and stuff. Forget it. I don't want to do it, too much work. So you can kind of try to imagine these circumstances where you're iterating just like you are with language, but with something very visual, whether it's your home or something else. I think you start to get more out of it. You could do the same thing with logos. You could say, I want my logo to be a little like this. Get rid of the little swirl, could you change the circles to rounded rectangles, you're able to kind of actually work with something like an artifact together. That's virtually impossible to do. That's another example, something very hard to do with language. I think that there are even bigger use cases in vision, whether it's something understanding that it's your face versus an intruder in a camera system, something that learns that over time. There's just a whole limit to what we probably can do with language. I think it's underappreciated with vision and I think it's just strictly because we're kind of trapped in our world right now. And I think that's going to change very significantly this next year.

Nathan Labenz: (25:41) I have kind of two lines of thought going in my head at the same time. One is how's that gonna happen? You guys have just trained a foundation model from scratch yourselves. I'd be interested to hear kind of what you think the trajectory of foundation models and vision is gonna be. And then I'm also really interested in - and it's a very general problem, although you have a particular form of it for product owners. How do you build in a way that balances the now and what people need to get utility from your product today versus what is going to be needed or maybe no longer needed as the models themselves get better. I went into the product and was making some stuff in preparation for this. And I haven't - I have done it more than once between our episodes, but definitely was just catching up in the last couple days. I see that there's a lot of new features. I was gonna ask, what are some of your favorite new features? What are people getting the most utility from? And I still wanna ask that, but I also now kinda wanna ask, how are you thinking about what features to build that are patches for model weaknesses that may not be needed in the future versus what things do you think are always going to be non-model features. So there's a lot there. I'll shut up and you can take it all apart.

Suhail Doshi: (26:58) No, no, no, that's great. I think the last year has kind of made me see - the community has done such an amazing effort at fine tuning models, finding fixes, being extremely clever. And it is exactly how you describe it. For as amazing of these feats as they are, it is a lot of patchwork. It's not a true fix at a foundation model level. There are models that fix hands, fix eyes. There are models that do upscaling to try to get more detail out of them. There's tons of amazing tools, but they're all just patches. I think maybe the more surprising thing is we've been thinking a lot about what comes after just text to image? What else is left? How do you get to a much deeper understanding of these kinds of concepts? And so I think for the most part, we sort of think of it as these three pieces around, okay, how do we create something from nothing? How do we take existing pixels and manipulate them at a really high fidelity? And how do we understand the images a la GPT-4V? And then how do we bring the three together? Because ideally that model understands a lot more, and it's far less patchwork and it's far more general than what we have today. We're planning to be very focused this year on editing. So I think the first thing is that I think the community and everybody involved, all the researchers involved made this wonderful way of creating amazing art. Sometimes it's totally realistic. Sometimes it's just fun, entertaining fantasy art, what have you. And I think that has driven some utility, but definitely not enough. I think the next thing is probably our kind of not so secret plan we're happy to reveal is just that we're going to work on multitask editing. And multitask editing is just taking existing images and manipulating those pixels such that we can achieve anything. There have been some early versions of this that happened last year - should be anything from like, what would I look like if I were aged up at 65 years old, or my son, what would I look like with a different hairstyle? There are models out there from researchers that are trying to replace clothes in a really high fidelity way. That should really be encompassed in a general model as well. Being able to do all kinds of things like that, being able to manipulate some kind of interior design would be a good example of that. And then there should be more difficult things. So those would be kind of local edits. And then there's more global edits. Global edits would be like, you have a big scene that's clearly a wintery scene that you might want to make a spring or summer scene. And that means every object, everything in that scene has to be changed. So that'd be a big global edit. I think that we need to get better and better at those kinds of things because I think the utility will rise across customers who want to do more valuable things. And I think a lot of those things tend to be real imagery. And then I think we're also going to work on a model that's trying to understand everything that's going on in an image. We need good understanding because that actually helps us train the other two models. And so hopefully by the end of the year, we're looking at something where we can kind of try to unify these three things into a single model that can do more surprising things, more surprising use cases. Just strictly for 2D images - it's like we've kind of moved on to video, people have moved to video and stuff, and I think that's really wonderful. Makes really amazing art. But I think that we still haven't quite - we kind of haven't shown how big things could be even just in image, which is on its own a hard modality.

Nathan Labenz: (30:50) Hey, we'll continue our interview in a moment after a word from our sponsors. So how do you think the experience changes over time? Like today, again, I go in there. I can do the classical text to image prompt and get something out. Now I can also bring my own image and ask for modifications that are kind of holistic. Like, one of my early semi-viral tweets from last year was taking the ultrasound image of our now 9-month-old baby that was still in my wife's belly at the time and saying, what would this look like if it was a newborn? And it did a reasonably good job of that. I wouldn't say it was - it was still in the uncanny valley, but it was good enough that people were intrigued to see it. And you got a lot more features too. Right now, I can outpaint. I can mask and inpaint. I can start with a line sketch. I can start with a depth map. I can start with probably a couple other things as well. You have kind of pre-rendering sort of features. There's the intensity of the guidance. So there's all these knobs and dials that have been added. Do some of those go away? Do we continue to add all these sort of control elements or does it become more of a just conversational UI again?

Suhail Doshi: (32:05) I think some of the elements go away, right? I think some of the elements go away because the model should have an understanding of depth. The model should have an understanding of edge lines. The model should have an understanding of color, lighting, right? The model shouldn't - ControlNet, for example, for these models is a really great innovative thing that came out last year. But the model should really understand that. You should be able to erase a part of an image and the model should understand that, actually, there is a big depth component here and that I have to consider the shadows or I have to consider what's in the foreground and the lighting related to a character or subject. Right? So I think definitely some of these things go away. And I think that for the most part, we find that there are definitely a whole bunch of power savvy users that use these tools, but a vast majority of people definitely struggle. Definitely it takes some time and expertise watching a couple of YouTube tutorials perhaps to understand these things. I think those things kind of go away. I don't think it ends up being just text. I think text is great. Text is a really great kind of absolute way that humans can compress a very high dimensional concept in their mind to something simpler that they can input into a model. It's very hard to imagine - how would you take something that you're imagining and give it to the model? Well, maybe the best answer for that would be another image. One of my favorite examples is the concept of shattered glass, right? Shattered glass is - we all have a version of what we can imagine that looks like, but actually it's very difficult to both have the same concept of what that means because there's so much entropy with shattered glass that you and I would have completely different imagination of how that would look and how the lighting would be and what color things would be like. Right? But maybe you gave the model something that you like, right? Maybe you give it a style. Even style is really hard to describe to these things. Style can be a combination of different things that you like together. If you're thinking about your brand, you might be inspired by multiple things and wish that something could look like all three of those things. Right? So I think my sense is that it's probably going to be a combination of language, which gives you some sense of that, but it's going to need something that's even higher dimensionality than that, so that it gets closer to what you want. A model that probably emits something visual probably should be very visual, right? It probably should be - maybe it's using your finger on your phone and masking something. Certainly it's going to be typing something. Certainly it will be like, I like this thing out there in the world. I think vision models will probably have to be extremely multimodal compared to maybe the starting point of language.

Nathan Labenz: (34:51) Yeah. When you said the bit about language and just the way that it allows us to compress what's in our head, my brain jumps to, well, what about just direct brain reading? I wonder if there is - because we've seen some pretty interesting results over the last year of, with still fairly cumbersome hardware, but less cumbersome gradually over time. Do you envision a future where somebody sits down and puts like the Playground crown on their head and sort of doesn't have to say a word, but can instead - unlike the classic scene from Back to the Future, suction cup the thing to your head and then just kind of focus on your own mental state, kind of commune with the system and get your ideas out that way. I would have thought that was insane to ask, by the way, even a year ago, but it feels less insane today somehow.

Suhail Doshi: (35:45) Yeah. I mean, we can - first we can get words out of our brain. If we can go from thought to word, that would be amazing. I don't know if we can do that. I actually haven't studied anything about brain - trying to get brain output to get to any kind of accuracy. I think I saw something maybe where maybe you can move an object, like there's ways to like left, right, up, down or something like that. But I haven't seen - I don't know if there's any research that's thought to text.

Nathan Labenz: (36:11) There's thought to text. There is also thought to image reconstruction. It doesn't necessarily work as well as it might need to work, but this is part of why I kind of have this AI scouting concept here. I'll give you two links and you can check these out at your convenience. But one, the first one, we had a guy named Tanishq...

Suhail Doshi: (36:31) Yeah, I know Tanishq.

Nathan Labenz: (36:34) What was really interesting about his was that it was low sample was required and they used existing foundation models and kind of figured out a way to with a relatively small per user sample map the brain activity into the kind of latent space necessary to then diffuse into the image. But the reconstructions are really quite good. That one is an fMRI if I recall correctly, which is far from consumer. And then this other one from Facebook or Meta is magnetoencephalography, which is definitely still cumbersome, but less so and also fast. This one, you can kind of see the accuracy isn't lower, but if you can reconstruct what they're actually seeing, then if they close their eyes and just imagine stuff, you would imagine that could start to work as well. And I also do wonder just about the degree to which people - a big obsession of mine in general is kind of how the world begins to change in response to AI, and I definitely think that people can kind of train their own minds to work with these systems and these early results don't have any of that going on yet.

Suhail Doshi: (37:51) Yeah, I mean, I think the ultimate multimodal model, I think at the end of the day, it just turns - it's not really about tokens and language. The vision transformers are just taking tokens of an image. They're trying to create some kind of codebook of images by taking patches of an image. And then the language stuff is all subword stuff. But it just seems like at some level, it's just gonna be something very byte level. And if you believe that it would just become byte level or something lower level than these two things, then it starts to not matter so much whether it's language or vision, then it becomes something else. Then it's like the models have - it's almost like you've given the model a representation that's very normalized across all of these different modalities to get to some kind of unification of them. And I do wonder where it's headed. I don't know right now. I don't necessarily know that actually, maybe the transformer is not the right thing or certainly not tokens of language are necessarily the right thing. It's probably too early to tell because maybe hardware is not even good enough yet. Maybe the compute performance is not good enough yet. And there's still a ways to go to scale text or vision. My first thought with the brain thing is I was like, I wonder if it would be too high dimensionality. I wonder if it'd just be such - vision is already a lot of input data so people tend to need to find a way to compress it down to something simpler or more encoded rather. So my first thought was I wonder if the brain has very, very high entropy because I don't know anything about this. I suppose you could encode it too. But yeah, it certainly would be cool if you could look at your room and sort of imagine what if it was there and somehow it kind of knew that you could do that. Perhaps text to whatever isn't maybe not the ultimate first input, maybe it is thoughts or something like that. Well, that's what I see. If you can do thought to text, then that's the right primitive, right? Then you can get from text to maybe anything, whether it's audio or images or other text, then you'd have the right maybe way to decode, go from thoughts to whatever you want to. Yeah. I wonder if anyone is working on thought to text.

Nathan Labenz: (40:12) There are some thought to text ones as well. I wouldn't say they're kind of in the same general ballpark as the thought to image where it's like, woah, that's striking, and they do it a very similar way. You read a sentence or a paragraph and as you're processing that language, they're kind of trying to reconstruct what you have been processing. And they do get a decent reconstruction. It's not perfect in the same way that these image reconstructions are not perfect, but they're definitely directionally right. You look at it and you're like, you're not guessing. That's for sure. The resolution is not there, but that they are onto something is pretty clearly true.

Suhail Doshi: (40:57) I'd love to spend more time on brain interfaces. I have no idea.

Nathan Labenz: (41:00) When in doubt, go back and see what Kurzweil said in the nineties, and then we're getting to the time when it was supposed to happen. And that guy has been - I've started to use the term Kurzweil's revenge because I came across his work in the mid 2000s. At that time, the exponential curve that he was projecting, we were still in the low part. And so it was like, he said this five years ago, but not much was supposed to happen in these five years. So whatever, maybe it'll happen. Then 10 years went by and it was 2015 and it was like, wow, he was wrong about everything. And this guy, what a dreamer. And then now here we are. And it's like, actually, exponentials are crazy. He's maybe a lot more right than not. So apparently he's got a new book coming and it's The Singularity Is Nearer or something, I think is the - I'm not sure if that was the joke title or the real title.

Suhail Doshi: (41:51) It's worth noting - most of the time when I talk to people about a unified vision model, even leaders in the field have often not thought about it. A lot of people are sort of trapped in this kind of, we're just trying to do text to image and they haven't really thought too much deeper about that. Or the other version of this is a very narrow use case, right? Like e-commerce product placement or swap, changing clothes or filters like on Snapchat, that kind of computer vision stuff. But that kind of computer vision stuff was very 2020, 2019 style AI. Right? That was pre generative AI, pre kind of big generative models that can - I mean, the promise of these generative models, foundation models specifically is that they generalize so well to these other kinds of things that we're surprised by every day that we use them. We're excited about GPT-5 because - and we don't even know what it will do, but because we're surprised - we're excited by what we'll be surprised about. And I think that same moment just hasn't happened with vision. So I feel like I've kind of caught you a little bit off guard during this chat because - but I just wanted to mention to you that most of the time when I'm talking to people about vision, they don't see it either. There are a few people who I've hired, those people that happen to have the ambition or goal to try to do something a lot bigger, but we do not have a general vision model of any sort. And you can tell by its limited utility. Maybe it will be clear towards the end of the year when we go, woah, I didn't know that we could do that.

Nathan Labenz: (43:40) Yeah, I think that is a good point of comparison. I would agree that the sort of surprising capability doesn't seem to have happened as much on the image side as it has on the language side. So it's a good contrast.

Suhail Doshi: (43:57) But just consider if everyone - if OpenAI - there's this wonderful huge language race. It's great. I love it. I want people to focus on that because we're focused on vision. But consider for a moment that all OpenAI was doing was vision. If we could imagine that there was actually a version of OpenAI that was super laser focused on vision. And right now there isn't. As far as I know, there might be, but I don't think there is. What would these models look like? I think they would be far superior than DALL-E 3, vastly better than that.

Nathan Labenz: (44:30) Yeah. What do you - I guess one interesting question is like, what does and doesn't exist in the world that, in text like everything-ish exists. Right? You may not have access to all of it, but it's all kind of out there. Feels like in one way or another. With images and with the text-image use case, the images were never captioned with the intent of allowing you to create images in a text to image sort of way. So it's sort of this found dataset concept that, oh, hey, look at all this web scale data, these captions are super noisy. Obviously we've gotten better at that, filtering etcetera over time. But largely it is still kind of derived from the fact that people happen to caption their images sometimes. I guess there's maybe just a lot of missing data where, you could say, here's an image and I want - it reminds me of the bad Photoshop memes online, right, where people are like, oh, I want my boyfriend to be over here or whatever. Then the person makes a mockery of their request. But more productively, it seems like what's missing is that we just don't have those before and after transformations in image almost at all. Right?

Suhail Doshi: (45:48) Kind of thing, is there a lack of training data basically relative to say, language where maybe there's vast quantities of structured and good training data? Is that kind of what you're saying?

Nathan Labenz: (45:58) Yeah, like the transformations that you would want to do now that you're starting to imagine what a robust AI tool would look like are just very hard for humans to do, very rare online, require advanced Photoshop skills historically and just don't exist and just aren't posted all that much, I would guess.

Suhail Doshi: (46:16) It turns out with vision it's perhaps more prone to being able to succeed with even synthetic data. One benefit of all these images is that you can make - it's very easy to generate lots of them and annotate them and change things with them. So it turns out that actually you can probably retrain a good model and ground truth stuff, but then there are all these wonderful other models that you can use to train on probably synthetic data to get a really good new kind of general model that's capable of new things. A really good example of this for folks is that maybe the thing that put me onto this most early was InstructPix2Pix early last year, where so much of it is actually just synthetic data. I think its overall performance was not world changing or anything, but it gives you a glimpse of what it could be if given - I mean, this was just one researcher at Berkeley, I think under Efros. So I think, given a team that really cares, what could it be? How could it be? So I think that's one thing that might be surprising regarding this debate around training data. The second thing I would say is that part because of the massive revolution with text and specifically some of these text-vision models, these multimodal models that are kind of getting built now, whether that's at Google, OpenAI with GPT-4V, or there's some really amazing open source ones by students at academia. I think this guy named - I don't know, sorry if I butchered the way to pronounce his name. I think his name is Haotian Liu or something like that. He did LLaVA with some other people. And so there's just this - I think going into this year, one thing that I think will probably be true and it's kind of a natural progression from where we are with language is, and you probably need to augment language somehow is these really great multimodal models. And it just turns out that if you scale up these multimodal models with text understanding and vision, then you can do even more capable things. It might be surprising how that can help with vision more significantly, maybe not too surprising. But certainly the language part was very important for these multimodal models to be as powerful as they are. For example, hopefully image things like image segmentation get better, but the captioning gets better. One thing everybody knows in the field for vision or at least with graphics is that that LAION dataset is really poorly annotated. It's sort of a best case effort, but it's very poorly done. Right? I think the folks that made it did the best they could, but the data is not great. And so if you have a multimodal model that's extremely good. The DALL-E 3 paper talked about how that team just completely recaptioned the dataset to be more accurate. And then that significantly improved prompt alignment for the image models. And now you get better prompt alignment. Now suddenly, if you ask the models to - one little test that I like is create an image where you've got four bottles and they're numbered one through four, but ordered backwards. Right? You try to do these little puzzle challenges for the model and suddenly these kinds of spatial reasoning tasks become possible. Or if you do recaptioning on text synthesis it starts to get better, performance starts to dramatically increase. Now we're starting to see Midjourney and DALL-E 3 be better at text. So it just turns out that actually the overall progression of the entire AI field helps quite significantly with vision. And so I think this is just going to keep happening.

Nathan Labenz: (50:00) So if I'm understanding correctly how you are expecting things to develop, it's almost kind of a mirror image in some ways to language where the language progress - at least the canonical breakthroughs seem to be just scale it up, and then, oh my god, look at this, we've got few shot learning, these kind of quote unquote emerging capabilities that we didn't train for. And now of course we're refining that and we're doing curriculum learning and alignment and a million things. But it was kind of like first the brute force - that you just dump all the text in, run it and something amazing comes out and now we'll refine from there. On the vision side, the path is more of, okay, we have all these different specialist models and now those are going to help us kind of create the super general dataset which doesn't exist totally as needed in the wild. And then we can kind of go like Captain Planet with our forces combined, now we can sort of summon the truly generalist vision model that you're dreaming of.

Suhail Doshi: (51:08) They're kind of like these pros and cons, right? On one hand, images are so rich in terms of information relative to language. Just look outside your window for a minute and try to really look at every object. Right now I'm looking at plants and even just looking at a plant, you can kind of see how do the leaves fall, how do they interact together, how does lighting react to the plant. And some of this is graphics knowledge, but some of this is just world knowledge. Right? A plant on a roof or how buildings look or how they're situated. Right? So images on one hand are extremely rich in terms of data. They're so rich, but then they sort of lack the right annotation and labeling. And so on the other hand, text perhaps, maybe has really great, rich, wonderful labeling inherently. Right? That's just inherent in language anyway. But on the other hand, they're very lossy. Language is very lossy. We don't have that many words to describe things. Right? And the words certainly are not descriptive enough. So on the other hand, there's cons to this and they have different utilities in our life. I can't say that - it makes sense that this happened, but it just so happens to be that we've got these amazing text models that are starting to lead to amazing multimodal models with vision. And that's probably going to help us fix some of our understanding of the images, like what's really going on in some of these images, where are things located? Why are they the way they are? And so my guess is, if you think that vision is mind blowing, I kind of look at vision as maybe about a year or two behind where we are with language. And it's wonderful actually that these multimodal models are getting so powerful because they're going to be very helpful with vision. They're going to lead to vastly better use cases for vision than they do today.

Nathan Labenz: (53:02) That's a good guide to what the future might have in store for us. How are you thinking about building a business through this maturation of the technology? Are you trying to grow a lot of revenue today? How much do you care about adoption and your relative position in the market compared to other options? From a funding standpoint, I know you had some capital already on hand when you pivoted into this, but I imagine more would be helpful. How are you thinking about what do you need to do to raise more money if you need or want to do that? Just the timing of it. It's very weird. Right? Because it seems like unlike previous technology waves, so much is in the future relative to what you can achieve today. I wonder how you think about metrics, milestones, proof points versus this grand strategy, how you kind of balance those and try to make them work together.

Suhail Doshi: (54:00) Our overall main quest is to make a unified vision model, whether it's images, video, 3D, etcetera, something that has true general understanding of pixels and such. But I think a very simple thing we're doing this year is we're only going to be working on two things. We're going to work on making a really great graphics model for just static 2D images because a vast majority of the utilities still gone unsolved. And so, yeah, I think it's not that complicated, in the sense that right now, if you want to do things with Photoshop or Lightroom or Illustrator or whatnot, I'm not an expert at these tools. There was a time where I was doing Photoshop tutorials in high school, trying to compete in logo contests and stuff. Since then, my skill has greatly atrophied, but I'm not an expert at this and I think a vast majority of humanity is not. It seems like there's clearly people that buy graphics or manipulate pixels and wish they could and just don't. Take the picture to get a better one. And then there's people that have really amazing skill. They're expert color graders. They know how to get rid of dangly hairs in your wedding picture. They make logos in Photoshop, whatever, or they're really amazing illustrators. And it takes a lot of skill to be that good. And so I think to the extent to which we can kind of give more of humanity the ability to do that kind of work, but kind of own - instead of going through a third party, be able to actually do some of these things on their own very easily. I think that will open up a lot of doors for us. And so I think we'll be very focused on just creating images and editing them throughout the year. And I hope that leads to a lot of happy users for whom I think have no alternative right now other than kind of asking a friend or going somewhere where there's someone that is talented, hiring somebody basically.

Nathan Labenz: (56:06) That feels like a little bit of a change from last year and maybe from the usual advice of fast iteration, talking to users. This feels a little bit more - and I've heard similar comments from Sam Altman recently where he was like, at YC, I told everyone, launch super early, iterate super frequently, and then in OpenAI, we took four years to launch anything and we had a massive capital outlay and we didn't really know what the use case was gonna be. It sounds like you're kind of shifting a little bit more toward that approach where you're like build something truly awesome. We kind of know what that is in our gut and then everything else will kind of follow from there.

Suhail Doshi: (56:47) We've had - maybe we didn't talk about it last year too much, but we kind of had the same plan since the company started. But I do wanna say, over the last year, we shipped like 100 things.

Nathan Labenz: (56:56) Yeah. The product has changed a lot. No doubt about that. Every time I've gone in there, it has been notably more feature rich for sure.

Suhail Doshi: (57:03) We have a Slack channel where we see every complaint from a user. "I don't like the hands." "You changed too much." "How do you..." - I get DMs from customers or even friends. "Hey, how do I make this character kind of consistent?" And the combination of these things turns out - there's no quick fix to solve these things. Some - I think there are certain class of problems where you accumulate them and you go, well, actually there is no quick fix for this. The worst product that we could make is a product where we have a button that's like the fix hands button. And then there's a fix the eyes button. Then there's the character consistency button. Nobody wants to use a product like that. That was a very complicated product to use. You have to know where everything is and then people are confused and you have to figure out ways to train them.

Nathan Labenz: (57:49) You gotta train a whole language model on top of that.

Suhail Doshi: (57:53) Right. And then you know how your language model is choosing which model to use at what point. And then of course, now your inference times are going up because you're using each model in the pipeline to fix one thing and this model fixes the thing, but then it kind of made some other thing worse. Right? This is not a great experience. And so this is all patchwork stuff that's just not going to lead to a world class product. So I think my belief is that we've actually distilled - we know what users are frustrated about every single day with the current state of the art. And we bought a lot of GPUs and we found a lot of really amazing researchers. And so it's not like we're going to go solve every imaginable editing and graphics problem day one, or we're going to wait three years and have something. I think, as long as we can make a few things very useful that are generally useful that attract a lot of users for whom like, wow, I didn't know - I'm finally happy someone solved this problem. I think that you can build a great - you can start from something kind of small that starts to become bigger. I think a good example of this actually is, if you look at Midjourney and what it looked like early last year compared to what it looks like this year, it might seem surprising that anyone paid early last year. Right? But nonetheless, it was the best at the time. And then time - these things kind of grow and they sort of mature. And so maybe what we'll have is something that's like a PS2 or a PlayStation 1, but it was still very fun to play or use rather. Then over time, we'll certainly keep making our version of our models better and better. And so the use cases I hope will expand.

Nathan Labenz: (59:30) One other big picture question I wanted to get your take on, obviously, we're in this moment of all this stuff is so new, people are - the fallout is just beginning, mostly positive, some negative, and we really don't have new kind of guiding rules or principles in many cases even for how should we handle all this? What should the rules be? I just did another episode on New York Times versus OpenAI. I guess my super big picture question to you is, do you have any thoughts at this point as to what the rules should be? That could be at the training data level - should everything - I think Japan is kind of going this direction where they're going to make everything available for everybody to train on. You could imagine people should be compensated if they're included. You could imagine people should be able to opt out, but not necessarily be entitled to compensation. You could imagine, I should not be able to ask for everything in the style of Greg Rutkowski or maybe that's okay. Can I make Mario and Luigi on Playground AI? There's just so many questions. I don't expect you to answer them all, but I wonder what your kind of emerging sense for what the rules of the road for the generative AI era should be.

Suhail Doshi: (1:00:42) I try to sometimes put myself in the shoes of the artists or the people making these images, photographers, whoever, because, yeah, obviously, we have a rational self interest in training on data like that. Now I read Greg Rutkowski's story, his kind of op-ed and tried to understand him better to the extent I can. I've not talked to him, but I looked at his page on DeviantArt, I get his style. And I think there are like two things that are kind of happening. One is nobody - people don't really wanna make art like Greg Rutkowski. It's not very exciting to me. The users that use these products are actually people that think of themselves as artists in some ways. And nobody - great people that are artists don't really want to copy another artist and claim their art. There's no pride in that. Right? It's not like these people are making images and they're really selling them for thousands of dollars or anything. So they're not - I don't think a vast majority are taking pride in that. And the other thing is, after I read his article, one of the - we were the first site ever. And I think the only site still to this day for whatever reason, where if there was a prompt on our site and someone references his name, we directly link back to his page. We actually give him - we actually call it additional credit, literally say "additional credit," wherever possible link back because one of the things he said was that in Google search, now people can't even find him. And I thought, well, maybe one way that we could at least help him is helping people find his actual art, pay for it, buy it, whatever, at least give him fame some other way to evolve in a world where, maybe if the pain was growing for him, maybe we could ease that a little bit. So I think there's small things that we can do. I'm not saying that fixes and alleviates the issue for Greg in any shape or form, just saying that I think there might be some small things that we can do. I think some bigger things, the other reality is that it's not like you can't make Mario and Luigi in Photoshop or Illustrator. Right? And the rules around that are like, of course, you and I can gain the skills and we can make Mario and Luigi or Mickey Mouse or whatever. But we can't just go run around starting to sell t-shirts with Nintendo branding either. And so I think there's another world where there's kind of a balance point where it might be generally okay to make things. Right? In fact, many brands are perfectly fine with fan art or whatever. But then there's kind of this question of commercial use, that's where things kind of stop. I think that's a fair area to say, oh, people are just kind of having fun. But when they start selling t-shirts in the thousands with kind of a Bizarro World looking Mickey Mouse, maybe that's not okay. And so I think that's probably okay too. I think it's going to be very hard, even if the models are not trained necessarily on Mickey Mouse or Mario and Luigi, I think it's going to be very hard to get these models at some point - the models will have some representation of it. Even if it hasn't necessarily strictly seen - even if you try your best to comb through the dataset, there'll always be something missing or it will just have some way of understanding concepts of it. So I think that's a very impractical thing. I think the genie is a little bit out of the bottle, but I think it might be better to actually talk about use as opposed to strictly training data. That's kind of my view. I also think the last part of this is maybe how fast it goes. If this whole thing is very gradual, then I think probably society would find some way to assimilate to it. I think if it's vastly faster than that, if it's a step function thing each time, then I think that we definitely need to do something about that. I think that who it's impacting very much needs to be considered. It cannot just be thrown away and you can't ask people to evolve and upscale something that they've trained on doing their own - some craft that they've been doing for a decade or two and ask them to kind of upscale and evolve and catch up to the times within a couple of years. So I don't think that's a real rational path. So that's kind of my rough triangulation of things. My feeling around rev share on training data, it just seems like it's going to disappoint people more than it's going to be practical and helpful. So I'm a little less sure on how we would really achieve that.

Nathan Labenz: (1:05:05) Yeah. It's tough. Right? If you are the artist that used to charge however many or even just the PowerPoint technician that used to charge however many dollars for a job, and now that can be done for a cent or whatever on a technology tool. The rev share on that cent is not gonna buy too much.

Suhail Doshi: (1:05:24) Yeah. I make music on the side. I sometimes think about, what if people would train on my music? What if I was a bigger artist and they trained on - how would I feel? This sort of rev share model, streaming revenues are really bad for music artists. This penny per thing always makes the artist feel like they're greatly undervalued. It leaves tons of resentment in the industry. So I'm not so sure about that. On the other hand, though, I think artists very much crave an audience, ways that they can - however, casting art is definitely not the same as buying art. I think that's a very different relationship people have with artists. If I make a song that's exactly like Avicii, people will be like, you just sound like Avicii, but that doesn't mean my song will become big in the world because people have built the relationship with the artist. So I just think that something that's really catering to helping give a spotlight to the artists, the real people involved in doing some of this stuff, I think would be very helpful. Because I think there's still a big audience of people that want that, even if you could reproduce the same exact image or something like it.

Nathan Labenz: (1:06:34) So how fast do you think it's gonna go and do you think we are headed for kind of a new social contract? I mean, I'm getting UBI vibes a little bit from a couple of your comments there.

Suhail Doshi: (1:06:46) No. I don't have any strong opinions around how to solve it. I think that there are much better minds, much better people are well versed in trying to understand, what happens with technological change? How do we retool people? How do we - maybe it becomes like a tax. I don't know. Yeah. Maybe it is kind of like a UBI tax sort of thing for certain industries, certainly investing and helping people upskill. One funny thing out of the art debate was I once DM'd with someone who really hated the AI art stuff, I think like a year and a half ago. And some of these people don't want to retool. They just enjoy - one of them said, I just enjoyed drawing. I don't want to retool. I don't wanna get new skills. I just I enjoy this craft. And so I think there's also a practical reality about that. You can't ask people to retool when they don't enjoy the craft anymore because they enjoyed it the way that it was. So I definitely think there'll be some version of this. I don't know how you distribute money like that. I don't have any sense. I do think that the pace is - it's going to be very fast though. I don't think it can be ignored. I think we'll be making a very big mistake in the technology industry if we just sort of say throw caution to the wind and go, we're just going to do this very quickly. We don't care who it impacts. I think that the last decade of technology and the tech industry in particular has showed that when we do that, we really disenfranchise people and that creates kind of a bad circumstance for our industry. And I hope that we've learned that lesson.

Nathan Labenz: (1:08:27) Yeah, I totally agree. Okay, one more little follow-up because I can't resist. I've been thinking lately, not so much for the art side, but I think there's probably a lot of commonalities. What are the minimum standards that AI application developers should be expected to uphold? Meaning, if you didn't do this, you are worthy of shame within the industry even if there are no laws yet on the books to control what happens. I wonder if you have any thoughts as to what AI developers should demand of one another. Presumably, there will be actual rules coming, but as we kind of ease into that and hopefully maybe even can inform what they might look like so they don't end up being dumb, which is a very real risk, I think. What would be a good sort of self governing, maybe just the kernel of a self governing standard that we can all kind of say, hey, this is - we gotta do this guys at least.

Suhail Doshi: (1:09:27) There's already kind of a pretty reasonable consensus around safety. Unfortunately, safety kind of got wrapped around - overloaded with alignment and stuff. And so I try to come up with a different word sometimes for situations. Anyone operating these models has seen extremely bad actors. We all have our version of the story that just seems extremely, extremely bad. And so there's just general evil out there and how people are using these tools. And so one of the things that we did at Playground is the existing state of the art safety filter that sort of stops - regardless of your point of view, we don't personally want nude content on our site, for example, and other forms of content like that. And so we ended up training a new sort of state of the art safety filter that went further. Of course, it helps us. We don't have to moderate as much, but also it just reduces significant evil that tends to occur in the world. Not that nude images in particular are evil, but there are definite class of images that are more evil or illegal in our country than other classes of images. And so I just think people using images for deep fakes and using it out of revenge and how we're going into a new election year, how things are going to be manipulated, how we're going to explain to our aunts or uncle or parents or a brother and sister that that thing that they saw on the internet was actually not real, even though maybe it's in their narrative, it actually technically wasn't real. So I think just combating the sheer disinformation, misuse, evil that might occur using these tools, because the tools tend to be extremely powerful. And so they have incredible amounts of value for the world, but it turns out that can totally be used for nefarious things. And so I just think that seems like the most basic thing that we can all do. And I kind of wish - one thing that I wish is that the safety parts of this business were - I wish they were a lot more open actually. OpenAI has a moderation model. We use it. It's free. We use it. They don't charge you anything for unless you use too much of it. But too much of it is a very high threshold. And I just wish that maybe we'd have a far more open collaboration around any of this kind of stuff because we all have different kinds of data and we would all do better if we worked a little bit better together on this and it would be better to have a state of the art model. It's not too easy - it's not that difficult for nefarious actors to test these things and try to work around them. I think we could beat this cat and mouse race if there was a larger set of companies working on an open safe set of safety models. And I don't think it would take away from anyone's core IP or anything like that. I think the world would be better off for it. And that one might be a very - the model ends up, the model's performance would end up being kind of a definitive answer to your question around what should we all be doing that is a minimum moral standard. Well, actually the minimum moral standard is high dimensionality too. It could be encapsulated in this model that we all work on together and we all contribute together and the best researchers in the field work on together.

Nathan Labenz: (1:12:45) It's a beautiful vision. The opportunity and the peril of generative AI, I think very well articulated there. I think it's a great note to end on. I will say, again, Suhail Doshi, thank you for being part of the Cognitive Revolution.

Suhail Doshi: (1:12:58) Thanks for having me.

Nathan Labenz: (1:12:59) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co or you can DM me on the social media platform of your choice.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Pixel Revolution Part 2 with Suhail Doshi, Founder of Playground AI

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next