Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson
Roboflow CEO Joseph Nelson discusses why computer vision still trails language models and how his team uses techniques like Neural Architecture Search to build efficient, task-specific models, exploring global competition, real-world deployments, and emerging regulation.
Watch Episode Here
Listen to Episode Here
Show Notes
Joseph Nelson, CEO of Roboflow, breaks down the current state of computer vision and why it still lags behind language models in real-world understanding, latency, and deployment. He explains how Roboflow distills frontier vision capabilities into efficient, task-specific models using techniques like Neural Architecture Search and RF-DETR. The conversation covers Chinese leadership in vision, Meta and NVIDIA’s roles in the ecosystem, coding agents, and emerging S-curves from world models to wearables. Nelson also explores aesthetic judgment in AI, real-world applications from agriculture to sports, and why outcome-focused regulation matters.
Sponsors:
Tasklet:
Build your own Cognitive Revolution monitoring agent in one click.
Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai
VCX:
VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com
Claude:
Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr
CHAPTERS:
(00:00) About the Episode
(04:23) State of computer vision
(12:29) Is vision solved
(19:41) Frontier models and failures (Part 1)
(19:46) Sponsors: Tasklet | VCX
(22:39) Frontier models and failures (Part 2)
(32:16) From cloud to edge (Part 1)
(32:21) Sponsor: Claude
(34:33) From cloud to edge (Part 2)
(43:25) Data needs and scaling
(50:52) Open source vision race
(01:01:38) NAS and productization
(01:12:24) Aesthetic judgment challenges
(01:17:22) Future horizons in vision
(01:31:18) Wearables and daily life
(01:43:06) Regulating AI vision tools
(01:51:00) Episode Outro
(01:56:39) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Transcript
This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.
Introduction
Hello, and welcome back to the Cognitive Revolution!
Today my guest is Joseph Nelson, CEO of Roboflow, a computer vision platform that supports more than 1 million engineers and more than half of the Fortune 100 as they seek to turn proprietary image and video data into a competitive advantage.
We begin with an overview of computer vision capabilities today. Joseph notes that while language is fundamentally a human construct and inherently optimized to be understood, the real world contains a fat tail of chaotic scenes, which are not-at-all optimized for understanding – and thus, just as the Vision Transformer came about 3 years after the original Transformer, computer vision today is roughly where language capabilities were 3 years ago with the introduction of GPT-4.
Which is to say that while frontier models can do amazing things, and most problems can be solved if you're willing to put in the work to fine-tune and pay any inference cost, we have a ways to go before foundation models will be able to do it all.
To make this concrete, Roboflow maintains a site called visioncheckup.com, which highlight the spatial reasoning, precision measurement, and grounding failures that still plague even the best multimodal models today.
And importantly, even when frontier models can solve a particular task, you can't wait 40 seconds for a reply when you're powering instant replay at Wimbledon or monitoring for defects on a high-throughput manufacturing line, and so there's often still a lot of work left to do to get vision models running efficiently enough to meet production latency and edge deployment requirements.
This is where Roboflow comes in, and I was super interested to hear Joseph describe what it looks like to go from an open-souce vision model to deploying your own task-specific model today.
He emphasizes the importance of establishing clear requirements upfront, as the performance thresholds that different customers need to hit on their respective use cases vary dramatically.
From there, the process often involves distilling frontier model capabilities into much smaller models, like Roboflow's own RF-DETR model, which they derived from Meta's DINOv2 backbone, using a really interesting training process called Neural Architecture Search, which in turn uses a weight sharing technique to train thousands of network configurations at once, all within a single training run. This process ultimately produces a set of models of varying sizes that collectively map out a performance Pareto frontier, and today, Roboflow has productized this approach, so that anyone can now run it on their own dataset and come out the other end with an N-of-1 model optimized specifically for their problem.
From there, we cover a number of additional topics.
- Joseph explains that Chinese companies have consistently led in computer vision, how much the American open source ecosystem currently depends on Meta, and why he's optimistic that NVIDIA will fill the gap if Meta's new AI leadership changes priorities.
- He also describes how coding agents are expanding the market for Roboflow's tools, how skills are emerging as a new go-to-market vector, and how Roboflow plans to use a first-party agent to guide users through the process of building computer vision pipelines.
- We also discuss the state of AIs' aesthetic taste, and why the inherent subjectivity of aesthetic preferences makes this such a hard problem.
- We hear about the emerging S-curves Joseph is watching, including world models, Vision-Language-Action models being developed in robotics, inference-time scaling for vision, and wearables now selling millions of units per year.
- We get his vision for how computer vision contributes to a good life as AI matures, which includes everything from precision agriculture and food safety to self-driving commutes and real-time sports analytics;
- And finally, he explains why he worries that overly-opinionated regulation could accidentally stifle all sorts of surprising but valuable use cases and recommends that policy-makers focus on outcomes instead of trying to regulate the tools people are using.
When it comes to computer vision, Joseph has quite literally seen it all. So whether you're looking to catch up on computer vision, like I was, or looking for a practical framework with which to approach a specific challenge, I think you'll find a lot of value in, and I hope you enjoy, my conversation with Joseph Nelson, CEO of Roboflow.
Main Episode
Nathan Labenz: Joseph Nelson, CEO at Roboflow. Welcome to the Cognitive Revolution. I'm excited for this. So regular listeners will know I really got into AI in a full time obsessive way in my role as founder of Waymark and It was such an exciting time four years ago when things were just starting to work. I ended up going really deep on what was available in computer vision at the time with Clip and Blip and Blip2 and Clip embeddings and trying to figure out how the problem that we had at the time was we have all these small business users. We had developed a pretty good technique for scraping their websites and their kind of online presence and creating a image library for them. But then what to do with that image library, right? It was just initially just a total jumble of photos. We couldn't make any sense of it. We made very blind guesses as to what we would actually put into content for them. And that obviously had a long way to go before it really started to work. So I had a ton of fun in like 2022 into 2023 timeframe, getting deep into the weeds on that stuff. And obviously a lot has happened since. So I'm really excited to catch up on a few years of computer vision progress in 90 minutes or so. Maybe let's start by just setting the stage. Where are we today in terms of computer vision? You can come at that from a lot of different angles. Maybe start with use cases. What are the use cases out there that are really well-established, that are driving the most volume, that are driving the most value? Give us a survey lay of the land.
Joseph Nelson: Since you brought up CLIP, maybe we can start in terms of some of the research that's powering what's now possible and then the sorts of use cases that flow from there. With vision, it's funny because When you think about AI and the trends of machine learning, originally, a lot of it, vision was home. You had ImageNet, you had MNIST, and deep learning gave rise to, is that photo a cat or a dog on the internet? And then you had language, I would say, almost jump out and take the lead in terms of wow factor and in terms of understanding with the introduction of the Transformers You All Need paper in 2017. And then you almost have five years of language like cooking with scaling laws and Chinchilla and GPT-2, and then you start to get like language products in GPT. I would say three and four is really where things started to break out in ChatGPT in 2022. That five year time delay of introduction transformer to products that become used by nearly a billion users, I think every single week now is now happening in vision because you had the vision transformer get introduced in 2020. And so that ends up being another stepwise change of what's possible and what capabilities are easy or easier, maybe out of the box. But to your point, what's interesting is historically there's been this divergence of, is this a language problem or is this a vision problem? And modalities are crashing together because just like our brains, you get more context if you can use language and vision together, however, there's some pretty meaningful differences of visual understanding, both in the way that visual models work, but also the use cases of where it's most impactful. And I'd say actually one of the biggest ones is I think about the analogy of the way our brains work as a useful way to inspire how our systems for visual reasoning will work. As in, we have this big LLM reasoning engine in our heads that is our brain, but we also have this rods and cones in visual cortex that operates and makes decisions in We jokingly call this like your lizard brain, like fast reaction way of understanding the world. And biology has evolved to having specialized systems for visual understanding, distinct from broad scale reasoning and the amount of neurons dedicated to that being more than any other sense. And I think the same will be true from biological inspiration for the systems that get used in machine learning. So what does that mean in practice and that abstract idea? It means a lot of stuff runs at the edge. A lot of stuff runs low latency. A lot of stuff runs out in the real world. So for example, in a lot of language or multimodal or multi-agent reasoning problems, you can have the benefit of assuming you have maybe near-infinite compute, 'cause you can run a long-running job on a data center. A lot of visual tasks where vision is most useful tends to be where you don't already have a human or eyes on the problem. You're understanding an environment, maybe in a remote location, maybe a manufacturing line. Maybe it's you're shipping a product. Maybe it's you've got cells underneath the microscope.
Joseph Nelson: Maybe you're looking through a telescope and like discovering new galaxies. Maybe you're building robots. And for a lot of those use cases, not all, but many, you need fast reaction times in addition to large-scale reasoning. And so you'd see this increasing divergence and specialization of where vision is especially helpful for low latency tasks and for things that are, it feels maybe intuitive, but out in the real world, like systems that we want to observe, where LLMs in language is inherently a human construct. But the visual world isn't inherently a human construct, right? Like language only exists where people do and systems that humans have crafted. The world's much bigger than just language, and anecdotally, the amount of distinct scenes in a day is more diverse than the number of unique words you probably read in a day. And so that heterogeneity, that richness makes visual reasoning, I think, harder. It means that the long tails are fatter, and it means that the use cases tend to be out in the world, for lack of a better way of describing it. So the use cases that we see, we become a natural sample of where visual AI and computer vision is being used in the real world. Like about a million devs download open source every 30 days, half the Fortune 100 built on the platform. So we have this kind of insight of what's actually making its way to production, where are people tinkering? It tends to be these things that are, I'd call them operationally complex problems, maybe in the enterprise sets of use cases. And for platform, you get broad amounts of inspiration, which could be a hobbyist that wants to understand, I don't know, I like to play board games, the dice that you just threw. I swear every time I play guitar, my numbers to get drawn the most. So I want a camera to prove to my friends I'm the most resource efficient compared to the resources that I drew for the way the dice came out on a serious and a joking example, or there's this YouTuber that maintains a channel called Dave's Armory out of Canada, and he's built a flame-throwing, weed-killing robot with RoboFlow, or he built his son like a self-driving couch that like follows him around the yard, like these silly things. And then you have more serious things inside the cases where like powering instant replay at sports broadcasts at Wimbledon, or doing quality assurance in products that are being produced at Rivian, or any sort of like physical world thing. But one thing that I deeply believe, and I think the rest of the world's kind of coming around to this, is that visual AI, visual understanding, and at least even that part of multi-agent reasoning is going to be bigger and more important than just language. As in, the ways for AI to reach its most potential, it needs to be out in the world. It needs to understand, see, reason, because the world's a pretty big place, the universe's even bigger. And for the systems that we wanna use and rely on, they need that type of capability. So linking that back to the research that's progressed, there's a lot of work to be done, but there's been a lot of progress from clip to present that we can talk a bit maybe in detail on. But to set the stage, I would just basically be one of really optimistic that we're approaching the ChatGPT moment for vision and the infrastructure to power all of that is coming online, which means you're about to see a Cambrian explosion of all the places and consumer expectations are just going to be disappointed absent the ability for folks to have visual understanding in the products and services we use day to day. So that's like maybe how I'd think about what's going on and what use cases underpin the types of ways the research is making its way to production so far.
Nathan Labenz: When you said that you think we're coming up on the ChatGPT moment, I thought that was quite interesting because sometimes I give talks about to kind of an audience that I'm trying to catch up on what's going on with AI. And I often give them the MNIST example as like, you know, look how simple this is for us. But we still can't write explicit code to identify these handwritten digits. As simple as that problem is, there's no explicit algorithm for it even today, right? Okay, that kind of gives people a sense of like why it is that we need this sort of fuzzier kind of intelligence. And then I zoom forward to the GPT-4 system card and I show the image that I'm sure you're familiar with of the guy hanging off the taxi in New York doing ironing on the back of the taxi and show them how we went from an ImageNet breakthrough in 2012 to that capability ten years later, where the model says, it's unusual to see this guy hanging off the back of the taxi doing ironing. So I was going to ask, and it sounds like your answer is going to be no, but I want to get a lot deeper under the hood on that. I was going to ask, like, To what degree could we consider vision almost a solved problem? And I know that not everything works immediately out of the box. Not everything is going to work maybe at the cost profile that you'd want or the latency requirement that you have. But if we started off with just like a, is there any vision problem that we couldn't solve if we really put our minds to it today? That was the working definition I had for solved problem. Do you think we are not there? And if we're not there, why aren't we there? What can't we do yet?
Joseph Nelson: The way I think about something being like solved problem is I think it's like I just asked the model and it almost impresses me. It delights me that it already understands and can do the thing I asked it to. And so I think that's why ChatGPT was such an aha moment for folks because you no longer had to train a model to understand sentiment or describe text or whatever. Just like I talk with it, it talks back, it feels like talking to, I don't know, a first grader perhaps. And so, envision solved problem. I think there's a subset of places where that's true, but it's not nearly as solved as languages. And I think the reason for this in my mental model for what it takes to get us there is what we were mentioning earlier of the world's very heterogeneous compared to language. If you think about this in a very like first principles way of, the amount of data that it takes to encode text with Unicode. Like I can represent all of text in Unicode in memory much more efficiently than even representing a single image because I have three color channels, 0255, RGB, pixel by pixel. And so that data disparity of how much more information it takes to even encode a visual scene I think is maybe like an anecdata example of why there's more heterogeneity in understanding visual scenes. But in a maybe a more thing, it's again, the number of scenes in a day is different than the number of words that you read in a day. So my mental model for this is I think about the world as your standard bell curve distribution. And in the fact, like center of that curve, what we were measuring is the frequency with which a thing is that exist out in the world. If you went out, let's say, and you took a walk, or maybe just throughout your full day, and you write down every object that you see, and then you went and looked back at your notes and how many objects you saw and maybe how long you looked at them, you would have something of a bell curve of the things that you saw that repeatedly showed up, person, car, food. And there would be some things that are like longer tail, if you will. I don't know, maybe one day you were changing your oil, so you were under the hood of your car. Now, even that's something that you're not gonna do every day. So, vision, having a model that can reach into those long tails, because it's heterogeneous, and because I think those tails are fatter, it's taking just a bit longer to have the data represented as much as the models that can reason about what all the various different scenes and videos exist out there. So what does that mean? Some things are like, quote unquote, solved problems, like count people in this image or increasingly OCR feels like a solved problem.
Joseph Nelson: There's a model GLM recently that we're really excited about of it can run real time. You can query it with, hey, how much was my salsa from this receipt or from this Google street images, what's the house address on the left? and it's able to visually reason and extract and pull the correct answer almost always. And so something like that feels closer to a solved problem. But the nature of how diverse some scenes are is it's gonna take representation and probably some reasoning models to be able to reach into those long tails. And what's happening, I used to say in slow motion, but it feels faster. What's happening in the vision is you're getting models, multimodal models, and this is like the big LMs like Gemini, just as much as open models like Momo, just as much as models like the Debtor family of transformers that are increasingly pushing outwards on this visualized bell curve, where more and more of those things that you see on a given day are understood zero or multi-shot. becomes maybe a semantic question of what you consider a solved problem. Like if you're in the middle of that bell curve, yeah, it's a solved problem. But if it's like something where it's so impressively surprising and delightful, it becomes a question of how long until someone starts to query and ask for things that wouldn't have been represented and trained. And so I think that we're on that, we're riding that curve, like the expectation is getting faster. Now, one other complexity in vision is what we talked about earlier, which is a lot of vision is edge-constrained environments. You want answers now. You're running a webcam, it's on your phone, it's in the palm of your hand. And so that means you also don't have the benefit of maybe waiting 40 seconds for a reply from a model for the thing that you were interested in querying. Which also means, it doesn't mean that the problems are intractable, but heuristically I see maybe an 18-month delay. between like a SOTA capability from multimodal cloud available model to something that you can get to run on an edge device, which maybe here we could define as maybe like a Jetson Orin level of compute, or maybe even like an iPhone where it's opaque, the exact GPU comparable you would make. So those things make vision feel unsolved. But I still think that what's going to continue to happen is the expanding nature of that, that bell curve. And if you think about that mental model for where we're going with visual capabilities, then I think that's a good way to think about where the field's headed and frankly, the problems to solve.
Nathan Labenz: So I want to work through that kind of Pareto frontier of performance and cost latency, you know, and where it can run trade-offs. But Let's do one more double click on kind of the most expensive end of that curve, which is the cloud available Frontier models, I think you're describing them as. Obviously these things are like famously spiky. I would say it's been a while for me since I've had an image use case where I was like, oh, this thing like can't do it or can't see it. I do remember some of those not too long ago, for example, with like the Arc AGI puzzles, I remember trying... frontier models on some of those puzzles, getting strange results, and then kind of working my way back to, can you just describe what the starting state is? And then I was kind of like, oh, well, no wonder it can't do the problems. It can't see the starting state accurately. It can't accurately just define which boxes are colored, what colors. But that's been a while. I guess I don't know. I'm sure you would know. If we just take our AGI puzzles and put them in today, are they accurately perceived and are Are there other things that would be good kind of representative examples of spikiness where people might be surprised that, oh, I kind of wouldn't have guessed that Frontier LLM wouldn't be able to see this the right way. Also, do they work with few-shot prompting? Obviously, few-shot has been a huge unlock in general, but does it work for vision? I really don't know that. So I guess to sum that up, can we go one level deeper in terms of the capability profile of the Frontier?
Joseph Nelson: Totally, yeah. We spent a bunch of time thinking about this and then trying to help people navigate what their expectations should be for the problem that they're solving, and where they may be able to have a zero-shot, multi-shot problem, or where you might be in a world where you need more representation of that problem before you can count on your model. The places where there's most frequently gaps. So one thing that we maintain is you'll see, just as you described, the iron board example, you'll see these vibes-based evals on Twitter. We've started rounding those up and throwing them on something called visioncheckup.com, which is if you took your LLM to the optometrist, what would it do and not do? And so we continuously update that thing. You'll see things consistently ticking up on that, but they're not all at 100% in the types of tasks that people like to try, including ourselves. So what's the common types of failure and where are you most likely to be disappointed? One of them is in grounding in particular. Grounding referring to like segmentation and detection, traditional tasks. But if you want to say, in your example, finding the starting position in Arc AGI, or sometimes I'll try to do like crosswords. And I'll be surprisingly disappointed of the model's ability to know where the word goes in the crossword. However, if I just treat it as just a text problem, and I say, Hey, here's the clue, and here's some of the letters that I know are in the word, the model will almost do better. If it just thinks about it like text, if it doesn't also have to think about where in the puzzle that issue is, or measurement. Basically, you can think about this subset of problems of things that are inherently very precise, where there's lots of precision involved. And some of this is, I think, the result of The post-training that's applied to these problems, and a lot of the labs, it's a little bit hearsay, but seems to be increasingly common knowledge, are not as interested in just solving the segmentation problem. They're interested in solving what was the user's intent, and if segmentation is a tool call as a part of that intent. But even still, the segmentation portion of that chain of thought is pretty unsolved, because there's so many different things that you would want to measure, see, and have pixel-perfect representation of. You will see that when you take more time to reason, AKA you like do more tool calling and you find more maybe specialized expert models for the scene you're looking at, you'll get better results. But in general, I would frame grounding as like a still pretty difficult issue with precise pixel level need. The second place where I think there's disappointment on the Pareto that you describe of accuracy speed is actually speed still. I was using Gemini 3 the other day to try to like automatically label a bunch of data for me and prompting it. And it would do, but it would take 40 seconds each time. And interestingly, the non-deterministic nature of generative AI also led to some pretty difficult downstream results. Because for that example, like I wanted really precise, consistent not necessarily what you think could be correct or your best guess or even not consistent results. And so that's another challenge is that like you and I could go try the same problem and get different results from the same model with at the same time of day. We maintain this other property called Playground where just playground@rophot.com where you can do SAM3 versus Gemini versus Claude Opus and What's really funny to me is that I'll find these failure cases and report them to our team, and then they don't reproduce. And it's actually not because of our use of the models. The model itself doesn't reproduce the same way. So that continues to be a little bit of a challenge after speed, I would say, is the reproducibility challenge. And then, I mean, it's a little bit redundant, but we talked about just the representation. Like you get into those long tails and this becomes a function of the type of question you're asking, but there is still a lot of the world that's just not understood by models. And a close cousin to these problems is spatial reasoning of being able to articulate not just the pixelized segmentation, but where one thing is with respect to another thing. So those are like common sorts of failure patterns we see.
Joseph Nelson: Now, you had a good follow-up, which was, how does FewShot help address these things versus ZeroShot? And the answer is pretty good, but still not infallible. Like, it helps. And how much does it help, obviously, is an interesting question. So one way that we think about these problems is we introduced a benchmark at NeurIPS this last year called RF100VL, RoboFlow 100 Vision Language. Basically, so folks that use Roboflow for research will share their work in effort to build upon others' work and bring the whole community of computers forward. So there's a large set of open source data sets that folks can learn from and try and accelerate their own problem they're solving. We went and worked with users and researchers and the hundreds of thousands of folks that are sharing open source projects to move the whole community forward, and we created a basket of a hundred of them. of like problems that seem to be represented in visual AI. And the domains that broke down into were problems like industrial, healthcare, flora, fauna, documents. There's a miscellaneous category 'cause of course it's tough to put everything into a single bucket. And we evaluated Gemini and SAM3 and a number of, and OpenAI of course, a number of multimodal LLMs, and then also models like Owlvit, which is a model that supports few-shot prompting, an open source model. Owlvit2 is the most current one. And a model called Grounding Dyno, which is the new version Grounding Dyno is behind an API, but still more open in general. And basically, the net here is we evaluated them by saying, can you do successful segmentation The same way if I passed a human annotator these same instructions, how does the model do? How would the person do of finding all blank things in an image based on those domains? And the best model, at the time we published the work, it was Gemini 2, but got 12.5%, like of 100% across all domains. And so the gap of how far these models have to go on And this data wasn't, again, it wasn't like arbitrary data. What's really interesting is that this sample wasn't like a perfectly curated research data set or imperfectly like CoCo or Object 365 or these works that are very helpful contributions to the field. It was, these are the places that folks are using models. Now, a second thing we did, so the zero-shot performance, 12%, cool, interesting. The second thing we did is we ran a competition at CVPR on 20 dataset subset, just for compute constraints, 'cause we thought we could get the point across. So RF20 instead of RF100. And we said, if you had a few shot, that is one, two, three, four, and five image examples, how much do you see the models improve and progress by comparison to one another? And the lift there, I think maximally was around 10% for a single model. I'd have to check the average across all domains, which is meaningful. especially when you're starting at 12%, but not a panacea, right? It's okay, great. Like I'm helping ground the model with the domain that I'm looking at, but it doesn't solve all the problems. I will say that's a place that I'm bullish. Specifically, I'm bullish about few shot for visual problems and providing prompts perhaps as image text pairs or even just as images with the task you're interested in, whether that's grounding or a description or a measurement or what have you. And so the story is clear, which is we need continuous, better representation of the real world problems people are trying to do. And we still have a bit of a ways to go as a community, but for it to be, quote unquote, totally solved, but progress is happening pretty fast. So that's like the progression of maybe where things are.
Nathan Labenz: Okay, well, that's a good start at the top of the curve. Let's maybe work our way down the curve. I mean, there's obviously multiple reasons that one wants to go down the curve. You may even add to my list, but obviously faster response time is a huge one. Lower cost is another great one. Ability to run on the edge is another great one. Ability to not have to send your data over the wire is another great one. There may be even more beyond that that you would highlight, but take me from kind of, I can naively send images into one of a few Frontier APIs, maybe a few shot to like, and maybe you could take, I'm not sure which way makes more sense to organize it. You could kind of go from large models to like smallest, most like able to run on edge or maybe a different way. Maybe this, maybe they line up, but maybe there's a difference in how you would think about actually coaching people from starting with, well, why don't you just, you know, at least what I normally do when I have a new challenge is, well, let's just see what a Frontier model can do out of the box. Now, once I've calibrated myself there, I can approach optimization and fine tuning whatever, you know, in any number of ways. So I'd also be interested in, if it's not the same as the kind of the curve of trading off convenience for all these other goods of latency and cost, I'd also be interested to hear how you recommend people navigate the path from that kind of first naive baseline performance to where should they go next? What model should they go try to open source, try to fine tune? How much data are they going to need? What techniques should they use? If that doesn't work, then what do they do? Until they're finally in some happy place where they've got everything that they wanted.
Joseph Nelson: I appreciate the way you broke that down. You're like, there's the speed accuracy curve, which folks know those things trade off a bigger model, takes more compute and is slower. And then there's these other dimensions that don't fit neatly on a graph, but might be really important to somebody. Like you want to maybe own your AI. Maybe you want to be building your own IP as a business. Maybe it's important to you as an individual. Maybe there's a constraint of the business case itself, of the problem that you're solving, where, again, low latency is super, super critical. That could be also a privacy consideration, where, as you said, you want to keep the data local to your thing. There could also be security as a close cousin to that. So there's a number of things that kind of frame where the, like where someone is gonna fall along those axes. Now how to navigate it and what I think often matters is it stems from the job to be done, if you will. So making that be like less generic and more real and something that people can actually think about. What I think about is many problems, especially in vision, require there to be a instantaneous response, like a very, you have something going down a line, you're watching a live sports broadcast, you need a decision right there, right then. And so of course that puts someone into a real-time category. And so usually there's the need to co-locate, compute, or run something on the edge, and you already know that you're in a class of models where, You want something that inherently, if you're gonna run it on the edge, you need to own it. You need to own the model. You need to have the weights. You need to put it into your environment. And that's where open source and where we're incredibly, we invest a ton in open source for this reason, in terms of publishing our own models, in terms of supporting open source repositories. I'm very optimistic about the future of open source AI, both because I think it's a important way for everyone to realize the benefits from it, just as much as I think it helps discover bottoms up all the ways this technology is going to be impactful. Now, a pattern that we see that's interesting of navigating these two is similar to what you described is, is this problem tractable? Is this doable at all with the types of models and then maybe reasoning and nudges I would use for pre- and post-processing of the model? And then fitting it to where I want to run it. And so we see this rising trend of distillation. So for example, SAM3 is a promptable model where I can say, find all the sheep, find all the people, find all the hockey players in this image, and it'll do a pretty good job. It's not infallible. There might be perspectives you didn't include, but it's state-of-the-art for open vocabulary segmentation at a minimum. Let's pretend that you are, one of our customers does the instant replay for clipping at Wimbledon in the US Open. And they actually bring compute to the courts because you have a live broadcast. You have sub 10 milliseconds to put something over the wire. And there is compute that could run SAM3, but in this case, it just wasn't economic to get the compute logged of that size. So basically, you're in the situation where you want a model that you can own and run at the edge, live over the wire to, in their case, frame the instant replay that you want to put on broadcast networks. Maybe you see on ESPN+ or CBS or something.
Joseph Nelson: And the things they want to know isn't an open vocabulary list, right? So you don't need a model that can see, I could prompt SAM3 for poker chips, I could prompt it for deer. The odds that those things show up at Wimbledon is pretty limited, hopefully. And the odds that I need them for my replay model are even more limited. So what I could do is use prior video from a prior year, prompt SAM3 and say, hey, give me other people, give me the tennis ball, give me the court, give me the net, have that then go and auto label a data set. Then I have a really high quality curated specific data set, and I can train my own smaller model that I can run on the edge, a model like RF debtor, which is the current state of the art for doing real time segmentation and object detection, and Not only does it run on the edge, but it's so efficient that I can run multiple streams on a single A100, in this case. And so I get the cost advantage. And then, of course, if something's cheaper, you open up more possibilities. And so that's a common pattern we'll see. Can I use a model to failure? And if it does work, great, then can I make it be mine? Or can I fit it into a use case where I know I'm going to need to run on the edge? Now, the other thing is that even as models eat more of the overall task, It's still okay, of course, to put a model in a harness or do pre- and post-processing of the model to nudge it in the direction of what you would expect. There's no shame in still using traditional techniques for post-processing. For example, there's models where I could ask Gemini to say, Count the tennis players on the court, and it would give me just the response of the count. But I couldn't just ask a detection transformer count. I could say, give me the persons, and then it responds and there's two people found. And then I would add a tiny bit of logic that's super fast code that just counts the class outputs, right? And so there's no shame in continuing to use stitched together post-processing logic for the purposes optimization of speed or where something's going to be possible. It reminds me of the database roaring, the most recent one was like vector databases, but even before that, like when you had... SQL databases and NoSQL databases, and where is it most useful to have document stores where you have unstructured data that references one another? Where is it most useful to have structured tables? And at some point in time, of course, you're gonna have to deal with sharding your databases if you have everything in those records versus if you maybe had a NoSQL database that's gonna scale for you automatically. And there's trade-offs in both those worlds. Reminds me of that, where it's not a question of purely capability, it's a question of constraint of the job to be done at hand. And again, running things on the edge in real time or a model you own, or there's a, we didn't talk about cost, but like, man, streaming to the cloud for video nonstop can be expensive if you have quite a few streams versus maybe owned compute. So these are all things that drive reasons why you can use maybe max ceiling intelligence and then apply it to a system that becomes one that you own and use. You've seen this trend in language and coding models too, of specialization and small models and expert models. In some ways, that's a place I think that language has drawn inspiration from vision, where there used to be consensus that felt like that it's one model to rule them all. Now it's increasingly maybe flipped back to actually there's domain-specific models and optimization to be made. I think that vision is increasingly in the camp of you do want a domain expertise model because you might be compute-constrained. of where you're going to run your system. So I don't know if that's a color you're thinking about, like navigating the Pareto and considerations, but those are things that we at least commonly see when we see folks approach problems like this.
Nathan Labenz: Yeah. That's great. Could you maybe give a little sketch of the scaling laws, so to speak, by which I mean, you know, this is a little dated at this point, but I used to have a talk where I kind of coach people through how to automate tasks, usually language tasks. And one of the big things was just like, The next, you kind of have to think of order of magnitudes of data, right? So if it's not working out of the box, then go get me like 10 good labeled examples. And if that's, you could probably put that into context and then it might work. And if that doesn't work, then you might need to think about 100 examples and you might have a small fine tune on your hands. And if that doesn't work, maybe 1,000 will. But I wonder kind of what you see in terms of, I guess, first of all, what do people need in terms of reliability? What do they get out of the box? And then how many like steps do they typically have to take up those order of magnitudes to actually get there? And maybe also like, how does that relate to model size? Obviously bigger models in general kind of can do more, but especially if you're doing very narrow stuff, it seems like you probably can get everything squished into pretty small models. So yeah, like what are the model sizes? And how much data do you need to kind of step up to where people are hitting the thresholds they need to actually deploy?
Joseph Nelson: The data question is one that, as you alluded to, gets informed by the business problem of like, how many nines do you need before you're able to use a thing in production? For example, like if you're building a system that's where your alternative is, you have no eyes on the thing, then you're probably more accepting of a less accurate model, like maybe I don't know, you're trying to get a sense of attendance, or you're staffing your retail location. And absent vision, you have no idea how many folks are coming in day in, day out. Maybe you can go check your point of sale system as like one source, but let's just admit you have to try to get to some source of truth where some people are in the store at the same time, or not everyone maybe checks out, or there's a different way we want this information. Point is, in this case, we might not have any eyes on the problem. We might not know how many folks are going to be present in our store at a given point in time, or maybe in our museum is an even better example, because people don't explicitly check out. So it's, I just don't know. And so if you add a model that is counting and it's 80% accurate at counting, you might be like, great, ship it, put it in production. If we have a sense of, do we have a dozen people or 100 people at a given point in time, then I'm comfortable with that. Whereas, that's totally valid. Whereas we have some other problems where And we have like healthcare manufacturers that make critical life-saving products. You can think like the IV bags in hospitals. And for them, an escape defect, meaning like a piece of particulate matter making its way in a product like that is life-threatening, let alone the detrimental impacts it would have on the company's reputation and so forth. So with lives on the line, you need to be really high recall to find if there's any particulate matter, and you're probably comfortable with adding vision to augment whatever system you currently have, whether it's people or lab inspections or sampling methodology, and weaning into where you can get more reliable vision systems. So ultimately, like a lot of things, it's a business question for what level of accuracy tolerance you accept. Now, in terms of the number of images or videos that it takes to get there, this becomes a function of how varied the scene is of interest. So like on one end of the spectrum, you have self-driving cars. that are in the big, wide open, crazy world. I remember Karpathy's talk like five years ago at CVPR where he was like, finding a stop sign. How hard can it be? How hard can it be? It's the same red octagon everywhere, right? And he's like, wrong. And photo after photo, he's, here's a stop sign that's blocked by a bush in a parking lot. Here's a stop sign that says, only stop if you're going. right, at the intersection. Here's a stop sign that's on a school bus. That's a stop sign that appears for a temporary amount of time.
Joseph Nelson: Here's a stop sign that's on a gate, that when the gate is up, you can see the stop sign, but you don't need to stop. And you're like, man, something as simple and straightforward as a stop sign, there's tons and tons of edge cases to understand. And so that's navigating the world fully, autonomously, and yet, of course, showing up and how long it's taken for us to claim victory laps on self-driving cars. And now that they're here, it's almost folks are surprisingly unexcited about it. Whereas you contrast that with something maybe like a manufacturing line, where you know the thing you want to make, and there might be a finite number of ways that thing is made wrong. Like maybe you produce batteries for an electric vehicle company. And the defects aren't always the same defects. So it's like a tricky problem where like traditional rules-based, like look at this image, do some OpenCV. If there's a deviation, then flag it. But, and so machine learning is gonna be helpful 'cause the defect could be like a different length or it could present itself differently. But the amount of variation that you're gonna see in the scan of a battery, the cross-section scan of a battery, is much more finite compared to driving on the open road. And so the orders of magnitude of data that you need, instead of talking about petabytes of video files, petabytes isn't even sufficient. For a car, you can probably get away with hundreds, frankly, hundreds of images in the case of a controlled environment to be able to produce something of utility. The last part of what you mentioned is like model size. And so, yeah, the intuition here holds. The smaller model, probably the faster it is, and also perhaps the less recall precision it's going to get. The, maybe the way to think about this is the RF debtor family of models, which is the current SOTA for doing real-time detection and segmentation, come in nano, small, medium, large, XL, 2XL. And like at the 2XL size, if you do a fine tune, it is more accurate than if you fine tune SAM3 and 40x faster. Now, of course, if you're doing a fine tune, you're inherently saying I want this fixed class list. So it's a different problem task type, right? It's not open vocab. It's I know the things I want to see and I want to know if those things are present or absent or how many of them there are. And then on the smaller size of spectrum, you can get like pico or nano models that are 180 plus frames per second on a Jetson Nano with four gigabytes of RAM and of the current family. And again, Based on the difficulty of the problem, if you're doing something simple like seeing oranges on a rack in a grocery store compared to finding particulate matter in an IV bag on a manufacturing line, you can get away probably with a smaller model that still clears the floor of the business utility while being more compute efficient and delivering the results that you need. And actually, get away is probably the wrong framing. It's actually, that's actually an optimization. That actually might be your most optimal strategy because you're able to deploy at a higher scale with maybe less compute. So that's like the, if you walk along the curve, it's, the good news is I think intuition holds here. It's not, it's, you would expect that harder problem, more data, bigger model, like, yeah, all those things follow what your expectations probably would be.
Nathan Labenz: You mentioned distillation from kind of foundational open source models as a way that people are bootstrapping their way into datasets and then obviously fine tuning downstream of that. And you mentioned this RF DETAR model, which the RF stands for Roboflow, right? So maybe if I understand correctly, there's some, I don't know if it's on that model in particular or in other places you've partnered with Meta. And I'm interested to hear the story. And I'm also interested to hear the lay of the land in terms of who is producing the open source models, why and how. In language, of course, there's been a lot of talk lately about Chinese companies distilling from Claude, et cetera, and Anthropic trying to shut them down, which I think there's obvious business use case, business reasons for that. There's also questions around like, what does that mean about the real strength that the Chinese companies have in terms of making their own models, like how much should we discount what they're able to produce? Does that explain why they're so spiky? And I kind of wonder to what degree this is also happening or understood to be happening. I'm not sure anybody really knows on the vision side, because I would say aside from Meta, it sure seems like the public perception seems to be that the Chinese companies are leading in both like vision tasks and also in like maybe not fully leading, but certainly in the open source category, leading in image and video generation. And so I'm wondering, is there some distillation going on there that they're kind of taking a shortcut or do they have just tremendously better chops in that area? If Meta were to have a change in strategy and not be so interested, and obviously they have had some changes in leadership, would the American side be kind of an empty bench? Are these projects so big that you as Roboflow maybe could still, you know, dig deep and fund them on their own? Or do you need like a hyperscaler partner like Meta to really get to the scale that you need? I guess like geopolitically, strategically, and then you can dig into the partnership with Meta too. What should we expect in terms of, you know, are we just going to continue to get these like great open source models like Mana from Heaven? Or is that maybe more precarious or kind of moment in time than people may appreciate?
Joseph Nelson: I think open source AI, there are reasons to be concerned about its future to its past in terms of the number of open source models we're going to get. However, there's a lot of things that give optimism to. So in vision in particular, you bring up something that I think is under discussed, which is that in visual AI in particular, the US has almost never led, whereas in language, We have consistently been ahead from closed models and open models alike. And there's a lot of reasons for that geopolitically, just as much as like task emphasis and execution. But everything from the importance of manufacturing and vision in manufacturing and the importance of manufacturing in the Chinese economy, these are all trends that tell you why focusing on visual understanding and that as a domain is probably a high priority. But yeah, to name names of the folks that I think are in that mix, it's the Alibaba QUEN team have done phenomenal work. They recently had, in QUEN VL in the initial QUEN models. QUEN 3 VL is world class and competitive with even closed models in its vision, language, reasoning, and scene understanding capability. The QUEN team has also had some recent leadership changes or at least leading researcher changes. So that might be tenuous. The GLM team, we were talking earlier about their mixture of experts model, especially for OCR specific tasks, but even just surpassing what's possible with closed models with their 9 billion parameter model, which is impressive work. In the, and then there's the DeepSeek team, which they, if you remember, they published an OCR paper where actually the innovation was actually a data processing technique for LLMs, where they gave a model a screenshot of a page as a way to get more tokens versus just each individual word in the form of readable text. And the realization was that the depreciation of understanding was much less than the compression that was experienced. So it was basically a way to give more tokens to scale-up training. In the US, we're not without folks that are doing incredible open source work. You mentioned Meta, who are the publishers, the Segment Anything family of models. And I'd say SAM3 is the best open vocabulary model globally, and Meta are the publishers of it. The Meta team, all the way back to FAIR and Yann LeCun starting efforts with Detectron 2, FasterRCNN, introducing Debtor, the Dino family of models. One thing that people dunk on Meta about is their lack of like language models. And again, under credit, how good Meta has consistently been at visual AI in particular and advancing computer vision consistently. And if you think about their business, this makes sense as well, of course, of making sense of photos and images that people share on social, just as much as the future of glasses and so forth. You also have Microsoft with the Phi family of models that are multimodal. The Allen Institute with Malmo, though you want to talk about like funding and turbulence. That's a topical thing that's taking place in the last little bit. On the diffusion side, you have Mistral doing some work on Black Forest Labs out of Europe. And then I think the other one that like is pretty exciting is NVIDIA. NVIDIA has put a ton of effort investment into open source AI. I think they have the most. open source model repositories now just by account, if that's your rough heuristic. But the Neumotron family models, Cosmos Reasoning, I was talking with one of their directors of open source Natter recently about how much they're investing in making those models be increasingly multimodal and the Cosmos Reasoning team doing great work to advance beyond just visual reasoning capability. So there is this geopolitical race for sure of wanting to have the best models and best possible. Now you asked a question that is near and dear to me. Okay, so where does Roboflow fit into this as you're not a foundation model company.
Joseph Nelson: You don't have a data center the size of Manhattan like Meta. You're not the publisher of GPUs like NVIDIA. So what's, where do you all fit in this mix? This is actually something gives me an intense amount of pride actually. So we published RFDebtor and RFDebtor retook state-of-the-art for the US in a very specific area of important tasks, real time object detection and real time instant segmentation. Before that, you had models like LWDebtor and the Diem family of models, which are tougher to fine tune, both of which are great work out of labs in China published. But RFDebtor, to tell you the story and how it ties these threads together, RFDebtor is the first real-time instance segmentation transformer, as well as the fastest and most accurate for doing pixel-wise segmentation and detection. The way that we made a bet that initially I wasn't sure would work, and have been delighted at how well it has worked, is we picked a narrow task and a small model. that is useful on edge tasks, which just as you and I have been discussing, I think is comparatively under-addressed. And so we basically had this novel area where we know people need things on the edge. We know people want models to be theirs. We know that open source AI is under attack. And we know that's incredibly important to give people models that can run in environments that they might not always have. So the way that we did that is it marries these themes. We took a Dyno v2 backbone. So a pre-training from the Meta family of models, and they've since released Dyno v3, but Dyno v2 backbone. And we noticed that there had been improvements from models in the transformers family of passing accuracy, but not speed for detection type tasks. Similarly, there were some models on the end transformers that were faster, but not more accurate. And so we said, if we use a Dyno v2 backbone and we use all the benefits of pre-training, and we use a shared weights neural architecture search NAS strategy, can we select intelligently search and find the most optimal speed accuracy model from an object 365 pre-train that then works downstream on COCO and user fine-tuned tasks? and attach a segmentation head and a detection head. And at the time when we started these experiments, it was soon after we'd raised our series B. So in total, we've raised about 63 million across all rounds, just to give you a sense of the size of resources we have available at our disposal. So not nothing, but also not all being spent on just this problem, of course, and also pales in comparisons, perhaps, to the billions to be able to be spent on foundation models. And through the training runs, we realized that this technique has promise. And so we invest further in it and we introduced the first detection model last April, the segmentation model in the fall, and we continue to invest in making the developer experience really high quality there. And critically, something I'm super proud of, it's Apache 2, which means that like even there's been like YOLO family of models, which we support and folks can use, but those are actually now not commercially permissible without a commercial license. which we're able to offer, which is awesome as a company, but I think there's places where people just want to build models that maybe they don't necessarily have commercial ambitions. And so it's world-class at what it does, and... Maybe to give a sneak preview, we already know some ways that it can extend to perhaps other task types and even have yet more accuracy. The LW Debtor team in China have responded, but not beat back some of our work. And so it's this cool kind of global arms race where your tiny friends at Roboflow are putting the US on the map in a pretty big way. Now, if MENA stopped publishing open source tomorrow, if NVIDIA start publishing open source tomorrow, just as I described to you, all of open source would take a hit because a lot of improvements are taking the best of ideas and experimenting, running ablations, smashing it together in smart minds, certainly smarter than me thinking about how to solve these outstanding problems. So it's something that like, I think tells the story of what's going on in open source vision and something we're proud of just as much as like the problems that are yet to be done.
Nathan Labenz: Well, yeah, I'm interested to hear more and I guess if I was going to do one double click there, We're obviously entering, according to many, and I'm among them, the era of recursive self-improvement broadly for, you know, AIs doing the AI research. It sounds like you kind of dabbled in that a bit with this architecture search. And I wonder if there was anything, as you look back on that experience, that you found surprising, you know, did it feel like a brute force grind or was there anything, are there any stories to tell about, you eureka moments coming out of that architecture search that felt somehow qualitatively different from a brute force grind through architecture space.
Joseph Nelson: One thing that I think is really exciting is I'll go deeper on this idea of weight sharing and neural architecture search. So a lot of the times you're doing very brute force, train a bunch of different models, compare the speed accuracy of those models. And you're almost doing a grid search of different parameters that could help, an informed grid search, right? You're not going to do things that would be, you would think, naive. But it is fairly naive guess and check of train this model, see its speed, evaluate it back and forth. And there's still, of course, a degree of that. And we published a paper, so the details here are open for anyone to dive in as well. What we did with weight sharing in neural architecture search is rather than training a separate model for every accuracy latency configuration, We use weight sharing in NAS to basically train thousands of subnetwork configurations in parallel with a single training run. And so at each train step, one subnetwork is sampled by randomly checking parameters like patch size, the number of decoders, the number of queries, the input resolution, the attention windowing. We use deformable attention in the model. And at inference time, you can actually sample any of those subnets. What that does is It doesn't just mean that we've introduced like maybe one model, we've actually introduced a framework by which we can repeatably produce open source models, as long as you can do NAS against the architecture. And a NAS training run isn't as efficient as a single training run. But it's also not 7,000 times more inefficient despite having the ability to compare all of the number of different configurations. That was a huge freaking unlock for us to be able to use our compute budget efficiently to be able to release models like this. And so that's one huge part of the story. The other maybe like notable unlock was rewriting deformable attention, which isn't supported in every inference engine. So we've had to like rewrite some like support for it or wait, for example, for TRT in the NVIDIA ecosystem to support it. It now does. But that was a useful realization. And then I mentioned the Dyno v2 backbone, which now that Dyno v3 is out, you can imagine what experiments we're running. Yeah, the weight sharing in NAS is something that's massive. And then by the way, anyone can use NAS on your own data set and almost create like a one of one model for your problem. Because the way that NAS works is it's going to train and create an output, a Pareto that you can then pick from of like, where do I want to exist along the speed accuracy trade off? And so you can be like, yeah, I like... I can be anywhere along that curve within my available compute budget. And you could just obviously just max accuracy or be lesser on speed. And so when we saw that NAS worked on Objects 365, we were interested to have it worked on downstream tasks. And now we've actually rolled out the ability to run GPUs in the cloud that'll do hosted NAS on any given dataset. And to the theme of owning your own AI, like if you NAS on your dataset, Literally no other model architecture exists that mirrors your dataset. It's like a one-of-one, which is like, I don't know, if I was smarter about crypto, there's some interesting crypto thing there of a 101 to give somebody, but that's outside my wheelhouse for sure. But it is the most, maybe purest form of your model, 'cause literally no other model would've landed at those optimizations for the dataset that someone wanted to train on. So NAS was kind of the unsung hero that is a huge unlock for the efficiency gains we were able to see.
Nathan Labenz: Yeah, that's really cool. I think one thing I noticed you have done, I don't know if you've done it exactly for this yet, but when people hear about this, this whole NAS and the, you know, it sounds complicated, right? We're gonna have a Pareto, Frontiers worth of models. I would imagine channeling myself, I guess I'll just speak for myself, that sounds both awesome and complicated. But I noticed that for at least some things you are following the trend that I see, I'm seeing everywhere these days of here's a skill that you can just give to your Claude code and have it speedrun through the process of setting this up for you. So I'm interested in like, how easy is it these days to get started? If I'm sitting on some esoteric problem and maybe some small amount of data that I'm and I'm like, oh, this Roboflow guy sounds like he's got some pretty cool techniques. Like what's the path of least resistance to go come out the other end of this tunnel, potentially not having done much work, and have my own one-of-one model that's got its own Pareto frontier of possibility and all that good stuff.
Joseph Nelson: For NAS specifically, I gave her the real stuff because I think the audience would be able to want to dive deeper. And I'm a skeptical person myself, so I'm like, give me the real, real of what's going on under the hood. And that's why I mentioned the papers out there. But as someone that builds products, we also have created the easy button, where it's run NAS on my data set, and then boom, we spin up a bunch of subnets on GPUs, kick off the training job, show the results. And then what happens back for the user is, here's your Pareto curve, and it's one click press where you want to be along that curve. That's for the human user. You mentioned the agent user, which is also an interesting place to spend some time. But the first thing I would note is that a core thesis of Roboflow of why we build things the way we do and how we approach things is like, you want to be very interoperable and allow someone to progressively reveal complexity, but set good defaults. So you can almost think about the products that we build that wrap RFDetter or wrap our inference server or use NAS. By all means, someone could set up their own infrastructure to do training and reimplement NAS. And it's all out there. It's open. The thesis in that is that by making it be easier and simpler, you actually inspire and engender trust, like that the benchmarks can be reproduced, that folks know where things come from. And then actually ease of use as a guiding philosophy, that strong defaults can be set. So I can, awesome, have a model trained for me or data set that like helps get curated or on the inference side, which we haven't spent even a ton of time. about like we've made tons of investments into, if you're doing vision-specific inference, there's a lot of optimizations you can make and assumptions you can make to most efficiently use the GPU for just the parts of the network that require it versus, for example, a resize where you can run that on CPU. And so there's tons of, and all that's open, like inference, if you pip install inference, it's an open-source Gab repository anyone can use, and also if we provide that as good defaults in the service, And if we're worth our salt at all, then we should make products that are easy to use. That's for the human user. Now, you mentioned something that's really exciting in general, that's certainly bigger than any one company, that is agents becoming the biggest user. And what does that look like? And like a lot of companies, we're leaning into the idea that if you expose CLIs, and maybe MCP, we haven't released a Rollflow-specific MCP yet, certainly we have lots of good CLIs, and there's this own debate, is it MCP or CLI feature?
Joseph Nelson: At a minimum, there's CLIs that all the actions that take place in the platform, Cloud Code and Codex and so forth, can take on behalf of a user so that you can say, Go optimize a model for me and make that be easier for me. Now, something that we are investing in as well is you've seen a common trend of companies that build amazing infrastructure products. And here I'll give a lot of credit to Vercel where they've done awesome job of making a common set of products for the front-end cloud and increasing the back-end cloud. And then they've layered on top of that now, a vZero agent, right? Their product that will then, you can chat with it and it'll build website and it'll choose good infrastructure for your problem. We've taken a lot of inspiration for that as a way to enable users similarly to chat with our workflows AI agent, where it's like, hey, I just wanna say, count people crossing the line or watch cars in the intersection or whatever it is, and the model, A lot of these problems, it's interesting. The hardest part's actually concern what the user actually wants, what their intent is. Once you kind of have a sense of what the intent is, then models can intelligently be like, Okay, you wanted to count. Nathan wanted to look at cars crossing the intersection, so is there a model that can already know cars? Probably SAM3. Actually, cars is a COCO class, probably RFDetter, and it'll be more efficient and faster and more compute efficient. Great, let's grab that, and a pre-trained model that knows cars. Okay, you said crossing the line. Ask the user, the intersection has multiple places you could have met by cross. So where, what intersection did you meet? And increasingly you can be in this future where it's like, if Joseph were sitting down with you side by side and helping you construct your problem and pick the model and follow the architecture, can we expose that as an agent in a democratized way, for lack of a better term, that anyone has access to folks who spend many hours a week thinking about these problems, being the guide in Sherpa for building a given pipeline. So in the scale of things, it's like ease of use as a, with good defaults, as a platform principle with revealed complexity progressively. Secondly is agents using CLIs to basically use the same sort of ease of use stuff. And the third is that like a first party agent, which we haven't... released yet, but maybe by the time this comes out, folks will discover to guide folks down that journey. So that's some of the ways that I like to think about building products that balance giving someone the satisfaction and awareness that it's built on good primitives while still being able to create products that are ease of use and allow folks to get to value quickly without needing to know everything about every subnet of a training run, for example.
Nathan Labenz: A very particular question that comes to mind that you might help me with or maybe set my expectations on a little bit is... So my company, Waymark, we make videos for small businesses, specifically really focused on TV quality advertising. So like your classic 30 second TV spot is like really our bread and butter. Customers have asked us from time to time, Hey, could you help us with display ads too? Our customers kind of... We partner with a lot of cable companies, media companies. So they are ultimately selling inventory, advertising inventory to the small business. We're helping enable that with the creative solution. They ask us about these display ads as well. And now we're getting to the point where we're like, well, maybe we could add that. We can vibe code all kinds of stuff much faster than we used to, certainly. But one challenge that I used to have a lot, and I'm not sure what the state of it will be today as I'm kind of digging into this is, aesthetic evaluation. And I don't know if way back in the day, there was really just one open source model or one open source data set and a couple open source models that were trained on it that seemed to do a halfway decent job of aesthetics. And by halfway decent, I mean, I could tell which was the top and which was the bottom end of the distribution, but in the middle, it was very unclear which way I was headed a lot of times. And there was one company that had one too, I forget, oh gosh, what was the name of that? EveryPixel, I think it was maybe called, that had trained their own aesthetic model. These days, we typically go to foundation models for that, and we say, What's suitable? What would make the business proud? How would you advise us, basically, on these available images to use? And they work pretty well. Definitely slow, definitely more than we'd like to spend, in many cases, to grind through a huge library of images that a small business might have. Is there anything in the kind of small, open source world that would be able to tackle a problem like that? Or is that just still so esoteric that nobody's gotten around to building that foundation for me?
Joseph Nelson: Aesthetics is a tough one for the reasons you described, where the types of problems that models can recursively improve against is you can benchmark it. And the second you can benchmark it, then you can scale a bunch of compute and the bitter lesson takes hold. Aesthetics maybe is a little bit eye of the beholder of what's good, what's bad. Now, there are some places where, and so for here, I say, for example, even if you just take diffusion models, some people just the way Midjourney lurks, more than they like the way ChatGPT looks, more than they like the way Gemini looks, or NanoMinit, I should say, when it creates examples. And maybe the model you're talking about is the Lion team had the aesthetics predictor model that helped evaluate some of these things because they also did some generative image stuff, so they also released their, I think, aesthetics evaluator stuff.
Nathan Labenz: That wasn't out when I was first really struggling with this problem, and it was that I think the timing of that was such that we had kind of moved already to foundation models, but that was definitely the best purpose-built thing I think I've still seen to this day.
Joseph Nelson: One thing that I'm sure you're aware of that your audience might find useful as a way to reason about this is, In the context of display ads, there are some services like Facebook, for example, where you're not allowed to have text be more than X percent of the display ad. They find that it just reduces the quality of the ad for the end user and whatever the reasons are. And so you can have, for sure, of course, that's a great example of the distinction between does this ad feel good taste-wise versus rules-based, is there too much percentage of this image that is text? The automation of taste and aesthetics and preference. If I could give you a thought of how I might approach that problem, it's a great RLHF problem where if you have a given client that you know their brand guidelines, you know their style, perhaps there's enough history of display ads they've run where you can get almost like a vibe check model that has been tuned for what they've done. And with foundation models, like you said, perhaps you can do a few shot approach of, hey, these are the ways that this person commonly likes to do things is a similar. Again, the big problem even with that approach is so much of marketing is being different. So if you're adhering to the brand guidelines, you might be stylistically following what you should have done, but you might be failing the top order task, which is stand out from the noise. So short answer, I don't have a great zero-shot aesthetics model for you beyond, I think, the things you're probably already doing. But longer answer, I think it's a great example of the conversation you might have been having about what distinguishes a task where you can train your way, post-train your way to victory with objective metrics versus ones where It's a little loosey-goosey to benchmark and therefore lives outside the range of where toss compute at it, get better results.
Nathan Labenz: Okay, so moving in the time we have left, let's talk about just frontiers, like what's coming next in any number of different directions. There's, of course, new architectures that people tend to get excited about, myself included, things like Mamba and, you know, state space models more generally. Once upon a time, there was like an explosion of vision use cases there. World models are obviously a big deal. I'm not really sure how to think about how they will relate to vision. We've got increasingly credible claims that people are going to start to scale up humanoid robots and put those into presumably factories first, but then, you know, businesses and homes not too far into the future either. What are the things that you are most excited about? What are the biggest questions that you have that you are kind of like, if this works, it's going to be a game changer, but I'm not sure if it's going to. We're all about scanning horizons here. What are the horizons you are scanning?
Joseph Nelson: There's things that I think are continuation of trends that are working, and then some newer S-curves that we're starting to ride. Trends that we're continuing to ride are transformer everything. We talked about how the Vision Transformer was in 2020, and intentions you only need is 2017. So you've seen diffusion transformers, DITs, and vision transformers, VITs, continue to eat more and more tasks and achieve SOTA accuracy. RFDETR is exactly that recipe applied to real time. That trend is known and going to continue. Another trend that's maybe more nascent is self-supervision, especially in the Dino, the Dino family of models. And Dynav3 kind of showed that you could have good latent understanding of things as a backbone without having large amounts of supervised label data. And then you can use that image understanding downstream for task, whether that's detection or segmentation or captioning or whatever. So that's...
Nathan Labenz: Can you tell what the unsupervised trick is there? I always like tell people the big unlock for language was language itself is structured. And so if you just have a ton of language, like Predicting what, given some text, what comes next? Like, we've got lots for you to work with. Similarly with Clip, right there, it turns out there was billions of captioned images. What is the unsupervised unlock for Dyno?
Joseph Nelson: Okay, so the Dyno family of models, you have one, two, three, three, and they're all riding on this trend of self-supervision. The DynoV3 model, I think, was trained on billion scale I have to check the exact stat, but I remember seeing that it was similar to the number of images on Rebel Flow Universe. And I was like, huh, there's something there. So billion scale plus images. And the observation is, if you start to have sufficient representation of given domains, then maybe intuitively, if you think about just like a human without being told what things are, you start to develop intuition for where and how structure should exist in a given scene. And that is understanding. Like if you know that the lamp is on top of the side table and often in a room with a bedroom, then you have a set of reasoning, or excuse me, a set of understanding of a given scene. And you can use that understanding. Again, it's a backbone. So you can't just use, you can attach a classification head to DynoV3. You can attach a segmentation head to DynoV3, but alone it's just a backbone, which has really rich latent understanding of scenes. And so That's the unlock. If you think about just looking at a bunch of scenes, you're going to start to develop your own maybe pattern matching as a crude way to think about it. To be a bit more specific--.
Nathan Labenz: Does it do that by some sort of masking type thing, though? Or how is it creating a prediction task for itself that nobody needed to label data for?
Joseph Nelson: In training, and there's papers, so fortunately that we can like falsify and understand, they use self-supervision techniques, which is you typically take a student-teacher model and you have a bigger model that's the teacher that validates the output of the student. And as you see the student continue to do well with predicting either patches or gram anchoring, then you continue to scale up the student-teacher recipe, training recipe, to larger amounts of data to understand more senses of scenes. The way, so that was like part of the train recipe. The way that understanding happens isn't actually, isn't that dissimilar from actually just the vision transformer itself, where you have, it's actually crazy that this works. Literally, like these models, and there's different approaches, but they take patches of the image. And this sort of is, it feels very, it felt very unintuitive, it still feels a little unintuitive to me, but if you have patches of an image, it's almost, you can understand the rest of the image, even from individual patches, even if you treat those patches independently. So it reminds me, way back in the day, I used to do language stuff, reminds me of bag of words, where you would have a document, and you would count the number of times each word occurs in a document, and you can start to get a sense of what that document is about. It's the same thing that's happening with understanding patches of a given image. Now, there are other techniques which we'll use And that's also why, by the way, earlier we were talking about the struggles of spatial reasoning. That's why. So you'll have increasing techniques that'll use cross attention and get a better understanding of where things are in a given image with respect to one another and the overall image. But that's like the core unlock is if you have a high number of images of various scenes, and you run verifiable, falsifiable tasks of fill in the blank, or what would you also expect to be here, or diffusion generation. And then you have a teacher that's able to scale or able to validate the student's work. Great, now you have the recipe for a self-supervised loop. You can plug in more data and scale up. And that's what they did for, they didn't release the data set, but it was like billion plus, I need to check that, I think it was a billion plus images. were in the Dynav3 pre-training. So it's actually like really cool that, that that works, honestly. And, and that it's open and there's a good technical report for it.
Nathan Labenz: Okay. Sorry to take you, um, down that rabbit hole. Let's pop back up to just more horizon scanning.
Joseph Nelson: Yeah. Yeah.
Nathan Labenz: World models, JEPA type things that, you know, those are always, uh, hotly debated as to whether they're like, the inspired future that few can understand, or if they're kind of beside the point. I still don't know where I come down on that myself. But feel free to opine on that or any other-- really, I'm most interested in what horizons you think are the most important ones to be watching.
Joseph Nelson: So we were talking about ones that we're already riding the known S-curve of transformer examples and of self-supervision and how patch embeddings work to create understanding. New S-curves that I'm excited that we're like, we society collectively are starting to ride. One is world models. Within that category, there's a various number of techniques, like you described, the V-JEPA technique, there's the World Labs techniques. And the idea of a world model, and there's different labs of different approaches, but the objective of a world model is, can we understand and reason about I'm trying not to use the word world in the definition, but like scenes and places and out in the world, it's a well-named thing, out in the world existence with a new architecture. And if you think about that, it's okay, so what's new, what's different. You're inherently, there's different approaches, but you're inherently multimodal by default. So you're thinking about some models will think about this as like a next scene prediction of video of, okay, I've entered some approaches of predict the next scene. Some will think about it a bit more like diffusion of a single viewpoint. The so what for world models that like I'm interested in is we will world models give us true understanding with physics, with spatial reasoning, with open-ended tasks that we can just start to use. And the answer is probably, but the more interesting answer is on what time horizon. And there, I'm not sure. I think what's interesting right now is we're using world models, like I would argue maybe even cosmos reasoning as an example of a world model. And you can use cosmos reasoning for at a minimum, it's boring, but like synthetic data generation. And at a maximum, perhaps you can use it to actually reason about and use something to navigate a given space. So world models is one category. I think about vision problems as like, read-write access. And I think about Roboflow as like a lot of things we think about, for what it's worth, is most of the read access. And then like world models are like a form of blending read-write access. Robotics is an example of write access to the real world. You are modifying and manipulating the real world with a robot. Of course, that requires understanding to then You have to understand, you have to have read access to have write access in a scene. World models are exciting because they're an example of potentially blending that. And potentially you get the understanding zero-shot or multi-shot. You could even argue that like some of Sora 2 was like the underpinnings of a world model to understand what's taking place. But that's one trend. To give an overview of another trend, and certainly go deeper in some of these too, that we haven't said is VLAs, vision language action models, which is incredibly popular in robotics. And like a video language action task is you provide instruction and a robot typically is able to action on that instruction. Move my computer centimeters to the left or something might be an example of an instruction you provide, and then a DLA would be able to action on that. And in that world, you have a number of emergent, like younger startups that are thinking about that. You have Nvidia's group project working on that, Google's is R2.
Joseph Nelson: where they're working on that set of VLA problems. And I think maybe another way to think about VLAs is like, it's a new task type, a new paradigm, and we should expect the similar things you and I were just discussing around different model sizes, different levels of generalizability. VLAs in some ways will need to be edge ready, 'cause if you're gonna run it on an embedded device and have embedded intelligence, then you're going to need, of course, the thing to be at the edge and run real time, so I think VLA, is like a trend that I think is, um, emergent and exciting and perhaps still under indexed on. And then the, this isn't unique to language or unique to vision. It happens in language too, but it's worth describing because it's providing visual understanding as inference time scaling and multimodal reasoning and reasoning in general. Like in a lot of ways, vision can be a tool call of of a broader agentic system that wants to understand and describe how to do stuff in a scene. Like for me, like I'll find like a thing I'm using Gemini for all the time is like a replacement for instruction manuals of what does this button on the remote do? Pilot light in my water heater went out recently, and I'm like, Okay, tell me about this specific model. That's somewhat a high-stakes task that one would want to proceed with caution with. I grew up as the son of a farmer, so if I wouldn't be allowed to figure out how to do that, I would be probably excised from the will. Fortunately, me and my friend Gemini in ChatGPT, we were able to solve the problem. That's a perfect example where I'm using visual reasoning in the real world, but interacting with it with language. But the post-training and reasoning, there's probably a tool call there to do some search, figure out the water heater model, figure out the instructions you're going to provide. And so all of that is in the category of you have a big compute budget and you can do post-training and inference time scaling to give better results. That's just going to continue. That's just getting going. So you can start to think about that as giving rise to visual agents. you can set off to go do a task for you, organize my images for me, or figure out, like in your case, perhaps like there's one that's aesthetically able to do good categorization of things that are brand aligned and not brand aligned across categories of display ads that you want to do. And so that's going to give rise to what we can learn from coding agents there. When you can let something run unencumbered over a long duration with a model, then we'll get similar benefits from long running vision agents. that can understand scenes and do things for us with all the caveats of speed and latency included. So those are some of the trends that are the hypey ones that I'm thinking about and paying attention to. Again, what I try to do when I break these down is I break them down into my normal distribution bell curve of the world and figure out what's the impact, what's the implication. And where can people use them? Can they make it be their own? Where is it going to be most useful? Broadly, I think that recording time of this episode is well timed because I think the vibes, the pendulum is swinging back to vision because you hear the rise of physical AI, of multimodality, the rise of hardware, what's defensible in a world of SaaS always being written and code like generation being simpler. And so that's putting more and more people thinking about the real world and hardware and ultimately cameras and getting things into those environments. And to me, I'm welcome, water's warm, been here all along, infrastructure's hot, let it rip, and we're fortunate to be able to power a lot of that sort of stuff. But hype wise, it's something that has me pretty excited about the amount of activity that's about to enter the space. So those are some trends and maybe themes to track that I'm looking after.
Nathan Labenz: How about wearables as one other one, that seems to be It kind of brings a lot of these challenges together, right? Because if you're gonna have something on your face, it can't be too heavy, it can't get too hot, but it has to understand what's going on around you well, or it's more annoying than it's valuable, right?
Joseph Nelson: Totally. I, we did the, we started a partnership with Meta for the semi-anything family models and now more general visual understanding, such that, for example, like when they launch SAM models, they're on Reboflow with day one support, and now we're helping them understand where the model can be improved and not improved and some work like that. I, for Christmas, my sniff another got me the Oakley Metas. She was like, if you're doing this awesome Meta work, you gotta be dogfooding their stuff. And. It's my first pair of wearables that are mine. I've used ones for like spectacles and I'm always like tinkering with stuff. And when the Apple Vision Pro came out, of course, gave that a run. I am pleasantly surprised. Like wearables are going to inflect. There were 8 million pairs sold last year. By point of comparison, 60 million AirPods. So like pretty good amount of volume moved. And the Oakley ones in particular are like targeted at like active activity. So I like to cycle, and you typically already have a pair of sunglasses on your face that are a bit bigger for cycling. I like to run, and they do bone production for music. They understand the scene. The AI onboard is not there. You have to have your phone with you, and they're offloading, presumably, some amount of the heavy lifting to the phone. You can say, Hey, Meta, and then get some of the feedback. But again, like many things in AI, it's the famous expression, this is the worst it'll ever be. And now that it's useful enough to be in a form factor where this is a pair of glasses where I went on a ride Sunday with some friends, they didn't even know that they were glasses that had the ability to play music and capture media. And then they gave them spin first time and were like, man, we need to get these for our next ride. And that like really give me the sense that this technology's arrived. But yeah, the constraints of running on the edge, constraining the amount of power draw it's gonna have. What's funny is like, I'm a bit like Charlie running up to his football and swinging and missing on AR. RoboFlow actually started as building AR apps. Before we even had a company, in 2017, we made AR apps just for fun. And I was like, Oh man, we've arrived. And how wrong was I on the timing of that? And then we came back at it in 2019, we made more AR apps. And I think the big unlock is the form factor. so that you don't have to have the glass brick in your hand. You can have a different thing. So I'm pretty excited about wearables. I think we actually, Snap also has their Spectacles. They were the first publicly traded company to mention Roboflone in an earning statement. So there was a special place for me when we did an integration with their Snap Spectacles for developers to create custom lenses and scenes you wanna understand. So if you wanna create like a custom, actually we had someone count the number of stop signs on their walk was like a funny thing 'cause they wanna like, I think, proved to their neighborhood that they were safe or something. And so there's something there's something brewing there. But the big change certainly is that the hardware has gotten good enough and the consumer willingness to adopt is showing up in the numbers. I hope I think this will happen. I hope that ecosystem stays open or becomes more open, I should say, so that anyone can publish apps. I don't have any insight in for information here and in what I shared in that I would bet the strategy is that right now it's closed APIs because you want to curate the experience and have high quality first user experience with the apps people can use. But I would bet that the strategy will be to open that up app store like or maybe even Android like where anyone could sideload. And so I'm excited for that future. But I think the hardware platform adoption precedes the software adoption, and we're just now starting the S-curve of the hardware adoption of wearables.
Nathan Labenz: So if you had to zoom out, this is a big ask, but if you had to zoom out from all these various horizons that we've just been scanning and try to kind of tell a story of how vision impacts life in general in the next few years, and I do think it's, It's hard to predict anything more than a few years out at this point. How do you think life changes? Like, are we all going around with always on cameras? Is that normalized? Do we all have like a sort of 24/7 retrospective video of our lives, maybe subject to some times when we choose to pause it? Are there other like unexpected changes to life that happen as these technologies get deployed that you think people are kind of sleeping on? I think AI is going to change everything to a first approximation. But I'm interested to hear your take on what particular contributions vision is going to play to that and how it will feel as we are actually living it.
Joseph Nelson: Man, I would love to paint the optimistic future for you of what vision unlocks for us, step by step through everyone's day. From the moment you wake up to having food that's been produced with higher quality with fewer pesticides, because you didn't have to spray all parts of the field, you only just spray where you saw weeds. to, I don't know, maybe you had eggs for breakfast and you want those eggs to have been visually assured to be safe from hens and all the supply chain to make its way there to your house, to your grocery store. Maybe you grab like your clothes out of the washer dryer, which for some reason you still have to say whites or colors, which is very obviously like a for the taking silly vision problem to your fridge auto stocking itself because it saw you were low on eggs in the first place and so you didn't even have to go and it called the Instacart MCP so you automatically had the food in the fridge. You take your self-driving There's zero accidents because all the cars are communicating with one another, and it's faster than you've ever been because you're able to not worry about unpredictability of someone else's actions with networked systems talking to each other at a car intersection. You have Wi-Fi along the way, so you're able to spend more time with your family because your workday had already started on the way to the office itself. You're in the office and you're commuting colleagues all across the globe, and you have pixel-perfect diffusion representations of them in the room next to you, remote work, same work, same place, It's just all the same of what it feels like to collaborate, at least digitally. There's gonna be something to human connection still, but at least the representation from Zoom has taken leaps and bounds, orders of attitude, better understanding of meeting with other people, to the, I don't know, you get home that night and you watch Thursday Night Football, and the stats are real-time, and your fantasy team wins because you had the best algorithms to know who was gonna play and who was gonna score, and you had your vision agent running in the background to do that better and faster than your friends. You have a package that showed up on the right time, in fact, same day because all the vision systems that were in the factory and inventory to make sure the product was made right, checked in at the right places. There was a bot that delivered it to your door so that it wasn't strewn about and your Ring camera made sure that there was no porch thefts or whatever that took it from. while you waited for it to arrive all the way to the moment that you brush your teeth with a, it's not silly, but an AI-enabled camera that's also doing cavity scans and making sure that everything's right in your mouth at the time of when you head to bed. This future is not in theory. Every part of that chain is things that Rovoflow customers are working on and all parts of that. Now, to give you a thing that is top of mind for folks of like the always on camera thing and what is society going to feel comfortable with that, I want to give you like some of my direct thoughts there as well. I think the transition of even now folks Actually, early on in smartphone territory, it made people uncomfortable that people always had cameras that they could capture other moments without people being aware that photos of them could have been captured. And frankly, even still, that's a real consideration of in public spaces, capturing photos, not capturing photos. And over time, what society, which is ultimately the judge of this, is willing to accept, is the benefit and life quality increase going to be better than the new societal behavior? And I would take the bet that yes, because it will start with simple things. Like think about my riding with my cycling glasses. Like they don't have a heads up display yet, but I'd love to have turn by turn directions. And then pretty soon I'm used to like having that small little display and other folks are interested in that. I do hope that to build technology companies You have to inherently be optimistic, because you're giving tools to people, and that means the tool will be used as a reflection of what you think about humanity.
Joseph Nelson: So if humanity is inherently good, then you're able to amplify those attributes. And I do think humanity is inherently good, even if there are bad actors. I think the same thing will be true for glasses and consent. And you can use prior technologies in pretty icky ways. The internet can be used to communicate with friends or support a small business online, just as much as, I don't know, share photos that shouldn't be shared. And the same thing could be true the next generation of technologies. I have. optimism that the benefits will continue to be things that folks will want to adopt. Though, the great news is, frankly, it's not up to me. It's like a jury of our peers of what others deem to be the case of where it's gonna be useful and not useful. And then on the governance front, I also think it's important that we have systems and society and institutions that exist that also govern the use of these things in a way that reflects the preferences of people around us. There's a reason that privacy rights should continue to be strongly enforced and apply to the change of the times, right? The search and seizure was written well before the existence of cars and homes, and so what defines unrightful access and entry, and we should have the same sort of means-tested laws applied to new technologies for what is private, not private, public spaces, private spaces, and again, I remain optimistic that the principles we hold dear around being able to have a right to privacy and a right to using things the way we want to, at least in the country in which I live, is going to be the way that the future products will be used and governed. So that's how I think about it as a participant in the system, just as much as someone that enables this future. But the good news is, man, the world is going to get so much better. We have folks that are accelerating cancer research, cleaning up the world's oceans. removing pesticides from foods that we might produce, ensuring that electric vehicles are produced correctly, ensuring that stuff shows up at the same time. I like to joke that RoboFull powers Santa Claus, and that future is happening now. And so I think that those are all things where it won't be without bad actors and its own set of means testing sticky issues. will happen, and there will be that case, that front page, that story, and we as a society will need to respond, ensure the frameworks and rights that we hold dear continue to be in place, even as the tools that we have to exist continue to be there. So that's how I think about that, and I think we have a responsibility to ensure that the future that we want to live in is one that we help foster. In a lot of ways, I like to show examples of vision in our results that the vast majority of things is all about improving quality of life, not about maybe some of the bad implications or bad actors that folks might sometimes be concerned about. Those are some long-ranging thoughts, but a lot of folks ask me that question, building a company that we get to build, so hopefully that gives you some color to how we've thought about it.
Nathan Labenz: Yeah, that's great. That could be a good place to leave it. If I was gonna ask one more follow-up question, 'cause I sometimes can't help myself, it would be, do you think that there are technical either solutions or rules that we could define in terms of technology properties that would really help. And here I'm kind of thinking, because we've had a lot of this discussion about very general purpose models versus very specific. And I am struck increasingly by this notion of safety through narrowness, basically. And so I'm kind of, from a bunch of different angles, I'm kind of wondering right now, Is there a social contract to be had around AI that is like, we want to, and we need to, and we all stand to benefit tremendously from solving very particular problems, but we also put ourselves at risk, perhaps, if we use fully general models everywhere to try to solve all these relatively narrow problems. So in the vision context, one that I could imagine is if you want to watch a public space for moments of violence or whatever. You could, you know, run that through a general purpose model that can tell you anything. And, you know, I think in some places we're like identifying individuals by their gaits and their facial structure and whatnot. But an alternative would be like, let's have a very narrow violence detector model that doesn't really do much except sound an alarm when it has detected something that, you know, we want, you know, a higher order response to. So I wonder if you have any thoughts on... One could argue, I guess, that maybe that sort of happens in some ways naturally because cost and efficiency kind of pulls things that direction. Somehow I don't feel that comfortable betting on that, and I kind of think we might need a little bit more of a social contract or some sort of idea of a new right. I'm always on the lookout for what are new rights that make sense in the AI world, and one of them might be to be classified by the like smallest and most narrow purpose-built possible model for the task at hand, as opposed to being processed by some general purpose reasoner, you know, that could answer any and all questions about me. Anyway, I'd love to hear your thoughts on that.
Joseph Nelson: Yeah, I spent some years in DC. I was an intern in the Senate once upon a time, like thinking about some of the institutional questions that affect this stuff is some things I've spent some time thinking about. My general thoughts is one of AI as a technology and firmly believing in the importance and value of open source for freedom of use and discovery and use case proliferation, all stem from this idea of giving the rights to tinker, if you will, and like kind of use models where folks want to use them. And what does AI change in terms of how important societal rights need to take place. And where I come down is I think that the outcomes that we want to have in society should continue to be enforced. And AI is a tool by which those outcomes can be realized or not realized. In other words, to be really specific, it's like we have regulations that prevent fraud. We have regulations that prevent forms of violence or the actual kind of outcome by which something happens. A scammer could use an LLM to make it really easy to impersonate someone else, and they should be prosecuted for committing scam. They likely shouldn't be prosecuted for the size of model that they use, like the model they were using was too big or too small for a given example or task. So I think the idea of thinking about what's like minimal, it's almost like minimal invasive use of minimal model size. I think where that gets into trouble is that the capabilities advance quickly enough, or you have distillation that you end up having like accidental corner cases where you catch too broad of a net that might stymie innovation and stymie adoption when in fact the goal was well intended. Like I'll give you another like great example. One could vary reasonably steelman, the idea that AI in healthcare has such far reaching implications that there ought to be some form of, if you're going to use AI for patient health, then you ought to have a governance body approve or inspect or allow that type of use for AI with patient health. And someone could be like very, that sounds like a very reasonable thoughts, intended position. And then I think about like users at Roboflow, like this this user at UNC Chapel Hill who was using AI in their lab to automatically count the number of neutrophils that respond to a given experiment. And here you just have a lab postdoc student that's accelerating the rate at which they can experiment and doing a fairly menial, frustrating thing of counting. There's like hundreds of colonies of neutrophils that appear under this experiment and how the proteins react allow you to know if the experiment was good or bad and if you do another round of treatment, and all of a sudden, that person that's just using AI in a fairly harmless, in fact, quite useful way, would never endeavor to do that because it actually is touching patient health. And so you've caught yourself in this accidental position where you've got something that's well intended. I don't want to harm patient health, or I don't want a model of a given size or a given use case, when in reality, probably the way to attach that is, you should be liable if you use procedures or things that, there's plenty of this already in the medical system, of you should be held accountable to practicing medicine correctly in the way that ensures patient health is respected. So I guess to be succinct, I wouldn't think that a narrow model size nails the way that you and I probably would want this technology to unfold. I do have optimism that types of regulations which inhibit misuse of any technology or of any behavior should be applied to AI and that regulating at the tool level is rife for accidental slowdown and deceleration of what I view to be the modern Industrial Revolution that's going to have consequential quality of life improvements in ways that we won't be able to fully forecast. And so one of the best ways to do that is to let it flourish and stamp out the places of where people engage in bad action. So that's a fairly general way. Of course, there's like individual things to think about, but that's how I've thought about where the field is today and where it shows a lot of promise.
Nathan Labenz: Yeah, I think that makes a lot of sense as well. I definitely, Another thing I kind of obsess about all the time is how do we avoid the nuclear outcome where we get all the weapons and don't get the energy? And I'm certainly not wanting to stumble my way into that sort of scenario. I think this has been great. Do you have anything else that I didn't ask you about that I should have or anything else you want to leave people with before we break?
Joseph Nelson: I don't think so. I really enjoyed the conversation. I appreciate the opportunity to about this, hear about some of the ways you've thought about Visual AI, and almost give a refresh from CLIP 2021 to 2026 Visual AI. And the rate at which this stuff moves, we could have a very different conversation six months from now about the same set of topics. So it's been fun to riff with you.
Nathan Labenz: Likewise, looking forward to it. Joseph Nelson, CEO of Roboflow, thank you for being part of the Cognitive Revolution.
Joseph Nelson: Thanks for having me.