Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson
Hello, and welcome back to the Cognitive Revolution!
Today my guest is Joseph Nelson, CEO of Roboflow, a computer vision platform that supports more than 1 million engineers and more than half of the Fortune 100 as they seek to turn proprietary image and video data into a competitive advantage.
We begin with an overview of computer vision capabilities today. Joseph notes that while language is fundamentally a human construct and inherently optimized to be understood, the real world contains a fat tail of chaotic scenes, which are not-at-all optimized for understanding – and thus, just as the Vision Transformer came about 3 years after the original Transformer, computer vision today is roughly where language capabilities were 3 years ago with the introduction of GPT-4.
Which is to say that while frontier models can do amazing things, and most problems can be solved if you're willing to put in the work to fine-tune and pay any inference cost, we have a ways to go before foundation models will be able to do it all.
To make this concrete, Roboflow maintains a site called visioncheckup.com, which highlight the spatial reasoning, precision measurement, and grounding failures that still plague even the best multimodal models today.
And importantly, even when frontier models can solve a particular task, you can't wait 40 seconds for a reply when you're powering instant replay at Wimbledon or monitoring for defects on a high-throughput manufacturing line, and so there's often still a lot of work left to do to get vision models running efficiently enough to meet production latency and edge deployment requirements.
This is where Roboflow comes in, and I was super interested to hear Joseph describe what it looks like to go from an open-souce vision model to deploying your own task-specific model today.
He emphasizes the importance of establishing clear requirements upfront, as the performance thresholds that different customers need to hit on their respective use cases vary dramatically.
From there, the process often involves distilling frontier model capabilities into much smaller models, like Roboflow's own RF-DETR model, which they derived from Meta's DINOv2 backbone, using a really interesting training process called Neural Architecture Search, which in turn uses a weight sharing technique to train thousands of network configurations at once, all within a single training run. This process ultimately produces a set of models of varying sizes that collectively map out a performance Pareto frontier, and today, Roboflow has productized this approach, so that anyone can now run it on their own dataset and come out the other end with an N-of-1 model optimized specifically for their problem.
From there, we cover a number of additional topics.
- Joseph explains that Chinese companies have consistently led in computer vision, how much the American open source ecosystem currently depends on Meta, and why he's optimistic that NVIDIA will fill the gap if Meta's new AI leadership changes priorities.
- He also describes how coding agents are expanding the market for Roboflow's tools, how skills are emerging as a new go-to-market vector, and how Roboflow plans to use a first-party agent to guide users through the process of building computer vision pipelines.
- We also discuss the state of AIs' aesthetic taste, and why the inherent subjectivity of aesthetic preferences makes this such a hard problem.
- We hear about the emerging S-curves Joseph is watching, including world models, Vision-Language-Action models being developed in robotics, inference-time scaling for vision, and wearables now selling millions of units per year.
- We get his vision for how computer vision contributes to a good life as AI matures, which includes everything from precision agriculture and food safety to self-driving commutes and real-time sports analytics;
- And finally, he explains why he worries that overly-opinionated regulation could accidentally stifle all sorts of surprising but valuable use cases and recommends that policy-makers focus on outcomes instead of trying to regulate the tools people are using.
When it comes to computer vision, Joseph has quite literally seen it all. So whether you're looking to catch up on computer vision, like I was, or looking for a practical framework with which to approach a specific challenge, I think you'll find a lot of value in, and I hope you enjoy, my conversation with Joseph Nelson, CEO of Roboflow.
Watch now!
Thank you for being part of The Cognitive Revolution,
Nathan Labenz