Watch Episode Here

Read Episode Description

Manu Sharma, founder and CEO of Labelbox, explains how frontier AI training data has evolved far beyond simple labeling to sophisticated reinforcement learning environments where domain experts create "gyms" for models to develop complex skills. With every Western frontier lab now spending over a billion dollars annually on training data, the conversation traces the shift from supervised learning to reinforcement learning from verifiable rewards, particularly for coding, mathematical reasoning, and computer use. Sharma reveals how Labelbox operates as a vertically integrated data factory, conducting over 2,000 AI-powered expert interviews daily and paying top specialists more than $250,000 annually. The discussion provides essential insights into the red-hot training data market that's reshaping AI development following major deals like Meta's $15B acquisition of Scale AI.

Transcript of the episode is here.

Sponsors:
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

The AGNTCY: The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org

NetSuite by Oracle: NetSuite by Oracle is the AI-powered business management suite trusted by over 42,000 businesses, offering a unified platform for accounting, financial management, inventory, and HR. Gain total visibility and control to make quick decisions and automate everyday tasks—download the free ebook, Navigating Global Trade: Three Insights for Leaders, at https://netsuite.com/cognitive

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(03:23) Introduction and Industry Chaos
(04:25) AGI Race Components
(11:09) Post-Training Evolution (Part 1)
(11:15) Sponsors: Oracle Cloud Infrastructure | The AGNTCY
(13:15) Post-Training Evolution (Part 2)
(15:28) Compute Budget Shifts
(23:31) Human Data's Role
(25:35) Expert Data Importance
(31:52) Training Paradigm Shift (Part 1)
(31:57) Sponsor: NetSuite by Oracle
(33:21) Training Paradigm Shift (Part 2)
(34:41) Solution Evaluation Framework
(36:48) Long Context Challenges
(38:37) Testing Long Context
(42:17) Data Collection Evolution
(43:41) Fine-Tuning vs Context
(49:55) Context Engineering Dominance
(56:39) Popular Fine-Tuning Models
(57:43) Context Engineering Coaching
(01:03:29) Creative vs Automated
(01:06:32) Frontier vs Enterprise
(01:12:54) Enterprise Implementation Support
(01:15:29) Sovereign AI Strategy
(01:24:24) Computer Use Data
(01:28:08) Generalist Data Contributors
(01:29:05) AI Interviewing Lessons
(01:34:02) Industry Future Outlook
(01:38:30) AGI vs Superintelligence
(01:41:02) Closing Thoughts
(01:41:15) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

Full Transcript

Transcript

Nathan Labenz (0:00) Hello, and welcome back to the Cognitive Revolution. Today, my guest is Manu Sharma, founder and CEO of Labelbox, a data factory that supplies frontier training data to all of the top western AI labs and many enterprises that are pushing the performance frontier with task specific fine tuned models. This conversation really couldn't be more timely. In the wake of Meta's recent $15,000,000,000 deal with Scale AI, which, of course, saw Scale's former CEO Alex Wang join Meta to lead their super intelligence team, Other frontier model developers have been left scrambling to secure their own training data pipelines, and the market overall is still in the process of realignment. These headline making developments have really spotlighted just how critical super high quality training data is to today's frontier capabilities and also how difficult and expensive it can be to create. Post training budgets are growing rapidly as labs race to imbue their models with differentiated capabilities. And as Manu explains, every Western frontier lab is now spending over $1,000,000,000 annually on frontier training data. Such tremendous investment is required because modern data work has moved far beyond the simple labeling and preference indication tasks that many are familiar with from years past. And so today, with Manu as our guide, we'll be tracing the evolution of post training from supervised learning from human examples to reinforcement learning from human feedback to the reinforcement learning from verifiable reward paradigm that's ascending today and unpacking how all of that has shifted data creation work away from tools that facilitated the collection of human preferences and reasoning and toward environments, which Manu calls gyms, where models can develop new skills, starting with coding, mathematical reasoning, and computer use through a process of trial, error, and highly automated feedback. The bottom line is that today, when you think about training data creation, you should envision not the data sweatshop of the past, but highly qualified and comfortable domain experts working to create sophisticated reinforcement learning environments and verifiers that teach frontier models to solve complex long horizon tasks. A quick disclaimer before we begin. Labelbox will be sponsoring the show for at least the next month, and so this does qualify as a sponsored episode. Nevertheless, I can sincerely say that this is a conversation I'd have wanted to have anyway. Because whether you're tracking AI progress analytically or trying to achieve superhuman performance in your particular AI application, understanding the best practices for training data creation is essential. And besides, the scale and scope of Labelbox's operation is genuinely remarkable. Their most recent capital raise was a $110,000,000 in early 2022. And today, Labelbox operates as a vertically integrated data factory. They're scaling their expert network in part by conducting over 2,000 AI powered interviews per day. And their most in demand experts are already earning more than $250,000 per year on the platform. I haven't yet had the chance to sign up to earn some of that money myself, but I am looking forward to the AI interview experience. And assuming Manu's right that the system does recognize me as a qualified expert, I'll report back, subject, of course, to any required NDAs on my experience as an AI trainer. Now, without further ado, I hope you enjoy this deep dive into the ongoing evolution of the red hot training data market with Manu Sharma, founder and CEO of Labelbox. Manu Sharma, founder and CEO of Labelbox. Welcome to the Cognitive Revolution.

Manu Sharma (3:29) Thank you.

Nathan Labenz (3:30) So I'm excited for this conversation. Your world is in chaos. The AI world is going through a bit of a reorganization or realignment right now. And when I say your world, I mean the world of data, data creation, human sources of data. You've been in this business with Labelbox for a number of years, we'll have a chance to dig into all the different facets of that. But obviously, the news that has sent various forms of shockwaves, I think, through the industry in the last couple weeks is Zuck and Meta doing this weird deal with scale where they're getting Alex to come over and lead the super intelligence team or something along those lines. Then, of course, we've got people leaving scale, and it seems like, generally speaking, it's chaos. So what's your report from the front line? Is this like driving a lot of opportunity for you? Are people confused? What are the takes that you are hearing that you think are interesting? Just super interested in kind of your gonzo report from the front to get started.

Manu Sharma (4:26) Yeah, so this is one of the most exciting times in the AI industry, just generally across the board. I think you see, when you really look into the innovations and pace of progress, I believe we are experiencing the maximum innovations we've ever seen in per day or per week time period. And it's an AGI race. It's a race among a number of companies and groups, and it's really interesting to see there are like these big AI labs that are playing and pushing out awesome capabilities of base models as well as products and product experiences. But then you have a number of teams that are just emerging with some new ideas and taking a bet on alternative techniques and so forth. One thing clear underneath all of that is that these teams need essentially three components to develop the frontier AI systems. One of them, of course, being compute. We've been seeing just incredible investments in CapEx across the board. The second, obviously, is the AI talent. And so these are obviously the researchers who know all the know-how of crafting these neural networks and the architectures and how to train them and what to train on and so forth. And a third equally important piece is actually data. And we are very much in a regime where a lot of the emphasis is going towards post training. So kind of like in the AI lens, the way these AI models are trained is you have initially a first phase where you're essentially taking all of the data from the web and from the sources where you can get access to, and you're essentially training a base model that can understand the patterns of all data generated by humans across over these, let's say, hundred years or so. And then after the base model is trained, there's this thing called post training. And post training is where it really emphasizes some aspects of the AI models that we interact with. It really turns these AI models into assistants. Within post training, there has been a number of techniques over the last three or four years that have become emergent or very important. And with your question, it is really awesome to see like a lot of limelight thrown at this industry where like how do companies produce this post training data. And so I think the news about Meta is essentially emphasizing that the data is very important for building these AI models and AI capabilities. And our industry, and generally speaking, has been going through a number of exciting step changes. So when we started Labelbox in 2018, supervised learning was in the main swing. Everybody was really creating a lot of training data, was essentially for supervised training. You have humans around the world who would tag images and videos or some snippets of text, and they will essentially train these models to mimic that behavior. When the transformers came out, we started to see the inflection towards unsupervised learning, where most of the learning was happening was on this large scale data sets. However, as we emerged through that new paradigm, then RLHF became a prominent way to actually make these models useful in everyday to day life. This also required yet new forms of data, very specialized data sets with expertise, because now these models have very strong base capabilities and it to learn the ways and the skills, if you will, to interact with knowledge workers. RLHF, SFT became a main thing, and now we are yet in another new paradigm of reinforcement learning. Nearly every AI lab that I am aware of is really emphasizing on creating datasets for reinforcement learning. It's reappeared back in the days we saw all these amazing innovations of DeepMind with AlphaGo and so forth, and they were really trained on reinforcement learning. It reappeared now, but in a new flavor, new form. At a high level, that's the trend we have seen, and all of these things actually require specialized datasets, and they need a lot of that. I think we are, I would say, still in very early days of actually making these AI systems more and more capable for everyday tasks that power the economy across the knowledge world. So that's what's going on at the macro level. And these AI players who are building these large flagship AI systems or models, they are very aware of how important it is to have access to ways to producing a whole bunch of data sets in all of these interesting specialized domains that ultimately really get into these useful business domains where the AI can learn about, let's say, what professionals, let's say software engineers, are doing in every day to day tasks when they're building companies and software, or if you have, let's say, domains like healthcare, insurance, all these amazing industries that we have, how do you really learn from core workflows that produce data sets to create these models? It's going it's generally very interesting, exciting times, and it is very common to expect, let's say, in an industry where M&As happen and certain teams make these decisions as they see fit for what the opportunities are. And Labelbox has been behind the scene providing these data sets to most of these AI labs, and it has shifted the workloads maybe a bit, maybe sometimes quite quickly, sometimes it takes some time for teams to reorient to the new supply chains and so forth. Yeah, so I think that's the macro context on what's going on, and I certainly expect even more changes in our industry over the next few years. There are new ways, are new teams that are innovating on how do people create data sets, how can we go help these AI labs to produce these data sets. That's what's going on, but it is generally very interesting and exciting times for people who are really keen in how these data sets are created and how these models are learning.

Nathan Labenz (11:10) Hey. We'll continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. It is the cloud built for AI and all of your biggest workloads. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz (12:25) Let's dig in a little bit more on the kind of nature of post training of, you know, the different techniques and then, like, maybe some of the subtle differences. So a year ago, maybe, the prevailing wisdom was, like, maybe 98, 99% of compute was going to pre training. And then this post training, while the data was really obviously critical to make it work well, was like a small thing that you could rapidly iterate on. It seems like that has changed, but it's unclear, like, how much it's changed and if everybody's changing in the same way. I'm always very confused by, are we experiencing convergence or divergence in the leading model providers offerings? There's definitely striking convergence, but there's also maybe some divergence and maybe more divergence to be expected in the future. And I think about this anecdote from the Claude 4 training where they had mistakenly left out, I'm sure you've seen this, that one sys prompt harmful dataset and then discovered behaviorally, as I understand it, that the model was following harmful system prompts. They're like, why was that? Ultimately traced it back to, well, we omitted, you know, we had a typo in our config files and this one dataset that was designed to teach the model not to follow the harmful system prompts was omitted, and so that behavior was not learned as intended. And then they didn't go back and redo all post training with the right configuration. Instead, they patched it or made an on the fly adjustment later. So from that, I infer that the post training has now become a significant enough part of the overall compute that, you know, a company like Anthropic as committed as they, I think, mostly credibly claim to be to getting this stuff right, didn't feel like they had the luxury of going back. So what can you tell us with obviously, sharing client secrets, but in a sort of abstracted way from any one strategy, how much compute would you say is now going into post training? Is it as simple? I'm sure it's not. It used to be pre trained, then some SFT, then a little RLHF, then you're done. Or at least that was how the public understood it, the AI public. I assume now there's more interleaving, more steps, and just more complication. So yeah, tell me more about what you think that kind of post training has evolved into.

Manu Sharma (14:38) Yeah, so I'm confident that in all of the big leading labs where including many of the hyperscalers that they are actively researching and innovating across the right sort of front of end to end training recipes. So I'm sure there are teams that are pursuing some new ideas on how do we make retraining more efficient and things like that, and the right kinds of data mixtures and so forth. But you have to know that pre training can take a long time. So let's say if it takes a few months to actually train a very large model and so forth, it is in a release cycle when these companies are shipping these products, that is a given. Okay, it's going to take X amount of time, six months, let's say, to really get something out the door. Of course, we are seeing that some of these teams are training gigantic models as a base model, and we'll talk about it in a bit. These models are in many ways used as a teacher model for student models that actually get shipped. Now, I would say that the very growing budgets are going into post training. I think the intuition that I will give here is that in many ways, these base models have very strong capabilities about understanding the patterns of data and knowledge and so forth, across a wide range of modalities. However, what we are finding is that these models now have to produce useful work in the form of, maybe you call it agents in form of like, long horizon tasks. And to solve that problem, these research teams have to use techniques to teach that capability. And so there is sufficient knowledge and capabilities in the base model, but you have to then go out to teach these models like how to do XYZ tasks. But just taking a very simple example, in software engineering, these models obviously are so awesome in terms of what they can do in coding, but there's so much still to be desired when it comes to professional software development at large scale, like how can these models develop an entire suite of products with minimal supervision. Would say, think we were starting to see that, but it's not there yet. To solve that problem, these research teams have to collect the right kinds of data sets and then do reinforcement learning on that. That is where a lot of focus is going across the board for these teams to test new recipes and new techniques and see if their models are getting better in those capabilities. They have generally a team that is or a number of teams that are exploring that frontier. How do we get these models really good in all of these categories? I think the two obvious macro ones is coding and math because the fundamental way these RL systems work in coding and math, you can verify to large extent, did the model got the solution right or not? So let's take a few examples, like in coding, let's say I come up with a PR request and the PR actually solves the stated objective, and does my tests pass in the code base, the prior tests and the new tests. And in RL, essentially, the system will try millions of attempts to solve that problem. If it gets right, it scores really well, and then the gradient, the weights change in the model, and then they keep going to the next one. And similarly in math, there's a large number of problems that can be verified by a numerical answer. So the very fast progress we saw in the reasoning models and the coding capabilities in the last, I would say, nine months or so is because of that. These teams are putting a lot of emphasis on just getting that right, and it is a very valuable thing to do because coding could be one of the most lucrative knowledge work in many ways, and some teams have a belief that they can actually, if they solve that, they can actually accelerate their AI research and so forth. It certainly seems like Anthropic believes that, many others do. And then there is new innovations or new ways to train these systems in non verifiable reporting, where a lot of our day to day work is you don't really know what the right answer may be, and there may be multiple right answers. But how do you model the data where these RL systems can track their progress and are they getting it right? And if they create a candidate solution, how can they score that candidate solution effectively? And so we have some really promising techniques right now that is underpinning a lot of these latest capabilities across the board, what you see across the models. Now, I do think that at a very high level, these models seems to feel like converging in the sense like they interact and they look and feel the same from an outside. However, there is this philosophical principle where to be able to really critique a capability, you really have to be at an expert level to really understand the nuances of quality and then have a quality judgment like this model is really good at this thing but not that thing. There is convergence maybe at a very high level when you look at very generic capabilities, but there's actually a lot of divergence when you go into the details. I code quite a bit. It's awesome for me to be able to get back into coding now with AI systems. There are many times where I am working and solving a very tough problem in coding, and you're stuck and you're iterating with these models and you are able to have an intuition. At least I get an intuition where, I've hit the ceiling of this model. I'm in a trajectory where I'm not really going to get this model to solve this problem. I'm able to see that across a variety of models. Have an intuitive judgment like model Gemini could solve this problem, maybe because it's very algorithmic and requires some math domain to really solve that. But maybe when it comes to really refactoring the software or architecting the distributed system, I might use Claude because it has this amazing property where it is asking a lot of questions or in an internal state and getting more knowledge from reading this file, that file, and ultimately trying to make a better software decision. From a macro, it looks like convergence, but when you actually go into the details, these models are getting optimized for very different goals, it seems like. Now, I do believe that the ambition among these AI labs is to make a generic systems that are just so great across the board. But in reality, what we are seeing is that companies are also starting to optimize experiences where the user bases are. For example, in Anthropic's case, it clearly is that they're very focused, they're very intentional on solving coding. When you look at maybe some other models, maybe they're more optimized for consumer use cases and everyday tasks and so forth, and maybe some other labs would be focused on making a general social assistant, everyday assistant, and that would have different optimizations to have. That's what it, to me, feels like, where we are at right now. Now, there is a very interesting question. I do think that is, like, do ultimately when these teams figure out how to make the best coding system, how to make the best researcher for scientific domains, do they are they able to bring it all together back into a single world model? And I think that remains to be seen.

Nathan Labenz (22:40) Broadly, I guess I wanna understand better what is the role of human data today? Maybe one way to think about it would be like, if we didn't have this human data, how would things be different? How would things fail? I'll just give you a little bit of a prompt. On the math side, it seems like we've seen from things like the DeepSeek R1 0 paper that you can get a model to learn at least, like, highly verifiable domains like math and coding pretty well even without the human traces, you know, that are typically used in the supervised fine tuning. But that comes with some very important downsides, like it starts to do it in strange ways and just generally act very strangely, language switching being one, you know, example, odd behavior. But I think lots more where that came from if we detach ourselves from human baseline and just spin the RL centrifuge more and more intensively. So I guess one value of human data is it provides this anchor that gives us some hope for some sort of alignment by default because we kind of know where the starting point is. Another interesting juxtaposition would be I think it's GPT 4.5 versus o3 mini is, like, the best contrast where GPT 4.5 has really good trivia knowledge, like blows o3 mini out of the water on, like, ability to answer random questions about the world. But o3 mini is way better when it comes to these sort of reasoning challenges. So we can decouple reasoning and behavior more broadly from, like, world knowledge. And then there's the difference again between like supervised fine tuning where my general understanding has been like, it's really important to record the trace and you're training the model on the actual reasoning process as exemplified by the human. I've had a ton of value from that. But then in the reinforcement learning, is that reasoning trace so important? So, yeah, break down kind of the roles, flavors, and, like, major drivers of impact of human data today as opposed to, you know, if you were just doing the RL thing to the max.

Manu Sharma (24:45) So the expert data continues to be extremely important for all of these capabilities that we are about to see, we have seen in the past. And I think in one way, it seems like we kind of underestimate how incredible human intelligence really is and all the world and economy that we have built and all the things we do. This is extremely special. We are trying to slice, like in many ways, we try to emulate it into AI systems. In many ways, systems are superhuman, but on the other hand, they still cannot do many other everyday to day tasks yet. The way I think of it is that the, let's say, in the domains like mathematics and so forth, we've taken that slice of like, okay, here's an interesting, awesome dataset about the truth of how at least the logic and reasoning work in sort of purely scientific domain. We can retrain these systems to learn from that and adjust its critique and reasoning from that source. But then with that, we've gotten these state of the art systems today, and they're very incredible. But yet, when you look at when you put them into agent based products, I would say there's a baseline established. Now there's a reasoning behavior that is basically a foundation for all these amazing capabilities we want to now go build. To build these new capabilities, let's say, take an example of a sophisticated AI system. In the future, you'd be able to simply call any company. Let's say you want to change your Wi-Fi router or something like that. Previously, you would call a whole bunch of people, stay online for 40 minutes, and try to resolve some issue. We are very soon going to be interacting with purely AI systems. These AI systems will be agents where there's a concierge who takes your call, drops you to the right specialized expert on how to solve that problem that you have with Wi-Fi router, and maybe the system has to now interact with the entire database to perhaps provision a replacement and things like that. It's a fairly sophisticated system where it's interacting with you with voice, but then has these capabilities to read the databases and navigate a variety of software stacks inside a company to help you as a user. That system to really build it reliably, there is just an incredible amount of edge cases that first of all exists because we are now working with tens of millions of users and use ways of how people could request these things. The base model is a starting point to develop these capabilities, but the architect of that sort of agent or that solution would need a very good data set that represents all of those kind of distribution of the use cases or the interactions that the system would have to go through. If you were to build that system today with base off the shelf model, you're going to require a lot of orchestration on top of the base models. You would need a lot of software stack and agent frameworks and so forth, and you might still not get very reliable output. That's just one very narrow example. There's probably this infinite number of examples and use cases like this in our economy. All of these things are going to require datasets for these systems to be rock solid and superhuman, super reliable. And I don't see any other way to go create these data sets. You can't really invent these data sets from algorithmic synthetic approaches. You can bootstrap it. You can make the process easier by all of these techniques of computer synthetic approaches. But you'll still need this quality judgment from how the authors or creators of this system want the AI system to interact, how do they want the customer service to be like, what personality it needs to have. These things are a design parameter and these people who are going to build these systems have a choice on how the systems need to act on their behalf of their company. All of this requires new forms of data, whether it's voice interactions, whether it's tool use and things like that. Now, I think there was some time period that we saw, maybe starting last year, that a number of research attempts was to mimic reasoning traces. Like, hey, let's ask expert humans, all the academics, universities and so forth, to teach models about how to think about whatever their scientific domain is and break down those problems and some reasoning steps and so forth. And I think what we have found is that I think researchers are finding that it's just really hard to ask humans to really express themselves in that reasoning step by step. And in fact, like there were some research papers going back to process reward models where there was really an emphasis to teach the model how to think step by step and so forth. I think they have largely moved on to essentially modeling the problem and a solution, and how would you grade the solution. That is the new paradigm. The reasoning is actually an emergent property of training these models with RL. I'm sure there are ongoing efforts to continue to make that much higher quality, and you might still need some sort of traces or audits of traces and so forth. But largely where I see things going is in the framework that I described, take any problem and then solve that problem and express that in a form of a way where you know the solution, and if you don't know the solution, you at least know how the AI system could be graded on how good it is performing in that task, and everything else becomes an emergent property to achieving that goal.

Nathan Labenz (31:01) Hey. We'll continue our interview in a moment after a word from our sponsors. It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number one cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into one suite. That gives you one source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's one system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the seven figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz (32:30) So the shift in both data and training, I'll try to summarize back to you and tell me if I'm getting it right, is from a supervised fine tuning mode where the human is responsible for both the final answer, but also the how did you get there? Spell out your reasoning. Let's think step by step, etcetera. And then the model is learning to imitate that pattern of thinking in hopes that learning the pattern of thinking will get you to the right answer at a high rate of reliability. That's the previous paradigm. The new paradigm is maybe because we've just found it really difficult to collect that data, maybe for some other reasons. Instead, now we're just focused on give me a really good solution and then let's develop a rubric for evaluating the AI solutions. And then we train with a reinforcement learning signal presumably where the AI takes its shot and then another grader model basically gets a, here's what the AI just did. Here's the gold standard example. Here's your rubric score, the new one with the gold standard in mind against this rubric, and then this reward signal given back is like the sum of points on the rubric or whatever. How'd I do on that? What else would you how would you elaborate that?

Manu Sharma (33:51) That's that's that's right, actually. And and in many more in most of the domains, you would not know the right solution, but what you will know from these domain experts in all of the like that respective industry is that how would they like they would know what is good and great or excellent. That quality judgment has to be somehow expressed in the data for the systems to learn. A lot of the work is going towards now producing these kind of data sets. You might have heard, you've had a number of speakers who are from these AI labs and some of them are doing new interesting work. You've talked in the previous episodes about RL environments where the thing, the way I think about it is like a gym. You get the model to go into this gym and practice a whole bunch of tests, practice a whole bunch of things, it possesses a new skill. Think of any domain and you want to really model that in some form of an environment where these models can go and play and start honing that skill. The challenging part actually is how do you craft these RL environments that are representative of sometimes generalized, very generalized skills. Let's say one way to categorize things is games could be much more generic planning gyms, but then, let's say, coding environments or maybe other environments are perhaps more specialized on some domains and certain kind of use cases. Yeah, so that's generally the way to go about it. Now this also has a new sort of challenges. To be able to build these systems and to make this paradigm really work, you've got to have very reliable graders, auto graders, where these are specialized models that are essentially able to align with how, let's say, an AI system will grade the progress, and it will be aligned with the experts and the domain experts. There's another work stream that goes into making sure that that is rock solid. Once you get that, then you can really scale up this RL training environments in the post training routine.

Nathan Labenz (35:59) One little aside, and maybe you'll have other examples too, but I've always wondered about super long context. That strikes me as one that would be just really hard to you know, you mentioned, like, it's hard to get people to record their thoughts. I've personally experienced that quite a bit because I've advised a number of companies and coached people on like, okay, you don't need that much, especially if you're fine tuning for a narrow task. The dataset doesn't necessarily have to be super huge. Obviously, the broader the task and the more diversity of inputs, then the bigger the dataset needs to be. But for a lot of these narrow, very purpose built work defined workflows, you don't necessarily need that much data, but yet I've always had a hard time of people actually sitting down and like writing out the traces. So I've lived that and can empathize with that. When it comes to long context, I haven't personally ever tried, but I've always wondered, you know, if you're talking in the Gemini case, right up to 1,000,000 tokens, it seems very, very difficult to just drop 1,000,000 tokens on someone and say, like, okay, you're responsible for the right answer. So I've always assumed that stuff was, like, more programmatically done and needle in a haystack or 10 needles in a haystack or whatever. And then I guess that's just generalized. But maybe you have more insight. Do you have any insight into how and especially now, like, Gemini not just can find a needle in a haystack, but it has really impressive command of, in in my experience, like, 500,000 plus tokens. It seems to be not just finding the anomaly, but, like, really grokking what is going on over these long contexts that are obviously unseen and code bases that are that it's never had any access to. So, yeah, I don't know if you would, but any insight into how that has come online and what role, if any, the human data creation engine has played in that process?

Manu Sharma (37:48) Yeah, totally. So you're right. When it comes to extremely large context windows, how do you really verify or test? Or in other words, create data sets that these models can express, like how good they are in those long context windows. So I think there are two broad camps that I've seen. And so one of them is, of course, let's say you can model needle in the haystack programmatically, create a whole bunch of hashes and see if you can actually find a particular hash and large number of noise and so forth. I think that's that camp of testing, perhaps maybe at a very fundamental level, can it find the information and a vast ocean of information data points. But then there's the other camp, which is when you test it in real world. So we're actually creating a ton of data sets that are testing these capabilities of these models in long contexts, and this would include, let's say, let's take an example of financial analysts. There's a very big industry where one of the job is to synthesize a vast amount of information like SEC filings about a company and all of the information they may have released about their company performance and things like that. And these can be 10 documents. It could be also audio transcripts of earnings calls and things like that. And the financial analyst who has really topped their game would be able to not simply just answer a question about like, hey, what was like, tell me a factual thing about where is this in 100 documents, which becomes in many ways a search problem. Rather, they would model the problem where like, okay, given all of this in different modalities, I want to accomplish a task. I want to analyze and create a model of the company so that I can forecast what maybe the company's earning potential is and things like that. That is actually very representative of a real world task. Somebody may take days to go create those things. Some models, while they are very good in many of these examples, largely there's more room for them to actually accomplish those tasks in a reliable way across a variety of domains. So that is an example where you're really forcing the model to understand and reason across multiple modalities and pieces of information. What we call it internally is multi hop capabilities and there's some strong performance, but again, there's so much to be desired. So that's one of the ways that these AI labs are casting this long and honing the reasoning planning capabilities across wide, large context windows. Let's say if you throw into video and other streams of other modalities, fundamentally take a lot of token space because they have to be tokenized and so forth. In many ways, Gemini particularly is very good in, let's say, video understanding and things like that, like, hey, in which frames something happened and so forth. But when it comes to real world applications that is trying to mimic or plug into an actual industry domain experience kind of workflows, these teams still need to improve these capabilities across those kind of vectors or domains, if you will. And so there is a way to model these things and test these capabilities. And the details is in the devil is in the details where we can induce failures across all the frontier models on these kind of tasks. And that data becomes very valuable for these AI labs to then use it to hill climb.

Nathan Labenz (41:27) Could you describe maybe the different modalities of interaction or of data collection? And again, I'll give you a little prompt from my own experience. This was actually summer 2022. I spent all summer doing fine tuning of the text da Vinci 2 series of models. They never actually released. We had an early kind of preview of that product, but they never released it. What I found to work well for me, and there was no multimodality at the time, was for any given task, do 10. Then at some point along the way, discovered the importance of the reasoning trace, but I didn't know that at first. So I would do 10, fine tune on just 10, then I would have a 100 queued up. I would have the model do the next 100, and then I'd basically do sort of rejection sampling or correction and specifically emphasize the ones that it got wrong, fix those, put those back into the dataset, do another 100, and basically try to get myself into some acceptable success rate through, you know, depending on the task to to end rounds of iteration. But it was kind of a I'd say the qualitative modes were like, initially, I'm totally on my own. I have to do the task. Later, it more is about having the AI do the task and review its outputs and find its flaws and correct.

Manu Sharma (42:48) How has that evolved today? In context of fine tuning these systems and models, you mean?

Nathan Labenz (42:56) Yeah. Well, actually, another next question I was gonna ask is around segmenting the market with respect to the hyperscaler frontier labs clearly have kind of one set of needs where they want to go out and just get all the data they can from all the different experts and be amazing at everything. And then you mentioned like financial services and I also think like pharma and obviously a bunch of different more specialist companies that I assume are more engaged in a fine tuning exercise. So yeah, we could add another dimension to the breakdown there as well. If that is like a fault line in how data is collected.

Manu Sharma (43:31) Yeah, so I think what you shared about this iterative creation of a small dataset to perhaps for eval, and eval essentially is some sort of a holdout set from training data in a way. And that is generally considered to be the best practice. It's really surprising how many teams or how many individuals across the industry would get that wrong. Because sometimes the hardest part in the enterprise space is to really think hard about, like, what the actual workflows or things that they wanna go automate or build is, and how do you express that in some sort of a dataset in eval. And it is emerging to be a craft in its own to kind of, like, really understand that trace or trajectory and express that into data sets. Generally the best teams or the successful outcomes we see is where you're essentially starting with something iteratively adding on to that evaluation set. And like anything else, there are known knowns and there are unknown unknowns. And so when you're in that journey, more often than not, we find that teams will uncover completely a new edge case that is never thought about. And we have to go back to the drawing board and say, what if we just do it this way or that way? So like anything else, iterative development is the cornerstone of success across the software industry, but that's what makes us, like, the software team so agile and successful in all these endeavors. I think that is so true in this new world where in rather than writing code, we are producing datasets and eval datasets for these AI systems to be evaluated and trained on. Fine tuning is really interesting. I don't think that fine tuning like we used to see before is going to last that in the future. What I mean by that is like in the supervised learning paradigm, like four or five years ago, generally speaking, most teams would take a base model and then they would specialize. They would take the data that they're collecting from their sensors or their companies, and they were going to fine tune that model and make it superhuman on that use case. I think that now in 2025, the fine tuning at best is helping you make the model efficient in whatever the task you want to do. What I mean by that is, maybe it is that you don't want to, maybe it's very expensive to, let's say, run a very large model at a high frequency, millions of queries per hour on a particular use case, And you can achieve really good state of the art performance on that. But given that it is a fairly narrow task, you can actually produce the data from that large model and then distill it down to a much smaller specialized model. For that queries per minute per hour kind of operating parameters, there is a better, smaller model that is just going to be better across the board. It will be achieving the same quality at even a fraction of the cost. That to me seems like one obvious reason to do fine tuning of models nowadays. Then there is probably maybe second, I would say, bucket where some problems truly are so unique where these base models don't have that capability and you just have to get it. The only way to achieve that is through fine tuning. But if you look at last few years, we saw companies like, I believe Google released MedPalm model, which was trained on healthcare only data. My list I was reading somewhere was that the base Gemini models at 2.5 outperforms hyper specialized model that was simply trained on healthcare data. What does that say? It is saying that these base models, the reasoning capabilities that is learning from vast amounts of other data is actually useful to solving that domain problem. Maybe the idea that many enterprises or many companies thought that they had a very special data sets in their companies and petabytes of data, maybe that data set actually is not that useful because it is not going to be helping you improve the reasoning capabilities certainly of the models because you do need that broad capabilities for models to learn from and emerge the reasoning behaviors. But perhaps fine tuning might help you to achieve that particular goal more cost effectively with the things I just said earlier. Yeah, I think we don't see as much fine tuning across the board among our customers. Now, what we do see is a lot of context engineering. Fine tuning, you're really training the ways of the model and so forth, but there's so many problems you can actually now solve by simply context engineering, and that is emerging to be a very specialized craft in its own. Prompt engineering emerged as you're coming up with all this way to model a problem in a prompt, but I think the most effective systems that we interact with today, they're actually very carefully context engineered, so that not only includes the prompt, but also all of the context that you might give that model or that query through retrieval mechanisms and things like that. That is not an easy thing to do, but it is something that is tractable and becoming very effective. Just to make sure I can

Nathan Labenz (49:07) pass the Turing test on that coming back, You're saying across these, like, high value industries, finance, pharma, whatever, the base models are so good at reasoning that you are better off. I'd be interested to hear kind of the scale of this too. Like, if, you know, one of these you mentioned the financial analyst use case. We could collect a bunch of data. We could imagine fine tuning that. OpenAI offers that on their platform. Of course, you can do it open source. Notably and then I had this question in mind too, like, why doesn't Anthropic offer fine tuning in any way and anything similar to the OpenAI way and why doesn't Google? And it sounds like your suggestion is maybe and correct my numbers, but maybe you're gonna create 10,000 examples of what a great financial analysis looks like. And then at the time, you'll at runtime, you'll, like, choose the best 10 that are relevant to this. Maybe 10,000 is too much. You choose the best 10, choose the best 50, put those into context, and it's about what runtime examples you are using that is driving the end performance. But you can get better as good or perhaps even better results from that than actually fine tuning and changing model weights. Do I

Manu Sharma (50:25) have that generally right? Yes. Like a lot of the agents that we are seeing now in the industry does require you to have context, right? So let's just take an example of code. When you're interacting in cursor, one of the things that makes it so effective is that there is a way for base model to get all the context, like what are the coding files and what are the relevant snippets, what are the functions that are similar. So this is the one directory structure and all of that information is modeled and sent to the base model for a coder to accomplish the goal that they're working on. That example, that is a really good framework to think about in any other domain. So what is the task and how do you take all the right relevant information and provide that as a context for the model to accomplish that? And we're seeing it's very effective. And the technology wave is where the base models are actually going to get better and better, and you might not need as much maybe you won't need as much for both context, maybe the context efficiency becomes more improved over time. But generally, that is the arc of technology where these foundation models are racing to absorb the capabilities that other app players are trying to develop with all this context engineering and so forth. It may take some time for these based models to be just so great in everything. But meanwhile, if you want to accomplish a certain task and goal, you're really focused. Most effective way to do is to just context, to engineer that system. There are some cases that we fine tune models all the time ourselves, and we do that in scenarios where we have to optimize for cost and or where we have a very unique opinion on how we want the judgments of cases to be. I'll give you an example. In our industry, we work with millions of domain experts around the world. And these are canonical examples like physicists, mathematicians, Olympiad level, software engineers, and so forth, so across the 70 plus countries. One of the things we do is how do you assess these individuals and expertise? Would be maybe 10 years ago, if you were to approach that problem, one, teams would try to solve that by creating tests for, hey, here's a physics test, here's a math test, and try to perform all those things. But then you can see it's just not scalable. You have to go manually craft these things. What we do is we actually have an AI conversational system that powered by the frontier models, and when these experts interview with us, they're interviewing with an AI, and we know about their resume, we know about their research and so forth, and they're having a conversation for 30 minutes or so about whatever domain there is. Then there's an AI system that assesses how good the interview went, and you're essentially trying to make a judgment like, is this person going to be a great producer of data set for teaching the frontier model new capabilities? A base model will have its own properties or characteristics of that judgment, and we don't necessarily align with that. We think that that is not good enough, and so for us to improve that judgment based on all the things we see, the outcomes of these experts and the data sets, we have to close that loop. One of the ways we do that is by finding these models because that capability just simply don't exist, that quality judgment. That's hopefully some examples on how we're fine tuning is helping and use cases for it. By and large, most of the industry that I see across the enterprise space has moved towards context engineering. We see some examples of fine tuning here and there, but it's largely context engineering across the board. In many ways, actually it's really good because it is the technology is enabling enterprises to now intuitively think of the most valuable workflows they want to solve and express that in some sort of like a multi step agentic trajectories and what perhaps the principles of solving that in an effective way would be and build that agentic architecture or system. It might require retrieval, it might require multiple function tool use and calls, and in many ways it is a software engineering task with a domain task mapping or domain task engineering. That's a lot more tractable for vast amount of enterprises where they may not have the skill set to be machine learning experts and train the models and go into the nitty gritty of model training and so forth. That's, I think, one of the reasons why the AI models and the industry have taken off so quickly is because now most of the world is now kind of finding ways to implement it in a more more frictionless way.

Nathan Labenz (55:48) So real quick on fine tuning, since it may be a mode of attack and decline on on some of these even quite highly specialized problems. What models do you see people mostly choosing to fine tune today?

Manu Sharma (56:02) I would say like generally three models, Qwen, Llama, Mistral, I think it's the three categories that we see in the open source world. And I don't have much visibility into the private models, but I think it usually comes down to the nature of the business in the company. In highly specialized industries, maybe the preference for these companies is to use open source models because of the operational constraints, because they are either in air gap situation or they want to have things on their own servers and so forth. But if you go into different industries like, let's say, high-tech digital natives, companies like Airbnb and Pinterest or so forth, they might be actually using the state of the art models from the cloud providers or OpenAI and so forth.

Nathan Labenz (56:52) So on context engineering then, maybe you can coach me on a task that I've been working on for a while. Every episode of the podcast, I write, and then now Claude does often a very admirable first draft of my intro essay, which I'll put at the top. And the basic approach there has been pretty much constant across, I'd say, since, like, Claude 2 wasn't very good at doing it, but I would do the same thing. I would just take the transcript of the current podcast and then a bunch of examples of previous intro essays that originally, you know, was just totally writing freehand. And then I just have a very simple prompt that's like adopting the tone, style, voice, perspective, structure, etcetera of the attached example essays. Write a new intro essay for the attached podcast transcript. What I've always felt a little weird about there or wondering if I could do better is notably, for all those examples, I don't have the source because I don't have enough context to provide a bunch of transcripts and their resulting essay. So it's not like I have the inputs and output pairs. Instead, I just have a bunch of outputs that I consider good and then the current input. So how would you advise me to maybe think about refining my context engineering to get better first drafts? And I do always still, for now, edit them. I've been waiting for the day when Claude just nails it, but how would you suggest I improve that approach? Where do you

Manu Sharma (58:31) think currently the system is lacking? Like, you think it's not producing good quality or you expect it to do other things that it's not doing yet?

Nathan Labenz (58:42) That's a great question. I mean, it obviously writes very nicely. It has no trouble generally following my structure. I feel like it gets me on sort of the cadence. I have this, like, persona, which is genuine. It's not like an act, but where I'm both, like, very enthused about the AI upside and legitimately fearful of, you know, could this get out of control, etcetera. It usually does a pretty good job of picking up on that and finding that sort of balance point. I'd say the things that it doesn't do super well are and this is where I have this lingering idea that, like, maybe if I was able to give it more of the inputs for those previous outputs, it might help, but I don't have context to do that. It it doesn't seem to, in some cases, like, pull out what's actually most unique and interesting about the current conversation. It sort of gives me something where it's like, yeah, that's fine. But in terms of how I would really wanna tee this up and frame it for people, you know, why is this relevant now? What is most interesting about this? How does this relate to broader understanding? It's that kind of stuff where I'm often like, not quite doing it. Sometimes I get better results by often I get better results. These days, I do sometimes elaborate on the prompt as well. So I'll say, in addition to my adopting style, telling voice, whatever, I'll say, in this case, I wanna emphasize or I thought what was most interesting about this was try to bring to the fore or whatever. And it can follow those instructions reasonably well. I still don't typically feel like it quite hits the heart of the matter a lot of the time. Obviously, that's a pretty subtle thing. Right? So it's amazing that they've come this far. I'm certainly not taking it for granted. But, yeah, that's maybe the best I can describe impromptu where it's not quite hitting for me.

Manu Sharma (1:00:23) It might be that you might have to stay persistent in context engineering. Like, hey, what are the examples where it didn't get right or it lacks kind of things or gotchas and so forth? Maybe there's a better way to express that in rules and things like that. But in a use case like this, I would argue that it is a lot better for you to actually continue to context engineer and try different base models. Perhaps try different ways to acquire new information or knowledge with the tools and so forth to get a transcript, get maybe background of the people and things like that. The reason is that every couple of weeks, you're going to see new models and basic capabilities emerge, including now that we are seeing assistance emerge, where they have memory as a feature. Things that you might have liked, not liked in that instance, it's being able to capture automatically in the context in a way. Maybe things are not great right now, but it will continue to get better as the base models improve. Versus, let's say if you were to architecturally fine tune something right now, you're basically freezing an investment. You have to do it right now and you have to continue doing that as the base models get better and better. One thing for sure is that the base models are improving very fast across the reasoning and this empathetic aspects of things. Even though we saw that with OpenAI 4.5 model, it is remarkable in many aspects that are not necessarily reasoning. Maybe that is the element that is missing, that creativity and that sheer touch of humanness and so forth. I think it is a macro question, do you really want to claim this? This is a very creative domain. I think what makes this thing so unique is that you are editing your editor in chief in this particular workflow. Maybe you actually do want to be, say, an editor in chief and you want to have just set up correctly from the models to help you do the best work for things makes you very unique about it. In this case, maybe the outcome is not to have fully automated situation, but there will be like there are cases where completely different domains where it's just mundane work and you just want rather have it fully solved by something. In those scenarios, maybe fine tuning might be a better kind of a technique to do that, solving that problem.

Nathan Labenz (1:02:38) Yeah. That's interesting. I am not looking forward to being replaced by AI, but I at least wanna know when it's happening. I do expect NotebookLM and similar to compete pretty effectively with me in the not too distant future, if not already. And in an asymmetric way, like, I I do recognize that there's a, hopefully, an appeal of a personality that you get to know over time and watch a person sort of evolve. At the same time, NotebookLM can take any topic that you wanna have a podcast on at the drop of a hat. Right? And I certainly can't match that. It'll be interesting to see which relative strengths and weaknesses went out. I often also think too, and I have a friend who really emphasizes this point all the time. If I just read the Claude output, I'm not sure anybody would really notice all that much difference or think it's any worse. So I think there's, to some degree, I am really just precious. You know? Like, I have a sense of what is me, and what is me matters to me, and it matters because I'm reading it and I'm putting my name on it, and I want it to be genuine to me. But that's a bit of a different question than, like, what is quality? What serves the audience best? And I can confidently say that when I edit Claude's output to make my own final thing, that it is more true to me. I cannot confidently say, actually, at this point that it is, like, truly better serving or better informing the audience. I haven't really run that test. I hope it is, but it's not entirely clear. That also might fall into the category of, like, do I really wanna know? So far, I haven't run that experiment, but it would be an interesting experiment to to literally just sometimes read the Claude essay and see if anybody comments that the thing was worse. Interested to just compare and contrast a little bit, like frontier developers, what they're doing with data, and then everybody else, and you can subsegment that if you want. My general sense is, like, at the frontier, they want as much data as they can get, as many domains as they can get, highest value, highest quality they can probably get. I've heard numbers of $300 an hour being paid out to physicists or biologists around the world to do this sort of stuff. When it comes to that everybody else layer, which can still be like obviously very sophisticated companies, obviously, they have a narrow more narrow focus. I'm interested in what else is different. Do they, for example, primarily use like their own team members to do the data work, or are they still interested in going outside? If you're a pharma company, my naive sense would be like, I'd want my own people to do this stuff as opposed to have you go source talent around the world, but maybe that's wrong. And then it's also interested in kind of the scale. Do they have their own people do it? Do they still wanna go outside? How big of a data set do you need for something like that?

Manu Sharma (1:05:25) How do you know?

Nathan Labenz (1:05:26) Are there, like, rules of thumb or guidance that you can give people? How much does that kind of cost? In a way, this could be like your intro sales pitch or like orientation of a new customer. But, yeah, I wanna get the lay of the land both at that top tier, but I think I understand that better and then especially at that kind of everybody else tier.

Manu Sharma (1:05:43) Yeah. So in frontier AI labs, we work with almost with everybody at point and there is, I would say, an insatiable appetite to get access to new novel specialized data sets that are in the regime of reinforcement learning and is helping them teach models longer horizon tasks, do that reliably, do that across the board, across variety of domains, right? Not just including mathematics or coding, but rather all the other things as well. So that is certainly the case. It's actually getting more in demand over time. The budgets across the frontier labs on data is actually increasing. To give you a sense, each of the frontier lab is probably spending over $1,000,000,000 a year on data, and so that is fundamentally increasing across the board. These frontier labs, they're trying all these different ways to produce these datasets. I think some teams try to hire contractors directly, and so there's an emergence of these staffing agencies that we are seeing in the world. All they do is just simply help them hire domain experts very quickly bulk, Staffing agency models have become very exciting for investors nowadays in the Silicon Valley just because now they're actually doing things in the AI space. But more often than not, the AI labs do ultimately require or want really high quality data delivered to them very quickly. To produce the data, you have to operate a data factory. Labelbox is essentially a data factory. We are fully verticalized, we have a very vast network of domain experts and we use awesome ways to actually screen and vet these experts, but then that is just one part of the story. Have to actually go build tools and technology to then produce these data sets. Actually, today, state of the art is that most of the data sets are essentially hybrid data sets where we have to use all of these amazing techniques with AI systems and synthetic data set approaches to infuse it with the human experts to produce these novel data sets. To your question about, we actually are going to release a study in the coming weeks. We looked at our network and look at the earnings across the board. Guess what is a yearly earning for a top quartile or top AI trainer? What would you guess what they're making for a year?

Nathan Labenz (1:08:19) Am I assuming they're working a full time 2000 hours?

Manu Sharma (1:08:23) You can assume that.

Nathan Labenz (1:08:24) I mean, the $300 an hour that I quoted seems high, but $100 an hour doesn't seem high. So maybe I'll say if they're doing a full 2,000, that would be $200,000, and then I'll discount because maybe they can't get that many hours. I'll put them at $140,000.

Manu Sharma (1:08:42) Yeah. That's a good guess. And so our top contributors are earning well north of $250,000 a year. To that, there's a power law, and so you see it asymptote and around maybe $40,000-$50,000 a year for other domains in other countries and so forth, but the best, most highly specialized individuals are earning over $250,000 a year, and we actually expect it to increase as the AI frontier goes deeper into agents and business workflows and so forth. This is pretty amazing to see. Just five years ago, the average hourly rates for the data stats was much lower than that. We were operating in a very different regime and domain at the time, Now you have essentially these domain experts who are using maybe part time, they're working on these kind of AI training tasks, and some of them are really changing their lifestyle and to be able to do this basically full time and have the freedom to do work whenever they want and so forth. So that is the state of the play with human data in frontier, expanding the frontier. These are extremely complex tasks and they're not necessarily any more tasks that a person would do for 5 minutes or so. These individuals are actually creating RL environments. These are endeavors to develop dev solvers and verifiers or these environments or play in these environments to produce an instance of what a good activity would look like and so forth. Now, we actually also have a very interesting, very big business in the enterprise space. So we are one of the leaders in building the software tools and platform for producing the training data. And so we have our customers like some of the best or biggest pharmaceutical companies or robotic medical imaging companies and so forth. And so in those cases, they are developing models where they do have to work with their own experts. So medical coding is a really great example. It's something I think will be solved with AI, and I'm excited to see their companies doing that. And in these scenarios, you have to have individuals who are really good experts in that domain. And it is such a nuanced situation where the insurance codes, medical codes, particularly in The United States, it's an industry on its own, and so you have to tap into those individuals from that industry or train new experts to do that task really well. I'm sure companies that are starting to do really well and build these AI systems, they are finding all these leverage points. They understand that nuance of how it's done and that they're probably trying to train a larger workforce to actually go do that in a factory sense. In these scenarios, these teams have to develop the entire data factory themselves, if you will. They have to build the technology and the tools. Then they have to figure out who these experts are going to be, how are they going to work in the factory to produce these data sets. Labelbox is operating a large scale data factory primarily for frontier labs and for enterprises, what we do is we actually offer them a technology platform to do it themselves. In those scenarios, they are using the domain experts to run the data factory themselves.

Nathan Labenz (1:12:05) Do you have to give them a sort of Palantir style forward deployed expert to help them develop the process? I assume that they need that help in most cases.

Manu Sharma (1:12:17) In some cases, yes. In some other cases, no. It really is a fact based. It really depends on sophistication of customers. So if you're, let's say, super high-tech digital startup, AI startup, application startup, perhaps you're you wanna have full control, architect and operate the system yourself, but maybe in some other industry, you're getting more help from our company.

Nathan Labenz (1:12:40) That makes sense. One other kind of random market segment, I don't know if this really exists or not, but I just did an episode not long ago on the concept of sovereign AI. And there's a lot of different facets of what that might mean, whatever. But one big one is countries around the world with different languages, different cultures, different value systems, whatever. They might want to make an investment in whatever they can make an investment in. It's not always entirely clear what they should do to get models to be more fluent in their language, more familiar with their values, respond in more culturally sensitive and appropriate ways, just to have more local knowledge. Right? Whatever the probably all the dimensions at once they'd like to improve. I wonder if there's any business there for you and whether there is or isn't, a theory that I've had in the past is that if you are Brazil, for example, or you are maybe even India, although India is big enough that maybe they want to have their own national champion. But I've said, maybe what you should do by default is just make an investment in data collection and then just go give that data to the frontier developers and be like, hey, you can be better in Brazil. We've done the data collection work as the Brazilian government. Here's a big data dump. Please use it, and then you'll be better for us in Brazil. And that way they don't have to worry about, oh my god, like, are we gonna do about data centers and what are we gonna do about frontier researchers and how we can afford the $100,000,000 to compete with what Zuckerberg is reportedly offering people. Although, I'm a little skeptical of that number except in the most rare cases. But yeah. Any is there any, like, sovereign AI demand coming your way? And if you were gonna advise the president of Brazil or Mexico or for that matter, India, what would you tell them they should do on the data front?

Manu Sharma (1:14:39) Yeah, so I believe there is tremendous opportunity in the government to essentially, I think, would look like a leapfrog or overhaul of entire ways of doing things to this new way of making it super efficient. And I think the different countries have a different system. In some countries, the government is providing certain services and they are actually owning the full stack of those services themselves. In some countries, it's more privatized model where they're essentially creating the environment and creating a spec, hey, we want private companies to be able to do XYZ things. So in the scenarios where some governments are developing or have been in the business of offering services to their citizens, which I think is most of the governments really, so some shape or form. There is a lot of opportunity to rethink what would be an AI data experience here, and it probably across by all dimensions would mean they would be able to render those services at much better experience at a fraction of cost and get out of that legacy system. Finally, we have a technology that allows them to do that. Many of these governments are probably running on the legacy Fortran and COBOL and things like that, which I believe, like, was learning somewhere from the DOGE team that they uncovered this massive amount of things that are just run on legacy systems and so forth. So there's tremendous opportunities, and I think there's opportunity for rethinking experiences in a pure digital native AI first manner. Think about India where so many citizens could benefit from basic knowledge understanding of, okay, how does these things work? What are the amenities? What are services available to me and all these different towns and villages and so forth? So there is that. Now, the investments really, I would argue that the government should be focusing on the outcomes, like what the experience they want, and then mapping it back to how do they achieve that in a most efficient way. And whether it's letting private companies to develop those solutions and implement them, and in some other areas where they are based on a particular use case, there's looking at like how do we re architect or develop those services. In many cases, it will become a question of like, how do you get those three components together, compute, data, and talent to go build those capabilities. And you would need certainly three of them to making that effective.

Nathan Labenz (1:17:12) So no shortcuts in other words. Like, you don't think they could just come to you and say, hey, we wanna go like, my silver data on a silver platter for the frontier developers. Where does that break down in your mind?

Manu Sharma (1:17:28) There are obviously many industries and many players in our market. They would say, No, let's go invest in the fundamentals and we'll figure out the use cases later. But you're going to have to get your data in the right place and things like that. I am not so sure that actually helps these teams achieve the outcomes, like the entities to achieve the outcome. Rather, let's walk backwards because now finally we have a technology that can be wielded and molded to achieving the goal. Actually, the hardest part is coming over that intention and the goal and the kind of outlines of what a great experience or service would look like and funding that. Then once you have that clarity, all the other things are actually fairly tractable. Like, okay, we can help them produce the right data sets, we can help them build the end to end systems and evaluate those things and so forth. I think more often than not, projects fail because they didn't have that end goal in sight rather than they were going into operating in a world where they were acquiring a whole bunch of set of technologies, seeing what it can do rather than walking backwards from the customer experience.

Nathan Labenz (1:18:41) Yeah. I I think that's great reminder and something I also, in my very ad hoc, you know, AI consulting for businesses, always emphasize, what problem are you trying to solve? Because there's so many amazing new technologies that you can experiment with and get lost in and have tons of fun with and spend a lot of money on. And yet if you don't have a clear idea of what it is you are trying to accomplish, it's probably not gonna go super well for you. Sometimes people are like, we need to build a platform first and then we can build an application. We need to build our AI platform. And I'm often like, I would do a spike. Let's get one thing working first. We'll learn a lot of lessons that way. We'll learn what data we do and don't have. There's all sorts of problems that we can identify in the course of one spike. Probably all that gets thrown away, and then, you know, at some point, maybe you mature into a more robust platform. But if you try to build that without specific problems in mind, I think you do set yourself up for a lot of frustration. So I totally agree with that analysis. I do give the frontier developers a lot of credit because I think they have taken a number of steps to close these gaps over time on their own. One really notable one was when GPT-4o dropped the tokenization of non Roman alphabet languages, like, got dramatically better, which brought cost down speed, you know, a lot more to fit in context windows. So they've, like, definitely prioritized this to a degree, but there's still a way in which it feels like if you're in India or you're Brazil, whatever you are, a second tier user because just the things are so English and so American centric in many ways that if you could just do a giant data dump, maybe first of I imagine the frontier developers would take it, and then it might lead to more parity between the American experience and the sort of Brazilian or the Indian experience of using the models off the shelf. But, yeah, who knows?

Manu Sharma (1:20:29) Don't know. I think it's really interesting because I think it will be really interesting to see how it all emerges. So before Labelbox, I worked at Planet Labs. It's a company that is now public, and it has it operates about 400 plus satellites in low Earth orbit and scans the entire Earth every day. I was involved in developing capabilities where we would use computer vision to extract insights from this dataset. So, for example, the government of Brazil would be very interested in understanding what is the level of deforestation that they are seeing every year and that there was simply no other alternative way to really understand that on a daily or monthly or weekly basis. But as a company, we could develop that insight because we would scan the entire earth every day. Would basically get an image of the entire Brazil every day at a certain resolution and we could apply these algorithms and essentially feed that as an insight to the government and to make the right decisions or policy decisions. That's an example where there's only certain companies who can do that and share that insight. It wouldn't make sense for Brazil to actually operate foreign satellites to just solving that problem. There's categories of capabilities or use cases where you have to get the private companies that are best in what they do, bring those capabilities and then work on integration with other systems to make affordable policy decisions or services that it gives to citizens. There are some cases where the governments are very uniquely positioned to curate or cater a intended experience to their citizens. What could be an example? Maybe experience with tax could be an interesting one, like, hey, can the governments offer a much more intuitive way to file taxes? Or maybe there's some other for countries where there is a centralized health service, can it make it super AI driven, where citizens can query and understand or get basic healthcare services conversationally perhaps and so forth. And so it really becomes case by case, what are the capabilities that the government can invest in and completely rethink it in some other areas where it will have to really rely on a private industry to offer a bunch of these solutions. Do think that we seem to be going in a direction where there are going to be these foundation model companies or models that are developed privately perhaps, or maybe the open source versions, They will be used as a base model in the sovereign context. They're essentially then going to either develop applications on top of it that is very uniquely for the sovereign use cases, or in some cases they will have to fine tune it or develop custom models on top of it because government has a scale and intent to be able to produce some specialist data sets or gather some specialist data sets that companies might not be able to do so.

Nathan Labenz (1:23:34) Different topic. Why has nobody offered to pay me to install software on my computer to watch me use my computer all day? It seems like one of the obvious relative deficiencies right now of the frontier models and like OpenAI's operator is getting decent, certainly. Yep. But you go watch the AI village and it's you've got a lot of stumbling around in various nooks and crannies of UIs. It just seems like this race for data acquisition has not extended to let's just go watch people use their computers, record that, and then bring that kind of behavioral data into the fold. I've never heard an answer to this question that has satisfied me. So what's your take on why nobody's made me that offer?

Manu Sharma (1:24:24) It's coming. And I think the reason I have a very clear visibility into this kind of all we we power a lot of this computer use agents right now that labs are building that. So right now, I think we are seeing first versions of truly useful computer use agents or computer use capabilities where it's just really the first innings of it, where from a movie, perhaps like the inspiration from Her, these systems are becoming an AI companion. They can listen to you, they can observe what you're seeing every day from maybe your camera on your phone, but then if you're doing work on a computer, you could just turn that on and it will know everything you're seeing on the screen. I would say that these models are very poor in that capability right now. So if you were to turn on, let's say, a live AI model that can understand a lot of these things on your screen and so forth, The understanding piece is really good, particularly some areas of UI or text and so forth. But when you're really doing some interesting or like domain work, they often fail. And They will fail in a very unintuitive way. They will fail maybe three minutes into the session where maybe the model did not understood or had the memory of what was set or what were the intended goals were in the first minute. So there is failures that are happening across the board, whether it's understanding of a screen at a time, what is it? For example, most models totally get all these shapes of geometry and so forth, as an example, have a very poor spatial understanding, they also fail in the time horizon. As part of the rollout of this AI systems, I would be not surprised at all that the freemium model would be that they will be able to use the sessions and the data sets to then perhaps train and improve the models. But again, there's a lot of work happening through companies like Labelbox, where we are actually producing these sessions and these data sets at a very large scale across different languages and domain expertise. I guess in a way you can say, the capabilities are being cooked. It's not rolled out yet, but as part of the product experiences, the companies are going to be able to take the data from the users and actually improve these systems and make it more reliable. I'm sure they will make a choice on what privacy and some users will want to intentionally allow their sessions to be used by the companies to improve their models.

Nathan Labenz (1:27:00) Do you think I have any expertise that would allow me to be a data contributor via Labelbox? I wouldn't aim for 2000 hours a year, but I wonder if I even have because I'm kind of a highly generalist, jack of all trades sort of person. Is that profile even useful anymore?

Manu Sharma (1:27:20) Absolutely. Like, the way I would there's a lot of talk in human data where we take examples of doctors, lawyers, that kind of thing. I actually think it's not that actually helpful. It's helpful to just maybe for new audience to really understand we're looking for this domain expertise. But actually, the way we think about it internally at Labelbox is we are simply looking for individuals who have high agency and IQ across the board. That finally correlates with being able to do very generalized tasks, even in certain domains. You would be able to learn completely new tasks, maybe even coding or maybe in some research where you're able to do long horizon tasks far better than any of these models can do right now. That means that you can actually produce the training data or signal for these models to learn from. So for sure, absolutely, I think I'll let you in.

Nathan Labenz (1:28:15) Alright. I'm looking forward to experiencing the AI interview as well, because I've heard about a lot of those things getting at least experimented with. It sounds like you've scaled yours. And actually, while we're on that, any lessons from the AI interviewing at scale that you would highlight for people?

Manu Sharma (1:28:32) Yeah. So we probably are I don't know about the others. We believe we are one of the largest rolled out production AI interview right now. We are conducting well north of 2,000 sessions or interviews a day. Right now, probably 50 people are interviewing right now with our Zara AI interview. And yet there's a lot of lessons. First of all, we were very surprised how much people loved it or enjoys interacting with an AI about their experience and so forth. In many ways, when you look at the average rating that we ask our distributors, Hey, how did the AI interview go? They are like 4.6, 4.7 average satisfaction score across the board. That is because, A, they can do the interview at their time, whenever they feel they're in a moment to have that conversation. B, it's incredibly patient in a sense that you're able to just dive into a variety of topics and so forth. In many ways, you, let's say, have experience, it will ask you about your research paper about a very nuanced thing that a human recruiter may not be able to even get to that detail in the first 30 minute session with you. People just love about sharing their experiences, and especially if they can share the aspects of their greatness of their work in that fidelity, in a very condensed manner. So that's one. Then the people are using it to practice their real world skills. So we just rolled out practice sessions of our AI interviews where simply people are using it as a way to do interview prep perhaps for other things. We rolled out with an intention like, Hey, our goal is not to go into that category or market. Our contributors asked for, Hey, this is so great. I want to be able to just do practice runs for other things in my real world jobs and so forth. And so that's that. I also find the really clever ways people try to cheat the system. I've certainly seen people literally just put an iPhone with a chatty video at best voice and have AIs talk to each other and things like that. So we just see this human ingenuity to gaming a system and so forth. That's just, I take it in a positive way. We have to understand, we have to build technology and systems to stay ahead of that. If we have that Clueless system, that means we not We have to be really good at assessment and finding really solid, good players with the right intent. That is always a pursuit because the human ingenuity is just so vast and you just get surprised by ways people employ all these different tools and techniques.

Nathan Labenz (1:31:18) Yeah. Interesting. I'm looking forward to seeing what that experience is like and how helpful the AI cheating assistant is.

Manu Sharma (1:31:24) Well, another thing I got very surprised about actually with the AI interviews is just how good it is in not only going deeper into like your context, your resume and experiences or papers and things like that, but how good it is to having conversations in all these different languages. It actually is able to assess how good you are, how natively fluent you are, let's say, in a particular language and so forth. We are able to assess all of those things, your different language skills, domain expert skills. I was honestly very shocked

Nathan Labenz (1:31:58) to see how effective. OpenAI these voice for that or something else?

Manu Sharma (1:32:03) We use multiple providers behind the scene, then we have, as I told you earlier, part of our system is fine tuned capabilities where especially around assessing the conversation and scoring or grading these things. These are all a data engine that we have built. So the more interviews we get and the more signals we get about how good these human experts are in producing the data, we are using that insights to improving the grading system, if that makes sense. And actually the entire format of interviews and so forth.

Nathan Labenz (1:32:36) Cool. I'm looking forward to checking it out. Maybe going back just to the topic we started with for a second. What is your expectation for the future of the industry? Are we enter is this a one off where Zuck is just doing something crazy, or should we expect this to be, like, the first domino of what could be a number of kind of frontier developer and data factory company pairings? What would you expect there? And you can I understand you've got a lot of skin in the game, so not sure exactly how much you wanna speculate on that? But

Manu Sharma (1:33:12) right now, one way to think about it is like we are slicing aspects of human intelligence and emulating that in AI. The question then is how many slices are left in the vast human knowledge? I would argue that there is probably infinite slices out there because we don't really understand how the human brain works yet very effectively, and so far we haven't seen systems that work like human brain yet. Think we're getting closer and so forth. Time and time again, across multiple paradigm shifts we have seen from supervised learning, when I started in my education, I used to train models with three neural networks, three neurons and three layers and think like in MATLAB and zoom link. Just in a decade or so, we've hopped multiple paradigms and time and time again, we have come back to data being very critical, essential ingredient to making these capable, highly capable models. Now the frontier is where we are trying to teach these models very sophisticated knowledge work. And if you look at it, they are very capable in certain ways, but they're not so capable at all in actual tasks in knowledge work yet. And the question is, what will it take to go do that? And I think in the RL paradigm, we are now at least looking how do we go create or emulate variety of these different domain tasks and express that in RL environments or data sets and have these models learn. So I think there's a long road ahead in terms of making these AI systems more capable, more integrated in our day to day lives, being more reliable in long horizon tasks, and all of this will require data, essentially, one way or the other. And I think there's some really exciting things that are happening in synthetic data domains. I think that is going to be always a pursuit because of just how much effective or the potential there could be. But then you still have to have some sort of a human grounding judgment to ground the synthetic data, if that makes sense, because these AI systems have to be actually useful and interact with humans at the end of the day. And so we are in very exciting times, but also I believe the data is going to continue to be the primary mode where humans are going to supervise these AI models, open and intended in the sense that they are, we want to be at the driving seat and we want to be able to command or manage millions of these AIs to accomplish a task. Managing these millions of AIs to accomplish a task will be imparted or exchanged through some data. Labelbox is positioned to continue to become a bigger and bigger data infrastructure provider for all of those use cases that are invented right now, but are yet to be invented in the future.

Nathan Labenz (1:36:10) That's a perfect place to end, but I have one more question, which is where does that leave you this is maybe the hardest one of all, but if you peer into your AI crystal ball, where does that leave you on the question of a plateau around a sort of AGI level versus a prospect of a sort of takeoff from an AGI level into a superintelligence level? And I know all the arguments in all the different directions, but, like, some of what you've said around, like, I mean, obviously, just positive transfer in general and potentially the sort of human level coder that can be run-in if it's, like, as good as the best human level coders and it can be run super parallel, that when I imagine that, I'm like, jeez, once you get to that level, you might take off pretty quick. Like, you might really find all these new architectural innovations at a faster rate than we have been, and the whole thing could get crazy. But then some of the other things that you've said when I when I focus my attention on those and I'm like, there's all these million slices and you gotta have data for each one, then that leaves me feeling more like maybe we hit a plateau around the best humans, which by the way, I think I would prefer for safety reasons for time to adjust. There'll be time for superintelligence is my view. But what is your expectation? Do you think we'll have any sort of extended period of kind of human expert level, but not like fast takeoff from there?

Manu Sharma (1:37:41) So I think in a way, both things are actually true. I think we are already in an accelerated takeoff by all means. We get adjusted with every day to day, hey, why are all these AIs not so great right now yet? It hasn't done my work yet. But five years ago, you asked me, man, this looks like some crazy capability from very fast future. And so in many ways, like right now, we are in a fast progress. I think the future is going to be very much more synergistic with hundreds of Think about it, like there are 30,000,000 developers now or something like that, and I think we're probably onboarding billions of coders over the next few years with this AI assistance. So that's really interesting and exciting that has its own properties to it. So there's that. Fundamental question that I think which also teases out in the philosophy realm is, how can these AIs ascertain quality judgment? And there's a lot of things about data, like world, like humans are able to figure out and able to put a quality judgment. And In many, many ways or scenarios, we're not able to necessarily express it why a certain thing is so good or bad, but we know when we see it. It's this visceral part or emotion that appears from that experience of reality and human mind. These are the concepts explored very heavily in books like Zen and the Art of Motorcycle Maintenance, which brings a lot of these really interesting ideas about quality and so forth from other philosophers and so forth. I think the question comes down to that, how that quality judgment will be imparted into AIs. So far, the way we are doing it is with humans teaching it in all these meta ways. These are ways that we're looking at rubrics or grading, things like that. And I'm sure a year or two years from now will be very different techniques. But I think that's an open question. I'm very curious about that. I think a lot about that, obviously. And that would be really key for, I would say, what you describe it, the sort of even faster progress. If you somehow can magically figure that out across vast knowledge of vast space of human knowledge, what makes things excellent, good, bad, and it is going to be key for making these AI systems really progress very quickly.

Nathan Labenz (1:40:02) Yeah. Taste in maybe a domain to watch. This has been excellent. Is there any other closing thought or anything we didn't touch on that you wanted to leave people with?

Manu Sharma (1:40:12) No. Thank you for having me. It was an honor to be here.

Nathan Labenz (1:40:15) My pleasure. That's very kind. So thank you very much as well. Manu Sharma, founder and CEO of Labelbox. Thank you for being part of the Cognitive Revolution.

Manu Sharma (1:40:23) Thank you.

Nathan Labenz (1:40:25) If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of a16z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!

The Data Factory: Inside the $100B Race for Post-Training Supremacy, with Labelbox CEO Manu Sharma

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!

The Data Factory: Inside the $100B Race for Post-Training Supremacy, with Labelbox CEO Manu Sharma

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!

Pioneering PAI: How Daniel Miessler's Personal AI Infrastructure Activates Human Agency & Creativity