In this episode of The Cognitive Revolution, Nathan welcomes back Div Garg, founder and CEO of MultiOn, for his third appearance to discuss the evolving landscape of AI agents.

Watch Episode Here

Read Episode Description

In this episode of The Cognitive Revolution, Nathan welcomes back Div Garg, founder and CEO of MultiOn, for his third appearance to discuss the evolving landscape of AI agents. We explore how agent development has shifted from open-ended frameworks to intelligent workflows, MultiOn's unique approach to agent development, and their journey toward achieving human-level performance. Dive into fascinating insights about data collection strategies, model fine-tuning techniques, and the future of agent authentication. Join us for an in-depth conversation about why 2025 might be the breakthrough year for AI agents.

Check out MultiOn: https://www.multion.ai/

Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive

SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognit...

Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.

RECOMMENDED PODCAST:
Unpack Pricing - Dive into the dark arts of SaaS pricing with Metronome CEO Scott Woody and tech leaders. Learn how strategic pricing drives explosive revenue growth in today's biggest companies like Snowflake, Cockroach Labs, Dropbox and more.
Apple: https://podcasts.apple.com/us/...
Spotify: https://open.spotify.com/show/...

CHAPTERS:
(00:00:00) Teaser
(00:00:40) About the Episode
(00:04:10) The Rise of AI Agents
(00:06:33) Open-Ended vs On-Rails
(00:10:00) Agent Architecture
(00:12:01) AI Learning & Feedback
(00:14:01) Data Collection (Part 1)
(00:18:27) Sponsors: Oracle Cloud Infrastructure (OCI) | SelectQuote
(00:20:51) Data Collection (Part 2)
(00:22:25) Self-Play & Rewards
(00:25:04) Model Strategy & Agent Q
(00:33:28) Sponsors: Weights & Biases RAG++
(00:34:39) Understanding Agent Q
(00:43:16) Search & Learning
(00:45:39) Benchmarks vs Reality
(00:50:18) Positive Transfer & Scale
(00:51:47) Fine-Tuning Strategies
(00:55:16) Vision Strategy
(01:00:16) Authentication & Security
(01:03:48) Future of AI Agents
(01:16:14) Cost, Latency, Reliability
(01:19:30) Avoiding the Bitter Lesson
(01:25:58) Agent-Assisted Future
(01:27:11) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

Full Transcript

Transcript

Div Garg: (0:00) I think OpenAI has definitely lost a lot of the lead they used to have with GPT-four where, like, they were kind of the sole winner and no one could catch up with them. At this point, it does seem like very homogeneous market where everyone is kind of close. When anything is disruptive, I think that it takes a lot of time for that to catch on. And I think we are in the start of the disruptive era where, like, a lot of the online communication and interaction will get disrupted by agents. Humans are pretty good at, like, navigating websites with UI. And so, like, theoretically, like, AI can also become very, very good, even better. I think I would call it the explosion of applications, which I don't think has happened so far. If you think about agent applications, I think it's still the rarity.

Nathan Labenz: (0:41) Hello, and welcome back to the Cognitive Revolution. Today, Div Garg, founder and CEO of MultiOn, returns for his third appearance on the show. A lot has changed in the AI agent landscape since I first spoke with Div in mid 2023. As you might remember, at that time, the AI community was abuzz about the potential for AI agents. With projects like baby AGI giving large language models access to tools and really only very minimal guidance, and then stepping back to watch and see what they could accomplish on their own. Of course, as it turned out, they didn't accomplish all that much. While there were many amazing moments, those were the outliers. On average, the step by step error rate, including on very mundane microtasks, was too high agent frameworks to successfully string many step sequences together all that often. And we also learned that while large language models can improve in many areas through self critique, they have a tendency to get stuck on obstacles that humans quickly find ways around. For that reason, much of the last 18 months of work on agents has gone into developing better and more prescriptive scaffolding, with many companies ultimately delivering platforms for what I call intelligent workflows. That is workflows that a human has designed and where the AI is needed to do some important subtask which requires intelligence, but where the AI is not given freedom to choose its own adventure. As of January, Div and the MultiOn team were still among the most bullish on open ended agents. And as you'll hear in this conversation, they have continued at least partially to buck that trend. They have built some new scaffolding and they have developed interesting techniques for domain specific fine tuning, but their agent continues to take arbitrary natural language requests and gamely does its best to fulfill them. The progress I found in my testing is pretty obvious. And in some context, the company claims human level performance. But still, the system as a whole is not a viable substitute for a human assistant. With that in mind, I was excited to pepper Div with questions about what he's learned from all of this activity. And so in this conversation, we unpack the latest in agent development, including the company's data collection strategy, the seemingly missing market for human computer use data, and the role of synthetic data in bridging that gap. The company's model strategy, including what models they've chosen as base, what fine tuning techniques they're using, and how their computer vision approaches have evolved over time. Why benchmarks so often show human level performance while the real world results are clearly not as strong. The future of agent authentication, as well as which parts of the Internet at large will compete to serve agents versus which parts will try to exclude them. And finally, what sorts of customers MultiOn is looking to partner with now, as well as how they're thinking about competing with hyperscalers in light of Claude's new computer use capability. Overall, it's clear to me that while it's taken longer than I had expected, reliable agents that can perform a very large percentage of routine computer use tasks are coming. It's only a matter of time. And as you'll hear, Div agrees with recent suggestions from both OpenAI and Anthropic that 2025 will be the year. That, of course, makes Div a very busy man, and so I very much appreciated his time and how open he was willing to be about the path that MultiOn has taken and the lessons they've learned along the way. As always, if you're finding value in the show, we'd appreciate it if you'd share it online, write a review on your podcast app, or leave a comment on YouTube. Of course, we welcome your feedback and suggestions via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now here's my conversation with Div Garg of MultiOn, catching up on the last year in AI agents. Div Garg, founder and CEO of MultiOn, welcome back to the Cognitive Revolution.

Div Garg: (4:16) Yeah. Thank you, Nathan. Excited to be back.

Nathan Labenz: (4:19) So agents, agents everywhere. The last 18 months has been quite the saga when it comes to AI agents, their promise, and whether or not they fulfilled that promise. And obviously, you've been right in the thick of it as the founder of this company. How would you tell the story of agents over the last 18 months since, like, the launch of GPT-four? And where do you think we are now in the grand saga of AI agents?

Div Garg: (4:45) Yeah. I think it's been interesting to see. I would still say we're very early. I will call it, like, it's similar to the Internet explosion. So we are kinda seeing, like, the first wave of the Internet in a sense. Right? It's still, like, glitchy. It's slow. Most people have not used it, but the next thing that's gonna happen is, like, this will become, like, more mainstream. Like, the quality and the reliability will go really up. And now over time, this will become, like, a rise to a lot of, like, opportunities where a lot of, like, the things that happen in their daily life will become more agent taking over time. So I do think, like, agents is still very early. I think computer with GPT-4 definitely I think at that time, it was kind of a promise. They were gonna create these kind of things that are gonna happen. You will have this kind of automatic capabilities. Now I think we're starting to see the early versions of that, where, like, maybe people have found some, like, use cases in, specific verticals, some business use cases. I think it's still it's very early. Think most people have not created the agentic product, so I think it's still, like, haven't explored it fully, and I think that will take more time.

Nathan Labenz: (5:40) Tell me about your agent lifestyle today. In the past, you've posted some viral demos on Twitter of like having an AI agent order your lunch or do these kind of various transactions for you? I think you had one of the first demos of it booking a flight. How is your day to day experience of AI agents? Like, what are they doing for you right now?

Div Garg: (6:06) Yeah. Nothing that's been pervasive. Think, like, I'm still using agents a lot for, like shopping has been a biggest case, like Instacart, groceries, definitely calendar invites, and then maybe a bunch of stuff on, like, LinkedIn or Twitter or a bunch of these kind of places where like you want to have bots that can automate things. For me personally, I definitely have this like a lot of my scheduling is now based on agents. A lot of my groceries and some of my online shopping in the API is also based on agent.

Nathan Labenz: (6:33) So let's talk about kind of a few different approaches to building agents. It seems like in the beginning, the first wave of things that were launched often was just like open source, check this out and mess around with the kind of projects I'm thinking here of baby AGI and that sort of early wave. Were very much just like, hey. GPT-four looks really smart. Let's let it figure everything out and kind of give it very open ended environment, very sort of open instructions, and kinda hope for the best. It seemed like by and large, with notable and kind of exciting demos as the exception rather than the rule, those approaches led to disappointment when people found like, oh, well, it doesn't really work most of the time. It gets, you know, stuck in various ways and whatever. And so the counter narrative, I would say, has been or the counter, you know, the sort of compensating approach, maybe is a better term for it, has been to put the agents on rails to give them much less sort of autonomous decision making authority and instead build out like very prescribed workflows where the step by step is kind of designed by the human workflow designer and the AI is kind of there for the key bits that require intelligence, but is not like left to choose its own adventure. You shared a preview version of the latest MultiOn agent and I was messing around with that. It seems like you guys are still more in the open ended frame of development, but I'm sure, you know, you've tried a bunch of stuff and have perspectives on those two different approaches. So tell me everything about the open ended versus the on rails approach.

Div Garg: (8:20) Yeah. And I think that's a good question. I totally agree. Like, I think on rails is the right way to start from, especially if you know exactly what use case you wanna target. If you do anything like enterprise or B2B SaaS, whatever. I do think, like, this kind of, like, more linear, constrained workflows. I think it's the right paradigm to start with. And I think over time, I think you can have more open ended use cases which allow for more general purpose experiences. Now I think for us also, we are trying to take a road in the middle where we don't want to be too constrained because I think we're still targeting a lot of everyday use. And what happens, we have surveyed, like, I don't know, like, at this point, we have 3,000 plus people from our, like, all the user data we have. And I do think the consumer behavior is just very different for every point. So depending on demographics and age tropes and, like, where you live, bunch of things. I think consumer behavior changes a lot, so you can have very constrained linear workflows because then you have to build, like, millions of these, and, like, that's not possible. And then you also don't wanna be fully open ended. Right? Because, like, if you already have fully merged something like the computer use API that came out for Anthropic, It's very general purpose, but, like, it's, like, it's very hard to find a utility for them. And so I think we're trying to take a road in the middle where, like, how can you build more constrained workflows for, like, tasks that you can define? And then we are building a lot of, like, verifiers and, like, one thing in the direction was the agent queue research group permitted, where, like, the agent can learn to improve and, like, improve its behavior on new websites it has never seen. And so we're starting with, like, more domain specific models and the models, like, know how to, like, do, I mean, like, vertical workflows or task delivery. How these models can improve and, like, generalize over time. But we don't want to start with models that are fully general because that seems to be, like, a big effort and also doesn't seem something that will I think for a startup like us will pay off in the next 6 months.

Nathan Labenz: (10:00) How does that look behind the scenes? I mean, in my experience of the product, it does still present as a very open ended thing when talk about kind of having these linear or closer to linear on rails workflows in the background. The Voyager project comes to mind from NVIDIA. I'm sure you're very familiar with that, where they would have the agent in a, like, a virtual video game environment. I think it was Minecraft. Go out and, like, figure out new skills and then sort of cache those skills for future use so it didn't have to constantly reinvent the wheel. Obviously, that's like a pretty big open ended environment, not nearly as big and as wide ranging as doing stuff broadly on the Internet. But what have you found to be the right architecture for or the right balance point between, like, what is prescribed versus what is left to the agent to decide at runtime?

Div Garg: (10:54) Yeah. That's a good question. I think what paradigm we have landed on is kind of like user choices. So it's like you don't want the agent to be fully autonomous. I think example I like to use a lot is kind of like flight booking. So you don't want to have the agent go and book you like a random flight and waste thousands of dollars without, like, sort of, like, checking in with you, like, do you actually are you fine with this time? So do you wanna fly in the evening? Do want nonstop or, like, this airline? Stuff like that. So preferences matter a lot. And I think, like, the paradigm we have landed on is, like, choices where, like, you might give some potentially ambiguous, like, user query to the agent. Like, go book me a flight to this weekend for a trip to Paris, and then the agent then you want the agent to kind of fill in the blanks where, like, okay, like, that may require some, like, a lot of choices that have to be implicitly being made. And then, like, we are building a lot of these workflows where, okay, like, how do we capture these preferences? How do we make, like, involve the human in the loop? And I think that becomes start that starts becoming complicated where, like, you do are not relying on just the actions, but also maybe, like, how to personalize, how to maybe like really get to know like what to do. And then based on that, build like a really solid product experience.

Nathan Labenz: (12:02) What are the experiences that I recall from an earlier version of the product that I don't know if it's still there. Maybe I missed it this time around in my testing was the opportunity to essentially correct the AI if it made a mistake, you know, to sort of demonstrate or like teach, I think was the word that you used in the past. The AI how to do a particular skill. You know, I think probably most people would find it reasonably intuitive to imagine how that was supposed to work. Right? You would have the agent go do stuff when it makes mistakes. The human could teach it what to do. Then, of course, it would learn what to do and you'd maybe fold that into your training data later. I wonder if, like, that has played out as you've thought it would, or are you now doing that in a more implicit way, do you just find that like that human sort of teaching paradigm is just not so useful as anticipated? And if that's not working, like what is the alternative in terms of how do you source data to, you know, to teach the models what to do?

Div Garg: (13:01) I mean, we've had a good success with them. I think we collected, like, some millions of trajectories using that method. The problem with any sort of crowdsourcing is, like, you can't really trust the quality of the data, and I think, like, Tesla is a good example for that. I think Tesla has been trying this for almost every week where they crowdsource a lot of data. But, like, filtering, like, the quality of the data is is, like, actually good data versus bad data. And that has been, like, a thing where, like, you have to build a lot of filtering pipeline and build a lot of, like, AI logic around that. So I don't think we have been doing a lot of that where we've seen some payoff, but I think it's still like I think it's just like a very noisy. So that makes it like a hard problem. The second way we use, like, basically just working with, like, high quality annotation data where, like, I won't say, like, too much here, but, like, if you have annotators or, like, people who can, like, sort of, like, help people like expert data. And I think that seems to be like one paradigm, maybe a more frontier that's pushing towards where like you can go on like train these agents on a lot of these general purpose use cases and then like train get models that are working really well.

Nathan Labenz: (14:02) I hear you on things being noisy. I also imagine, of course, the counterpoint would be like human annotators are expensive and limited in supply. One question I have is why is nobody propositioning me to pay to watch me use my computer all the time? Like, I feel like there should be a market now in install a general observer that just kind of watches a person use a computer. Maybe you could have a couple different modes of it, including one where you like talk through what you're doing and explain yourself. Because obviously most of the time, the chain of thought is not actually said out loud and recorded. You know, you'd have to clean up that data. Of course, people are doing random stuff. You'd also have to anonymize that data. I would wanna know who you are, and I'd wanna make sure I trust you before I let you install that on my computer. But I feel like with the cost of annotation as people get into like higher and higher end knowledge workers or even like scientific experts in some cases for the annotation stuff, it feels like that's pretty valuable and I feel like I should be getting like 1000 dollars a month at least for somebody to watch me use my computer. So why isn't that happening? Or if it is, you know, point me to where I sign up.

Div Garg: (15:16) Yeah. I totally agree. I think the market is there now. I think it wasn't there, I mean, like, even a month or two ago. I do think it's a new market. I just I think it just depends, like, who's willing to purchase the data because I think at the end of the day, getting this kind of data is not that expensive because I think there's enough people who are willing to generate this data for very cheap, especially if we can, like, use other countries and, like, whatever, like, outsource a lot of this kind of data collection. I think a lot of this data can get very cheap. It does become, like, if the company is really interested in your personal data, and then they might be willing to offer, like, more, like, 1000 dollars a month just to be able to, like, get access to very, like, more, like, personal. Okay. Like, what makes Nathan what does it look like when he's using his computer?

Nathan Labenz: (16:00) Yeah. I mean, it's an interesting paradigm. It kinda reminds me of we did two episodes. They were separated by some number of months on MindEye and MindEye 2. And in the first case and this was, like, reconstructing what somebody was looking at, like, what image they were looking at by the fMRI data scan, from a scan of their brain at the time that they were looking at this image. In the first version, it was patient specific modeling that was being done and you had to have something like, I don't know, 20 hours or whatever of these scans with a person looking at a new image every few seconds to then finally get to the point where you could train a model on that person. And then in the second edition, they had figured out how to kind of, let's say, map all of these different users, each of which have, like, a different anatomy of their brain, literally, like, different brain sizes. There's a lot, of course, difference between people. Map that all into essentially a shared space, train one model on all of those data points, and then from that base model, it would only take, like, an hour of one individual's data to kind of fine tune it to their particular anatomy and activation patterns. So I can see that here too where you might say, you know, there's vast bulk data to be collected out there. And then we just on the margin, it's really about fine tuning to each individual person. So I guess what you're saying is like that market is international, and it's just a it's a lower price point because, I guess, people feel like they wanna separate out, like, scientific expertise from just, like, general routine computer use, and it's better to buy those separately at different market prices.

Div Garg: (17:49) Yeah. Because if you think what the generic computer use, it's not that expensive. It's also, like, I think, like, very common patterns. So I think you can just outsource a lot of that, get that very cheap. I do think there's value of personal data, which is, like, if you're willing to, like, share all your personal data and, like, exactly how you are actually doing things and you don't care about privacy, then I think people are willing to pay a lot. Right? Like, if they cannot get access to a lot of this private data, and then they can, like, maybe use that fully and then maybe, like, I do think, like, that is a market that doesn't exist, that can be, like, very but a market that people are willing to pay a lot of money if they can get access to this private data. And then you're willing to, like, sort of, like, live with that kind of, like, loss of privacy in a sense.

Nathan Labenz: (18:28) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: Do you know what those data I assume you're not doing this directly yourselves to the degree that you are sourcing this kind of data. I assume you're working with partners to do it. Are there like marketplaces that exist or you know go to companies? And are there I'm also curious as to what degree people are sort of logging in and doing like discrete tasks like here's your task, go do it on the browser, tick, tick, tick versus just kind of open ended observation that would be more of like an imitation learning paradigm? How would you describe the, you don't have to give us any secret sources unless you want to, but let's say I wanted to buy this data. What's out there, you know, that I can buy and who do I buy from and how much should I expect to pay?

Div Garg: (19:16) I would have to say there's lot of like data labeling companies. A lot of them we are working with. Again, I can't tell really well, like, who we are working with just for, like, more, like, confidentiality purposes. But I would definitely say, like, we've tried a lot of things that even if you use something like MTurk or whatever, like, online solutions. Anything you can, like, create task. Okay. Like, I want someone who's doing this particular whatever, like, the computer use workflow, collect a lot of data, and then get a lot of, like, data that you can actually train on. I'm pretty sure that's what maybe, like, Anthropic is doing, much of the that's what we're doing. We have definitely done that, but then we're also, like, thinking about a lot of, like, smart tricks where, how do we find data in, like, the regimes we care about? So how can we incentivize people to give us, like, the data that we don't have? And then make sure that we can, like, keep improving, figure out, like, okay, like, where is the agent most deficient on and get the right quality of data and keep them making better and better.

Nathan Labenz: (20:06) Are we at the point of self play? Another thing that I sort of expect we'll have to tip soon is just objective feedback from reality. Right? I mean, we've obviously seen that work in all these game playing environments. It has worked but hasn't quite maybe hit the critical tipping point yet for code generation, but I would assume that there's some analogous version of feedback from reality that you could do in an open ended computer use context too. Right?

Div Garg: (20:37) Totally agree. I think, like, the thing that's missing the most right now is just having a really great quality reward that you can use from the environment. So when you're in a game based setting, like, Minecraft or something, you get a good reward or, like, even, like, when you're playing chess, like, AlphaZero. You have, like, you know, exactly, like, what are the rewards for any moves you make in the game space. And then it's easy to optimize the policy and make it, like, yeah, this is what we wanna do to win. What happens though is, like, if you are in, like, a more of a real world scenario, there's no environment reward. And then you basically it makes it challenging because you don't know what is the objective that you're optimizing. And then you maybe need to train another, like, model that is kind of, like, giving, like, some sort of proxy reward that you can use to, like, sort of, like, optimize the model and make it better and better. I think for coding, I think it's easier because you can create, like, unit tests where you can have some sort of, like, static checking, like, so you can get a numeric score on the quality of the code and, like, does this code actually function well? And then you can do some little self play. Okay. Like, based on this kind of, like, score, can we optimize the score to actually become better and better and optimize it? I don't think it's possible, like, there'll be, like, some sort of, like the thing with these kind of algorithms is, like, there's always ways to cheat in a sense. So whatever metric or score you come up with, like, the algorithm might find a way to cheat. Like, it might find some sort of, like, I just find out, like, you can just create a lot of, like, null strings or something, and, like, that actually solves the problem and, like, gets you, like, infinity score or something like that. And people have seen that reinforcement learning a lot when you use that in game playing environments. So, yeah, so I do think it becomes a question of can you obtain, like, high quality rewards on the setting you are in? And I think computer use is, like, a hard one there, but, like, maybe a human can give you, like, the right proxy where you're if you're doing something like human feedback, much of the things things, I think that is possible, but I think it's a harder problem than just, like, putting self play on a game based environment.

Nathan Labenz: (22:19) I always remember, you know, that visual of I think it was a DeepMind video game player from, like, 2017 where the little boat is circling around and around the same space and, like, picking up infinite points, but not actually advancing in the game in the way that it was meant to. So, yeah, lots of vivid examples of mode collapse or other strange solutions in the reinforcement learning world. So what's your model strategy today? I read through the agent queue paper that you guys put out a couple months back. At that time, you were reporting results on a Llama 3 70B, which you were doing a number of things to enhance obviously from its base version. You know, is that still the baseline for what you guys are using in production, or how would you characterize the available, I guess, you know, commercial and noncommercial open source options that are available to an agent developer today?

Div Garg: (23:18) Yeah. At this point, I think we've got a lot of different models. So closed source, open source, commercial. Like, Llama 70B has been, I think, a good compromise in terms of speed versus the model reasoning capabilities. And I think right now, I think we're starting to go more towards what we're calling, like, GPT-four style where can you do, like, inference time compute? Can you optimize the model to do, like, chain of thought during inference and be like, become a better reasoner? So I think that seems to be the way to solve a lot of these complex reasoning problems. So that's also what we're going towards where, like I think at this point, the base model doesn't matter that much because I think there's enough base models. Most of them are performing, like, similar-ish. The recipe and, like, the data they train on are also, like, more or less the same. So the trick kinda becomes, okay. Like, what is your application? And how do you optimize a model on that application? And I think that's what we did with agent queue where, like, can we care about this kind of web interface that's kind of like a shopping environment? How can we get the right environment, like, feedback? How can we train on the feedback? How do we make sure that this base model can, like, optimize and become better and better there? And then can you combine that with some sort of, like, inference time compute tricks to make it, like, a better reasoner? So I don't think, like, at this point, architectures seems to start becoming irrelevant where, like, I think we might see some sort of, like at least at the current scale. It's possible, like, if you are able to, like, 10x the current size, the parameters of the model, then a different architecture might, like, start shining. Right now, think, like, most architecture seems to, like, level out at similar performance where, like, there's not too much diversity in terms of the models out there, whether it's closed source or open source and, like, how much difference you can get. So even if you look at the leaderboards, like, everything seems to be very tied up. It's everything's starting to come very, very close. No one has I think OpenAI has definitely lost a lot of the lead they used to have with GPT-four where, like, they were kind of the sole winner and no one could catch up with them. At this point, it does seem like very homogeneous market where everyone's kind of close and the base model architecture kind of seems to become, like, not relevant as long as you can use it.

Nathan Labenz: (25:14) Interesting. o1 would seem to be a possible exception to that or no? I mean, obviously, o1 is, like, expensive if you're trying to do mouse by mouse click, and I guess there's also a question right now of what is available is o1 preview, which doesn't have the vision aspects enabled yet, but does that sort of does that summary of everything sort of being on a roughly even level include o1 or not include o1?

Div Garg: (25:46) I think o1 is interesting, I think. I will say it's very good at math. People have found out, like, it's not very good outside the mathematical domain, but I think it's very highly specifically trained on. So I'll say, o1, I think it's not too different from what we have right now. Maybe, like, at least for the preview, maybe, like, the full o1 model is much better, and then it can shine and, like, work in a lot of different regimes. I do think, like, this kind of paradigm where you instead of, like, training then compute where, like, last year, GPT-four, everything was you just throw more compute during training, more parameters, more compute, more data, and then you get a better model. That's better reasoning. And I think we're starting to see this type of, like, instead of compute during training, it's for compute during inference. And then, like, the more chain of thought you do, the better you get. And so for o1, it's kind of like, if I throw less compute, maybe it's kind of, like, maybe even worse than GPT-four, but if I throw a lot more compute, maybe it's much, much better. And then so it kind of becomes, like, how much time and compute can I throw during inference depending on application? But I do think that becomes like a different class of how to think about these problems, especially if you think about this as kind of like how much latency can you target at inference and like how much resources are you willing to like give to the model?

Nathan Labenz: (26:51) Yeah, I'm interested. Maybe we'll come back in a little bit to talk about kind of user experience and how much you think some of that stuff matters. Like I'm always unsure about, you know, how much cost matters, how much latency matters. Obviously, it depends on the user experience and depends what the agent is trying to accomplish, but let's circle back to that. For a moment, let's stay on the agent queue research that you did. I've got a table here that I pulled up from the paper. There's basically two kind of environments, right, that are used in the paper. One is an e-commerce benchmark that was created by academics and put out there in sort of a self contained, you know, Amazon like environment for shopping. And then the other one is actually going and having the agent do things on OpenTable. If I understand correctly, in both cases, you get to about 95% success through a variety of techniques. And I thought maybe you could just like walk us through this graph, which people can pull up from the paper. But it's a nice sort of explanation of how you build up to the success starting from a Llama 3 70B instruct base model, which for the graph is coming in on the OpenTable task at not even 20% success. That compares to GPT-4o which is over 60%, but still leaves like 6 more bars on the graph where you're layering on these additional techniques and showing what the contribution of each one is. So you wanna take us from, again, under 20% on Llama 3 70B to a little over 60% GPT-four o to through 6, you know, alternatives all the way to the 95%.

Div Garg: (28:35) Yeah. No. Definitely. Yeah. I think this is what I'll say, like, is possible with vertical specialization of models. Because I think what happens is like, most of these models are just trained on very generic intent data, and then and they're not, like, really, really super great at one thing. And I think that's very obvious when you look at, like, Llama. Like, Llama's, like, base performance, like, 20% because I think it's probably never been trained on this kind of, like, tasks before. But then you start looking GPT-four. I think GPT-four maybe has more browsing kind of, like, capabilities or I think the OpenAI still has a browsing tool. So they maybe have more data on this kind of interfaces, and that makes it, like, a better reasoner and a better, like, better accuracy on this kind of, like, action tasks. But, again, I think what happens is that you just this is very generic. There's no, like, specific fine tuning towards a particular vertical. I think we'll see a lot of loss of accuracy. And I think that's one thing we've proven with AgentQ where we're, like, now suppose actually you have a lot more model that performs 20%, but then based on like a lot of the data that we collect, can we actually boost the performance to make it like essentially solves the environment almost all the time? And then we try this experiment where we're like, let's use a lot of like techniques like DPO, maybe doing reinforcement learning using, like, human feedback. Let's combine that with maybe, like, stuff like Monte Carlo tree search where it can search the space of the websites so we can figure out we can explore the websites, figure out, like, if we go down this route or this link, would it work or not? And if you're like, okay. This didn't work. This won't work, and then keep doing that until we can kind of, like, keep fine tuning, make it better and better. And then we over time, we have, like, a very specific, like, vertical agent capability Like, now this agent just knows, like, clearly, I've been with the subset. And the great thing is, like, this thing just took us one day, so it was very fast. And it's kind of a self improvement cycle so that you can just keep doing it more and more. So And there's some limitation on, like, how much can you improve the performance as long as you have a good quality feedback. So as long as your feedback signal is very, very high quality, I think you can just, like, keep making this better. And then I think that was, like, a very interesting learning. Like, we were actually also very surprised. Like, okay. Like, that's sort of almost, like, a 4.5x improvement. We were like, yep. That's kind of crazy. Like, just in one day, we were able to, like, boost our model that's performing at 20% to, like, get this all the way to where we were able to push it with techniques we used. And I think that's kind of the good starting point here where how do you explore these moments, and then how do you keep have this capability where you can, like, self learn and optimize?

Nathan Labenz: (31:09) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: So let's go one by one. I think it is worth just breaking it down. The first one on the chart is Llama 3 70B instruct RFT. Is that a reinforcement learning fine tuning or a different kind of usually, it's in just simple, like, instruction fine tuning is, like, the first layer of post training. Right? But I'm it wasn't clear to me what the RFT stood for in that case.

Div Garg: (31:42) Yeah. I don't think it's, like, an older algorithm. I think it's a research paper. I think it's a reinforcement fine tuning. I think it's research paper, like, someone the algorithms that can read, I think, that came out last year. So we use this as a baseline, and then we compared to the bunch of other methods, and there's a couple of work from, like, Salesforce, the groups. And then the interesting thing was, like, no one kind of, like, thought about like the direction and stuff. Thought about where like, how can we do more efficient search and like a main pillar of this kind of like learning. And then like, how do you self optimize?

Nathan Labenz: (32:11) So that's reinforcement learning. DPO is next. I know you're right in the heart of DPO country there in Palo Alto. As maybe an aside, can you help me develop my intuition for the DPO algorithm? I feel like I'm on a quest, you know, I can look at the equation and that doesn't jump out, you know, as like superintuitive to me. How would you describe in a qualitative or intuitive sense what DPO is doing and like how a set of preferences on different generations is ultimately being translated back into, like, updates to the weights of the model.

Div Garg: (32:50) Yeah. So I would say DPO is a very intuitive algorithm at the end of the day. So when you do, like, supervised fine tuning, what you usually do is, like, you have a bunch of expert data. This is, like, kind of the optimal data. And then training the model to be, like, here's my ground truth optimal data. And then you want to, like, sort of, like, make the model predict behavior that's similar to it. And then you're doing this learning where you're doing, like, gradient descent to, like, go more closer to imitate, like, what the data looks like so the model's predictions are similar to it. I think, DPO, I think the thing they do is I will call it, like they also use negative data in a sense. So they do, like, this kind of contrastive learning where they're like, we have some positive feedback data. We have some negative feedback data. And then we wanna do gradient descent towards the positive data. So we wanna model it to, like, go closer to the behavior in terms of its predictions towards the positive outputs, but we want to do, like, gradient ascent on the negative data. So you want it to go away from the negative, like, behavior. And I think that actually makes it work better because, like, if you're just kinda, like if you think about, like, maybe, like, the positive behavior is kind of a circle and the model is kind of, like, trying to get close to it, It's possible like the model might be very widespread. So it's like it can cover the circle, but might also be very widespread outside. And now when you're doing DPO, you're saying, okay, like, here's the positive stuff. There's the negative stuff outside. And you're telling the model, the model has to kind of become close to circle. It will basically have all alterations to be here, so it can't be widespread. And so I think that's a one good way to think about it. Like, you're kind of, like, giving it more what to do and what not to do. And the what not to do, I think that kind of, like, thing becomes very useful, especially when you're doing this kind of reinforcement learning where you're, like, this is plus one, this is minus. And then so I think that at the end of the day, it's a very intuitive algorithm, and that means just a bit of fully formulated using, like, a lot of reinforcement learning, like, algorithmic literature and, like, principles. And then overall things like kind of like gradient descent towards the positive samples, gradient ascent away from the negative samples, and then you kind of like counterbalance that by summing the sum of like all the samples and dividing that to normalize the factor. I think that's basically it.

Nathan Labenz: (34:54) I know in like typical instruction tuning much like the pre training, it's literally just token by token evaluation, right, of the output. And then for PPO, I know there's like a reward model that is responsible for sort of scoring all the tokens and that reward model ultimately gives a signal of like, this token was really good, this token was not good, whatever. With DPO, I understand that there is no reward model. And yet if I understand correctly, it's like, it's not just training the model to predict like exactly what the tokens were. Right? Cause with the negative examples, you can't just say, don't do that token. You have to say like, okay, well, so what? Right? So can you give a little more intuition for how, if I have whole generations that are scored positively and negatively, how is the model understanding or how is the algorithm translating a score that is like not token by token into something that can be ultimately applied token wise to update weights through back propagation?

Div Garg: (36:06) So I would say first the initial data that was formulated, I think that's token wise. So that first, the original data from RLHF papers. So I do think that one you could be giving a positive and negative feedback for each generation of the model. And so it basically becomes like supervised fine tuning. So you're saying, like, okay. Like, here's this generation. This gets positive score. Here's this generation. This gets negative score. And then you can kind of, like, just directly put that in, like, a simple loss equation and train on that. So you don't have to actually think about trajectories in the original DPO. The one modification we did in our AgentQ paper was kind of, like, come up with trajectory level DPO. So when you're, like, working on environment, you're taking multiple steps. So you are not, like, just outputting, like, maybe a single token and, like, finishing that. You have to keep taking a lot of steps until you reach your end state. And then you can kind of generate a lot of trajectories. And, like, once you generate a lot of these trajectories, some trajectories have positive scores and negative scores. And that's more close to maybe what PPO does usually take in their reinforcement learning setting. And then for this to work, then you have to, like, think about, like, how can we apply this DPO algorithm on a more like, a trajectory level. Like, all these scores are for trajectories, not for individual steps. And I think then we propose this, like, trajectory level DPO that you can find in the AgentQ paper where, like, how can this work and, like, how can we balance, so if you have, like, different trajectory steps. And I don't think that's there's, yep. So the original, think I was called, like, very simple. Like, it's working on, like, per token or per step basis. So you don't have to actually think about trajectories.

Nathan Labenz: (37:34) So in the original version, you're saying basically my positive score or my negative score is just applied to all tokens equally and then that signal is propagated and that's it. Can you give a little more intuition for how the next generation works? You know, this is maybe more tedious than some we're interested in, but I really do want to continue to develop my intuition for what exactly in this as precisely as I can understand it, what is the signal that we are actually sending into the model? Cause I feel like that's really helpful for at least having some foundation on which to make like guesses or on which to interpret the downstream behavior that comes from that.

Div Garg: (38:20) Yeah. No, that makes sense. Also at the end of the day, the signal is kind of, like, the positive or negative score per generation. So I suppose, like, you're applying DPO to a large language model, and then you say that, like, maybe, like, I want you to maybe, like, say, like, who's the current president of the United States or something. And I suppose, like, the model knows the right answer, and you can give it, like, this was a good generation, so you it a plus one score. But if it comes with a wrong answer, you can be, like, correct. This is not correct, and then you can give it a negative one. And then you can do this multiple times where, like, you can ask her. You can generate 10 generations, and then you can have different voters different people who are, like, giving the score. It can be the one person who's giving the score. And once you have enough of this, like, here's all the positive generations, here's all the negative generations. When you put that in DPO and you're like, now the goal is to improve the sort of like optimize the probability of outputting the positive generations and minimize the probability of outputting the negative generations.

Nathan Labenz: (39:14) Okay. I still want to keep studying this a little bit more, but maybe I'll leave the rest for another day. Returning to the techniques in the paper, can we first just talk about like what is the definition of AgentQ? Like just to be sure I'm clear on that. And then I see, you know, jumps off the chart that enabling the Monte Carlo tree search drives a big boost in performance, but maybe you could sort of talk about what are the, in addition to that, like what are the biggest drivers of improved performance? But first just give us like what is the bundle that constitutes AgentQ?

Div Garg: (39:51) Yeah. Like I would say AgentQ again is a simple idea. I think it's kind of like it borrows from I think Richard Sutton's bitter lesson where, like, the only thing that seems to work is, like, search and learning. And I think, like, we were kind of very inspired by that. Like, even if you look at, like, Noam Brown's work at, like, the work he did on diplomacy at Meta and then, like, the stuff he's working on currently at OpenAI. A lot of that is, like, how do you combine search and, like, how do you make that and, like, use that with learning process to make it more intelligent, like, algorithms. And so we did a similar thing where we're, like, search just seems to be very underexplored. Like, no one has really thought about it. Like, how do you do search for agents where, like, that seems to be, like, an obvious thing. And we're like, let's can we now, like, combine a lot of this, like, Monte Carlo tree search where, like, it's easy for the agent to go explore the state of different environments? And then once we can explore these environments, can we get a lot of, like, this kind of, like, positive, negative reward? We're, like, so the positive generations, positive, like, trajectories, which are actually the goal. These are the negative trajectories which fail which fail to reach the goal. And once we have that, then can we put that in a learning process and then, like, keep optimizing. So it's like, maximize the probability of positive trajectories that actually reach the goal, minimize the probability of trajectories that don't reach the goal. And how the MCTS helps is it's like we could be kind of giving, like, okay, like explore as much as possible. So we're kind of, like, adding a lot of entropy where, like, it's possible if you're not exploring the environment, maybe just like a very narrow path that the model has learned, and then it just thinks like, this is just the one thing that can do on the environment space. And so it just doesn't know how to navigate the environment. But we try to add a lot of entropy to the initial exploration for search. So, like, the model is trying to go and, like, take as many routes as possible. But then over time, it's finding out, okay. Like, this route should work like that ends. Okay. This didn't actually reach the goal. So let's avoid that in the future. These are the positive ones. Let's keep doing that. And that way, we, like, can train a model that has kind of seen the whole environment splits and has learned to collect exactly, like, how to reach the goal. And we keep doing that again and again until we can make this model like self optimize and become better and better to maximize the quality of success.

Nathan Labenz: (41:54) So by applying all these things, and I didn't realize until this conversation that this was a relatively quick project with like not a ton of compute presumably put into it. Although maybe, I mean, you can spend a lot of compute in a day, but I assume you didn't spend that much compute in just a quick sprint of a project like this. You gradually climb all the way up, you get to 95-ish percent on the one benchmark that is better than human performance. And this sort of gets philosophical pretty quick, but we see this all over the place, right? Where it's like, oh, AIs beat humans at this benchmark and that benchmark and all these benchmarks. And I do take that very seriously as a signal of how far we've come. And yet at the same time, if you just looked at all the benchmarks, you would think like we definitely have some sort of AGI running around and yet somehow we don't quite yet. So I guess, first of all, how do you think about that divergence between the AI performance on a benchmark versus a human performance on a benchmark? Like it's still, I assume you would agree and I would say candidly in my testing of the agent, it can go do stuff, you know, it can accomplish some tasks, but I'll bet on myself in a Paul Bunyan versus machine style competition between me and the MultiOn agent still. So why do we not why do we see these benchmarks with the AIs winning when it obviously people design them to try to represent the real world and I actually did go look at the web shop thing and it looks like a pretty, a little bit bare bones, but like a pretty normal-ish e-commerce environment. Are people not trying,

Div Garg: (43:39) I don't

Nathan Labenz: know, I've come out, you know, one idea that comes to mind is maybe people aren't trying that hard on the benchmarks, but I don't think that's probably usually the case. I don't know. What's your theory of the apparent divergence between what the benchmark seem to be telling us and what we actually then see in practice?

Div Garg: (43:55) I'll again call it like a difference between real world scenario versus the mock scenario. It's like a benchmark is a more like a narrow domain. Like, it'll be like a couple of, the web shop mean, the web shop environment has, like, maybe, like, 5, 6 different, like, environments. And what we also do is we are training on each different environment each different site. So we can optimize the model to be very good at that site. When we do more, like, general purpose agents, and that's working in the real world, like, until we have the leeway to go and, like, optimize this agent on every single website, I think, like, the performance won't match. Right? Because, like, a benchmark is just a very narrow domain, so it's just to do well on the benchmark. For us to do well on the entire Internet, that is just a much more challenging problem. I think it's possible. Think about it. But I think it's just much more complex and, like, it requires more resources. Now I do think that's kind of, like, where things start emerging where, like, a benchmark is a, like, a simplified version of what you will see in the real world. But if you think about it as a vertical, like, because I like, here's this one narrow domain that I really care about. And then we get we take a look at, like, the average human accuracy on the domain. And then we take a look at, like, okay, like, what can we do with agents. Then the surprising result is that, okay, like, on this narrow domain, we can actually beat a human. And that becomes, like, sort of, like, a very interesting learning. Okay. Like, now it's gonna be, like, beat a human in a very narrow domain. Then, okay, like, over time, you can, like, keep generalizing and generalizing that you can, like, start beating humans and, like, better in more generalized domain. But the question then becomes, like, okay, like, how much data do you need? How much feedback do you need? How much scale do you need? How much compute is needed? I think that's kind of, like, an interesting scale problem. They're, like, I don't think we've kind of figured out like the fundamental base to solve the problem, but then now you have to apply scale to make this like actually really shine.

Nathan Labenz: (45:37) Are you to the point now? I mean, of course, this is like the big. There's obviously multiple big stories, but I feel like a good candidate for what's different about the current era of AI versus previous years of AI is that we see positive transfer across lots of things, right? Like it used to be the case that if you tried to train a model on two tasks, it wouldn't be as good as two models trained on one task each. Now you have with foundation models, this sort of like foundational ability that seems to be, you know, rather quick to generalize to other things and training on a million tasks turns out to make the million and first easier to learn. Do you have a sense for where you go from negative to positive transfer as you scale up an agent like this? Like, I presumably it doesn't happen if you go from just OpenTable to just OpenTable plus Amazon. But if you keep if you just kept adding and I'm sure you've, you know, explored this sort of thing. If you just kept adding different sites, is there a point where you start to see positive transfer? Like, do you have a sense for where the threshold or the tipping point is there?

Div Garg: (46:52) Yeah. Definitely, it's correct. Especially, suppose you have like a lot of shopping domain websites. So you can have this like Amazon or Target or Best Buy, Walmart, whatever. Most of these websites actually look similar because it's kind of like a search bar, product catalog, and you have some sort of, like, detailed product view, checkout. And there's a lot of positive transfer, like, even if you train on one website and then you were, like, generalize it to, like, similar domains. I think you actually get a very, very good transfer. So I think within each category, I think there's a really good positive transfer. And overall too I think, like, the Internet is not it's I think it's made for humans to use. It's not that diverse. If you think about, like, the different UIs and, like, elements that you encounter, you basically see, like, a lot of, like, drop downs or drawers or, like, navigation bars and whatever, like, text fields. But I think, like, there's a lot of commonality. So the model can, like, learn how to work on, like, a lot of these interfaces. I think it does a lot of positive transfer where you can work on new websites and then, like, generalize. And that's also, like, one thing we're very bullish on in what we're doing with web agents. That we can keep making these agents better and better. And over time, I think I do think there's a scaling law here. But, like, once you do hit enough scale in terms of like the number of websites you've trained on, then you can generalize to the remaining websites you're working on.

Nathan Labenz: (48:08) Let's come back to the scale, the question of scale and the scaling laws and the possibility of bitter lessons and for whom the lessons will be bitter again in a minute as well. Just staying on your model strategy for a minute longer, curious what you've learned about fine tuning. I feel like there are you know, obviously, lots of different strategies for fine tuning. LoRA seems to have kind of become the default thing because it's efficient. I was a big fan or at least I was really very excited to read a paper called MoRA from not too long ago, which was an alternative that was, like, similar number of parameters but higher rank. And the math on that escapes me a little bit, but conceptually, they reported sort of a denser fine tuning, denser use, more intensive use of the available parameters. And they did report that it was better at learning facts, which I thought was really interesting. So interested if you've experimented with that or if you're just biting the bullet and doing, like, full weight training.

Div Garg: (49:11) Yeah. I don't think we see, like, lot of improvement just with LoRA. I think LoRA is, like, very efficient. It's a very good way to train these models and fine tune them on your application. I do think there's also, like, a lot of variants for LoRA. I think there's, like, patch, LoRA, and, like, there's much of, like, these kind of, like, variations, and, like, a lot of this works really well. I do think, like, LoRA works really well. There's definitely a lot of this new innovation that's happening where people are coming from more alternatives. So I think that's interesting space to watch for. I think full training can help, but it again, is about how much data you have. So if you have, like, a crazy amount of tokens, have, like, billions of tokens, then definitely you should be, like, full refund training. But if you have, like, 100,000 tokens or in that range, I think then LoRA is actually, like, a better method because you have less parameters to optimize. And so if you have less tokens less data, then I think that's actually a better use of the data. And then you actually get better performance. And but if you have, like, billions of tokens, then I think, like, you should be optimizing more parameters. Yeah. I think a good way to also think about this is Chinchilla scaling laws. So, like, what is the optimal data for optimal number of, like, the size of a model? And so if you have, like, this many parameter model, this is the amount of data you should be throwing at the model. And I think that's a good way to think about, like, LoRA versus not LoRA.

Nathan Labenz: (50:24) I guess I would maybe expect you would have billions of tokens. It sounds like you're not it sounds like you're not using huge datasets. Obviously, quality is a super important dimension. Should I infer from this that you have been just trying to optimize the quality of the data set that you're working with over quantity and it's just not that big that you the LoRA approach still is enough?

Div Garg: (50:50) Obviously, we've tried a couple of things. One thing we definitely care about is speed. And then if you're training over, like, billions of tokens, that is a very long process. And I think so that we have some experiments we've been running, but for us, it's about, like, quick iterations. And so what can we do with, like, a less amount of data, and how can we make more efficient use of models that we can deploy in production and then, like, make use of for our customers. So in that scenario, we have found, like, well, if you look even if you look at AgentQ, like, we did, like, maybe a couple of days of training, and we were able to get better performance on this, like, narrow domains. It actually is a better thing if you want to build product. If you're, like, doing, like, purist research, then I think you should be just spending, like, 6 months training, like, hundreds of billions of tokens and, like, kind of train, like, the best general agent possible. But yeah. But the learning so far for us sorry. I also will say for the space is, it's much easier to build a narrow agent and it's also very easy to much, much easier to productize unless you want to be like a foundation model company.

Nathan Labenz: (51:49) Yeah. Okay. Gotcha. What is your vision strategy right now? Is that all, I mean, of course, by vision, I mean like interpreting what the agent encounters on the computer screen. In the early days, we were seeing lots of examples where people were trying to parse the DOM of the website and figure out how to like strip all the crap out of the HTML so that it would fit into the context window. And now of course we've got much more in the way of multi modality that we can take advantage of. I mean, in my testing, did find a couple of weird places where the agent sort of got stuck and like was saying that it couldn't find something that was like just plainly on the screen. It wasn't clear to me if that was like just, you know, I don't know. I don't know what was happening there. Like, was the model not able to see well or was there some sort of weird situation like an iframe that would could possibly be causing some visibility issues? So I guess two parts of that question, like, what's your overall strategy for interpreting the visuals and maybe what are some examples of just weird challenges that pose problems, you know, relative to kind of the happy path that people would think of first?

Div Garg: (53:02) Yeah. So I would say at this point, like, I think we're doing a lot of hybrid pipelines, but I think there's a lot of, like, useful stuff that you get from the DOM, where there's a lot of, like, meta tags and ARIA labels, which is kind of, like, optimized for bots. And then you have a lot of, like, the visual data, which has a lot of layouts and states. And we're doing a combination of both, so we try to be hybrid. Where, like, can we get the best of both worlds? And then can we use to train the models and do things? One challenge we always I think this is something that throws off a lot of people is because if you're a human, you kind of expect these things to fully work like a human. Right? But what's happening is, like, the representation of the screen of the UI that the agent is seeing is different from what you are seeing. And then the and that makes, like, the behavior of the agent could be, like, a bit weird where maybe it's able to click on invisible elements. So it's able to maybe, like it doesn't need to scroll down. It can just, like, kind of, like, see the whole, like if there's a website with 10 pages, maybe it just says it's able to see, like, all the 10 pages without needing to scroll down like a human. And I think that kind of, like, throws off, like, people from their experiments where you have to artificially figure out how to constrain it to be more like a human. And then there's also these things where maybe, like, you might have some sort of elements which are tricky to detect or identify, which maybe, like, are very obvious for a human, but, like, like, maybe the agent is not able to see that. Maybe it's some sort of, like, complex JavaScript issue or some sort of iframe issue. I think, like, right now, we don't see a lot of that. I don't think, like, in our new prototypes, I think we have been just, like, trying a lot of different things. So then they might have some deficiencies there. But I do think like a lot of the major products we have, I think like we have been able to solve a lot of these things over the last couple of months where like now our pipeline for processing like the information on websites. I think it's very robust.

Nathan Labenz: (54:48) I realized too there's probably somewhat of an adversarial situation happening because what I was trying when I observed that was and just to give people a little bit of a sense for the feel. So you added me to a TestFlight where I could go download the in development MultiOn app and then basically open the app, tell the agent what you wanted to do. It then has kind of a dual interface where it's like telling you what it's doing and allowing you to pause it or give it additional instructions, feedback, whatever as it goes. And then if you minimize that, you're just watching the screen and it can also talk out loud to you. So you can basically hear it narrate its progress and you're watching the, you know, the state of the browser evolve as it goes on and does its thing. Well, I'm in Detroit and it's the day before Thanksgiving as we're recording. So we've got the Lions playing their annual Thanksgiving Day game tomorrow, and the Michigan Ohio State is also this Saturday. So I just asked it to go get me tickets for each of these games. I tried both. And it was it was successful in terms of, like, understanding my query. It was successful in terms of doing, like, an initial search. It actually did a couple different approaches, you know, and different trials. It in some cases, like, went to Google and searched. In other cases, it went like direct to a ticket site from just prior knowledge. It was able to search successfully on the ticket sites. And then there were a couple of instances where it was like, those tickets were now on the screen, but it would say, I'm having a hard time finding the tickets. And especially if you're saying that's rare overall, I suspect that possibly part of what's going on is, you know, obviously these ticket companies have bot wars going on for many years now at this point where they're trying to prevent people from buying up and reselling and whatever. I don't know. No expert in the ticket market, but I know it's complicated. I guess all that is to say, what do you think is the how do you describe the dynamic right now? Maybe I didn't I wasn't thinking about this at the time, but as I am thinking about it now, I'm thinking you could imagine this cutting multiple different ways. Right? You might imagine new auth frameworks for agents coming online. And by the way, I also another test, I didn't have to log in for these ticket sites, but another test where I did need to log in to create an account, I tried shopping for Thanksgiving dinner on Instacart as well. To do that, I had to you know, at some point I got to a, you must create an account screen and I just kind of paused and like I put in my email And then, you know, when it sent me the confirmation, I had to go get that out of my email and provide that to the agent so it could provide it. It was able to fill those steps with me putting that stuff into the chat. But you could imagine, like, different parts of the world evolving very differently where Instacart might say we want to be an aggressive early adopter of whatever sort of auth paradigm is coming for agents because we obviously want people to shop however they want to shop. And if they want to have an agent shop for them, more power to them. Whereas a Ticketmaster or a StubHub might say, we want to ban all agents effectively because in some way or other, they're eating into our margins. And so whatever countermeasures they might put up. That's a long prompt, but I'd love to hear your thoughts on if you see any good auth frameworks starting to emerge, if you are seeing like countermeasures in the wild that you feel are kind of anticipating an AI agent wave and trying to resist proactively? And I guess just kind of generally, you know, where do you see that sort of thing shaking out?

Div Garg: (58:36) So let's see. First one, definitely authentication, I think it's getting more mature. I'm adding there's a lot of, like, authentication providers. I think like Anon is one. I think that's been trying to build a lot of agent identity for browsing kind of stuff. I think as there was a new one that came out agent, auto something for APIs and, like, plugins integrations. So I don't think that's it's starting to get mature where people have been thinking about auth. And I don't think that's become major enough that, like, everyone can go and, like, work with, like, some sort of, like, authentication provider and, like, plug it in and, like, you're good to go. But the second question becomes like, okay, like are the current services and websites gonna be adversarial or cooperative? I think long term they will be cooperative because I think it's kind of a win win over time. Short term it's hard to say because it's possible like some people might just take the direction or the positioning. More to this is kind of like, maybe harmful behavior. It's kind of like, more like spammy. And that kind of becomes like, how do you build trust? How do you make sure that you feel like you are actually helping the website owners and the services and the merchants? And I don't think there's been a situation where, like, as I said, to make sure I think it's transforming a lot of the experiences. But it's like, if a merchant used to care about, like, the revenue and number of users, And I think that's something that agents can help. I just like I think it's like and when anything's disruptive, I think that it takes a lot of time for that to catch on. And I think that we are in the start of the disruptive era where, like, a lot of the online communication and interaction will get disrupted by agents. And I think, like, there might be a lot of, like, initial backlash where, like, this is kind of dangerous, and this is not safe, and this is causing whatever this issue is. And but I think, like, over time, I think it's a win win. It's just a different paradigm and it's an unexpected paradigm and journey.

Nathan Labenz: (1:00:23) So, yeah, let's look ahead a little bit to the future. I've got kind of multiple related questions. I mean, you guys have been one of the most, you know, quick to put things out there and try anything with this and see what happens and very open ended. As you've described in this conversation, you know, you've not bucked the trend when it comes to getting somewhat more narrow and being like, okay, you know, let's really nail some dialed in use cases. So where are you today on that? Are you working with, like, businesses? And, you know, if so, I'd be really interested to know what sort of use cases they are finding agents to be valuable for. I think that's probably like the the probably the most important question for all of this stuff is like, where are people finding value today? And then and maybe that also kind of answers like why the auth stuff isn't so much of a focus at the moment if it's like, well, we're if we're working with enterprises, we're in their authed environment and it's a whole, wouldn't have to worry about signing into StubHub or whatever. So, yeah, I guess I'm really interested in, like, where you see you know, you've mentioned, like, a 6 month time frame. What bets are you making right now to, you know, start to bring real value to the world in the not too distant future?

Div Garg: (1:01:43) So I was gonna say, like, verticals. I think, like, more, like, vertical use cases, that's what we're going doing in. Internally, I think I think that's the direction I'm bullish on. I think the AgentQ was kinda like our first step in the direction of, like, can you have vertical agents that you can, like, train and learn on? So you don't want, like, something that's purely hardcoded, so you don't want, like, something like playwright, script products, a lot of scripting language. That's it. Because it's purely hardcoded. It's very, like, brittle. It will just break all the time. But you don't want, like, some sort of, like, like if you want the model to be able to adapt, so that maybe, like, the interface change is able to, like, recover. It's able to, like, improve. And but, like, I still wanted to be, like, narrow now. And I think, like, AgentQ was kind like, showing, okay, this is kind of possible. And that's one thing we've been working with a lot of our clients on where, like, can we now ship this kind of, like, more narrow agents for specific use cases, which could be maybe and I think it'll explore a lot of different sectors. Some would say, like, a good example is maybe, like, restaurant reservations or if you wanna do, like, some, like, travel kind of things, if you wanna do, like, scheduling. And I think there's a lot of these kind of sectors, and I can't say too much about the bets we're making right now. And we feel like outside of, like, how can you kind of build this kind of, like, agentic, like, verticals where, like, this agent becomes very good in this particular vertical, and then keep optimizing it and making it into, like, a solid product experience. And then there's a lot of have to think about reliability and then, like, optimizing for the user interaction and human in the loop and bunch of other things. I think people are starting to think about it, but I think it's still very early. And I think, like, building a product is just a complex thing because it's just not the agent. I think it is just like 10 or more other things.

Nathan Labenz: (1:03:20) Yeah. I think what I'm kind of inferring from your comments is that you're kind of finding a third space that I hadn't really considered as much where one approach would be to say, here's a here's, you know, consumer, here's the do whatever you want. Here's our open ended agent. Another one would be to say, we want to kind of compete in like the business process space where it's like back office, not visible, super structured. Then OpenTable kind of represents a sort of a middle ground where you might say, we want to, and maybe Instacart would be another, you know, great example of this or kayak. We want to be the partner that creates an agent for your site that could still be consumer facing, but because we can partner with you in a deeper way, we can really dial in that reliability and get that product experience to be exactly what you want it to be for the kayak assistant agent or the Instacart agent or the OpenTable agent or obviously people can fill out the list for themselves. But am I headed the right direction there? That's what I'm reading between the lines from the AgentQ paper.

Div Garg: (1:04:36) Yeah. Yeah. I do think, like, this can be applied even for business processes too because there too I think, like, if you look at the current issues people have with UiPath or other automation software, I think it's just, like, it's very brittle, takes, like, a lot of, like, onboarding ramp, but it requires, like, special, like, engineers who are, like, just really good at building this kind of, like, automations. And then now I think if you can decrease the complexity, that also is a lot of use cases where, like, if we can enable anyone to build this kind of, like, automation, even for business processes, and this automation is resilient, then I think that also is a big unlock. So I don't think it can be applied there, and I think there's, like, with, like, interesting use cases there that also we might look into. But we do at the end of the day, think, like, we're just trying to build something that's more like everyday purpose in a sense.

Nathan Labenz: (1:05:24) When you think about being that layer for some of these complicated products, you know, like a kayak, right? There's a lot of UI. If you were going to do a partnership with a kayak or with an OpenTable, would it still make sense to work through the UI or would you start to augment the UI action space with like a specific sort of more API like, you know, or tool kind of modality? Because you could imagine the agent might want to, you know, again thinking of kayak, right? All the little UIs in that sidebar, it seems like the AIs would have a better job if they know that they're the kayak agent and they don't have to generalize beyond that. It seems like they might not want to go like drag the sliders for the time interval that the user requested, but rather just like make a sort of declarative statement to the system that like, I want to narrow the search in this way more kind of function calling like. I realize I'm kind of a couple of layers of maybe speculation deep here beyond what you've explicitly confirmed, but how do you think about the hybrid between the fully general UI path to, you know, taking the next step and in some cases what would seem to me more reliable of making like a function call?

Div Garg: (1:06:49) Yeah. I think we've been thinking about that a lot. I think we have some things we're working on. I don't think that's interesting, I think it just takes a lot of time. I would say good analogy here is like think about self driving cars. Right? So it's you can imagine, like, if it was to build, like, a self driving car from, like, scratch today. My I can just say, like, let's why not just go and construct special roads on every highway? And this is special, like, there's a special lane, and that's the lane is for, like, the self driving car. And the self driving car just uses the lane, and the lane has sensors and everything ready, and the self driving car just has to follow the lane. And that's it. Like, you basically don't have to do any R&D. Will basically just build the lane, and you kinda put it to go. It's a very simple problem in terms of engineering research. The thing is, like, when you wanna do something like that, it's just, like, very complex. That's if you change a lot of behavior, it's a very big infrastructure project, and we wanna start off this kind of, like, new thing, and that will cause, like hey. If you about this, it take, like, 10 years and, like, a lot of, like, politics and, like, convincing people and, like, getting there, And only that can you do that. And then, like, a lot of self driving car startups decided, like, okay. Like, that's probably not gonna happen anytime soon. Let's just build a car that can, like, let's build autonomous car that can drive, like, a human car on any lane, and let's go solve this problem using the car R&D and, like, how we made that phone. And because that's how humans, like, can navigate the world. I think we have a similar analogy here because I think, like, over time, you might have this kind of, like, a special internet infrastructure that allows agents to come and get to websites and then, like, do more reliable function calling. But, again, that becomes, like, the, what's the incentives so people, like, move over that. I think that it just takes a lot of time. And, like, I think I do think that this is, like, a multiyear thing if even if you were to, like, someone wants to start this. And I think we've actually been looking at that. So I do think that could be, like, what the future looks like. But in the meantime, until, like, everyone agrees on, like, here's this new protocol and, like or here's how we'll do this communication and, like, convince everyone to adopt that protocol, I think, like, basically, being able to, like, use these websites similar to how humans use them. I think that's the right way to start from because, like, I think that's also the bet with AI. The bet with AI is, like, whatever a human can do, like, with AI, at the end of the day machine learning can be able to do that. And so that's how we're trying to do. We're like, humans are pretty good at, like, navigating websites with UI. And so, like, theoretically, like, AI can also become very, very good, even better. And so if we can have this kind of AI, which is, like, as good as human or better, I think that's a good way to solve the problem immediately, and this is just, like, how fast can we get there? Once you get there and then you're like, okay. Like, now maybe it's time for something, like, different right. Just want to paradigm change and flip the paradigm. And then I think, like, that will help. But I think that becomes, like, a timing problem where, like it's just, like, a massive change in behavior, and I think that takes a lot of time.

Nathan Labenz: (1:09:24) A lot of these sites though do have like a I'm just pushing a little bit in the middle space. I mean, I hear you on like the road you know, the dedicated lane for the self drivers. I've been calling for that since I was in school. And maybe that'll be one of the pleasant surprises that we get from a new Trump administration, although I'm not holding my breath. But there's been at least a little talk. I totally get it when you map that onto an enterprise. It's like, hey guys, who wants to implement this new agent protocol? You know, no hands go up. Or maybe one does, but you know, the boss doesn't like it, whatever. I get that complexity. But a lot of these sites do have like an internal API. Right? I mean, isn't there something in the middle in a lot of cases that does exist? Like what the UI talks to in many cases is an API, right? So is there a does it make sense in some cases to just be like, okay, kayak, you've got a real bear of a UI, but this data structure that manipulates is easier. And this is like an actual live question for myself too. Mean, I'm I actually have an episode coming up that I'm still putting some final touches on for people that want to do what I call building software that uses itself, kind of trying to coach people that are building apps to create like the, you know, the magic AI functions that their users have always wanted, even if they didn't know to ask for. And that's kind of my paradigm. So I wonder how you react to, you know, I sort of see this equivalence between UI and AI, not everywhere, but in like a lot of places and encourage developers to take advantage of that. To say like, okay, instead of asking your user to use all this UI, have one button, ask them what they want and then have the AI translate that to the structured data that sits behind that UI. Now that doesn't scale super well, right? Because that wouldn't be something you could do for every website. But if you did want to do a kayak or an OpenTable partnership, seems like it would be viable. And certainly if I'm like talking to developers that are working on their own app, you know, it should be viable for them. So I don't know. Any other reactions to that that sort of attempt to find the Goldilocks zone?

Div Garg: (1:11:33) No. I do agree. Like, if this is again something we've thought about yeah. I've been thinking about it by actually proposing some sort of like open standard around this where you can have, like, some sort of, like, a standardized way to communicate between, like, agents and websites. And I didn't think that can make a lot of sense where, like, all of these websites have backend APIs that would, like, directly interface with. I think the biggest issue becomes, like, security and, like, okay. Like, who calls these APIs? Is it gonna call these APIs? Identification, where, like, how do you identify who's calling these APIs? And there becomes, like, patterns of use. Because, like, think, like, if you have website like Airbnb or DoorDash, I think they have all the patterns of use where, like, they know, like, this is basically what the user will, like, use a website and it's super optimized a website for that particular pattern of use. And what happens is, like, if you transform yourself into, like, an API first business, I think that just changes a lot of paradigm because now you have to kind of fight against the product teams and maybe, like, a lot of your front end teams. And I think, like, the back end engineers also don't want this. Like, you're exposing a lot of, like, security loops. You have to think about SOC 2 and a bunch of things. So it becomes very hard for enterprise to navigate that landscape. But I don't think I get to think about engineering wise, yes, it's possible. But I think it's just like a very different business patterns, and I think very different security and, like, identity patterns, which just complicates the landscape a lot.

Nathan Labenz: (1:12:47) Yeah. Gotcha. How about I put a pin earlier in just like cost and latency. This is very common question for kind of anyone, building in this space. What have you learned about what really matters? And, you said you care about speed, certainly in the context of like your own development iteration. How much do you think this matters for the end user, you know, for your customers if they're partner businesses and the end users? What's the sort of optimization problem look like cost, reliability, and speed?

Div Garg: (1:13:24) And it's like for our agents or like I'm just curious exactly what you mean by that.

Nathan Labenz: (1:13:30) Yeah. I mean, you can contextualize it however you want, but I mean, in some sense, that's the, you know, the oldest question in software. Right? You're good, fast, cheap, pick two. Some would say in AI, the magic is you can maybe have all three, but you're still obviously making some trade offs. And, know, you said kind of 70B is like a good compromise. Is that because you, like, have users, like, sitting there waiting and you wanna return for them quickly? Is it yeah. So I'm just kinda curious as to how you're thinking about, you know, what you're willing to pay in terms of financial cost and latency for, you know, marginal improvements to reliability.

Div Garg: (1:14:07) Yeah. I think it comes down to the use case. If it's a high risk use case, like, you wanna do some sort of purchase. And you're like, I really want to confirm and, like, make sure that we buy the right thing. You don't actually make the wrong purchase or, like, you're doing travel or something, just make sure that everything's correct. So how much does that matter? And then because, like, it's a is it a soft boundary versus a hard boundary? So a hard boundary is something that if you do it, it's irreversible, and if something's irreversible, it's very costly to the user. And you want to be, like, very careful. Be like, maybe you don't really care about how much time it takes. It only takes you, like, maybe, like, 20 minutes or something potentially. Maybe it's it can be like more costly if you have to do more reasoning to more compute, but the job gets reliably done all the time. Then I think a lot of people will want to live with that world compared to just like, yeah, maybe it's very fast, but it's very low reliability for like higher like this kind of like hard decision boundaries. There's some things which are like more soft decision boundaries. Like chat is a good example where like if you are talking to a chatbot, it's a soft decision boundary. There's nothing irreversible about it. And there's like some patterns like that. Like suppose maybe like you're like doing some scheduling kind of things, maybe like you occasionally have some small things that you can, like, fix. But a lot of agentic behavior is, like, more hard decision making. And then so you want to be careful about, okay, like, before you fully actuate the action and, like, complete the task for the user, that the task is actually, like, correctly done and, like, before you do something irreversible. And detecting if something is reversible versus irreversible, that is also something that I think we've spent a good amount of time thinking about. Because, like, if something's reversible, we can just have, like, a weak, like, a small model and do it, and then it's fine. But if something's irreversible, then you're like, you need to build a lot of verification and make sure that everything's, like, as good as possible and there's no edge cases and then only done, take that irreversible action.

Nathan Labenz: (1:15:51) Yeah. That makes sense to me. Okay, one more big question and then I'll invite any other comments on things we didn't get to. And I should have probably said this through at the top, but full disclosure, I'm a very small investor in the company and I mostly don't really invest for returns so much as to support things and people that I think are cool and want to see come into existence. So I've been fascinated by what you've been building basically since the beginning. I do wonder how you think about kind of positioning yourself to avoid being a casualty of the bitter lesson here. We see, like, Claude, of course, now has computer use out, and it is still not super reliable when it comes to step by step advancing through tasks on the web. It does bring though some, like, really nice advantages in terms of being, like, an outstanding writer. You know, there are some, like, qualities of Claude that are really hard to get anywhere else. And then of course, OpenAI is always rumored to have something coming soon. And right now the rumors and hints, would say are suggesting that there's an agent framework coming soon too. So what is an agent company to do that hasn't raised, you know, $10,000,000,000 to compete effectively when, you know, they're clearly coming for you with incredible scale on some timeline?

Div Garg: (1:17:18) Like, again, I would say it's kind of like finding the niche that you are that you wanna focus on and then, like, going after that really well. Now I do think that's how, like, if you look at a lot of the successful AI products, Perplexity is a good example, and they just chose, like, search is the niche and they just focused on hallucination as the one problem they were trying to solve. They're like, yeah, like, these models have issues with hallucination. I just go fix hallucination. And that's basically, like, what they were focusing on fully more or less and then building a product experience around that. Cursor's a good example, where they're like, they're like, yeah, like, let's build this coding IDE and, like, that's where we're strong. And then they kind of go similar thing with agent. Like, there's computer use, and the end of the day, it's capability but not the product. And so we'll have this action as a capability, but then it's kind of like, like, what's the one thing that you can really clearly do to differentiate? And then how do you optimize the loop? How do you build a product experience? And the thing is, like, most frontier AI labs are actually not interested in this because, like, if you have hundreds of billions of dollars, you don't want to go after, like, a narrow domain. So you're like, I just wanna throw in compute and, sort of, like, build a general enough thing. I mean, that is great because that's kind of, like, a foundation block. But then I think, like, someone needs to take that general purpose thing and build out, like, the right product experience. I think that kind of becomes, like, that unlocks a lot of, like, market opportunities where, like, there will be a lot of, like, different businesses and, like, different use cases that are built that satisfy actual user need or, like, product need that is missing right now using this as a building block. And I do think that's something that will try to happen maybe, like, early 2025 because I think, like, this year, the technology was too early. I think, like, we are probably, like, one of the first companies that were providing, like, this kind of any kind of, like, agentic capability, and no one was able to build an application where, like, unless we are there in terms of the technology frontier. Like, it is hard to build a product. It's kind of like if you wanna build a cart, you have to first invent the wheel. But if you feel like, okay. Like, we don't know how to build a wheel properly or it's like it's just not there, then it's like you can't really, like, build up a product experience on that. And I think that's where we have been. Like, the technology has been, like, just very new, and then that makes it hard to bring a full experience, which is, like, obviously, reliable. But now I think that as the technology becomes better and better, I think, like, early 2025, we've seen a, like, a I would call it the explosion of applications, which I don't think has happened so far. If you think about agentic app, I think it's still the rarity.

Nathan Labenz: (1:19:29) So does that suggest that you would be open to starting to use these? Like, is the future of MultiOn, like, potentially powered by Claude computer use? You know, with that being sort of, you know, the core capability and you being the product around it?

Div Garg: (1:19:47) Yeah. Again, I won't say too much here. I think there's a lot things that are possible. We have a close relationship with the Anthropic team. I know a lot of people there, research team and the marketing teams. Yes. So I think I would say, I know the deviate of our company, and then it's kind of like what gets you across. And I think we are also very innovative. Like, so we have AgentQ, and those are the kind of things that we are able to come up with. So I do think with how this had this, like, kind of, like, an R&D advantage. Like, we are able to come up with a lot of new things. And then, like, the question for us becomes as the ecosystem matures. How can we combine a lot of our expertise to have a, like, a strategic advantage and then use that as a way to, like, sort of, like, solve a lot of specific problems? And I like to call this kind of, like, product focused research, which is, like, you have a, like, a problem in and then you're doing the research to solve this problem. Because what happens, like, most of research you do, even like foundation AI labs are in vacuum, it's kind of like, hey. Here's the research problem. This will solve it. Research is not directed towards anything. But if you're like, here's this one problem that I really care about. It's just like there's a missing block. We just don't know how to fill the block. Like, let's go and do the required R&D to build that block so we can, like, fit it in and, like, sort of, like, complete the piece of the puzzle so we can reach to where we want to go. And I think we that's how we think about it. Like, we have a lot of this capability where we're like, a lot of things are missing. If anything, well, right now in the models and even the computer use, there's lot of things that are missing. And that's like, how do you solve the problems?

Nathan Labenz: (1:21:09) Cool. Well, never a dull moment. It's a fast evolving landscape. I think that's all the questions that I had for you. Is there anything else that you wanted to touch on before we break? Great time to pitch if you're hiring for any particular roles or, you know, looking for any kinds of customers or just anything else that any other wisdom you want to impart?

Div Garg: (1:21:28) Sure. No. We definitely are open for hiring. I think we always think for, like, great people who can, like, raise the bar here. So we're always open for researchers and engineers and product folks who want to, like, sort of work in this kind of space. And I think I'm just very excited about, like, how to select applications, especially, like, a lot of the things we're working internally. I think, like, it gives us to to one of the prototypes. That's still, like, a very early prototype, but I think I'm very excited about, like, what will the right user interaction and the right ways to, like, use this products look like. Because I think that's just the next wave. Right? Where, like, it's like, now you have the agentic product where, like, okay. Things are happening automatically, and you don't have to watch or do things all the time. But maybe sometimes you wanna take control. Sometimes you wanna, like, maybe do things manually. Sometimes you want maybe, like, you want to, like, learn and improve. And I think just just takes a really unique paradigm. And I think, like, building up really solid product there, I think that's a hard challenge. So I think that's something that we are very, very spending a lot of time on in a sense. Okay. Like, how do you actually solve this problem? And, like, how do you make this into the best interaction and, like, the best experience? And, like how do you fit everything together.

Nathan Labenz: (1:22:32) Maybe one more question. Where are we a year from now? That could be a MultiOn specific answer. It could be an agents generally answer. You kind of tease that a little bit by saying, you know, '25 is the year, but it's what it's the year for what? You know, can you paint a picture of my agent assisted life a year from now?

Div Garg: (1:22:54) Yeah. I would say we might start to see, like, something like, say, like, Jarvis from Iron Man kind of things where, like, now you have this assistant, this agent. Okay. Like, look up my files or find me this documents or maybe, like, book this, like, whatever, like, a flight for me or, like, do this sort of a boring job, dentist appointment. I don't think, like, people kind of trying this. Even, like, last year, people were very bullish on, like, trying this kind of things, but just, again, like, the technology was not there. And now if you fast forward to 2025, I think a lot of these things might start working, and so you might start seeing that actual assistants that are, like, useful for you are helping you in a lot of your day to day. So I'm starting to get mainstream products based on agents and a lot of, like, kind of, like, explosion of, like, vertical applications, which has not happened yet.

Nathan Labenz: (1:23:39) Div Garg, founder and CEO of MultiOn, thank you for being part of the Cognitive Revolution.

Div Garg: (1:23:45) Thanks.

Nathan Labenz: (1:23:46) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

The Evolution of AI Agents: Lessons from 2024, with MultiOn CEO Div Garg

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

The Evolution of AI Agents: Lessons from 2024, with MultiOn CEO Div Garg

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything