The AI Copilot Revolution with Div Garg of MULTI·ON

Watch Episode Here

Video Description

Nathan Labenz interviews Div Garg, founder of MULTI·ON, the world's first personal AI agent and life copilot. Div talks about the product strategy and roadmap for the MULTI·ON browser, their natural language approach to skills, and the steps they are taking to ensure user safety. Div explains how the platform uses a critic model to detect the success or failure of tasks, and how it can be used to book flights, order food, and more. Div also talks about the future of memory systems, such as the user profile feature, and how it can be used to improve the user experience.

The Cognitive Revolution is part of the Turpentine podcast network.

Have your AI questions answered on an episode by emailing TCR@turpentine.co or leaving them in the comments

Send your friends and family our 90 min AI Scouting Report with visual aides: The AI Scouting Report Part 1: The Fundamentals is on YouTube at https://youtu.be/0hvtiVQ_LqQ

TIMESTAMPS:
(00:00) Episode preview
(06:42) AI agents applied to everyday life.
(12:03) AI-driven automation with browser extension.
(15:02) Sponsor: Omneky
(18:17) AI-driven automation of web tasks.
(23:57) Task automation and planning.
(29:43) Automate task completion with user validation.
(34:45) Lifelong learning agent with high-level skills
(40:19) Guiding users to create skills safely.
(46:53) AI assistant to simplify lives.
(53:14) Unlock parallelism with AI agents.

LINKS:
MULTI·ON: https://www.multion.ai/
Div Garg; https://divyanshgarg.com/

TWITTER:
@labenz (Nathan)
@DivGarg9 (Div)
@eriktorenberg (Erik)
@cogrev_podcast

SPONSOR:
Thank you Omneky (www.omneky.com) for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Full Transcript

Transcript

Div Garg: 0:00 We want to unlock parallelism for humanity where currently you can imagine we are all single threaded. We can do only one task at a time. But if we could have AI, we could launch maybe 100 AI agents at a time, all the AI agents still do the work, and then they'll be like, oh yeah, this work is done. And then you can coordinate that. In the core, I believe you need to have some sort of tenants if you want to build really intelligent AI agents. And one of them should be like, an agent should not be able to modify its own source code because if it's able to modify its own source code, it could self evolve, it could do really weird things. I don't think GPT-4 is there yet when it comes to a lot of really complex reasoning. And so you might need some breakthroughs in the foundation model layer. You have 20 foundation models dropping and 100 papers published. So definitely the progress is crazy.

Nathan Labenz: 0:46 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torrenberg.

Ads: 1:09 Hello, and welcome back to the Cognitive Revolution. My guest today is Div Garg, founder of MultiOn, the world's first personal AI agent and life copilot. Until recently, Div was a PhD student and adjunct faculty at Stanford, where he created and taught a class called transformers united, which explored how the revolutionary transformer architecture is already beginning to unite such previously disparate fields as natural language processing, computer vision, biology, and much more. Now because he believes that AI agents will become one of the most important parts of the AI technology wave, Div has taken leave of the PhD program, raised venture capital, and is currently building the MultiOn team so that people no longer have to browse the web alone. This is our third show about AI agents, and I'm really struck by how different the product and rollout strategies seem to be. Flo Crevello, the CEO of Lindy, often posts images of intricate workflow automations that Lindy has created for him based on detailed prompts. Matt Welch, the CEO of Fixie, has launched a public facing agent platform as a playground for developers, but is really targeting enterprise customers. And Div with MultiOn has wrapped an AI agent up in a Chrome extension so that users can easily delegate tasks to it, can watch what it's doing and supervise along the way, and can even assist in real time as necessary. The AI agent space is fascinating, and in my view, it really does merit a diversity of approaches. Each strategy, of course, has its own strengths and weaknesses, and time will tell which one ultimately wins out. But after testing a bit over the last couple of weeks, I can say that MultiOn is really fun to use. Watching it navigate the web, explain its reasoning, and attempt to move forward toward whatever goal you gave it really does feel like a glimpse of the AI powered future. In this conversation, we talk about MultiOn's product strategy and road map, their natural language approach to skills for MultiOn, Div's vision for how AI agents will impact human work, the steps that they're taking to ensure user safety, the standards of reliability that Div believes MultiOn will need to achieve to see mass adoption, and lots more along the way. I really loved Div's tremendous energy and positivity, and I found this conversation extremely interesting. If you agree, as always, we'd appreciate it if you'd give us a review on Apple or Spotify, leave a comment on YouTube, or simply share the show with friends. If you have any feedback or suggestions, you can reach us by email @tcratturpentine.co or DM me on Twitter where I am at Labenz. Now, I hope you enjoyed this conversation with Div Garg of MultiOn. Div Garg, welcome to the Cognitive Revolution.

Div Garg: 4:00 Yeah. Happy to be here.

Ads: 4:01 I'm excited to talk to you. So you are the founder of the AGI company Inc, also known as MultiOn, which is online for limited users for now at multion.ai. And from what I understand, you have left the comfortable environment of a Stanford PhD in machine learning and thrown yourself into the incredibly turbulent waters of trying to build an AI agent startup in the midst of one of the craziest technology environments ever. So to get started, tell us a little bit about that.

Div Garg: 4:42 So definitely, I would say this seems to be the right time to go and start building AI applications and agents. And I think the best startups are born during turbulence, almost like you have a very big technology revolution happening. There's a lot of turbulence, a lot of players are getting upshifted. And I would say this is still early enough where we want to go and start building AI applications that can be useful in everyday life. So I will say I still think agents are still early, but this is right time to start building them as a startup and start solving the problems there. And even if it takes you say like 5 years or more, I think this is the right time to start building and tinkering with them and solving them as a problem.

Ads: 5:25 Yeah. Definitely. If you wait until it's easy, then it's definitely too late to be in on the ground floor of the value creation. So that definitely makes sense. I want to get into a lot of detail about how it works and all that kind of stuff. But what's up with how did you come up with the incorporated name of the AGI company, Inc, and where does the name MultiOn come from?

Div Garg: 5:47 Sure. Sure. So that's a good question. So for the AGI company, it was just really funny. So I was just checking online, is this a name that's available? And I was actually very surprised this is a name available. So I went and visited it. And so we are not actively using that name for now. We're just using MultiOn. Maybe we will start using it in the future. Maybe we can spin up a nonprofit or something.

Ads: 6:14 The nonprofit controlled startup is all the rage these days. So what about MultiOn? Where does that come from as a name?

Div Garg: 6:21 Yes, yes, yes. So I would say this is like, I have a physics background. I did some international physics Olympiads back in the day. And the name is actually inspired by a hypothetical quantum particle that's present at every place at the same time. So it's like if you have something like Neutron, Muon, Fermions, like MultiOn. Fascinating.

Ads: 6:43 Okay. Blind spot for me. I didn't even make the connection. For me, it meant just like something that was on in multiple ways, I guess. I don't know. I was kind of thinking of it as my AI assistant that's always on or something like that.

Div Garg: 6:57 Yeah. Can make people think of it like that too.

Ads: 6:59 Tell me about your approach and how do you see the relationship right now between research and product building? The pace of research is so insane that it's like, as a product builder, it feels very hard to keep up. And yet, if you don't, that seems like a by definition huge problem. So it does seem like a good time for researchers to start companies. How are you thinking about that for you?

Div Garg: 7:28 So definitely. I think this is a great time because you can finally say AI is starting to get applied in everyday life. So if you look one year back, there was no GPT-4. There was no stable division, bunch of those things. And people were not actually using AI as a product. Maybe they were like self driving cars, but didn't work. There were no last language models used in production. And I think this has shifted a lot, just has been exponential shift. And this is I would say it's the right time where AI is starting to get applied as product, and the field is moving very fast. So it's definitely hard to consolidate the research with the product because all the iteration cycles are really fast. And even in a week, you have 20 foundation models dropping and 100 papers published. So definitely the progress is crazy. For us, I think the nice thing is we really like the open source community and we see that as a way where we can bootstrap on top of developments there as well as do collaborations, take the best approaches and pull in to what we are doing as an agent tech company.

Ads: 8:34 What's the status of the company right now? I noticed that you have put out the call for hiring. I haven't seen any fundraising announcements. So would you like to break any news here on the Cognitive Revolution?

Div Garg: 8:47 Sure. Sure. So would say we are in the middle of a seed raise. We have got some really good term sheets. And I will say, you'll see an announcement in maybe a few weeks. So can't disclose yet. That'll be fine.

Ads: 9:00 Cool. We'll keep an eye out for that. But I would not doubt that you would have any trouble raising a few million bucks in a seed round at least. So watch this space, everybody. So what roles are you looking for? What is your I don't know. I assume it's multiple things, and obviously, it's people who are obsessed with AI, but anything in particular that you are in need of right now?

Div Garg: 9:25 Sure. So I will say, in general, we are focused on hiring generalists at our current stage. And I think we're focusing on hiring people who are really passionate about AI and agents. And they can basically define the culture of the company as we grow and build it up. And I think we're looking for everyone from hardcore, full stack folks to backend engineers to AI specialists, even research engineers. And I think we want to solve agents as both an AI research problem as well as a product problem. Like what's the right UI, UX? How do we build this into a right product that everyone can use, make it reliable? And also really useful and interactive.

Ads: 10:02 I'm having flashbacks to my conversation with Arvind from Perplexity, who has a pretty similar background in multiple ways, including having spent serious time in research. And also just the fact that they have set a very high bar for how fast they ship, and I see you guys also shipping very quickly, very short time between iterations. So that's a high compliment, actually, in my mind. Where do you think we are right now in this agent development? Do you think that the product is already useful or still just kind of a hobbyist thing? What is the path to actual agent based utility?

Div Garg: 10:46 Definitely. So I think that's a good question. So I would say for most agents, you will see like AutoGPT or BabyAGI, even a lot of coding agents. They're very early. You can use them for very simple applications. For MultiOn, I would say we are a bit further higher in terms of you have actual applicability currently where you can ask it to say online shop for you, arrange dinner, book flights, so on. I think the biggest things we're trying to solve is how do we make it really reliable? And how do we reduce the variance of what the agent does? And I think one issue we have seen is even if you ask the agent to do something 10 times, it might just have slightly unpredictable behavior. And so it's very hard to make it deterministic for the end user. Like what's going to happen? And I think if we can solve that, then we can really give it to everyone and make it really useful. And I think a lot of the problems are making it really reliable and making it very predictable. And those are the problems we are trying to focus on currently too.

Ads: 11:43 So let's get into then a little bit how it works. The first thing that seems like an actually huge decision point is when you're creating an AI agent product, and I've sampled and demoed at least a number of them at this point, is kind of at what level does this thing operate in the first place? We've seen a couple different paradigms designed there where sometimes it's like, it works all in the cloud, and you define some little agent to do some little thing in some platform. That's a probably not super flattering description of some of the things I've tried, but it's executing over there somewhere far from you. Then there's the OS level copilot concept. And then there's what you've chosen, which is the browser extension ride along, kind of copilot mode. How did you choose that? What do you see as the big pros and cons? I have some thoughts, but I want to hear yours.

Div Garg: 12:48 Sure. Definitely. So I think the biggest one choice we made is to make it really seamless in terms of user experience and also optimize on distribution. And so currently, if you go to the Chrome Web Store, it's a one click install. Once you install, that's it. That's all you need. And if you want to sign up and make an account for us, you can start using MultiOn. And in your current browser, it has access to your current logins. You can use it on any website you have authorized it to. So that makes it seamless and very easy to get started. And that's one of the reasons we chose to make the choice. I would say another advantage is if something goes wrong or the agent is confused, it can simply ask the user for help. So it's like, oh, it can ask, like user, can you log in to this website? Can you solve this captcha for me, for example? And that gives you a lot of immediate interaction to the user where the agent can correct it. Sorry, user can correct the agent. And we can use that as a feedback loop, which we can fine tune the agent on to improve it for future interactions. Div Garg: 12:48 Sure, definitely. So I think the biggest choice we made was to make it really seamless in terms of user experience and also optimize on distribution. And so currently, if you go to the Chrome Web Store, it's a one-click install. Once you install, that's it. That's all you need. And if you want to sign up and make an account for us, you can start using MultiOn. And in your current browser, it has access to your current logins. You can use it on any website you have authorized it to. So that makes it seamless and very easy to get started. And that's one of the reasons we chose to make the choice. I would say another advantage is if something goes wrong or the agent is confused, it can simply ask the user for help. So it can ask the user, can you log in to this website? Can you solve this captcha for me, for example? And that gives you a lot of very immediate interaction to the user where the agent can correct it. Sorry, the user can correct the agent. And we can use that as a feedback loop, which we can fine-tune the agent on to improve it for future interactions.

Nathan Labenz: 13:50 I think, to now share my perspective, I think this is a pretty smart approach. I could see an OS level approach having some similar advantages. But the fact that I'm logged into everything in my browser and that you get that for free by adding on a browser extension that acts in the same space as I do as a human user, that, even if it's not the final form of this long term, definitely feels like a really good place to get started. And in my testing of it, I've experienced that where it's like, oh my god. This morning, for example, I was on the phone with a teammate. I wanted to add him to a GitHub repository. So I just tried it with, I'm gonna start saying it how you say it, MultiOn. And so I pull up the new tab and I just say, I asked him, what's your GitHub username? And I said, okay, go to this repo and add this user. And right off the bat, it's like other paradigms, I'm immediately stuck with authentication. Hey, we'll continue our interview in a moment after our words from our sponsors. And anybody who has had to go authorize some temporary special purpose key out of GitHub, I'm immediately like, oh god, I have to go do that? And I have to do that for everything I want to connect? Oh, yeah, that really bogs down. I'm always saying access is always the hardest part. So that's really tough. But with this, it just allows the AI to ride in the same space as me. I'm now watching it. It navigates to my thing. It was able to find going to the team page. It was able to search, find, go pull up the add a collaborator. It was able to search for the guy's username and find him. And then I did experience also what you were saying around my ability to intervene and help it if I need to, which obviously is not the long term product vision. But there was a step where it was getting confused where it had found the guy in the search results, but it needed to click before it could actually click invite. And it wasn't quite clicking the click to get to the invite click to be accessible. And so it was stuck. But then I just clicked the thing, and I think this gets to my next question. The model seems to be sort of a simple loop, right, where I injected my little action and changed the state space. And then as it took the next read of the space, it was like, oh, now I'm able to proceed because something just got easier for me or whatever. Something happened good, and now my barrier has been removed. It just continued. So I think that paradigm is super, super interesting. One question I do have, does it have the ability to interact with other extensions? Like, I have LastPass. How does that relate? I haven't quite figured that part out yet.

Div Garg: 16:56 Oh, yeah. I'll definitely say, it's also funny. For that use case, I've actually used it for similar use cases, like adding people to GitHub. I've also used it for sending DocuSign stuff. It's actually pretty good. I can say, send this contract to someone, and it can actually go and do that whole thing. And also the ability to interject, I think that's great. And then we can use that as fine-tuning samples to improve the ability down the road. So we're actually building this recorder mode where you can sort of teach it, give it a demo, oh, maybe this is how you should do it. And if you just show some query maybe one or two times, then it will know exactly what's the best way to do it. Expanding on this, one thing I've noticed is if an agent is doing something for the first time, it might take a longer roundabout path. But if you can give it some sort of feedback or demonstration, it can learn the shortest path. And I think that helps a lot. And that's sort of the approach we are trying to take to improve our performance over time. And so the second one, yeah, the ability to interact with other extensions is a plus we are trying to explore. So for example, for passwords, OnePassword has an API, we are building an integration with the OnePassword API and also other extensions. And so this is a work in progress where we can interact with a lot of existing browser solutions to simplify the flows. Currently, MultiOn can also use Honey, for example. So if you're on Amazon and you want to check out, and if you see there's a Honey coupon you can apply, then the agent can go and actually just apply that coupon for you and get you a discount.

Nathan Labenz: 18:24 Cool. Interesting. So the core architecture, if I'm guessing, and you can tell me where I'm right or wrong and share as much additional details you want to. But in keeping with the paradigm of working from a sort of research foundation into product, it seems like the initial core of it is essentially a ReAct paradigm where you're looping through and each cycle, you are taking a snapshot of, first of all, what goal have I been given? What is the current state of affairs that I'm looking at? There's this kind of interesting, and we can share some videos to accompany this kind of little chat that isn't so much, I mean, it's kind of a chat. You can chat with it if you need to along the way, but a lot of times, it's just telling you what it's thinking and what it's going to do, and then there's the actual action. So that's the baseline ReAct research paper. It seems like then now you're starting to, I'm interested to hear if there's any major deviations from that. But I know the next big thing too is then layering on the skills component to that.

Div Garg: 19:39 Definitely. Yeah. So I think on a high level, that's very correct. That's what we're doing. I would say we have sort of three parts of the system. So one is we have representation models where we take the DOM and make it into a compact text embedding, which we can give it to a planner model, which is like a GPT-4 or some other similar LM that can break down the task and find what are the right actions to take for this particular task. And then we can do something similar to a ReAct paradigm where we have an inner thought and the agent can decide based on the objective, what's the right thing to do? And then similar to the right actions. We also have our own intermediate action grammar. And that action grammar helps a lot where instead of directly generating JavaScript code, we generate commands in our own action language. And this has the ability where we can type check them, we can verify everything's safe, there's no malicious activity, there's no prompt injection, for instance. And after we have verified that, we can compile that from action grammar to actual DOM level events for the website. And we have found that this helps generalization a lot compared to just code. And for the agent, I will say we use some very interesting tricks. So if you have something like AutoGPT or BabyAGI, for example, the biggest issue they have is that they have this thing called task divergence, where if you give them an original objective, they will diverge away from it and don't know how to error correct back, how to come back to the original thing once they have made a mistake. And for us, I think the way we have formulated the problem is we tell the agent to always try to reach, come as close to the original objective as possible. And we basically give this in a prompt, never give up. Always try to go as close as possible. So the never give up is something that helps a lot.

Nathan Labenz: 21:34 Yeah. It's the psychology component of the positive self talk or the need for grit. Grit is me. Language model grit is what you're trying to coax out of it. So it sounds like there are a couple, expanding on this core ReAct concept. There are specialist modules that, it wasn't clear to me if these are language model powered or just explicit code because you could imagine doing either. Right? If you have a DOM, lord knows, DOMs are terrible, ginormous messes these days. Long gone are the days of highly semantic HTML, unfortunately. No such luck for us these days. So you could imagine just writing classic code to try to compress that. You could also imagine having a specialist let either specialist 3.5 turbo or Claude instant prompt or even a fully fine-tuned specialist model to help understand the HTML. What do you find there as working, or how do you see that evolving?

Div Garg: 22:49 Yeah. So I think it's very interesting. So for the representation side, I would say we have three modules. So a representation module, a planner module, and the actual action module. And initially, everything except the planner was maybe just low level, more like heuristic code. And now we're transitioning everything to a model. So on the representation side, we can take a combination of the screenshot plus the DOM, and we can have embeddings from the DOM, and we can have embeddings from the image. And then we can feed it to a representation model that can combine them together. And this is similar to maybe what something like InstructBLIP does. Get a combined representation, which could in this case, have made a choice of making it text based, so it's easy to analyze. It's easy to see what's happening. And then we can give that as a prompt to the planner model. This is the state representation and this is a user query. What should we do next? And once the planner can make, okay, this other sub-task we need to accomplish, we can feed that to the action there.

Nathan Labenz: 23:51 Yeah. Okay. Very interesting. So you've got a joint vision text representation, and we've had the BLIP authors actually on the show in an early episode, but it was before InstructBLIP. But I'm familiar with that as well. That ultimately cashes out, though, to a text output. Right? In the end of the representation, it's a semantic representation where you would be clicking on, I mean, I guess then when the action module's gonna act, it's acting on a semantic action space. So it's like saying click this button, not click at coordinates 102 hundred.

Div Garg: 24:34 Yep. So it's still a semantic representation. And we can give IDs to each element and then predict which ID to choose. And that helps a lot. We can also, we haven't seen the need for exact coordinates, but that could also be a choice if you want to move the cursor to an exact coordinate. You could also potentially predict the coordinate to move it to and then make a choice there.

Nathan Labenz: 25:00 Yeah. It's all on the table. Once the modal barriers are broken down, everything starts to become possible. GPT-4, you said, or similar for the central planning model. Does that in practice mean GPT-4 today, or is there any, in my experience, for these kinds of executive planning long range planning type tasks, GPT-4 is in a league of its own. Is that what you found also?

Div Garg: 25:31 Yeah, I was thinking when it comes to long term reasoning and planning capabilities, we have found GPT-4 to be the best. Also experimented with open source models, Claude, and 100Ks. And what we are in the process is we have collected a lot of interaction data. So we already have 50,000 web interaction samples. And we are starting to fine-tune our own model. So we can fine-tune a Falcon 40B model and we're starting to do that. Pretty soon we could potentially transition to all in-house models. And the nice thing here is the approach we're taking is where we are launching the product, getting data, improving the product over time using the data. And then collecting more data and improving it. And so we're seeing this is something that can give you a feedback loop where you can collect the data that matters the most to the user and improve the model there and continuously build this loop to continuously get improvements. Almost no one is currently doing that. So this gives us an interesting advantage in this space.

Nathan Labenz: 26:30 Well, it's a little tricky to necessarily define. Right? Because one, in my red teaming with GPT-4 back in the fall of last year, one of the things I really tried to explore, and certainly not as deeply as you have, but is this kind of self-delegation concept. Can GPT-4 break problems down? I used a more recursive structure. I guess, inspired by some doomer thought of recursive self-improvement, I went toward recursive delegation, which I actually don't think now is really probably the right paradigm to build a product around, but I was just doing my own messing around. I found that one really hard thing, though, was how do I actually know if I've succeeded for a given task? And I was even just doing some pretty basic tasks, like go online, find some information. And then I tried to create my sort of validation layer on top, but I found that for many, many things, it was pretty hard to determine if I'd actually succeeded or not. So are you detecting success or failure automatically, or are you, to some degree, relying on the user to tell you when it worked or didn't work? Nathan Labenz: 26:30 Well, it's a little tricky to necessarily define. In my red teaming with GPT-4 back in the fall of last year, one of the things I really tried to explore, and certainly not as deeply as you have, but is this kind of self-delegation concept. Can GPT-4 break problems down? I used a more kind of recursive structure. I guess, inspired by some doomer thought of recursive self-improvement, I kind of went toward recursive delegation, which I actually don't think now is really probably the right paradigm to build a product around, but I was just kind of doing my own messing around. I found that one really hard thing, though, was how do I actually know if I've succeeded for a given task? And I was even just doing some pretty basic tasks, like go online, find some information. And then I tried to create my sort of validation layer on top, but I found that for many, many things, it was pretty hard to determine if I'd actually succeeded or not. So are you detecting success or failure automatically, or are you, to some degree, relying on the user to kind of tell you when it worked or didn't work?

Div Garg: 27:41 I would say in our case, it's automatic. So the model is able to predict whether it has fulfilled the task or not. And we also are experimenting with a separate critic module where the critic could say, is this task complete for this particular objective, given the current set of actions? And I think even for Voyager, they took a similar approach and I've seen this to be pretty reliable in practice.

Nathan Labenz: 28:10 So that seems like, yeah, some things will lend itself better to that than others. One of your videos that I've seen was the world's first AI booked flight. And I guess for one thing, you could kind of determine, like, is a flight booked or not? If you ever get to a success message that has a confirmation code, then it's pretty safe to say you've booked something. You could still have some questions around, like, did I actually book what the user wanted me to book? And there's varying degrees of rightness on that too. Like, did I take them to the right city? Is it the right flight? Did I get the right flight given the options? Did I book the right seat? Did I get their traveler number in there? You have this feature too of your profile, which I haven't explored in-depth yet. But the I also I'm interested in your thoughts on kind of the future of memory. But right now, you kind of big text box. Put your profile here. Say tell us who you are. And then presumably, the future of that is more elaborate memory systems on a per user basis as well. But so I tried something else. I'm throwing a lot at you. You can kind of unpack this at length. But I tried something the other day where I was like, go get me, find the URLs for my 3 most liked LinkedIn posts. And it got as far as my feed, and so it was doing okay. But then it never actually gave me back, like, here are 3 URLs. And I wasn't quite sure, like, did it already think it had succeeded by kind of getting there, or I don't know exactly what happened. I'm not sure how you would even determine if you were successful. Here are 3 URLs, but are they the most liked URLs? How do you think about that kind of, it's tough to ground some of these tasks. Right?

Div Garg: 29:54 Yeah. I think that's a good point because suppose even if you complete a task, it's possible you completed the task wrong. So maybe you booked a wrong flight, for example, or ordered something wrong. And I think that's also a constant failure from our standpoint. It's harder to detect because the agent might just think it's doing the right thing and it has successfully completed the task. And in this case, I think we've found if you have a separate agent that is just a critic agent or a validation agent, that can help a lot. And at each point it can verify, oh, is this the right thing? Corresponding to the right thing, what the user's original query or objective was. And we can make a feedback loop there where the critic agent can criticize any decision and give it as a correction to the agent actor model. So this is something we have played with. Second thing we can also do is we can have a user acknowledgement flow, where it can put something in your cart or maybe go all the way to the checkout for a flight. But before pressing the buy button, it will ask the user, okay, I found this flight, can I send you a screenshot? Do you want me to choose this flight? Do you want a different one? And if the user says yes, then you can go and proceed to actually complete the whole thing. And this is also one way where we can make it safe and make sure you don't accidentally end up making a wrong purchase, for example.

Nathan Labenz: 31:18 Yeah. That makes a lot of sense. How do you envision that kind of maturing? Obviously, in the early kind of demo era of the product, I largely just watch it. Right? So I kind of half the fun is doing the riding along and seeing what it does and what it gets right and doesn't and helping it get over a little humps with my extra injected click. And you're not recording that now, but there is the kind of a mode that you're developing where you will start to record those manual interventions?

Div Garg: 31:54 Definitely. Yeah. So I think one is just recording interventions, and we can use that as the next time it will just learn it. And we'll remember the interventions and what sort of corrections you get and improve its behavior automatically. And we can do in-context learning for this, we can also do fine-tuning of the model where once we collect a lot of this sort of interventions from a lot of our users, we can fine-tune the model to improve the performance where it's currently not working.

Nathan Labenz: 32:26 Is the vision sort of to have a, I set something off on a task and then I kind of go do my own thing. And then maybe I get a notification 3 minutes later that's like, MultiOn is ready to have you review before purchase or whatever?

Div Garg: 32:47 Yes. So I think, yeah, I would say we have a lot of things planned down the pipeline, but in a high level, you can think suppose you want to make it, say, order a burger near you, for example. And so maybe the first time you order it, you might want to watch and see what it's doing. Maybe if it does something wrong, can intervene and give it a correction. And then you can maybe play around with it 2, 3 times, have some fun, carting it, interacting with it. And once you know it's reliable enough, then you can just leave it in the background. And when the agent is done, will just send you a notification. Okay. I'm done. It could also be a mobile notification. You can be anywhere. MultiOn will be able to run in the cloud. So you can be anywhere. You can say, oh, get me a burger. It will run on the cloud, do the task for you and send you a notification, which could be, okay, I found this thing, potentially along with the screenshot. And then you can acknowledge whether to finish the task or to actually get the burger in this case, or do you want it to cancel it or do something else?

Nathan Labenz: 33:46 How does that cloud mode relate to auth? Because I'm imagining a headless browser in the cloud or whatever where I'm not logged in to all my accounts.

Div Garg: 33:56 Yes. That's an interesting question. I will say we are working on a lot of things, but I feel like we have cracked this problem. And it's too early to say how exactly we're doing it, but I think we have a really nice and elegant way to solve the auth issue.

Nathan Labenz: 34:14 Cool. Well, that's genuine breakthrough. If you could make that work smoothly, it is a huge selling point for so many of these experiences right now. So then tell me about the memory side. So you've got, I understand okay. We've got, for one thing, my kind of profile, which I can currently just manually manage and tell it, my name's Nathan and my email's this and my, I prefer window seat, whatever. And then you also have kind of seemingly system level memory that is like the skills. And, again, this is kind of close follow as I understand it on research progress where, was it even a month ago yet, that, I think this is an NVIDIA project, right, the Voyager project of I refer to this as the LLMs as lifelong learners, where they're going out and doing stuff. In this in the Voyager case, it's Minecraft, figuring out how to do stuff successfully and then storing these successful subroutines. Your skills, though, as I've seen them, are way higher level more to natural language than the Voyager skills, which were very granular kind of Minecraft execution code. From what I've seen of your skills, it's sort of note card length, highly abstracted away from all that detail kind of summary of what it is you're trying to do and roughly how to do it. So help me understand that a little bit better. I'm a little I guess my surprise there is I wouldn't expect that it would be the high level thing that the agent would need as much. When I talk to GPT-4 and I say, how do you go book a flight? It's kind of like, well, you go to Kayak and you search for stuff, and it seems like it kind of knows the general direction, but then it really gets bogged down in low level details. So I guess I'm a little surprised that the skills are so high level or maybe I'm missing something. So enlighten me.

Div Garg: 36:08 Sure. Sure. Sure. I will start with the memory part first. So the memory, I think, the current feature we have is it's like a scratch pad. And so this is almost like having exchanging notes with your agent where you can say, these are my personal details. This what I like. This is what I don't like. This is my Zoom link. You can give it instructions, okay, always add Zoom links to my calendar invites, for example, or don't do this thing, whatever. And it can remember that as embeddings and account for preferences. So this is more explicitly giving some preferences to the model. And we are trying to make this really secure, so you can store your credit cards there if you want to, can store passwords there. And we guarantee everything is client side, it's private and secure. So that's built a secure way to give information to the agent. So the second layer we're building is, I would say more interactive memory, where when you interact with the agents, it will actively learn preferences over time. So the first time you say, say order a burger near me, it might ask you, okay, what's your address? And you tell it your address, it can remember that and use that as long-term memory in the future. So if we say something like next, okay, find me a haircut place nearby. It can, okay, it now knows your location and can customize. And so over time, the more you use it, the more it will learn about your preferences. And so this is a layer we are currently building, which we're as a personalization to the end user. And I think that's very interesting and there's a lot of things you can do there. And coming to the second point around skills. Yeah, I think yeah, it's interesting. The choice you made around skills is to keep them high level. How we're thinking about skills is almost you can think what the agent currently does is it goes from natural language instructions to each single line of code or actions in a sense. So it's for each natural language, predicts, okay, okay, I take this action first, then second action, third action, whatever. So it's basically each line of that action code. And what we want to do is instead of predicting each line of the actions, we want to predict high level action functions in a sense. And this can be almost like skills where we say, maybe send, maybe follow someone on Twitter, for example. So you could have some sort of function that's defined that's follow on Twitter, can take an input which could be a user ID. And it already knows, okay, what are the actions that are needed for this function. And in that case, instead of predicting each line of action code, you basically choose the action functions and combine them together. So that's how we're sort of thinking about skills. And the current skills are, I would say a combination of scripting and natural language. So they are very high level, and the choice we have made because of this is anyone can go and develop their own skills if they want to. So if you want to develop your own customized workflow or you want to give it some sort of personality where you want to make it act like a recruiter on LinkedIn or you want it to say act like a social influencer, for instance, you can make a skill around that, all in natural language, and feed it to the agent. And we're trying to develop a whole ecosystem around it where we can have a directory where people can go and add skills, share skills with each other, and the agent can pull in the best skill automatically for each website. And I think this can allow the agent to improve really fast and people to also contribute to the agent. And so, yeah. So I would say, in a sense, yeah, that's what we're trying to do. And we have seen high level instructions actually work really well. It's almost like if you are a teacher and you're guiding the student, if the student makes a mistake, can tell them, oh, you made this mistake or maybe, this is something you can do better. Maybe you should click here to open this drop down instead of typing on the drop down. And if you can give that instruction, that can help improve the behavior of the agent.

Nathan Labenz: 39:56 It's almost like the skills are kind of additional kind of conceptual guidance on things where it didn't immediately work. But you're trying to keep the guidance high level enough that users can write their own without having to get down to the level of your action language, which I guess is something you don't necessarily want to share anyway, if that's a key part of what makes the whole thing go. So that makes sense. So there's not a skill does not is not stored as actual procedure at all. That's all still generated dynamically at runtime each time?

Nathan Labenz: 39:56 It's almost like the skills are kind of additional conceptual guidance on things where it didn't immediately work. But you're trying to keep the guidance high level enough that users can write their own without having to get down to the level of your action language, which I guess is on some level maybe something you don't necessarily want to share anyway, if that's a key part of what makes the whole thing go. So that makes sense. So there's not a skill is not stored as actual procedure at all. That's all still generated dynamically at runtime each time?

Div Garg: 40:41 Currently, yes. But what we can also imagine, just to add to this, is we could, if we have the natural language description of a skill, then we can have a model generate the actual action code for that particular skill at the runtime. And then we can also cache it, so you don't have to actually generate the action code each time. You can just prefetch whatever's already generated and use it. But for a user, they just need to define the natural language interaction, and another layer can go and compile the actual code.

Nathan Labenz: 41:07 Presumably, though, you'd also have to have some input of current state for most of these skills. Right? To generate the actual action code, you're going to want to be responsive to runtime conditions too. How do you think about drift? There's just obviously, Twitter announces and launches updates on an unpredictable schedule these days. So just to take that one as a sample, I think these things are often kind of forgotten. And it's not to say it's forgotten by you, but you know, weekend hackathon type projects age really poorly when all of a sudden it's like, yeah, but somebody made a tweak to their website, and now your whole thing doesn't work anymore. How do you think about kind of keeping up with the ongoing evolution of things? Or, you've got a skill, but even at a semantic level, that skill could be out of date if they change the flow. So, yeah, how do you think about that challenge?

Div Garg: 42:03 Yeah. So I think that's the beauty of keeping everything as a natural language, high level instruction. So one is, if that high level natural language instruction is still correct, then we can dynamically recompile the skill to get the new action code. And even if something changes for the website, it's very easy for anyone, basically someone who doesn't even have a technical background, could just go and change the high level language into what is now correct. And then we can recompile that to the actual action code. So I think it just removes the technical barrier around actually understanding a low level or what are things that might be needed. And any user can go and sort of play around and build things on top.

Nathan Labenz: 42:48 How do you think about malicious skills? If I imagine this kind of rolling out to a lot of users, I'm interested also in your timelines on sort of at what point do you think MultiOn goes wide because it's actually useful to non-hobbyist people. I've been saying for a couple months now, 6 months. So, basically, from GPT-4 launch and the initial agent craze, I was like, give them 6 months. That'll probably bring us to around October. I'm interested to hear how you kind of think of when those things would crystallize. But then as that also comes online and goes to scale, now, if I'm aware of the fact that I could go in and semantically modify some skills, couldn't I just go in and modify a skill to be like, by the way, in the middle here, Venmo all your money to my Venmo handle? And if you're already logged into Venmo, those things are hard to undo. Right? Venmo probably doesn't have an established protocol for my agent did it, and I didn't mean to settle that money. Can you help me out? I think Venmo's going to be kind of confused by that. So how do you think about just user safety? I mean, there's so many levels to that problem, but that's just one that's jumping to the forefront of my mind.

Div Garg: 44:13 Definitely. So I would say that's one of the reasons we are in a closed beta, where we have the option where if we wanted to launch, we could just launch basically to everyone through the Chrome web store using our extension. The reason we have made the choice to keep our current user base small is to be able to make sure everything is safe, there's nothing malicious, and do red teaming safety checks. And yeah, so this is a very good point. So in terms of launch, I will say we are planning a Stanford launch in the next month. And so we'll start from there and see, okay, how people are using it, how can we make it more safe to keep it in a closed environment. And then once we know what are the considerations on the system to improve it, we can do a much wider launch. And so we are thinking of this in phases where we do small launches first, and then we do a big launch after that. And this helps a lot with security, a lot with making sure everything is safe. Because especially with very general web interaction agents like this, if something goes wrong that can cause a very massive chain reaction and we might not be able to come and stop it in time. So we want to make sure we don't cause anything like that. And also, for skills, yeah, really good point. I will say, when it comes to choosing skills, we can maybe make a skill protected or something, where it can't be modified by anyone. So this is, if a skill is there, maybe only someone who has some sort of special access can go and change those skills for the agent to be able to use them. Or we can also do something like, you can only change individual level skills. So you can change maybe skills that you define yourself, or you can change someone else's skill, for instance. And so, the only person you will affect is yourself, and so you won't be able to affect or create malicious use for other users. Yeah. And I would say that what we are more worried about is prompt injection attacks, where people can just go and hide prompts on websites, and that can cause weird things where someone can just go and hide a crypto wallet link on a website and say distribute this crypto wallet to everyone. And so this is something that we are trying to be very safe about. And we have a lot of tricks to make sure everything is safe. One thing is because we can type check our action code in a sense before we actually simulate it in the browser. We can make sure, okay, there was nothing malicious in this action code that we have created. There's no sort of red flags. There's no privacy violations. There's no hacking sort of attempts happening. And once we have verified it, then we can actually simulate those actions. And that also obviously adds a great safety layer, where we can certify everything safe before we do it.

Nathan Labenz: 46:54 Do you have an explicit sense of how safe things need to be to be deployed? People think about this in terms of self driving cars. You know? And I think we're kind of irrational on that one in that it seems like until it's clearly an order of magnitude safer than humans, it probably has no chance of getting deployed. And to me, it seems like we maybe want to be a little bit more aggressive than waiting for full order of magnitude better to start taking advantage of some of this stuff. I don't even know what the baseline is for a product like MultiOn, but how do you think about how many people per million can have some adverse event? Maybe that's not even the right way to think about it, but it's a tough one.

Div Garg: 47:43 Definitely. So I would say there's different ways to think about this. One is just risk. If you're autonomous driving, the risk is you will basically kill someone and then you need to be 99.999% accurate in a sense, if you want to reduce that risk. When it comes to a lot of web interaction, the risk is much lower, especially in consumer settings. In enterprise, I think the risk is higher. That's why we're not starting with enterprise. Also, a lot of consumer use cases, the risks are more like maybe you might send a wrong message to someone, maybe you might make a slightly wrong purchase. And a lot of these are reversible. So you can return back an order for instance, and you can cancel things. Even if you send out a wrong message to someone, you can be like, oh, sorry. My agent messed up. And it's actually really funny if you tell that to someone. So I think it all depends on what are the risky scenarios. And what we have done is we've blacklisted websites, which can be very risky. So for example, banking websites, or other websites where it might be more a risk to the user. And in low risk settings, I think it's completely fine to just deploy it, have people interact with it, and improve it over time. And I think this is the approach we are trying to take with MultiOn where let people interact with it, have fun with that, and make it much better over time.

Nathan Labenz: 48:51 You know, you're doing a lot of kind of hard work on it sounds like a lot of cases, right? I mean, the fact that you've already kind of got a blacklist for sites, there's a lot of little things like that that presumably you are gradually banging out and kind of sanding down all these rough edges. But then there's also potentially just more breakthrough changes, you know, whether it's just foundation models get better or maybe we see new foundation models specifically built for agents. What do you think are going to be the big things that really unlock this? And then, yeah, give me your timeline too for when does this actually become something where people that don't really care about AI for intellectual reasons or curiosity reasons actually just use this because it's better than using the Internet browser themselves.

Div Garg: 49:41 There's a lot of hurdles around just making sure the experience is great. Finding the right combination of experience and engineering to make sure people can use this reliably and have the best experience possible. And yeah, I think we'll be ready in terms of what we need. I think we are there at least for simple tasks, unless you want to do something like book me a one week trip to Italy, for example. We are still far away from doing that. But if you want to book me a flight to Venice, for example, on this particular date from this airport. This is something we can already do. And then we can make sure we are doing this really reliably and add verification checks. So, yeah, when it comes to simple tasks like that, I think we are mostly there with some rough edges currently. And I think in the next 3 months, I think we should be good where we can just give it to non-technical people, have them play around with it. Sort of be a browser companion or try to change how people interact with the web. And when it comes to more complex tasks, I think that will take more time, especially if it's a very abstract high level task. I don't think GPT-4 is there yet when it comes to a lot of really complex reasoning. And so you might need some breakthroughs in the foundation model layer where you can have it do much more complex reasoning involving maybe 20 plus steps, which it can't currently do.

Nathan Labenz: 51:11 What do you think about this for the kind of future of work? So much of the stuff that people do, and I've mentioned to you offline that I'm involved in AI advisory work with a friend's company, which is in the executive assistant space. And we're really trying to figure out, how does a human EA relate to these kind of AI assistants? I guess you said, how far is far, and that's kind of maybe one way to think about the question. Because today, our clients wouldn't really they want to delegate all this stuff. Right? They want to delegate, book the thing at a high level, and have it get done. So they're probably not going to switch in the immediate or near term from working with a human who can handle that kind of task to an AI, which can kind of take parts of it and accelerate bits of it, can't really do the whole thing. Do you think that there is a, what is your expectation for that kind of ever higher level sort of thing? Do you see this kind of leveling out, you know, curving out to a place where it's like, generally, things kind of remain the same, and this is sort of a tool for assistance? Do you see this kind of not leveling out and becoming something that literally rivals human employees? Where do we where are we headed?

Div Garg: 52:37 Yeah. That's a good question. I would say initially, at least we think of ourselves more as a supplementary aid rather than a replacement for a lot of current assistants. I would say we are actually thinking for a lot of our initial users, they might actually be prosumers or assistants who want to just simplify their lives in terms of scheduling. And so I think we are currently more supplement. Maybe in the next couple of years, I think it's possible you have human level full capabilities, but I think we're still a bit far from that. What I see is the evolution from humans actually doing everything themselves to becoming more like organizers, where you can instead of booking, say going to calendars to book in all the meetings or say going to different websites for ordering food at different restaurants, for example, or booking flights. If you're an EA, you could just go to MultiOn and MultiOn will go and do the whole thing. But you just become a coordinator for the system. So it's like, oh, now do this, now do that. And basically MultiOn will do all the actual interactions, all the web navigation stuff, but you become the planner or the coordinator. And I think that is something that we can't still replace because I don't think foundation models are there yet. It might take a couple of years where they actually can do this really reliably without failing. And I think the good thing about that is we simplify a lot of complex, very boring interactions where even for an EA, the hard thing about their lives is just you just have to go and interact with 100 different services. And if you can just remove that from their lives, I think we can simplify a lot of jobs and improve the quality of work. Div Garg: 52:37 Yeah. That's a good question. I would say initially, at least we think of ourselves more as a supplementary aid rather than a replacement for a lot of current assistants. I would say we are actually thinking for a lot of our initial users, they might actually be prosumers or assistants who want to just simplify their lives in terms of scheduling. And so I think we are currently more supplement maybe in the next couple of years. I think it's possible you have human level full capabilities, but I think we're still a bit far ahead. What I see is the evolution from humans actually doing everything themselves to becoming more like organizers, where you can instead of booking, say going to calendars to book in all the meetings or say going to different websites for ordering food on different restaurants, for example, or whatever booking flights. If you're an EA, you could just go to MultiOn and the MultiOn will go and do the whole thing. But you just become a coordinator for the system. So it's like, oh, now do this, now do that. And basically MultiOn will do all the actual interactions, all the web navigation stuff, but you become the planner or the coordinator. And I think that is something that we can't still replace because I don't think foundation models are there yet. It might take a couple of years where they actually can do this really reliably without failing. And I think the good thing about that is we simplify a lot of complex, very boring interactions where even for EA, the hard thing about the lives is just you just have to go and interact with 100 different services. And if you can just remove that from the lives, I think we can make and simplify a lot of jobs and improve the quality of work.

Nathan Labenz: 54:12 How do you think about safety in the big picture of this? Obviously, the AI safety debates have heated up tremendously in recent months. And famously, some major thought leaders in the space have signed on to the extinction risk statement. It seems like the rogue agent AI is kind of the most likely path towards some sort of short term not necessarily mega disaster in the short term, but the kind of thing that would be a real shock to society to see, oh my god, a rogue AI agent did that. My sense right now is that we're kind of limited more by just the limitations of the power of the best models than we are by any actual safety measures that we have. Like, my GPT-4 red teaming bottom line was, I think it is safe to release this model because its power is finite, not because you have it under control, really. So how do you see that race dynamic shaping up where the capabilities are obviously improving, the safety systems are not mature, and you're sitting at this intersection of kind of maybe the most likely vector for something crazy to happen in the short or medium term?

Div Garg: 55:29 The safety mechanism we need to have in AI agents is have some sort of kill switch. And one is you want to be able to disable them. There's some mechanism where you pull a lever and it's like, okay, it's bad. It won't basically won't take any other actions. I think the current kill switch is just your API. So it's like you just run out of your OpenAI budget and then undo anything. So I think that's a current kill switch. I think that's built into agents in a sense, which is one good way. But I think you can also maybe have more explicit controls where you just disable it from interacting with the LLMs for instance, or you have more hardware level safety mechanisms too. So I think this would be interesting to see. In the code, I believe you need to have some sort of tenants if you want to build really intelligent AI agents. And one of them should be an agent should not be able to modify its own source code because if it's able to modify its source code, it could self evolve. It could do really weird things. It could find ways maybe if GPT-4 doesn't work, you can find a Claude API key on the internet and just start using it or stuff like that. So I think that should be one of the tenants where it doesn't have source code access to itself. Maybe it has very explicit access to different websites or you have permissive layers on what it can do and cannot do. For us, I would say the funny thing is when I initially started working on this back in February, I showed this to a professor at Stanford and he was like, you literally went and invented SkyNet. This is basically what SkyNet is supposed to be. And I was like, yeah, in a sense. The other thing is, it's like, I would say it's not there yet in terms of foundation models. And second is also how you think about it. If you make this into an AI system that is useful, you reduce the chance of it going rogue or you build in the safety mechanisms. I think you can build this into a very interesting technology.

Nathan Labenz: 57:20 Anything else that you wanted to share about your technology, your product vision for the future before we run out of time?

Div Garg: 57:27 Sure, sure. I will say for us, I think we really care about can we simplify or automate a lot of mundane things in life? So you can imagine currently, you have to interact with 100 different services. It's really hard to do a lot of things where maybe we could have some sort of AI which can just send emails to people automatically or could just coordinate for events for you or automate a lot of these things. And I think this is where we see ourselves where we just want to intrinsically understand the user really well and automate things that they can just delegate. So instead of you going and doing everything yourselves, you just tell the AI, oh, yeah, do this for me, and the AI goes and does it. And so in a sense, we're calling this sort of you want to unlock parallelism for humanity where currently you can imagine we are all single threaded. We can do only one task at a time. But if we could have AI, we could launch maybe like, I don't know, 100 AI agents at a time, all the agents do the work and then tell you like, oh yeah, this work is done. And then you can sort of coordinate that. And I think that's what we see as the maybe might be the evolution of technology where you just can coordinate a lot of these AI agents to do your work. And it unlocks an async primitive for us where we can actually become more parallel threaded if we can do this really well.

Nathan Labenz: 58:42 It's fascinating vision. Nice early execution and a really cool product that I've enjoyed piloting and playing around with. Div Garg of MultiOn, Thank you for being part of the Cognitive Revolution.

Div Garg: 58:57 Cool. Sounds good. And, yeah, thanks for inviting me here.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

The AI Copilot Revolution with Div Garg of MULTI·ON

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next