Always Bet on the Models: How Tasklet Puts the Agency in Agents, with CEO Andrew Lee

Watch Episode Here

Listen to Episode Here

Show Notes

Today Andrew Lee, founder and CEO of Shortwave and Tasklet.ai, joins The Cognitive Revolution to discuss building a new AI agent automation platform that replaces traditional workflow tools with conversational, long-lived agents, exploring how betting on rapidly improving model capabilities enables more flexible automation across thousands of business integrations while navigating the challenges of reliability, cost management, and the emerging paradigm of virtual employees.

Shownotes brought to you by Notion AI Meeting Notes - try one month for free at: https://notion.com/lp/nathan

Agent-first vs. workflow-first approach: Task Lit fundamentally differs from traditional automation tools by using a single AI agent that handles all tasks through natural language, rather than building explicit workflows or state machines.
Speed is the only remaining moat: In the AI era, traditional competitive advantages are evaporating so quickly that the only sustainable advantage is moving faster than competitors.
Small business marketing as near-term target: The technology is approaching "best available human" performance for tasks like small business marketing, where hiring quality human help is difficult.

Sponsors:

Google Gemini Notebook LM:

Notebook LM is an AI-first tool that helps you make sense of complex information. Upload your documents and it instantly becomes a personal expert, helping you uncover insights and brainstorm new ideas at https://notebooklm.google.com

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

Linear:

Linear is the system for modern product development. Nearly every AI company you've heard of is using Linear to build products. Get 6 months of Linear Business for free at: https://linear.app/tcr

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

PRODUCED BY:

https://aipodcast.ing

Transcript

Introduction

Hello, and welcome back to the Cognitive Revolution!

Today, I'm excited to welcome Andrew Lee back for his third appearance on the podcast.

Our last two episodes together were about Andrew's AI Email Assistant product Shortwave, and while Shortwave does remain one the few AI apps that I use daily on both laptop and phone, the occasion for this conversation is the launch of Tasklet, a new AI agent platform that combines the natural-language interaction & open-ended nature of chatbot assistants with the goal orientation & tangible output of automation software – and which has Andrew & team even more excited.

In practical, user experience terms, Tasklet invites you to "Describe a Task or Responsibility to Automate" and then figures out, for itself, how to do it. As its boss, you can then review its output, or its full step by step process, and offer feedback to guide its improvement going forward.

I'm honored, and grateful, to say that Andrew will be sponsoring The Cognitive Revolution to help support Tasklet's launch, and while we will always be transparent about any financial relationships with guests – whether that's sponsorship, investment, or otherwise – I can sincerely say that we make no compromises on content quality, and in fact, Andrew's belief that "speed is the only moat" for AI startups and his resulting willingness to share everything he's learned with remarkable transparency and technical depth have made him a favorite guest among AI engineers.

And what's more, having tested Tasklet for a month before launch, I can honestly say that it's much more like the Agent experience I had expected OpenAI to launch at Dev Day than what they actually did launch, and because it delivers frontier capabilities in a radically accessible form factor, I recommend it to both people who are just getting started with AI automation, as well as to experienced AI engineers who want to experiment with new concepts and want to tap into the AI's own ideas for how to get the job done.

In this conversation, we explore every angle of how Andrew & team are building Tasklet, including:

● Andrew's plan to "always bet on the models", and why he believes open-ended agentic systems will ultimately prove more reliable than traditional step-by-step automation software;

● Tasklet's two-tier architecture – which uses high-level agents to maintain instruction and spawn sub-agents for individual runs – and how this enables both recurring automation and ad-hoc assistance;

● Real-world use cases ranging from email triage and marketing automation to commodity trading and venue management;

● How Tasklet manages context across long-running agent relationships, including sophisticated approaches to memory, context compaction, and the use of SQL databases to store agent state;

● Their tool integration strategy, which combines more than 3000 pre-built integrations with the model's native ability to figure out how to use MCPs and previously unseen APIs – plus computer use as a final fallback;

● Fascinating technical details on prompt engineering, including why they're moving away from pre-generated instructions and toward just-in-time instructions to avoid conflicting directives;

● Why Andrew still runs Tasklet on vibes rather than evals for model selection, and why he's confident that Anthropic's models remain the best choice for multi-turn agent interactions;

● And his perspective on AI's overall trajectory, which I'd say represents the mainline view among frontier lab employees and the most connected app builders, namely that progress has not faltered and is absolutely expected to continue for some time to come.

To me, while the capabilities frontier remains jagged and today's models can of course only handle a couple hours of work at a time, Tasklet does feel like a preview of the Virtual Employee future. When you can maintain an ongoing relationship with an agent, give it high-level feedback on its work, and watch it incorporate that feedback into future runs, you start to experience something genuinely new. And Andrew's report that some users are already giving their Tasklet agents their dedicated email addresses – and even names – suggests this paradigm shift is already beginning.

If you're intrigued, and certainly if you listen to this episode I think you will be, Andrew invites you to try Tasklet, for free, at Tasklet.ai – and then, when it's time to upgrade, use code "COGREV" to get 50% off any paid plan.

Now, I hope you enjoy this deep dive into the strategic thinking, product management, and technical implementation of a frontier AI agent platform, with Andrew Lee, founder and CEO of Tasklet.

Main Episode

speaker_1: Andrew Lee, founder and CEO of Shortwave and now also founder and CEO of Tasklet. Welcome back to the cognitive revolution.

speaker_2: Thank you for having me.

speaker_1: I'm excited for this conversation. This is the third time we've done this and I've learned a ton each of the last two times. I'm sure that's going to be the case again today. The occasion is there is a new product out from you guys. It's called Tasklet. I've had the chance to play with it for the last month or so as you've been developing it and and refining it. And I think it is a pretty cool paradigm that kind of blurs the lines between what many people are familiar with in the form of a chat bot on the one hand, and then also in the form of kind of a structured, you know, workflow or agent on the other hand. This is something that kind of sits in the middle and can kind of play both roles. And I found it pretty cool to play around with. So excited to dig into every aspect of it with you. But maybe for starters, tell us about task at a high level. Like how did you decide to build another product and and how did you hone in on this vision?

speaker_2: Yeah. So this is our our second product after Shortwave. And basically what happened was earlier this year, what we got really good at hooking our AI and shortwave up to other products. So we added MCP and people are doing like these interesting workflows where they were like taking data from their e-mail and they were sticking it in their notion or in their sauna and updating their HubSpot automatically. And, and people really like this. And it worked really well. And they said, hey, wouldn't it be nice if you just did this automatically before I got up every morning so I didn't have to sit on my computer and like run this prompt every day to like update my my data. And we said, yeah, that'd be cool. Let's let's build that. So we started making that. And we very quickly realized that, hey, if this thing is running when you're not at your computer, having a tight UI integration between your AI agent and your e-mail doesn't make any sense because you're not even in front of your computer when this is happening. And So what if we took advantage of that and build something just much more general purpose that was great for doing, you know, e-mail sync with other platforms, but also just like general automation. And so we took a bigger swing and said, hey, we're going to try to build something that is designed to be your general purpose AI agent automation tool. And we spent, you know, starting basically end of May, early June all the way up until last week, building, testing, iterating, and we finally launched it. So it's live. Check it out Task lit dot AI.

speaker_1: Yeah, it's cool. I think that the thing that I have really enjoyed about it is because I, you know, I interact with people from all walks of life who are interested in AI fairly often. They're like, how could I set something up that would do whatever for me? And the friction to do that kind of stuff has been fairly high, right? Even with AI as a, you know, something that can sit in a node in a traditional automation software platform, there's still like a lot for people to figure out if they're not the kind of people that are used to getting really structured in their own minds about how to break a task down into its steps. And, you know, how to set all that up in traditional software. And so I have found myself going to task that as like the first place and being like, OK, start here. Just start by saying what you want and then kind of iterate on the thing from there. It's a really interesting split where you have kind of the main line agent itself is like an entity that you can chat with and, and that level it feels more like a chat bot, but then it also like follows your instructions and you know, you can set it up to run every day or on various triggers. And you can, you know, tell us more about the the particulars of that. But then you also have this like second tier, which is all the runs that the agent has done of this particular task. And I think that 1-2 punch, both of which are just very sort of general purpose, like very, you know, very natural language driven, not really requiring any of the nitty gritty setup that people have become used to in some ways, but also have like probably been deterred by in many situations, is a pretty interesting new way to look at what an AI agent could be. Tell us more about that.

speaker_2: Yeah, traditionally the way people do these sorts of automations is they use a workflow product, something like Zapier or NAN or like the Agent Kit product that Open AI put out is I, I would call a workflow product. And I, I look at workflow products and say, this is this was the right way to approach this from a traditional software engineering standpoint like a year or two ago when models were smart but not that smart. And I think we've learned the last few years that you should always bet on the model. The models are always going to get smarter. And the right thing to do is to find ways to give those models more agency over time. And I think the reason people have been shy about doing workflow automation, like fully agentic all the way down is because they didn't really trust it to be reliable. They didn't feel like the models were there yet. But I think now is the time. I think we've gone through that transition from, you know, you have a workflow that defines step one, Step 2, step three-step four. And maybe there's some LOL calls inside it to what if you just let the agent plan the whole thing. And the advantages you get out of this are tremendous because let's say you run into an error state in a workflow product, if you don't have a way to handle the error state, it just breaks. In an agent product, it just kind of figures it out, works around it. It can handle nuance much better. And also, as you mentioned, it's just a whole lot easier to set up. And so we've been trying to say, hey, we're going to bet on the models. Some of our workflow competitors have kind of done a hybrid solution where they say we're going to have an agent for the purpose of creating the workflow. So like if you if you use string, this is the approach with string, but the output of that is still a workflow. And so it's fairly constrained to what it can do it. And we're saying, hey, not only are we going to have the the setup portion be an agent, the actual implementation is going to be an agent as well.

speaker_1: Yeah, it's funny because that has kind of been the assumption on my part of what I expected things to trend toward. And yet I do find myself quite liking the form factor that you've created. I guess to dig into a little bit more like trusting the models and what you mean by that. I have this one slide that I use all the time in presentations that's just sort of a the minimalist diagram of an agent. Like what is the, you know, sort of most abstract form, right? Everything, all the detail kind of stripped away and it's basically just a box around an LLM and some tools. And that's like your core agent. This thing is then given a task, the LLM can reason a little bit, can use the tools. When it uses tools, something happens in the environment, it gets feedback and it just keeps kind of iterating on that basis, right? This is the classic like LLM in a loop is basically your minimalist agent. So when you talk about like betting on models, you know, the the most extreme version of that would be like just throw that very minimal scaffold up and, you know, let the models have at it. You've done a lot more than that, I'm sure. And I can see, you know, some of the the fruits of that labor. But how do you think about like what adds the most value beyond that super minimal agent scaffold? That's, you know, kind of a cartoon diagram.

speaker_2: Yeah, what I what I mean, but on the models, what I what I really mean here is what is the, you know, what is really in control of what's going to be happening, right. So one approach is to have a workflow where The thing is that is in control is traditional software and you as the user define the steps and it goes through some sort of flow chart. And maybe within those boxes there's some LLM calls, but but the overarching control is done by traditional software versus if you put the model in charge, the models in charge, the models makes the big decisions. And then within what's happening with the model, you might have traditional software into like execute the tools. And so it kind of inverts the problem. You rather than having software wrapping LLMS, you have LLMS wrapping software. So that's when I say, you know, we're betting on the model. That's what I mean. Either way you do it, there's still a lot of traditional software that has to be built. And I'd say the, you know, the the core of what the agent is, I think I totally agree with you. It's basically just calling an LLM in a loop where you, you know, have the LM spit out some tool calls or resolve the tool calls, you call the LLM again and it kind of reasons through things. But yeah, there's a lot that gets built around that. I'd say there's kind of three categories of things that we like, big categories of things that we bring at the table. The 1st is connections. So we hook up to not like some pre enumerated set of tools, but we hook up to everything on the Internet like any service you want to connect to we can connect to. And all of the work of figuring out how do we connect to Gmail or Notion or your random enterprise API, or you know a computer that can access LinkedIn or some MCP server. All of that work is our own code and there's a lot of work trying to to take all these heterogeneous ways of talking to the Internet and putting them into something that is like consistent and the LLM can reason about and use well. So that's category one. Category 2 is triggers. So the, you know, all the stuff that makes the LLM get sort of run automatically and have those runs be, you know, encapsulated in these sub agents that have sort of limited permissions where you have some control over them. Is, is a big, big area of work. And then category 3 is it's a very nascent part of our product, but it's all of the sort of team collaboration and sharing features. So we're really targeting business operations use cases where, you know, it's not just you operating your personal stuff. It's like you're automating some core business process. And we see that as very much a team sport. And so there's, there's a few features in there now like we have the ability to share an agent, but we're going to be doing a lot of stuff to provide sort of like oversight and management and auditing and cost controls and things at an organizational level.

speaker_1: But I, I will say prior to playing around with Tasklet, I had the working assumption that that while somewhat costly in terms of time and, you know, cognitive overhead to get set up would bring me to a place of higher reliability. So I sort of assumed that there was an inherent trade off between accessibility and reliability. But you're making an interesting point that when the AI itself has more room to kind of choose its own adventure, that's my turn for these, you know, less structured agents or choose your own adventure agents that they do have. An in a way also the opportunity at least to have a higher level of robustness to unexpected stuff because they can route around obstacles that they find at runtime that weren't anticipated by, you know, the person who otherwise would have been building out, you know, these sort of block by block control flows. How do you think about that trade off? And do you think that it is in fact like more reliable on net or is it a different kind of reliability? Yeah. What are the upper limits of this? Like could you put something like this in production, you know, as part of a production app? How far do you think this goes?

speaker_2: Yeah, I totally agree with your point about, but the the upper bound of the liability is not what you get from workflow products today because the real world is messy and things break and those tools break all the time and agents potentially solve many of those cases. So I do think you need to look at this holistically. I think today with the models you have today, you're probably going to have in most business applications, a somewhat less reliable solution with today's models. But as I said, always bet on the models. I, I, I give you 6 spots before that's no longer the case. And I think, I think we've been able to look back over the last few years and see that these predictions really, really do come true. So we're looking ahead and saying today, for many applications, it's, it is, you know, reliable enough and there's a lot of other advantages. And in the future, it'll probably be more reliable. The other thing I want to note is there is a lot that the models can do to provide reliability and sort of guard rails around their process. So for example, today we use the LLMS to try to figure out pipes. So we dynamically create some of our connections. So we use the LLMS try to figure out like, hey, what are the type restrictions that should be on this? And then we enforce those in code. And so if, for example, what you want is you really want the LLM to follow a flow chart for some portion of of the project, right? We could build the ability for you to tell the, hey, you know, during this phase you must do these steps in order. And the LLM could actually create its own guardrails and enforce that through code if it wanted to. So I don't, I think if the model is wrapping the code, the model can then construct constraints in those codes to in that code to enforce the, you know, the reliability goals that you want. Haven't built it yet, but I think we can. So I think the future is not going to be, you know, forever. Here's the the quick, easier on liable way. And then here's the harder, more liable way. I I think it really is going to be better and basically all scenarios in the not too distant.

speaker_1: Future yeah, maybe it could be Sonnet 4.5 new or maybe we have to wait till Sonnet 4.7, but it seems like these things are coming at us pretty fast. This is something we talked about last time and your response was quite interesting. Basically as I recall, it was like very vibes based. How do you eval things these days? Is have you? I mean this debate has continued to rage as I'm sure you're well aware since the last conversation. Have you updated your position? Are we still still running on vibes?

speaker_2: Still running on vibes. Yeah, I think what what we've figured out is in the current, you know, market situation, the Anthropic models are the best of what we do. And there's there's really no question there. And if you're looking at emails from the standpoint of we're trying to decide whether to roll for, for another new model, we kind of have no choice, right? We can't really switch off Anthropic right now because there's no viable alternative. And when a new Anthropic model comes out, the pressure from our customers to release that thing as soon as possible is quite tremendous. Everybody wants it right now. And so, you know, if we had some evals and the evals, you know, weren't working out the way we wanted, we're probably still going to roll out that new bottle, right. We might tweak some props and stuff to to try to make it work better, but we're probably going to still roll out that new model. So we've just sort of accepted that like we're going to be on the non profit for a while and we're always going to roll the new bottle as quickly as it comes out. And so we need some approach that works in that environment. The other, the other thing to consider is our product is changing so fast that anything that we put in that could constrain our ability to move quickly is going to come with real business costs. So you know the, this entire product is only, we only started with ready code in June. It's evolved tremendously over that time. It's going to continue evolving tremendously. And our users are very clear that the thing they want above all else, and I mean this above, you know, above reliability and security and, you know, all the normal things you want is they want the smartest, most capable thing. And they've been, you know, I've had lots of calls with people where, you know, they're asking about sock 2 and we're like, yeah, we don't have sock 2 for this yet, right. And, you know, if you had to choose between sock 2 or it being a little smarter, basically it's always like, well, don't tell it complies, but I thought it to be a little smarter. So that's really where our focus is, is like moving quickly, being smart and evals is is not beyond our own dog food. He's not a part of that right now. So we do do a lot of testing, don't get me wrong, but it's all like we have a whole bunch of triggers that run for our own internal usage. And we basically roll stuff out to our internal team. First. We give it some time, see how it goes and then if it feels good, we roll it out wider and it in shortwave. We do basically experiments based rollouts. So we allow power users like opt into things. We look for retention rates in the new features and roll them out. We're going to do something similar in task that we're like, you know, we'll roll out a new version, we'll let power users opt into it, we'll see what retention looks like, and then we'll flip it on for everybody.

speaker_1: I guess just to ground this out a little bit, you said it like anthropic models are clearly the best for what you do. How do you test that? I mean are you literally just doing like Andrew's top 10 agent use cases and running them head to head with GBD 5?

speaker_2: Yeah, it's, I mean, it's vibes, right? So we, we built our system so it's easy to swap in different models to to, to test. And we tried the different use cases. And I think, I think what you find is all of the models do a pretty good job at answering the initial question you have. In fact, some of the models may do a better job if your goal is to like have one question, one answer, you know, you might find that open ad models do better. But if you want to have something that is LLM call tool call, LLM call tool call over a long sequence, which like all the stuff you want to do with our product is, is that type of thing. And Tropic just sort of handles that iteration better. And then that over, you know, 100 iterations, that really adds up if it's slightly better at each turn. So we've, yeah, it's all been sort of hand testing in different models. But ultimately for that sort of long iterative processes, this is, it's clearly the best. And I, I think the market agrees with us. If you look at, you know, all the other folks that are doing stuff like us, they're all heavily using Entropic.

speaker_1: Models, what would you add? Of course, everybody who listens to this feed is very familiar with the meter task length graph at this point, and we were all waiting for sign at 4.5 to, you know, be planted on that graph. It did come in a little lower than GPT 5, right on trend with the curve, but a bit lower in the meter estimation than GPT 5. How would you account for that difference if you could? Or like you know, do and they just the other way to think about that is they, you know, put us like just over 2 hours now in terms of the size of task as measured in the time it takes a human to do it that the models can now handle like 50% of the time. Do you accept that analysis? Would you complicate that analysis? Is there, you know, is there some part of that? Is there some reason that you are coming out with a very different conclusion than what they are measuring? Because I think right now the prevailing, you know, notion would be GPD 5 is a bit ahead.

speaker_2: Yeah, I I don't know why that difference is there. My, my cynical take is, you know, some people are, are playing harder to hack the metrics than others, but I don't actually know what the reason is. But I, you know, I look at our actual sort of practical real world testing and, and my experience has been it's better. And the data point, I would point to that that says, Hey, you know, this, there is something real here. I'm not just making this up. Is, is pricing, right? So if GPT 5 was as good at these use cases as sonnet is, it wouldn't be less than half the price and at less than half the price, you think everyone be switching off and they're not. And when they, you know, when anthropic came out with 4-5, they didn't lower the cost, the price, even though the open AI had lowered the price. So the fact that the pricing is holding up tells me that people are choosing it. And I think it's just because, you know, ignore, ignore what the metrics say. The the real world utility of it is just better.

speaker_1: How do you think about the size of tasks? When I have looked into runs of things that I've done, I've routinely seen like dozens of steps which, you know, for whatever reason that the tasks I tend to come up with are like, help me with AI research, not for whatever. I know why that is. It's becoming overwhelming and I need all the help I can get. So it's a lot of stuff that's like search reason, search reason, search reason, search reason a bunch of times. And then finally compile a report for me, send that off, update memory. And that's basically the anatomy of most runs of the tasks of the, you know, of the recurring tasks, the agents that I've created so far. How does that compare to the frontier? Like our is dozens of tasks kind of toward the upper limit? Are you seeing things that are substantially more than that? Do you think about them in terms of time it would take a human to do them or some other metric? Like how do you even conceive of what the how to measure, you know, the size of what the agents can do?

speaker_2: Yeah, to be honest, yeah. We don't make any effort to try to estimate the the time it would take a human to do this. We're a we're a little team. We're just trying to build a thing that that people will pay for. I'll leave to the researchers to try to figure out how that relates to to human time. We look at things in terms of, of turns because that fact that, you know, correlates with our cost and we absolutely do see people with tasks that are spanning a very large number of turns. And we actually, this is something we need to change. There's a, there's a 50 turn limit per question basically in our product. And that initially was just set to like keep a bug from costing us infinite money. And we're hitting this all the time now and, and we need to bump that. And I think the number one place for hitting this is computer use where, you know, if your task is like, hey, you know, go to LinkedIn, find these 10 people and send them all a message. You know every operation there is going to be a turn and sometimes multiple turns. And so I think impunities is based case we're just like the number of turns just absolutely explodes.

speaker_1: Yeah, that's funny. I have been doing that a bit as well. Getting specifically, there's a event coming up in Detroit where I was invited to either either give just a solo presentation or maybe do like a live podcast recording in front of a live audience. And then the question was like, well, who's the guest? And so I'm specifically having task Let Go comb through AI leaders at various organizations in Detroit and pull back, you know, lists to to me for people that I should consider inviting. Which brings me to a great question. It seems like in the metrics cloud 4.5 Sonnet was a big step up in computer use. How did that feel to you in your vibes based assessment?

speaker_2: It, it seems really good. I'd say the the big blocker for more computer use for us right now really is just cost and speed. It is tremendously expensive to move a computer around by screenshotting. And so yeah, I'd say when computer use fails for our users, it has less to do with like the intelligence, the bottle and more to do with either they hit that 50 turn limit or like they just can't afford to do it anymore. But yeah, I found it super impressive. We've looked at using other models for specifically for computers for cost. So like we're looking at using Gemini and having like a sub agent to handle portions of that. But the the downside of having a sub agent run and then to a capturing some, you know, distilled version of the data and putting on the main agent is just get less intelligence overall. And so we've we've played with that a bit, but so far the strategy is just spend a lot of money on Sonic. We're going to we just we just rolled a haiku and so we've been doing a little bit fast in here. I think it's too early to say how how well a haiku is going to do with computer use, but that could be a big factor for for cost for us. But generally super impressed with with the capabilities for computer use. How?

speaker_1: Do you think about what kind of computers to give to the AI? It seems like we have. On the one hand, there's sort of just a browser is like 1 paradigm. And then on the Grok 4 launch, I always think back to Elon talking about setting up like, you know, power workstations with all the same, you know, high end software that the engineers at Tesla and SpaceX use. And it seems like you're somewhere in the middle right now where it is not just a browser, right? You have like a full operating system, you know, VM type of thing that the that the AI can use. How do you decide? Like how much of a computer to give it and and how much can it do beyond the browser today?

speaker_2: Yeah, this is probably the most active source of discussion for our team right now. We have actually have everyone's team flying out next week and this is going to be a big focus of that. That discussion is what direction we want to take this. Right now in the product, you, every agent has its own sequel database. It has its own code execution environment. It has like very limited file capabilities and you can optionally spin up a full Linux VM and connect to that Linux VM and like use a browser or use a file system in that VM. We used to have Windows support that caused some problems for us. It was expensive. The, the, you know, the wake up from sleep time was like really bad. There was, there was some reasons it was harder and it was also no one, it was pretty rare for people to use the benefits of Windows besides stuff in the browser. So we, we switch to Linux, which I think has been working much better for folks. We've considered doing Mac for the for the for the usage going forward. I, I see a couple of big goals. One is we want to give it terminal access. So there's a lot of operations that like, if you use cloud code, right, it's, it's shocking how much cloud code can do outside the programming since it has the daily access to console. So for example, like we have a weird number of people opening up agents and then being like, you know, hey, I upload this file, like please convert it from, you know, this video format to this video format. And for like, you know, I don't know why you're asking AI do this, but we totally could, right? We could run FFM peg and like, you know, do that sort of processing. So I think we want to give agents, you know, a shell and let them run command line things. I think we want to give them a file system. And I think we want all of those pieces of the computer to work well together, right? So, for example, today you can run, we can write an execute code, but that code can't access your database. That code can't access your connections. It can output data in the LM. But you know what? If you had all those fast as a computer that the file system, the database, the code execution environment, the shell and the UI all able to talk to each other, I think that could be tremendously powerful. Doing that well is really hard for a variety of reasons. Cost is a big factor, right? You don't actually want to have, you know, you could have 100 agents in our in our product. You don't actually want to have like a physical, 100 physical machines per user. The thing will be primitively expensive. This is an area actually where there's a lot of startup activity. So there's a lot of people have figured out that you need specialized cloud infrastructure for doing this. I was just talking to a startup this morning called Black Soul. That seems pretty cool. But yeah, a lot of discussion here, a lot of excitement. I think big things are coming. Watch the space.

speaker_1: What do you think of the requirements there? Like it just seems that the main one would be like you sort of want to be able to suspend the VM and kind of put it into sleep mode, but like keep its state kind of like my Mac when it runs out of battery, whatever it comes up in roughly the same, usually in roughly the same state that I left it. Is that the the big thing or what else is kind of missing from the tools you have available right now?

speaker_2: Yeah, I, I think cost is the is probably the number one issue where, yeah, you want to have the illusion of every agent having its own computer and you want that computer when you're using it to be powerful, but you don't want to pay for any of the milliseconds you're not using it. And that includes not just compute, but also storage. And that can get complicated, right? You don't want to, you know, have every agent you have to just have a store, store copy of the entire OS, for example. So how do you get the illusion of every agent having its own computer without having to pay the costs of every agent literally having its own computer? I think that's one big factor. I think another big factor is you want to make sure that those computers don't kind of get messed up over time. So you want to give some thought to, you know, what aspects of this thing are configurable and what aspects of this thing are not Configure all You wouldn't you wouldn't want the agent to like, you know, mess up some config and then forever after that be unable to like use its computer. So that's something we're thinking about it too is like, you know, reliability for this thing. And besides, I don't know man, like the, the requirements here I think are super unclear. This is a very unturned territory. So we're we're figuring this out one day at a time.

speaker_1: Let's talk about use cases for a minute, because I go very deep into all the technical stuff. What use cases are you seeing maybe that are like on the frontier of what the models can or can't do or that are just like really creative, you know, things that you think people should be taking inspiration from? One thing that I have honestly had a bit of a challenge with in using any AI product that's meant to do recurring tasks for me is, and I don't take this for granted, I think it's like a sign of great privilege in many respects. I don't really have that many recurring tasks. I'm mostly, I'm like chasing my curiosity on a day-to-day basis. And so a lot of times I feel like I'm like, OK, well, what should I be recurring that I'm not, you know, because I'm, I'm too lazy and maybe may I, you know, could, could take me there. So yeah, yeah. I mean, take all the time you want on use case. I need inspiration.

speaker_2: So, so here's an interesting thing that we that we figured out earlier on. So the pitch that we give you is that we help you with recurring tasks and we do. But the first, the first version of task that it was set up. So there was sort of a, a task that setup phase and then once you got the automation running, it just like switch into running mode and you couldn't, couldn't talk to the main agents. You were like either designing the automation or running the automation, but you weren't doing both at the same time. And what we discovered is that people had like, for example, they configured like an e-mail triage agent and they hooked it up to their Notion and their Asana. And they given all these detailed instructions around like what types of emails they care about and stuff like that. And they're like, Hey, I, I did all this work to create this agent. It's often writing its automation. But now I just wanted to write this e-mail for me. Why can't I talk to it more? And we said, Oh, we should go make a new agent for that. And it's like, well, I know, I think then I have to configure it again. I have to hook it up to all the same stuff. I have to get all the same prompts, like I just want to talk to this agent. And so we realized that, hey, what we need to do is we need to enable on any one of these individual agents, both the trigger tasks, but also the ability for you to keep chatting with it. And so that's why the agent works that way today. Like, well, it is off, you know, waiting for emails to come in or webhooks to fire or whatever. You can keep talking to those agents. And once we enable that, what we found is like people started naming these things and like it was, you know, this isn't, you know, my agent for Tration, my e-mail, this is like Joe, my EA and Joe does these automated tasks, but Joe also helps me when I asked Joe to do things. And so I think the way that we've been thinking about this more is as a tool for like very heavyweight, very long lived agents. Like you go to ChatGPT or Claude and it's like you have a quick question. You, you talk to him for a little bit and then you forget about that. You never come back to. And you have this long list of, of chats you've never, you know, come back to. And in our product, we're trying to have something where it's like a lot more set up, but a lot more value in the long term. And maybe you keep coming back to this particular agent for years. And what that has resulted in is actually the vast majority of usage of our product is not the automated tasks. There's a lot of that. And, and, and basically all of our paid users have something that is automated and recurring. But they are like most of the messages that they send is actually the other messages that they ask, you know, to the same agents that are doing these recurring tasks. We've seen all kinds of stuff. So, you know, as an example, like we've had some companies that have automated their billing processes. We've had companies that had EAS doing various EA type work who has decided to go fully with Tasklet to like automate those tasks. We have had like music venues that sort of manage their calendar through this or like, you know, keep all the very stakeholders aware of of what's going on their venue. We had it was come out trading commodities trading firm. I was talking the other day that like sort of watches for events in the real world and like makes recommendations or trades to make. Personally, I use it for marketing super heavily. So basically all the content that you see from shortwave or Tasklet these days is written through Tasklet. And you know, actually you, you saw part of this for the, you know, the sponsored spot that we're doing with you guys where I basically have a bunch of docs that have my notes and I have an agent whose job it is specifically to do marketing for me. And I'm anytime I need new content, I'm like, Hey, here's, here's the dock I have for this, Here's the Sprint here for this. This is this thing. And then, OK, now I need a LinkedIn post or now I need a podcast sponsorship, like spit that out for me. So it is everything under the sun, pretty much every page using something that's automated. But they also do a lot of the stuff too. Yeah.

speaker_1: It's interesting, I've been gravitating that way as well. In terms of just the one offs like the LinkedIn thing I said that was just a A1 off and well, preparing for this I, I also was like, you know, just getting task code to help me, you know, identify any gaps in my outline of questions for you. It did actually come up with a couple interesting questions that would would only come from task that given the sort of unique insight that it has into it. So I.

speaker_2: Love that you did that, that was pretty funny.

speaker_1: I guess I'm the, I mean, there's a lot of little nitty gritty things that I found really interesting. And I do really like how you can go back and continue the conversation with the agent at the high level, right. Again, there's this sort of two tiered thing and people should just go play with it because I think it's, it's in some ways like more intuitive to just go do than it is for me to try to describe. But the high level agent that sort of defines like what is this thing all about and maintains the prompt that will then be used for the individual runs. That's that's the sort of higher order thing that you can talk to. One bit of feedback, for example, that I gave to my archive paper searching agent the other day was I guess first of all in the prompt that I originally gave it. Usually in the past I have felt that I had to give anything like that a lot of context on me, but this time I just said my interests are pretty well represented on the cognitive Revolution website. So go check that out and then you can kind of search archive, you know, with that in mind. So it was cool because it at that higher level it went and did that search to try to characterize me and understand like what I'm about and then turn that into the prompt, which then, you know, gets fed into each individual run. And so it doesn't have to do that every single time. Maybe I should come back, you know, at some point in the future and say, hey, like you should refresh. I've done, you know, a bunch more episodes or whatever. In practice, what I did do is I was noticing as I went into the runs and just like, you know, trace through to see what they were doing. One thing I noticed it wasn't doing all the time was accessing the memory that it had logged. So I was seeing like at the end of the runs, it would it would log like, you know, sent Nathan these five papers or whatever and they'd be put into the store. But then if it wasn't accessing that store at the beginning of the following day's run, then you know what good was that, right? So I just went back to the high level agent and said, hey, I've noticed you haven't been accessing memory at the beginning of the runs and that is leading to some duplication in the papers that I'm getting from day-to-day. So like please access the, you know, the memory at the beginning of the run going forward and it just like updates its prompt and then you know, for all subsequent runs like that is now part of its standard operating procedures. So I think that is pretty cool. It is starting to feel much more like a virtual employee. You know, I mean, that's obviously a sort of fuzzy notion, but something that a lot of people kind of have on on the horizon in their imaginations. And you can, you know, you can kind of feel that a little bit more where you're like, I've reviewed some of your work and I want to give you, I want you to like keep the same assignments that you have, but I want to give you high level feedback on what I'm seeing in your output. And then have you kind of figure out how to incorporate that into your future work without me having to like get into the Super nitty gritty details of it. I think that is pretty cool. OK, some nitty gritty stuff. How do you manage the, I guess even even more nitty gritty? First, I don't think I can talk to a run these days. And that is something that I thought I might also like to do. Like I might want to give high level feedback like I just described of like I've noticed sometimes you're not doing this in quite the way I think best. But sometimes I might also want to go down to the individual run and just say like, hey, can you do one more step? Or can you, you know, turn this into a whatever on a one off basis? Is there any reason that doesn't exist? Or I, you know, maybe just coming soon and I haven't got to it yet?

speaker_2: So it's, it's worth walking through some of the stuff that we've tried and also talking about kind of where we're going here, because I think this is this is an area that we've learned a lot and the way it works today is actually not how I'd like it to work with the long term. So we actually used to give you an easier way to to chat with these the sub agent runs. You actually there are situations where you can, we have the ability for the sub agent to ask for human input, in which case like a chat box will appear and you can talk to it. We found that that is not very frequently used and a little confusing. And the idea that you come to the product and there's like 1 chat box over 1 chat box over here. And then every sub agent run also has this chat box starts get confusing for users of like there's 20 places I can talk to the AI, Which which place do I want to talk to the AI? And so we're moving to a world instead where we just want you to talk to the main agent. And I think the piece that's missing right now is that we're not giving the main agent enough information about the recent runs that have occurred. And so you can't today go into the main agent and get into the nitty gritty and, and get the, you know, sub agents continue running, but you could. So this is this whole project probably like introspection, which is like, how do you, how do you have these sub agents that run and have sort of the the reliability of the cost benefits of being run as sub agents? But then have the main agent appear to have full knowledge and control over them. And that is a very, that's an ongoing, that's an ongoing project. I think one of the areas we're going to look at is the way that instructions work. So you mentioned that, you know, you look at the instructions sometimes and then give the main agent instructions to to change those instructions. We actually used to let you, so we have these intermediate instructions, right? And those intermediate instructions are generated by the main agent and then persisted. And we actually used to let you modify this manually. And the down the big downside of that is it creates A tension between what you've told the main agent and what you've ended those instructions to be. So if you go and you tell the main agent a whole bunch of stuff and then later on you go and edit the sub agent instructions. Even if you tell the main agent that you made those edits, it's start, it's it's pretty tricky for the main agent to reconcile. Well, you told me you wanted XYZ, but then later you added instruction to this and I'm not sure what you want later because the conversation continues. And so we removed and also wasn't being used very much. People weren't editing it. So we removed the ability to edit those. And I think in the future, we're actually probably going to stop pre generating those instructions because pre generating those requires the agent to reason about when to update those and how to update those while doing other tasks. Like you might have, you know, you might ask you to do something and make it aside while doing that where it has to go do the thing that's you're just trying to help you with immediately and then also going to update these instructions, which is a long, slow, error prone process. So I think what we may move to is something where those sub agent instructions are generated just in time and they're generated sort of out of band and done in such a way that they always factor in all information. And if you did want to see the instructions in a way that you could digest them, you just ask the AI. You rather than having a special button to say, show me instructions, you just tell the AI like give me a summary of the instructions that you've given me or sorry, give me a summary of the instructions that I've given you. And it could do that. And then you could say if you really want to follow is exactly or modified, then you could say, well, actually I'd like you to follow these modified instructions and you could provide instructions. And so I think you can get the best of both worlds. I think you can get the ability to be very specific in your instructions and to see what it's thinking without sort of AUI that creates this ambiguity.

speaker_1: Yeah, that's a really interesting point. And that the ambiguity, which I also, you know, think kind of blurs into an agent that has multiple conflicting goals is becoming a real driver of strange results from AI systems in general. I mean, the GBT 5 prompt guide takes pains to warn people against having any sort of contradictory instructions. Like that's apparently a huge drag on performance if there's any, any contradiction in the, in the prompts. And also, you know, the things like alignment faking and you know, a lot of these sort of bizarre behaviors stem from some sort of incompatibility between either what the agent has been like trained on at a high level and what it's currently being asked to do, or even just like to, you know, kind of instructions that it might be receiving at different levels of the the instruction hierarchy. So that is, I think you're right to flag, but that can lead to some weird stuff.

speaker_2: There is one color fleet. There's just one part of the code where I've asked the team to make me the reviewer before PR goes in. I, you know, I am, I'm not the CTO here. I'm definitely not the reviewer for things, but anything that touches the main prompts I want to see. And the main thing I'm looking for is conflicting instructions. Because if you have, you know, a bunch of people all kind of working on their part of the system and giving instructions that make sense. You, you need one person who's sort of looking at the prompt as a whole and being like, are we telling, are we, are we giving instructions to this thing that are clear where if it reads the whole problem, it knows what's going on And like that that it's a problem a surprisingly amount of of time even when people are trying to be very careful about it.

speaker_1: So how about this just general notion of like context engineering context management context, maybe like clean up or compression. And then that obviously blur into memory too, although that is a bit distinct as things stand today. The question that was suggested by Tasklet for you says that there's a token budget 200,000 tokens in my current context. Now I know that there are some opportunities to get to longer context with the right customer relationship. I think you have to pay more per token if you want to opt into those things. But obviously it's going to be finite no matter what, right? So especially if I'm like talking to this high level agent and I've had a, you know, a bunch of runs that are running on an ongoing basis and I say, oh, can you, you know, pull up the run from October 1st? No, now can you like look into the run from October 5th or whatever? It seems like I'm going to be out of context real quick. So how do you manage, you know, when you talk about like something that you could come back to for years, is that like still aspirational or is there a way to get the agents to be that long lived given the finite context that you have to work with?

speaker_2: Yeah, totally. I, I want to note, I really love the word context engineering. I've been, I've been looking for a word to describe this because I think the there's, there's just kind of two parts of building an agent. Well, three parts. There's the core agent loop, but that's pretty simple. There's the model and then there's all the work of plumbing things into that model. And I think the terms that people have used in the past are, you know, not not quite right for the way things actually work. So the term rag, for example, I think has been used very badly. There's sort of a naive rag that most people think of, but like doesn't really describe like the way tool calling and stuff works. So I, I really like the word context engineering. I think it gives like sufficient gravity and weight to like the sophistication of the system that actually like are built around these agents to like Plumb the right data in. And we've done a lot of a lot of work here. We're going to continue to do a lot of work here. And yeah, I agree with you. You somehow need in our case, you might have an agent that's been running for a long time that's, you know, processed, you know, 100,000 emails. Obviously you can't fit all of those into, you know, any sort of reasonable token budget. So you have to give the illusion to the user that not only have you fit those things in, but that the LLM is still able to reason intelligently with like a huge context. And you do that by, you know, there's a lot of methods here, right? So you can, you can hide things inside tool calls. And then you probably saw the the stuff from Anthropic recently. Like one of the things you can do is just like hide the results of the tool calls. You can do LLM based compactions. So you can say, you know, we're going to rewrite, take, take some big section and we're going to take a model and we're going to summarize that some way you can do things in sub agents. So like when our sub agents run, we're not giving them the whole chat history. We're giving them some, you know, tailored portion of of that chat history. So I see this is a major, a major focus for us is, you know, giving the the user the illusion that everything they already said at this thing, every e-mail it's ever been processed is being constituted at all times without actually doing that. And I think that's a. It's a fun problem.

speaker_1: So I guess one more thing, I'm just like context management. If I'm, if I'm looking at the sub agents, I have this archive paper Finder daily, right? And it, it puts at the end of a run, you know, it stores what it found and what it sent me. And then at the beginning of the run now with my feedback, it is pulling that history out so it knows, you know, what it has already found and hopefully avoids duplication even that over time, right is going to start to get huge in in the limit. Like there's a, you know, just a ton of stuff that it's already found. I guess I could kind of help it out by saying like, you know, only search for stuff in the last two weeks and then like you only have to pull stuff from for the last two weeks. But it seems like everything kind of blows up in the end. So I wonder how you are thinking about that. And I also noticed, I I think I noticed that even in just the month that I was using the product before the launch, when I go back to the earliest runs, it looks like it's like a no sequel store. And now the more recent one, it has a sequel back end. Previously it was like, you know, just sort of Jason docs that were being sent into the store and now we actually have like insert statements and whatnot. So I guess how do you think about the size of the store does that? I guess that becomes like a rag, you know, with possible pardons for abusing the term where you sort of have to apply some intelligence at at that level as well. Or, you know, I, I could imagine you could also be using the old Gmail trick, which was like promise unlimited storage and just know that people don't have that many emails coming in yet. So you can kind of figure it out at some point in the future. And then there's also these like dedicated memory startups that I'm sure you've looked at men zero and super memory. So I guess to try to sum that up into a question, if I'm storing more and more stuff, how do I not overflow the context just with that? What are you seeing working, not working, you know? What about these memory startups?

speaker_2: Yeah, this is very much a, a core part of the challenge. So I we're not we're not using any any other providers because we think this is something that we have to solve ourselves. And if our whole premise is we're like ChatGPT or Claude, but like much more heavyweight and designed for like long running recurring work like this is this is a very core, fundamental thing we need to do. The North Star for us is the illusion that we're just one big long chat and everything has been put into that chat. And if the model is perfectly smart and this is the system breaks under sort of all of those for all of those reasons, right, that, you know, we can't actually fit all of it in context. We can't afford to do whatever we could. And the model is not infinitely smart. So if you, you know, for example, you stored every e-mail that you've ever processed just in the history, the LLM isn't going to do a good job of reasoning about which ones you processed or which ones you haven't processed. It's just it's just not going to be that smart. And so one of the techniques that we've been using is having a data structure that is managed explicitly by the LLM across runs. The first iteration of this was a very simple version which was in the system prompt, we included a Jason BLOB and that Jason BLOB could be directly edited by a tool call in the LM. So in the sub agent runs, you could just tell it like, hey, I want to add this data, remove this data from this station object. And that actually works surprisingly well. This was like the dumbest thing we could think of and we built it and it just worked and it was really cool. But it came with a couple of downsides. 1 is it fairly quickly got large enough that like it started to break down and not be useful. So we had a lot of agents that were like working for a while and then after a week or two or whatever, they kind of stopped working because the Jason object got really, really large. The other problem is it is terrible for caching to put mutable state into your system prompt. And caching is very important for cost. And so but the benefit of having it in the system prompt at all times is, you know, the LM doesn't have to reason about how to retrieve it. It's just there, right? And so we looked at other ways to do this and we settled up using just a SQL database because it turns out the models are just super good at writing and using SQL like they're, they're way better than I have at this. And if you just tell it, look, you got a SQL database and you can store stuff in there and retrieve it, and then you can give yourself instructions on how to manage that thing. It works. It works really well. I agree. This is not the end of the story, right? So you totally kind of instructions that say like, you know, every time you get an e-mail store on the list and then every time you start load the whole list and that list could get bigger and bigger and bigger and eventually cause your problems. And so I we're not done. We're going to have to keep working on this. I think there's going to be in addition to being, you know, even smarter about how to manage that sequel data, there's going to be lots of other tools that we use. So, you know, as an example, one of the things I'm looking at now is in a world where you've compacted old history, but that old history becomes relevant, you can potentially give the agent tools for uncompacting portions of the history, right? Be like, OK, you know, in this particular case, I really didn't do need to know what the what the earlier messages were. And you could potentially like load those two tool calls. So there's a lot of tools, there's a lot of ideas that we're exploring here. I think this is very much an unsolved problem and we're working on it.

speaker_1: Any impulse to some sort of graph of knowledge? I, I, we did an episode on it. It's probably been at least a year on a paper called Hippo Rag where it was a pretty, you know, kind of, I don't know. It was both cool, but also like you could, there were definitely still things to to do to, you know, take that to a production system. There was a Hippo Rag 2 at one point, but what I thought was cool about it was that it sort of dynamically on a periodic basis. It was almost like what would happen if you know when the agent is asleep, right as you would go back and sweep through all of the new items that were stored to whatever memory store and try to find connections between them and you know, kind of map the space out, create these connections. That then in the future really improved the retrieval because you could do sort of a graph based search. And if you hit like one node and you know, expand to, you know, 2 hops out from that one node, then you like get a pretty good set of like all the surrounding information. And you can do that, you know, dynamically at runtime with with that background processing to, you know, kind of continually build out the the graph day after day. Anything along those lines in your mind or like what do you think are the sort of creative ideas that will?

speaker_2: So I, I suspect like that something like that could work well, but I don't think it's the right business move for us to try to explore doing that. And the reason for that is the agent's internal state is only one of the many sources of data that the agent needs to, to do its job. So, you know, you, if you asked it to do a task, it might need to look through its own state, but it also might need to look through all of many systems that you're connected to. And the amount of data that's connected to the systems is way too big for us to like suck it all in and like put it into the system. So if we built a system that for internal state managed it super well, but for all the external state, we're sort of still relying on like whatever those services provide for their, you know, search and memory systems. I don't think that I don't think it'd be a huge improvement to the overall capabilities of our product. So we're sort of betting on again, betting on the model to say, you know, assume that the memory capabilities never get that great. Well, the model gets smart enough about using those things such that it can kind of, you know, figure it out and get really awesome. And I think that trend has been, I'm optimistic about this. And like, I would argue that companies that, you know, couple years ago were like, hey, we're going to, we're going to ingest all the data from all these different systems and build this like awesome AI search product. I suspect now they're looking at this and saying crap like if you just take an agent and have the agent reason about how to use legacy search tools, it does almost as well. And man, that is way simpler and way cheaper. I think people are sort of realizing that now. So that's our bet is we shouldn't invest so much in the kind of smart memory systems now. We should invest in connecting the things and then that on the model, getting smart enough to use more traditional systems.

speaker_1: Yeah, that's interesting. Do you also rule out the clawed like inbuilt memory because they now offer a little bit of memory management through the API itself, right. Is that also sort of don't use it because it's it's got to be core to us?

speaker_2: We haven't, we haven't played with their latest stuff, if if memory serves. I, I don't think they're doing anything that we couldn't be doing ourselves. I think their context editing stuff is also stuff that you, you could totally do yourself as well. So yeah, it's in the camp of of don't use it because we can do it ourselves. Although I haven't, I haven't played with this specifically.

speaker_1: You know, going back, I think, I think to our first conversation, there was like a lot of custom models and sort of, you know, relatively deep down the stack optimizations that you were using for retrieval. When you index my full, you know, and very gnarly Gmail history for shortwave. It sounds like that's not so much of part of the strategy now. But you know, I also remembering, I think it was cognition in the last few days showed something where they are training their own like relatively narrow a gentic search model that is, you know, meant to do the same sort of thing that Sonnet can do in terms of doing retrieval against all these systems. But, you know, faster and cheaper, if nothing else, because they're, you know, able to specialize on a particular, you know, relatively narrowly scope task and therefore they can do with a smaller model. I'm guessing you're going to say we're leaving it to Sonnet for now and maybe one day we'll need to do that sort of optimization. But what do you think is the outlook there?

speaker_2: Roughly, I mean, I, I do think that a good AI search system with a great model is going to outperform a traditional search system with a great model. And I think that'll always be true. And you know, you can see this today, like going to Tasklet and asking it to search your e-mail is not going to give you as good a results as going to shortwave and asking to search your e-mail because they're both using the same model. And Shortwave has the benefit of like really good semantic search stack. But if you look at the two businesses like Shortwave, it is a tremendously complex and expensive system that operates that search stack, right. So it is a is a major factor of our cost is a major factor of the effort that we have to put into developing and maintaining the product, like maintain that search stack in pass split, we hook up to thousands of services. It is not practical for us to like build search for all those things. And even if we tried to say, OK, we're due for a, you know, a few of the most common ones, I don't think it scales in any reasonable way. So I think our, our only reasonable strategy with pass split is to say, well, we're going to rely on their existing search stuff. And I'm optimistic of of two things. One is that the models will keep getting better using those existing traditional search stacks. And the other is that the services themselves will overtime build good semantic search like Gmail will eventually, I think, build a good semantic search API that we can use, at which point the semantic search in in shortwave is, you know, we can maybe throw that infrastructure away and save some money. So I think that's our bet. I do think, I'm not saying that, you know, start-ups here aren't potentially going to be awesome and, and, and be very successful. I do think there's lots of opportunity AI search. I just don't think it's the right bet for for us to bacon.

speaker_1: Yeah, interesting. Let's talk about integrations. You have a lot of them. It's advertised 3000 plus business tools out-of-the-box, any API, any MCP. How did you get to 3000? That's a lot. You said 3000 MC. PS I would assume that that all kind of came sort of for free. But when you separate them out like that, it sounds like you've done quite a grind of, you know, a long March through integration land.

speaker_2: So, so yeah, so this is this is an interesting story when we so in shortwave, I think we have like an integrations or so something like that. And we started doing tasks that we thought, OK, we'll start the same way. We'll pick the top 10 most commonly used ones and we'll focus on those use cases, you know, e-mail and Notion and the sauna and stuff like that. And we started going down this path and my Co founder said hey we should try booking up on his integration platforms and we hooked up pipe Dream and they have a whole bunch of integrations in there and. It just worked really well, right? So we're like, holy crap, like something we have access to all the stuff that that they have here. And then we said to ourselves, well, you know, we should have MCP in here as well. So let's have MCP. And that worked really well too. And we said, you know, this pipe theory stuff works surprisingly well, even though there's like nothing in our code making this stuff work well. Like we don't have any customization of the prompt. It's just, it's just you plug it in, boom, it works. This is awesome. I wonder if we can do the same thing with HTPAPIS. Like, I wonder if we could just, you know, describe to the model generally how to use HTPAPIS and we could like put some constraints around how those calls were made and then connect to an arbitrary HTP API. So we built a thing to do called Direct API Connections, and it worked really well. Like, like shockingly well. And you know, we found ways to make it even better by like, for example, going and scraping website docs. Like when you make a direct API connection, we go and we search the web, we find the API docs, we look it up, we pull out a description and then we use that and boom, we've got this, this API connection. And it works so well in fact, that we have users who are switching off the official MCP like so for example, we, we have a, a bang to pay user who stopped using the official Notion MCP and started using the direct API connection that were those fully LLM generated on our side because it works better, it's more reliable, it has less bugs or whatever. So, and basically we just kept adding and adding and adding and we realized that the opportunity here was to connect to everything. We said, hey, rather than because everyone, there's so many competitors doing what we're doing now with task. And every one of them comes out and says, Hey, we connect to this sat or this Sat or this sat and they all try to differentiate how many they, they cover. And we're like, look, we need to just end this conversation and be like, look, we work with everything. There's nothing we can't connect to. And so we made that a goal and we very quickly went from, you know, and to thousands of integrations that and some of those are ours, some of those are through integration platforms and then direct API. So any HTPAPIMCP provider provided MCP. So anyone like, you know, for example, like with with with linear, right, they have their own MCP service. So those are included that list as well. And like we helped with discovery for that and then computer use. And if you got to put all of that together, the the thing that we provide is a unified UX and LLM interface for those things. So no matter whether that integration is hand rolled by us or it's done through an integration platform or it is a direct API connection or it's an MCP service that they built or it's an MCP service that you built. It all has the same sort of configuration process and the same activation process for controlling like tools and permissioning things. And it works the same way in the LLM. So the rest of the system doesn't care, right? It it, there's integration, it knows how these integrations and like boom, everything works and then for good measure we have the computer use because there are some things that don't have AP is and don't have MCP servers either because.

speaker_1: LinkedIn doesn't work.

speaker_2: Or in the case of LinkedIn, because they don't want you to, right? And well, it's real hard to stop computer use. So, and that is a key part of our pitch being like, look like not only do we, you know, we use the integration platforms, we cover everything they cover. We think any API, anyone with MTP and we can also do computer use. And by the way, you know, the top end, we have hand rolled ones that are like super solid. So you know, the really important stuff, we, we make sure it's great. And I think, I think this has led to Super Bowl folks, especially, especially the direct API connections. Like, I think an interesting data point on our side is you hear a lot of buzz about MCP. I used to be very optimistic about MCP. Our direct API connections are used far, far more frequently than MCP.

speaker_1: That's really interesting. When it comes to like credentials, the flow is basically like, oh, hey, I want you to check my e-mail. I may have never talked to it about my e-mail before or whatever. And then the thing comes back and says, OK, you're going to need to connect your e-mail if you want me to do that. The sign in there is like a familiar flow to people. Are you do those those credentials I assume live at a cross agent layer, right? Do they is that also a technology platform that you're able to tap into to manage those credentials for users? Or do you have to roll your own version of that for some reason?

speaker_2: That that's all our own stuff. So a big part of what we've built is the sort of integration and connection system. And basically we have this concept of a connection and it's it's an authenticated stateful connection to A to an external service. And it's something that may be a little complicated to set up, right. And, and this is when you for like a, a common 1, you might owe auth into or for direct API, you might actually have to give us like an API key and maybe some other details about the headers you have to provide stuff like that. Or MCP, you might need to provide some information about the MCP server. And there's a, there's a connection creation of flow. So if you go in there and say, connect to my Gmail, the first time you do this, it's going to walk you through a flow to create that connection. Once you create that connection, that connection is stored on your user and you can reuse that anytime you want, but we require that every agent sort of you re grant permissions for it. And the, and the idea here is like, you know, if you're mad, imagine you're a company and like every agents, a different employee, You don't want to give every employee access to everything, right? So some employees are, you know, be able to do this and some place to do that. You want to, you want to restrict what the agents could do so they don't go off and go rogue and do crazy things. So you create the connections and then separately you enable the tools on those connections specifically for the agents. And I see a lot of opportunity here in the future because the place we're going to expand on the on the team side is allowing you to share connections across your team. So the way we want this to be used in the future is you're an IT administrator at your company. You define the connections that people at your company should have, right? What are the API keys that people should be using? What are the services they should have access to? Those connections come with audit logging and they come with, you know, cost controls and like other sort of oversight and, you know, policies associated with them. And then you assign who on your team is allowed to access the different connections and you tell every to your company. Hey, you know, rather than you using any a tool you want and hooking up to to our systems any way you want, you should use task lead to use the connections that we've sort of created and pre vetted for you and that we've set up the proper compliance and audit logging and things for. And so, yeah, the way I just work in the future, you're going to have your team will have connections personally about connections, creating those connections might be some work, but once you have those connections, enabling those on your agents is like super simple and super intuitive and like super safe.

speaker_1: Interesting When it comes to the API, you know, spin up for a new previously unknown API, is there is that something that happens like for every agent, you know, fresh each time? Or and maybe it wouldn't even make sense to try to like create a skill bank because these are so long tail that like you don't see too many of them twice. But I'm I'm wondering if there is AI think it was Voyager was the original project that I recall sort of pioneering this notion of like, OK, I've successfully achieved, you know, this skill. Therefore, I'm going to save that to a skill library that I can come back and tap into again later. I mean, you could even imagine doing that sort of thing like across users or across organizations. But then you might worry that maybe you shouldn't for some reason, or there could be some, you know, some proprietary info. So yeah, is there any sort of crystallization of that API success?

speaker_2: There it is. So you can, you can broadly, you could broadly group our connections into two types. There are like discoverable prebuilt integrations and then there are custom integrations. And we literally have a search function where you can like search for integrations and we return the list that we know about. And some of those integrations that we built, some of them are through integration platforms, some of those are MCP servers provided by other services. But that's the set where we're like, hey, we know these exist. We know how to configure them. And we try to make those like really easy in one click. And then the other set is custom where it's like, you know, we don't know anything about this. You need to tell us the details. And, and I think we're going to try to keep that divide of like any connection that we know about, we're going to try to like save the configuration for you and make it as one click or close to one click as we could possibly get it. But hey, you can always set up your own thing if you want to.

speaker_1: Yeah, interesting. MCP, It sounded like there was a little bit of a hot take potentially brewing there. Yeah.

speaker_2: And I don't, you know, I don't know how many other people have come to the same conclusion, but I don't, I don't think I'm a total outlier anymore. We we start working with MCP, you know, early this year and initially I was extremely bullish the the community was exploding. It was obviously like super cool what you could do with it. You could hook up these other services. And it seemed like this missing this missing piece. And we obviously built this at the core of what we do in shortwave and like it was a big inspiration for us building tasks at all. But when cloud four came out and we started playing with every time like a major model version comes out, you know, we kind of rethink all of our priors. And one of the big question marks we had is, hey, if like, you know, empty, you know, if a big goal of MCP is to provide these tool definitions, but those tool definitions, you know, there's probably a debate you've probably seen around like whether the tools should align with API calls or whether they should align with something more specific. So assuming those tool definitions end up just aligning with API calls and those APIs are documented on the Internet, like, what's the point of MCP, right? Why can't we just like directly call these API endpoints if we have descriptions for those which are available online? And I think the answer before Cloud 4 was like, well, models aren't really smart enough to like make that whole process work end to end, But they got to a certain point. We, we tried this, we're like man, like the models are totally able to one, just directly use the API's, you don't need to necessarily have like custom tools fail that works fine. And two, they're smart enough to like generate their own tool descriptions basically by just scraping the web. And what's the point of MCP? And that that isn't entirely true because like one of the advantages of MCP is it provides a better off experience, at least in some cases. So for example, with Notion, the only way to get like a nice off that gives you off broadly to your workspace is to use MCP. But like we actually, we, we don't do this anymore. But for a while, when we felt like the, the tool definitions from from Notion were not good, we actually used the, the MCP off from them and then just like overrided their tool definitions for their own tool definitions because we thought we had a better approach. So now we look at MCP, as you know, in cases where we feel like the provider of the service has a good version, we use it. In cases where they don't, maybe we'll use it for off, but not for the dual definitions. And in many of those cases, people prefer to use direct API or some other method for accessing it. So we now say, hey, we're going to give you as a user like every way to connect to every service out there. Sure, you want to do MCP, we got MCP. You want direct API, we got direct API. You want to go through an integration platform. We got that. It's up to you to decide which one you feel like will work best for your use case. And then once you've created that connection, you know you can kind of forget about how you're connected and just like you reuse that in the future.

speaker_1: One thing I've been looking for in MCPS, and I've seen precious few of them, is, for lack of a better term, smart MCPS. Basically, you know, working from the same kind of observation that you said a minute ago. Like if it's just a one to one on the API, what's the point? One way to answer that would be to zoom out a little bit and provide like higher level sort of intent based notions to the model that don't require that maybe like wrap, you know, a bunch of API calls or maybe even there's like an AI on the side of the MCP that sort of takes in your higher level in intent. And like, you know, because it's a specialist in using its own AI is it can kind of translate your higher order wish to, you know, a bunch of specific executions and give you something back. Everything has pros and cons that would have some loss of like visibility and control, but maybe could give you higher performance and and give you a reason to like want to use the MCP over the direct API itself. Have you seen many of these? You know, why doesn't this? Why isn't this happening more than it is? Or am I missing it because I haven't seen much of this even though it seems very natural to me.

speaker_2: I, I haven't seen much of this and I, I do have some examples of use cases here. So for, for example, like in Gmail, there's no API for forward. The the way you forward an e-mail is you create a new draft and then in the draft you include the right content and you send it. But if you're going to give a tool to an LLM, the the process of like copying the history into that new draft, probably not something about the LLM doing, you probably want to have just like a tool call that like knows to like do the pop post incorrectly. So I definitely think there are there are use cases for higher level tools. I think the big, the big downside of trying to do these higher level MCPS, though, is like it's, it's a it's another thing to maintain. It's a lot of work. And if the model is going to get smart enough to do this stuff well, or if the model is going to get smart enough to generate those tools, you could maybe avoid having all these companies have to like think through what to do here. So today, you know, for direct API, we use the API endpoints directly. Maybe in the future we'll have the model like reason about logical groupings of those endpoints such that you know, more can be done in code. But I'm not I'm not super optimistic about like over the long term like hand rolled MCP that does this because I think it'll be obsoleted by the models.

speaker_1: So it does not sound like you think the progress in model capabilities is stalling out now or anytime soon. What's your commentary for those that are? And I don't believe that either. But you know, there is this kind of counter narrative that has come out or come on, let's say since GPT 5 that well, I guess pre training stalled out and everything's kind of, you know, this is about as good as the AI is going to get. How do you react to that or what? What's your argument that that's not right?

speaker_2: I, I think they're still improving and I, I very much drunk the kool-aid of the metric to watch is the length of tasks that can be completed autonomously. And, you know, regardless of how you want to measure that, I think we're at a point now where you can't, I mentioned this really, you can't really see that much of a difference in like single question, single response now as the models get smarter, right? It was really smart before. It's still really smart where you start to see the differences add up is over many turns, right? You, you know, computer use, for example, you say, hey, I want you to like navigate to LinkedIn and like find all my previous Co workers and make a list of their current jobs or something, right? To do that, well, the AI has to go off and do turn after turn after turn after turn after turn. And a model that is, you know, point O 1% better at every turn over time is going to end up being, you know, radically more effective than than one that's slightly worse. So if what you're measuring is, you know, what's the quality of 1 answer in one response, I think that's sort of plateauing. But what if you're, what you're measuring is what is the quality at turn 100 or to 1000? I still, I still think there's, you know, orders of magnitude to, to go here. So we're still betting on it. I I expect this to go for a very long time if that is the metric.

speaker_1: Maybe let's circle back to that in a minute. Last thing on integrations, What other like what, what integration providers are you finding to be either really popular or like hidden gems? Like, in other words, what should I sign up for so that I have it so that when I am using Tasklet, I'm like, you know, doing what the the cool kids are doing?

speaker_2: That's, that's a good question. I mean, the, the boring answer for you is, well, just the stuff that you're using already, right? So for the most part, we're not telling you like use a new service. It shouldn't just work with the stuff that's already in your stack. If you actually want to like use some services that work really well with this, you know the one that comes to mind for me is Rocket Reach I.

speaker_1: Was going to say exactly the same, it's so funny.

speaker_2: Because, you know, for example, you know, I probably shouldn't say this to a member of the media, but I've been automating some of my I don't.

speaker_1: See any members of the media here? No, no worries.

speaker_2: Yeah, I've been automating some of some of my press reach out and it does a really good job of, of crafting emails that really I feel like I would have sent and finding the right people and customizing them for, for, for that, that person. And I can automate this to constantly be looking for people that might want to talk with me. And Rocket Reach is great because some of those people, they just put their e-mail address on the website. Some of them are a little harder to find and so it can it can track down emails for people.

speaker_1: Yeah, I was good at nominate the entire category of contact info finding companies as probably a huge beneficiary of the agent wave because that information is hard to find. I have not had super great luck. Certainly sometimes you can just have the agent go find it, but it is not nearly as successful as the companies that have taken the, you know, pains and years and done whatever data deals they've done to find that information. So yeah, that's a great one. Any other categories come to mind? I I think your first one is spot on.

speaker_2: Oh I guess I got AI got one for you using other AI products. So a fair number of people hook it up to like perplexities search API because it's really good at doing sort of deeper research if what you're trying to do is like really heavy duty web scraping. So I think that's a, that's a good one. Another one, this isn't really a specific integration. It's a use case. We've seen people give tasks that it's own e-mail address, like they'll create a Google Workspace account specifically and give it a name so that it can like send and receive its own emails as itself. That's another fun one we've seen. But really beyond that, it is the tools they already have, right? So people that use Salesforce, that use it with Salesforce, people use Stripe, they use it with Stripe. So, so whatever. Whatever tools you already want.

speaker_1: With that e-mail sending and this would be true for probably any number of different kinds of integrations, you know, post something to slack, whatever. Is there a, a best practice or a good, good paradigm for thinking about human in the loop like I sometimes would imagine I would want like, say I wanted a tasklet to organize our lunch order on a daily basis back when we were at the office every day. You know, this was a constant pain point. And I could imagine something that would be like, OK, go, you know, post the slack, remind everybody twice and then, you know, collect and place the orders. But you'd have to have some sort of, I guess on that one, you can do it on a pure time basis. You know, lunchtime is lunchtime, but you know, plenty of things you can imagine where you like. You need that feedback and you want to wait for it and you want to respond to it when you get it. But you, you know, you need to have it right. So is there anything that does that paradigm exist at the run level today or is there anything I could do to to sort of make sure that I set that up?

speaker_2: So is it doesn't that you bring in lunch ordering? Because this was the canonical example we had internally of this is the thing we wanted to be able to automate because we do this right, We order DoorDash every day for the team. And it turns out this is a non trivial thing to automate because, well, you're not in the office every day and different people are in the office every day and there are air cases to handle. Like, you know, you order from a restaurant and then the restaurant just like cancels your order sometimes you know, it's like handle, handle that case. There are dietary preferences to consider. You don't want to order from the same restaurant every day. So people get sick of that. Sometimes you have guests, right? So there's this long, long list of complexities and exceptions and things for ordering lunch. And we've worked with EAS in the past, like human EAS that do this. Our own team does it. It's kind of a hassle. Sometimes we forget to order lunch and then we have to go out for lunch. And so we're like, man, we, we, we should automate. This is the first thing we should automate and but we actually it's a little embarrassing we don't currently have that particular thing automated, although I think we could today we have half of it automated and I haven't gotten around to setting up the other half of it. So we get an e-mail every day with recommendations and the recommendation have DoorDash links and someone has to choose the one and click the DoorDash link and like set up the group order, which we should fix because we can do that out with computer use. But the human in the loop portion you talked about, I think it's really important because for example, you might want someone to approve the restaurant that was picked. You might want someone to approve that like everyone has added their order or there are no guests today. So I do think that's a good example here. We actually do have some computer used capabilities in the product today with the sub agent runs actually can contact you and they can send you an e-mail. It doesn't work super well. It hasn't been super common. This is an area I want to fix. Like what I really like to be able to happen is for the sub agents to be able to like call into the main agent and be like, hey, I ran into an issue. Main agent decide what to do and then the main agent kind of reason about how to handle it, including contacting you. This is an also an area where like we probably need a mobile app to do this well, because getting an e-mail when you need even in the loop is probably not great. What you really want is like a push notification and they can you can jump in. So we would love to make light lunch ordering the like ultimate starter use case and make it super awesome. We'd love to do him in the loop. Well, it's it's a solid work in progress though.

speaker_1: Got you. OK. I'm standing by on I guess going deeper, you know what, a little bit deeper down the stack than the integrations. Wonder what you are using to build. You mentioned pipe dream being one platform. What else would you say you have found to be great? I'm thinking like, you know, I go to the Versal SDKA lot of times if I want to, you know, set something up that has nice front end, or maybe I'll use Lang chain and Lang Smith if I want to have, you know, workflow, especially if I want to have that hosted in the cloud. What? What's good in the tool? You know in the picks and shovels layer that you would?

speaker_2: Recommend others. I did the list of shout outs right now it's very limited. My experience went great, but happy with them. Google Cloud is awesome. So you know, our computers for example, is all backed by like VMS and Google Cloud and we're using a lot of Google Cloud products and it's awesome. And then Entropic is super key. We don't use Lang chain. All the agent code is our own stuff. It's all TypeScript. We don't use any specialized search products for what we're doing. So yeah, it's a short list of shout outs, but yeah, Google Cloud, Pipe Dream and Tropic.

speaker_1: How about at the coding layer? Are you guys cursor shop?

speaker_2: Or yeah, we used a bunch of stuff heavily cloud code. And then most of the team uses cursor and we do use codecs for some things. And there's, you know, some folks prefer codecs and some folks use cursors. Some folks use both for different things, which I find interesting. We've tried some code review tools, but like we just have pod code like integrated to our CI at this point, which works pretty well. We use Sentry, we use stat sig for stats. I'm probably forgetting some. Yeah, those are the, those are the probably the big ones.

speaker_1: Are you tracking a lines of code drafted by AI or similar metric?

speaker_2: We, we aren't, it is significant. I, I suspect by just pure number of lines of code, it is the majority of them. I don't think it's 90%, but it's, it's the majority of them. I, I don't, I don't think that equates to like, you know, the majority of the thinking or the reasoning, right? Like the writing the lines of code is often not the hard part, but a lot of AI coding. Yes, for sure.

speaker_1: Have you been able to or, or try to or do you, do you, you know, feel that it is desirable to with today's level of coding capability go from like a ticket or a GitHub issue or whatever straight all the way through to a pull request. We see lots of examples of that, but you know, are they, is that, is that something that you are doing and finding value in?

speaker_2: Totally. Yeah, it's got to be something where, you know, the system is already set up well enough to support it because I think one of the downsides of AI goading is it'll hack in a solution that like you should not add in that works, but like breaks the system in some way. So if you look at a bug report and you're like, hey, this is actually, you know, a relatively quick fix and like a system is already sort of architected to handle this. Well, like, yeah, we'll throw AI at it. My Co founder does this in a very snarky way, you know, so he's he's a really good software engineer. He's really fast. He's much faster than me. So if I'm feeling lazy, I'll, you know, send him an e-mail and ask me to do something like I had some stats stuff I wanted to do. So I emailed him this morning and he just emails me back a screenshot of like him copying my thing and pasting in the cloud code. And you know, do you like, hey, do you could, you could type this yourself into cloud code. But yeah, we're totally doing stuff end to end for sure. So as an example, like the static content on our website, the pricing page, the release notes page, the terms of service, all that stuff that I mean the, the content, you know, I, I decided what goes on there, but the actual pages and layout and everything was like just totally 1 shotted with cloud code.

speaker_1: It's a brave new world, that's for sure. How about functional testing? One thing that has come up for me multiple times is I would love to have an agent that you know, and repple is doing this now within their own platform. But like obviously that's, you know, specific to replit and kind of a closed, I mean, remarkably open in some ways, but like closed in other ways ecosystem is there. Have you seen anything that just you point at your product and say like go use this product and find bugs with it at a purely like user interface, you know functional testing sort of level?

speaker_2: So we do this through task, but actually we have tasks that test itself. So when we built computer use a lot of the work that you have to do is in, you know, properly configuring the VM for example, so that it is doing remote desktop correctly and it's, you know, sleep opening, it's waking from hibernation and things correctly. And my Co founder debugged all this stuff by being like, hey, the VM isn't working right, go debug its configuration. Try to figure out why it's not waking from sleep properly or you know, this thing or that thing isn't working right. We have it test use cases, right? We've gone in and been like, I want you to connect to blah, blah, blah, and then come up with, you know, a way to test all the tools. And so it'll like connected ocean and it'll just go it just runs through the tools and like make up test cases. We have the ability to with computer use to do like UI testing of our product. Actually one of the whole most hilarious demos I've seen is somebody used computer use to open up a task like website and go back and edit itself right. So the agent can just go into its own self and like edit its agent. So yeah, we do a lot of sort of AI driven testing but through task. But honestly?

speaker_1: That one example is you talk about a hall of mirrors that is. Yeah. It's starting to get strange in a few ways. Yeah, that makes sense. OK, security, you I mentioned earlier that you don't have SAC 2 on this product and custom what customers really want is like a smarter agent that can do more stuff at the same time. You know, when I'm giving access to like everything, right, my e-mail, my slack, whatever. And I'm generally not a security minded person mostly, you know, figure clean living pays off. But you know, it's a lot of access. Is sock 2 even? First of all, is sock 2 even meaningful in the AI era? Like, how big of a deal is that? I've actually never been through it. I know you, you know, have done these kinds of things in prior lives. Is IT security theater? How much does it matter? Like what should I think as soon as when I see that?

speaker_2: I want to be clear that we are, we care a lot about David's privacy and we are totally planning to do sock 2. It's, it's been a week since launch. So if, if, if anyone is listening is, you know, wondering about this, yes, we, we totally plan to do sock two. We plan to do it soon. We take this stuff very seriously and I do think it's still perfectly relevant. Like what you care about with stock 2 is, you know, basically one of the controls on the data, right? How how do you know, for example, that like the employees at tasks that aren't like doing something with your data, they shouldn't be doing? So I still think it's very relevant, still think it's very important, but I do think it's also not enough. So, you know, traditionally what you've cared about with, you know, SAS products is whether the, the humans at the company are doing the right thing and whether the systems are, are secured. Now you also care about whether the agent is doing the right thing. And for that, there doesn't really exist a standard like this. And I've, I've talked to a bunch of security people to try to understand like, hey, when you're, you know, when you're thinking about your company using this stuff, like, how do you think about this? And basically the answer from everyone is like, we have no idea. This is the Wild West. We're figuring this out. Now you would share a standard with me and I, I don't know enough about it to have an opinion. I know there's a lot of people thinking about this exact problem. One interesting data point I would have though is when we initially started testing this with customers, I thought the number one risk to the business, the number one problem that we would face is people would just be terrified of the AI going rogue and they wouldn't want to hook this up to their systems. And we have had, and I'm not exaggerating here, we have had not a single complaint of this happening. We have had all kinds of complaints about other things, right? Tons and tons and tons of bug reports. But no one has said, hey, the AI just went and did this crazy thing. And I'm unhappy about this. And I found that really interesting and kind of shocking. And what I attributed to basically is like, because our entire product is the, the, the entire intent of our product is, hey, we're going to give your agent agency that that's the whole thing. And we're messing it very clearly. And we're super upfront about this. Like the whole point, the goal of task lead is to give an agent agency that people have the right expectations set. And we do give you controls too, right? You can decide, hey, what do I want to connect it to within those things? What are the tools I want to give access to? But I think it's fascinating right that it has been a complete nothing burger so far. To be clear, there were still very top of mind for us. We're still thinking very hard about this. And I think it'll probably still become, you know, some of the customers bring up a lot as we sort of move up market into some of these bigger customers and doing higher value things. But yeah, I don't, I don't have all the answers yet for sure.

speaker_1: Yeah, that's interesting. So presumably, I mean, I think you would have heard about it if agents had gone like truly rogue. So presumably that means they're, you know, Claude is behaving pretty well. It can't be that people are, you know, sitting on those episodes. I did hear, I think, I think this was actually in the system card. If not, I, I heard it in conversation from somebody at Anthropic that the famous Claude blackmail scenario reported in the, I think it was the Claude 4 model card originally that that problem has gone away with 4.5. So they're, you know, potentially are also just some under the hood improvements that standard. By the way, AIUC from the AI underwriting company, their AIUC 1 standard will be subject of an upcoming episode. So I, I still got some more self education to do around that as well. One of the things they're trying to do, I mean, they, they've created the standard and then there's sort of their master plan, you know, grand vision is like they want to harness the power of the insurance industry to make AI safe. And they, they, the reasoning is basically like in financializing risk broadly, you know, it's, it's a great way to focus the minds of some very smart people to say like, well, what is this risk? You know, how, how likely is it really to happen? How bad is it going to be if it does happen? And, you know, can we get a good enough handle on this to actually write a policy on it that, you know, we we're reasonably confident we'll make money on? And that certainly hasn't happened for AI agents yet. But that's kind of the, you know, the big picture goal that they have. I wonder if you would be, you know, does that sound appealing to you? Like if you if there were AI insurance or you know AI agent insurance on offer, would you be wanting to buy it?

speaker_2: It it does, yeah. So I mean, the the company that comes to buy here when you bring this up is Airbnb. I don't know if you remember the very early days of Airbnb, but there was a, you know, customer that like trashed somebody's house or something. And this was before they, they had insurance and they said, man, our, you know, our brand is going to be real tarnished. We don't react. And they said we're gonna have this big insurance policy. There was a huge deal for them and I think it was key to the success of the business. And for us, you know, right now we're focused on how do we make this thing super smart and it's all very cutting edge tech. But like, I think the medium and long term success of our business is entirely dependent on us being the most trusted place, most trusted way for like enterprises to deploy agents inside their companies. And you know, I think a lot of people are going to have the ability to connect to a lot of tools and automate a lot of stuff. And we're going to differentiate by being, you know, the thing that the IT folks it was comfortable with. And to that end, yeah, to the extent that, you know, insurance and compliance can help us tell that story and give people confidence, I'm very interested. So. So yeah, very interested.

speaker_1: Cool, how about on economics? When I recall from the first shortwave conversation was you were losing money on every user. From at the second conversation, cashing was the big unlock that changed the economics to where with the same price point you could make the product profitable. Now you're kind of back into an early phase. So are you burning money on me or what does it look like?

speaker_2: Well, let me give you a quick update on Shortwave first, because I think that's what I was telling and I'll we'll talk about task that. So yeah, I think that the first time that we talked, probably the second time we talked, we were losing money on folks, probably losing a lot of money. Basically every launch that we did with Shortwave, we were worried will this bankrupt the company because we didn't have enough data yet to know how much this stuff will get used and how effectively would monetize. And you know, there are several launches where we thought, hey, this might, this might just go to the moon cost wise and not bring any revenue and it's going to really screw us. And that never happened. It was always expensive, but not bankruptcy expensive. And over time, we found ways to monetize more effectively and the cost went down and we found ways to optimize costs with, with caching and stuff. We've done a whole bunch of stuff recently with, with caching that's helped actually a ton. And we've gotten shortwave by profitable. So shortwave is now, you know, not like traditional SAS, you know, 90% margins, but it's, it's healthy margins and we're making good money off of it. And task LED is back to where shortwave was because task LED, you know, shortwave there's, you know, you get value from the AI, but you also get value from other things in there. And so a lot of it is sort of, you know, half of it's traditional SAS margins, half it's sort of AI margins, a test that is all AI people use a lot more tokens per per person that in test that they do in intro race. And we are strongly margin negative right now. But I think the same the same thing is going to happen, right? We're going to be that way for a while, but over the next few years like it'll eventually get to to neutral and then into positive territory, eventually strongly positive. A huge unlock for us here is a haiku. So very excited about haiku 4-5 because up until now your choice was, you know, Opus or Sonic, which are like relatively new versions or like a very ancient version of haiku. And haiku 4-5 is good and it's fast and it's, it's well relative. It's not nano cheap. It's like it's relatively speaking. It's it's cheap and it is good enough for most use cases. We're actually just today we're rolling out the option for Haiku in task split and the way we've been addressing cost in task that to keep it even with this, we're very margin negative. But the way we've kept it from going so margin negative we're, you know, going out of business is with quota limitations, which you've seen. And basically you have a little meter that's next to your text input that tells you how much quota you have left for the day. And Haiku basically should let us significantly increase that well number of tokens that people can use without exhaustion that quota because it's a third the price. So that's that's going to be a really big unlock for us.

speaker_1: You want to talk a little bit more about interesting stuff you've done with caching recently? I thought that was definitely very interesting last time, and if there's new techniques or updates I'm sure people would love to hear them.

speaker_2: Yeah. So I think the the key to caching, basically doing caching well is making sure your agent is an immutable log. You never want to modify earlier messages because every time you modify an earlier message, you invalidate the cache. And then you want to be smart about you know which what cash, you know, your options are like no caching or five minute caching or an hour caching with with entropic and like which caching do you use for different pieces of it? And we've managed to push our cash hit rate in shortwave up to like, I want to say like 85%, something like that, which is pretty good and for and I'd love to do higher right. And it's like multi turn stuff. There's things you can do. One of the big unlocks for us in in shortwave was replacing date in system messages with sorry in state in the in the system prompt with system messages. So there are certain types of things that traditionally you do in the system message, for example, the current time, date and time, right? If you put that in your system prompt, then every time a minute elapses, you've nuke to your caching and you got to, you got to start over. And you can't capture across different users. If you do that, if you don't put in your system prompt at all, it doesn't know what time it is. And time can be really valuable. If you put in a tool call, then you've got to do a tool call and you've got to turn on a tool call every time you do that. And that's really expensive. But what you can do is you can use system messages and the system messages basically just like a little block of, you know, XML for that you append and hide from the UI from the user in a user message when they send the user message. So basically right before the user message gets inserted, the LLM, there's the thing above it being like, hey, by the way, you know, here's some stuff that happens. And this is, I think something that the topic does heavily with their own stuff. Like they do this now for like context usage, I believe. And this works great for any state that you want to make available to the the agent without it having to call a tool, especially things that you want to keep updated. So otherwise they have to like call the tool multiple times. So for time stamps for you know state about the user. For like memories and shortwave, we have like a memory concept that lets you sort of customized behavior. We can do that through system messages as well. So that's, I think that's been a big unlock for us. Compactions been a big unlock for us. And like trying to be smart about how that works. Yeah, those are probably the big ones. And then lots of little things.

speaker_1: One thing that I thought about from a cost, especially with you being strongly margin negative, is how do you think about the possibility that some of the agents that people have set up are kind of zombies? Like they set this thing up. They're getting an e-mail every day. Maybe they're not even reading that e-mail anymore, You know, maybe they forgot and maybe they're still using other agents, you know, actively. But like, if it's not hitting a limit for them, you know, there's not much of an incentive for them to come in and and turn that off. Do you have any ways of thinking about how to detect that or try to, you know, have some, I don't know, some smart system to identify and turn down the zombie costs?

speaker_2: Yeah. So I wanted to know, it's been a week. So we're kind of learning about that now. We did actually look at some sass this morning and there's a significant fraction of our cost is free users with recurring jobs. And you know, some of those users may be getting value from that and might cause them to come back and eventually pay us to tell their friends and maybe or maybe just going to spam on their side and they never see it and, or it's updating some system they don't look at. So it is on our mind because yes, you know, every few users out there spending, you know, a dollar a day on LM costs and they aren't even looking at the output. So it's not going to, you know, benefit us in terms of growth. That's a problem. I think we do plan to do on the paid user side will be like, hey, look, you're paying for it. You want to pay for it. That's, that's fine. And the free user cloud, I think what we'll probably do is look for some indication of activity in the app and you know, if we haven't seen you in a while, we might e-mail you and be like, hey, you got to come back into the app. And if you don't, we might turn your stuff off. I don't have a good answer here yet. I think it will be a problem if we have to do something about it, but TBD.

speaker_1: Yeah, I mean, if you let free users set up recurring jobs, you got to have some limits on that or you're going to be bankrupt to forking all your money over to Anthropic. OK how about Frontiers, the multi agent future? Yeah, I think is a dramatically under theorized topic. I have this one graph that I always use in presentations. And we did an episode about this with respect to Claude's ability to cooperate with itself in a certain behavioral economics experiment. Other models of this was called 35 other models at the time couldn't do it. And I'm always struck by like, just how little of that research has been done, the fact that I don't even think that experiment has been repeated with later models. And, you know, I think we all have a very foggy vision of like, OK, it's not just going to be these one off things that are doing their own thing totally in their own silo. But they're going to intersect with each other. They're going to, you know, pass messages back and forth. They're going to, like, trade with each other, perhaps with, you know, crypto wallets at some point in time. What does that look like in your imagination today?

speaker_2: So my thinking is that within within an agent or with within a single application where you are serving a single user. I have yet to see a good use case for for having multiple agents. Like my experience has been, if you have a big LLM and you feed it as much context as possible and you have that one lol reasons, the whole problem itself, you get the best answers. The place that I see multi agents coming in is when you have multiple different parties being represented. And maybe you don't want to just let one agent has see all the data and make all the decisions. Like if you I'm trading with you, right, your agent and I agent might not want to have, they might not want to share data with each other. And there might be some sort of competitive dynamic going on or adversarial dynamic. And that's, I think that's sort of outside the scope of what we're doing. So our, you know, our focus really is, you know, one big agent with all the context making special decisions for you to the extent that there are other agents out there in the world do they have to interact with? I guess we'll figure that out. I do want to give a shout out though to there. There's a start up that I think you talked to the Co founder of. So my former Co founder from Firebase, James Tamplin has a company called Cradle AI Cradle with AK and they are doing exactly this where they are sort of building virtual worlds and letting all the models interact through games like Minecraft and stuff and just seeing like what they do in sort of collaborative and competitive dynamics. And it's fun to watch. I don't know if you've you've played with it, but I know you chat with James.

speaker_1: Yeah. I met him briefly at the curve and we've got a call on the calendar coming up to go deeper into that. And I do think it's really fascinating stuff. I mean, well, let me get your take on this. OK, so I pitched this vision of the future to somebody, maybe even a couple people from anthropic at the curve where I'll summit James and the it's sort of a confusing vision for me. But basically it's like, OK, let's just assume that that meter graph keeps going, right? The exponential keeps going. If you take the, you can always discount this, But if you just take the more aggressive 4 month doubling time that people plotted for like, you know, just the last like 9 months worth of models or whatever since maybe maybe 12 months worth of models since 01, then four months doubling means 8X per year. We're at two hours now. We're basically at two days a year from now, We're at 2 weeks, 2 years from now. We're basically a quarter's worth of project size that you could send off to an AI in three years. May or may not happen, but that's the trend that we know that has been plotted. At the same time, we have these like very weird behaviors, scheming, deception, you know, blackmailing you, whatever that are. I think it's unknown exactly how rare they are. I mean, it's good to hear that you've heard zero complaints about this. That suggests it's like potentially quite rare in production. Most of the reports of this, the blackmailing, the, you know, autonomous whistleblowing were in research setups. Although I think people are in my view, like too quick to dismiss those research setups as being unrealistic. I don't like, I don't know, man. Like you give it an AI access to all these file systems, all these emails. There's a, you know, a lot of people out there. There's a lot of weird stuff and emails like seems like this stuff is going to happen. So if I if you combine those two trends where the task length is getting longer and these, you know, bad behaviors keep coming up and then they're sort of suppressed or somewhat some degree trained away, but never quite to 0. Maybe you end up in a world in three years time where you can delegate weeks, if not months of work to an AI. But then also there's some like vanishingly small but not zero chance that it like actively sabotages you in the pursuit of doing that. So I pitched this to these a couple of guys from Anthropic, and they were, I was like, do you think that's like a realistic vision of the future? And their answer both times was basically like, yeah, maybe that that sounds about right, honestly. So I was like, well, that's really weird. What do you think is that? Is that kind of the mental model that you have or what do you think 2027 or 2028 looks like in terms of how much we can delegate and how much risk we'll be taking?

speaker_2: I, I, I do think the task length is going to keep getting bigger and bigger. I, I am blown away on a weekly basis by the types of stuff that AI can do for you when every model release comes out. Like Gemini 3 is supposed to be amazing and, you know, I've seen some things and it looks incredible, but I'm sure in three months it'll, it'll seem, you know, super dated. So I, the progress seems fast. I think the task lengths will increase. I, I tend to look at the, the blackmail scenarios and things as being fun and interesting, but maybe not super relevant to the day-to-day of my business because, you know, we talked about earlier, you know, having conflicting instructions and how much of a problem that creates. And like, yeah, if you set up a research setup where you tell, you know, you tell the AI there's multiple conflicting goals and it, it's way of getting out of these conflicting goals is doing something a little unsavory. Well, that's not terribly different than how a human would behave. And in many cases, it might not be terribly different than how do you want the human to behave. So for example, like, you know, if you you know, if you have a humanoid robot, right. So the humanoid robot jaywalk sometimes, I don't know, maybe it's against the law, but humans, you jaywalk all the time. So yeah, I do see a world, yeah, maybe where it can do a lot of stuff and it does things maybe in a little bit of a shady way sometimes. And the degree to do that does that might be a matter of, you know, debate and strong opinions, and we'll figure it out.

speaker_1: Yeah, I think it's a great point that we don't really have, I think in many cases, even a decent intuition for like, what we want the AI to do, let alone any sort of consensus that we could measure them against. In the case where the model was sent an e-mail to the FDA to blow the whistle on the, you know, in this case, fictional company that was in the research setup, the company was doing something bad, you know, something there. I think in general, like we sort of celebrate whistleblowers if they're actually blowing the whistle on, you know, genuinely bad behavior, like faking clinical trial data is, you know, is bad enough that you want somebody to speak up. Do you want your AI to speak up? I mean, maybe you do. It's certainly a weird thing, but it's sometime.

speaker_2: Debate right the.

speaker_1: Regulator the blackmail feels like it crosses maybe a different line, but the the whistle blowing I I don't see as necessarily a big problem but or I mean it could be a big problem depending who you are, but it it doesn't necessarily seem like the wrong thing for the AI to do. But yeah, we got a lot of, I mean, if that does happen that fast and we're at quarters worth of work in 2028 that you can just, you know, send off a little blurb and, you know, it has access to all your stuff and it's just like goes and figures it out for 1/4's worth of work, that is going to be a very, very different world than it is coming at us quite quick.

speaker_2: The future is going to be fun.

speaker_1: If we reel that back into the the present, though, you know, I see all these like NADN master templates, right, That are, you know, people even sell these because it's like gnarly to set them up. And I've done enough Andy then to know that it can be quite gnarly to set them up. But there there does seem to be a pattern there that I'm wondering why you're not, you know, seeing it the same way or whatever. Like if you had a sort of general purpose agent you know, wouldn't, might you not want it to like route certain things to the marketing agent and write other things to the the HR agent? And maybe sometimes you need, you know, input from both of them or whatever. Like why isn't there more of a role in your mind for multi agent orchestration in today's world?

speaker_2: Because I think the the trend we're seeing is the foundation models are becoming best at everything. And it's not like you have a model, you know, that's good at science and a model that's good at math and a model that's good at, you know, some other topic, you know, law or something. You have one model that's good at everything. And the only thing in that world that differentiates between, again, assuming they're they're working on my behalf, right? If I have two agents working on my behalf, the only thing that differentiates them is not the model. The model is an expert. Both, both agents have the same model, models expert at everything. The only thing that differentiates them is the context they have access to and the tools they have access to. And if that model is smart enough to be able to handle both sets of context and both sets of tools, if you just give one agent all the context and all the tools from both agents, our experience so far as it does a better job. So I think, I think it's the generality of those of those foundation models and, and this actually I think has bigger implications. I am of the opinion that most SAS, like vertical SAS products are, are going to go away because you're going to end up with these horizontal platforms, these AI platforms that are, that work in every situation. So as an example, today you can, you know, we've been prototyping this morning, but like you can like dynamically generate UIS in Tasklet in a world where you have an AI that's an expert at everything, the model is an expert at everything. It can dynamically write code, generate UIS for anything, right? And if this gets really good, why would you need, you know, a SAS product for construction or for medicine or whatever, right? You could just go into your generic general purpose platform and be like, hey, you know, use this part of your foundation model knowledge, generate some you guys like this and let's go. So I think you're going to have horizontal companies that are good at everything that leverage the, the, you know, the fact that foundation models are experts in every domain. And then you're going to have businesses that vertically or integrated businesses that sort of provide value end to end and leverage those platforms. But I don't think you're going to have these like too many of these like intermediate SAS platforms over time.

speaker_1: Yeah, I one of my mantras for the AI era in general, which I think Tasklet really exemplifies, is AI beats UI. It's just like in general, who wants to use UI? And if you have created something that is basically a bunch of UI that sort of makes tangible or you know, sort of encodes the steps that somebody figured out for a workflow, it does feel like the days of those sorts of systems may be numbered. So I do agree that a lot of this stuff feels like it kind of washes away in the under the great wave of AI progress.

speaker_2: One of the questions that we, we ask ourselves internally is how long is it going to be until I can go into an AI agent product, you know, task letter ChatGPT or whatever and tell it, give me an e-mail inbox designed for like fast triage with these constraints. Because the moment that, that works well, shortwave seats ceases to be a valuable product. And I, you know, I think it's going to be a while, right? We're still, we're still investing for everybody, still see as, as a, you know, I use everyday, lots of people use it. Everybody makes a lot of money, but there will come a day when you can generate a shortwave dynamically.

speaker_1: Yeah. Do you want to put a over under on that? I mean that's still more than a quarters worth of work, right. So it would be if that, if the medium trend is the trend, it would still be like beyond.

speaker_2: 2028, yeah, I don't know it's years, it's not it's not quarters, it's years out. But you know SAS, SAS companies used to you used to get a revenue multiple of SAS companies based on like you know, it's going to grow for 20 years. I don't think, I don't think shortwave is going to be around in its current form in 20 years.

speaker_1: Is there any limit to the, you know, single agent? Like one other idea that I wanted to to bounce off you is just like supervision agents or sort of quality control agents. I mean, I have an intuition that says if I've just had an agent, do you know, dozens of steps or even 100 steps or whatever? And now I could say to the agent itself, like, please look back on your work and tell me how you could have done better or, you know, you know, give yourself some instructions for next time to do even better. It probably can do that remarkably well. I mean, again, these things are obviously super impressive, but my intuition is still like if I had a different prompt that I, you know, really dialed in for that purpose and, you know, had it sort of sit outside of the the main agent. And I don't know, maybe this is just, I'm, I'm overly anthropomorphizing, but I, I sort of feel like this outside view with a like a different frame of mind as embodied by a different prompt. It still feels like it should add something to me, but it sounds like you've tried that and I haven't found that to be the case.

speaker_2: Yeah, not, not so far, no.

speaker_1: Interesting. That is really fascinating. I've I think I've tried to push on every angle that I could to see if there's any limits to your one agent to roll them all paradigm, but I'm not finding any. So how does this what does this mature into? I mean, there's, there's this talk of like virtual employee. I think this is the product that I have used that feels most like the virtual employee in the sense that it is pretty UI light, right? I'm not, when I think of an employee, I don't interact with it through forms. I interact with it through language, right? And, and the primary thing here is language. And there's not really a point where that flips over into some other, you know, gnarly or tedious paradigm. So that is really cool. You even mentioned like some people give them names, some people, you know, give them their own e-mail account so they can do stuff, you know, into like as themselves, right? As opposed to acting on behalf of the user. What is the, you know, you're kind of on the path, right? Like what? What is the virtual employee of the future look like? And you know, what do you need? What's missing for us to get there?

speaker_2: Yeah, I, I really do think that that is probably the right way to to think about it in the long term as a virtual employee. I, I, we've avoided that messaging because I think that that has been overdone. So many companies message virtual employee and then they don't deliver on something that is even remotely like that. There is, I think there are a huge number of companies building toward the same thing, right? They figured out, man, this agent can get good at everything and people like to just be able to use language and let's just keep getting this thing more and more capabilities and, you know, let's give it computers and and a lot of connect to your services and stuff. And I think the thing that sets us apart from most of the other people that are doing this is a combination of it being agents all the way down right there is yeah, we don't turn it to a flow chart or something and it actually working. And I think this is a refrain I hear again again from people that are paying us is I tried it and I just expected, you know, to not like it because I've tried a bunch of the products in that way. It actually works my use case. And then, you know, I'm really impressed. I think that is our path forward. So the launch went super well. It's growing really fast where we are adding revenue faster for Tacit than we have ever added for storage by a huge margin. So people like it, it's great. We have a long path ahead and I think our focus is entirely on just making it actually deliver on what it's supposed to do, right? Let's make this thing smarter and more reliable and faster and simpler and, you know, like an employee, right? How do we take this from employee today that like, you know, is maybe a little bit dumb sometimes and sometimes drop the ball and, you know, sometimes it's a little confusing and just make it smarter, faster, more reliable, easier to work with. And I think it's a long, long way to run there, right? Like, yeah, today you might have it, you know, finding spam in your inbox or, you know, generating invoices for you or something. But it's not like planning a road map or, you know, running your entire marketing department head to end. But maybe we'd get there, right? Maybe maybe in the future you have a small business and you're just like, you know, I want to focus on the restaurant and cooking. I want you to, like, literally do my entire marketing campaign. Go. I'd love to be able to get there.

speaker_1: Yeah, it's crazy how close we already are and it it doesn't feel to me like it is going to be all that much longer before we really start to see not necessarily superhuman performance. But like Ethan Malik has this great, you know, best available human standard and it seems like we're closing in on that for a lot, especially small business, Right, man, who can you hire really to do your marketing as a small business? The, the virtual employees are, I think, hitting that level pretty quickly. Last time we talked, one thing you had said that really stuck with me is there's no other Moat than speed. And I think that's a, you know, big reason that you're willing to be so open with all the details of what you're building, the tools, all the techniques and so on. Have you found any Moat other than speed in the intervening months? And and how do you sleep at night if that's if that's still the only Moat?

speaker_2: No, I I think, I think speeds the only thing that matters at this point. I think there used to be boats and AI is like I'll give you a good example here with all the integration platforms. It used to be that to compete with a Zapier or an NAD N or a pipe dream or something, you needed to have a lot of connectors and those connectors had to be hand rolled. And so you had to build up over the course of years this huge investment building those connectors, but direct API that's gone right. So we on day one have, I'd argue with computer use better integration support than Zapier has, despite being a few months, the product being a few months old and Zapier being a very old product. So that Moat to them, you know, is, is gone shortly. If I give you a similar example of like for a long time or like, hey, our Moat is like we have an e-mail client, right? Anyone who wants to come and build an AI e-mail client, they first have to have an e-mail client. Well, it's not going to be too long before you can just ask the AI to write an e-mail client for you and then that mode is gone. So I do think it's speed and it's all speeding up because the tools are getting better. The AI tools are getting better. So that is our entire focus of like, how do we make sure that by the time, you know, by the time the people, you know, listening to this podcast might want to compete with us, have figured out a way to, to replicate what we have that we've already figured out new stuff. And we're going to, we're going to try and just keep that, keep that going.

speaker_1: Shorten the timelines is the the modern maxim for sure. This has been great. I really appreciate all the time and the depth of your answers. Anything else you want to leave people with?

speaker_2: I don't think so, just just check it out. I think you'll like it. It works, it really does. Tasklit dot AI.

speaker_1: Tasklit dot AI Andrew Lee, founder and CEO of both Shortwave and Tasklit. Thank you again for being part of the cognitive revolution.

speaker_2: Thanks for having me.