Three Kinds of Software Survive: Tasklet's Andrew Lee on Competing to be a Horizontal Platform

Watch Episode Here

Listen to Episode Here

Show Notes

Andrew Lee, CEO of Tasklet, returns for his fourth appearance to share how his team has once again rewritten their entire agent stack, now emphasizing file system context, agentic search, and multi-resolution summarization. The conversation digs into the strategic tension of competing with your own supplier, as Anthropic's Claude Max accounts offer direct customers far more tokens than API partners get at the same price. Andrew also lays out his framework for the only three types of software companies that will survive the AI transition and discusses Tasklet's evolution toward becoming a model-agnostic horizontal platform.

LINKS:

Sponsors:

Brave Search API:

Brave Search API gives AI agents a fast, independent search index for research, RAG pipelines, images, places, and fewer hallucinations. Get $5 in free credits at https://brave.com/search/api/?mtm_campaign=q2-26-cognitive-revolution

Sequence:

Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code COGNISM in the source field to save 20% off year one

Roboflow:

Roboflow is an end-to-end visual AI platform that lets you turn raw ideas into fully deployed applications in just hours, powering breakthroughs like Blueprint Pro's floor-plan understanding tool. Read the full Blueprint Pro story and see how over a million engineers are building the next wave of visual AI at https://roboflow.com

Claude:

Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(02:43) Tasklet rebuilt everything

(08:06) Context compaction strategy

(15:05) Model progress updates (Part 1)

(15:13) Sponsors: Brave Search API | Sequence

(17:37) Model progress updates (Part 2)

(23:22) Anthropic competition dynamics (Part 1)

(34:10) Sponsors: Roboflow | Claude

(36:51) Anthropic competition dynamics (Part 2)

(36:51) Harness versus models

(47:30) Frontier model choices

(55:19) OpenAI platform risks

(01:03:17) Shared organizational brain

(01:08:35) Future software platforms

(01:19:06) Reliability and oversight

(01:23:46) Lightning round insights

(01:30:49) Episode Outro

(01:34:04) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

[00:00] Hello, and welcome back to the Cognitive Revolution!

Today, I'm pleased to welcome audience favorite Andrew Lee, CEO of Tasklet, back for his fourth appearance on the podcast.

Andrew has always been extremely transparent and candid – his belief that "Speed is the only moat" has made him comfortable sharing intimate details of Tasklet's agent architecture, and as you'll hear, in the 6 months since we last spoke, Tasklet has indeed once again entirely re-written their stack.

Today, there's much more use of file system context and agentic search to leverage available information while conserving tokens, and a huge emphasis on summarization at several levels of resolution.

This time around, we also dig into the delicate strategic situation that Andrew and Tasklet face. While their product strategy of "always betting on the models" has proven correct and his choice of Claude has been rewarded, Andrew observes that these days "everyone is building the same thing", and today his most intense competition is actually coming from his critical supplier, Anthropic, which with Claude Max accounts, gives their direct customers an estimated 5 times as many tokens as Tasklet can purchase at the same price via the API.

In micro terms, the relatively high cost of tokens has caused Tasklet to stick with Opus 4.6 rather than moving to the new 4.7, and in macro terms, it's pushing Andrew and team to become a horizontal platform that's capable of harnessing – or, as Andrew describes it … outfitting with a mecha suit, frontier models from any provider.

This evolution, which I do think Andrew has played about as well as possible, is critical, because horizontal platforms are 1 of only 3 types of software company Andrew believes will survive the AI transition, the others being API-first companies like Stripe, and companies that develop solutions and sell outcomes – best exemplified, perhaps, by Fin's model of "99 cents per customer service ticket resolved."

We get into lots more besides, including Tasklet's instant apps feature, how they are thinking about deep personal and shared organizational context, the cloud container company Andrew endorses, Tasklet's token-to-labor cost ratio, and whether Zuckerberg has come calling after his Manus acquisition was cancelled by the Chinese government.

It's a fun one, with lots of valuable details, from somebody who's in the arena, competing to become one of the few general purpose AI agent platforms winners, and actually willing to tell us all about it.

Please enjoy my conversation with Andrew Lee, founder and CEO of Tasklet.

Main Episode

[02:43] Nathan Labenz: Andrew Lee, returning champion and CEO of Tasklet. Welcome back to The Cognitive Revolution.

[02:49] Andrew Lee: Thank you. Glad to be here.

[02:52] Nathan Labenz: You are a fan favorite, and I'm going to just try to pepper you with a bunch of questions and make sure we get as much alpha for all the builders in the audience, myself included, as we can. So first question, it's been about six months since we last spoke. You have rung in my head probably weekly with your speed is the only moat mantra. And I guess my first question is, what have you rebuilt in the last six months since we talked? Or maybe more to the point, what have you not rebuilt in the last six months since we talked?

[03:24] Andrew Lee: Yeah, I think the mantra has stayed the same and we've rebuilt basically everything. I was thinking about this earlier, and even the pieces where I'm like, Oh, this has stayed the same. Actually, no, I've been totally rebuilt. So from a product perspective, the product is actually very different now. When we launched this thing in October, it was all focused on a workflow automation. We thought, Hey, it would be really cool if people could come in, describe a workflow, and we'd run the workflow for you. But basically, the feedback that we got right out of the gate was, Hey, once I've given this agent all of my context and hooks it up to all my stuff, I don't want it just running my workflows. I also just want to be able to talk to it synchronously too. And so it's not just a workflow automation tool anymore. Now it is a very general purpose agent. It's great for doing workflows, but it's also great for doing other types of stuff. So that required basically a total rebuild of the product experience and has developed a lot of technology behind it. So as an example, in a workflow automation tool in the previous iteration of the product, You basically had a main agent that you'd talk to for a brief period to set up your workflow. Once you were done setting up your workflow, you basically stopped talking to that agent. The chats were relatively short, and then our system would kick off runs of what we call the task agent on a periodic basis when events happened. Every agent was a pretty short thing, and you could do pretty simple context engineering to make that work. In a world where you want this general-purpose agent that you can talk to synchronously and run these automations Product experience people want is just one big linear chat where everything is in one chat. It works really well from a product experience, but the engineering gets really complicated because you can't just have an infinitely long chat history that you feed into the LLM. Even if you could, it'd be really, really expensive and you wouldn't want every automation to have to send in 10 million tokens from all the previous runs. And so we had to kind of totally rethink the way context engineering works. and say, what if, instead of the history being the thing we send to the LLM, what if the history is in the file system? What if the files are the agent? Marc Andreessen had a thing about this, and I think people figured this out, but we made the switch kind of in November where we said, really, okay, what we need is a file system. that has your history. Then we need the thing that we actually send to the LLM to be just hints at what's in the file system and what things you need to read to get the work done. That way, we can scale the agent up, including the number of chat messages that were sent. There's a lot of stuff we'll scale up in the future, but basically, you can scale from what fits in the context window to what just fits in the file system. You can fit a lot more stuff in the file system. There's a bunch of other stuff we've rebuilt. Another big thing we've rebuilt is around computer use. When we launched, initially, computer use was this add-on. You could have a Linux machine. Actually, initially, it was a Windows machine, then it was a Linux machine tacked on. It really was an afterthought. You'd use it for certain things. It worked okay, but most things you do with agent didn't touch it. At this point, computer use is the absolute core of the product. Basically, everything you do is running shell commands, touching a file system, touching a database. We have a very tightly integrated browser use experience now where every agent has a headless VM and a browser VM that persist state across runs and allow you to do lots of really cool stuff. And it's sort of in the critical path. So now, if our computer use goes down, everything goes down versus before it was kind of this afterthought. We've also rethought the way our integrations work. So this is something I think for a product experience that maybe doesn't look terribly different to folks, but the basic architecture of how we plug in other systems to the Tasklet agents has been completely rethought, basically to allow the agent to have sort of more control and management of like, those connections. A simple example of a product experience that's improved is you used to not be able to connect multiple instances of the same type of thing. You couldn't have three different Gmail accounts connected to an agent, and now you can. The base architecture there has been rebuilt. I'd say basically every line of code has probably been touched in the last six months, and most of our fundamental assumptions were thrown out. The product still can do many of the same things, but hopefully much better now.

[07:43] Nathan Labenz: Yeah, that's cool. So literally, you can't think of anything that has survived the last six months.

[07:49] Andrew Lee: Nothing substantial. The visual design is totally different. The structure of the app is totally different. Yeah, the connections system is different. The way we use computers is different. The core of the agent is different. The context management is different. The way we do compaction is different. So it's all new.

[08:07] Nathan Labenz: Yeah, okay, let's talk about that compaction, because I mean, one of the big takeaways from It might have been two conversations ago was the critical importance of caching. And you had said at the time, you know, long context is, you know, it's quite effective. Obviously it's expensive caching and especially the sort of 90% off caching pricing of Claude was like critical to enabling certain things to really work in a way that wasn't, you know, tanking your, you know, What is it? The classic meme, right? My company is dying, making it work without killing your company. So it sounds like that's changed quite a bit. Now it's much more of like a pointers, hints type of thing. How is this working? And obviously tokens are like, you know, people are spending a lot on tokens these days and I'm, you know, increasingly like hitting my limits on even the highest, you know, plan with Tasklet these days. So how are you managing context for me? What should I know? What lessons should I take away from your experience on how best to manage context in the modern moment?

[09:18] Andrew Lee: Yeah, so caching has actually become a much bigger deal because now that the real context is in the file system, there's just a lot more tool calls that need to be done to do the basic operations of the agent because you're loading in a bunch of files and stuff. And so We really have to make that caching work if we don't want this thing to be like crazy expensive. So that's been very much at the forefront. We came up with a new approach to context management that we shipped in December that basically works like this. You take your whole chat history, And you put it in the file system. So it's all accessible in the file system. And then you find a way to summarize that whole history in kind of a fixed length number of tokens by having recent stuff be included in like sort of high granularity, like the last thing you sent, you'll probably have most of it or all of it there. And then older things basically have like decreasing fidelity as you go back. So if you have a very long chat, the stuff in your current turn, like the current thing that's running, It's probably all going to be there, including all the thinking blocks and all the tool call responses and all the files and things are probably going to be sent to the LLM, depending on how long the run is. But for most kind of short runs, that'll be the case. The previous turn will probably mostly be there. You're going to have the full user message. You'll probably have the assistant response. You'll probably have the tool call arguments. You'll probably have the tool call responses. You'll probably have the thinking blocks. But as you go farther back, we start stripping the thinking blocks. We start stripping the tool call responses, or at least truncating the tool call responses and then stripping them. We start truncating and then stripping the tool call arguments. Then we start collapsing tool calls, and then we start shrinking down the assistant messages And then finally, we get to some LLM-based summarization. And we do this in buckets, moving back so that we can have sort of a minimal impact on caching. Basically, you want to avoid messing with prefixes as much as you can. So as you go back, basically, you get into these buckets where we have different levels of compression. And those buckets, as they get older, tend to get added to very slowly. And then once they hit a certain threshold, we shrink them down. And this system has actually worked basically pretty well. And the core thesis basically is like, you generally care a lot more about recent stuff and you trust the agent to go and like look things up when it needs to. And I would say it's not perfect. We do definitely do have people say that agents forget things. It definitely does still It costs us a lot of money to run, but I think it's generally worked. Our plan is to double down on this type of architecture. We have lots of ideas how to improve this, but the basic approach of this decreasing fidelity as you go back and these bucketed cache-aware chunks, I think is the right approach.

[12:09] Nathan Labenz: How often does that get updated? If I have an agent that runs on a daily basis, do you try to keep the cache active from one day to the next, or is it every day we're going to have a fresh cache that will run through that whole session and all the interactions, but kind of tomorrow, you know, do we begin again or do we like on what frequency do we begin again?

[12:34] Andrew Lee: Well, there's two pieces there. There is when do things get updated on our side? Like when do we decide what that compressed history that we put into LLM looks like? And then what caching do we do on the LLM side? And the answer to the former is every time you do anything, it's sort of incrementally updated, including in the middle of runs. If you have a very long turn that uses a lot of tokens, it might actually start compressing inside that turn. And the reason that we persist that is actually calculating that could be really expensive. Running an LLM-based compaction of an older section that eats a lot of tokens. You don't want to do that every time the thing starts up. Every hour you have a trigger running and every time you have to compress a bunch of history, that can be very expensive. We keep all that around. On the model side, caching depends on the provider. In the case of Anthropic, we're using five-minute caching, and so it doesn't stick around very long. The assumption there basically is you're probably either in an active session or in the middle of a turn. which case five minute cache is enough, or you're probably waiting for next trigger to run. And like most people's triggers are not running like every, you know, half hour, they're running every few hours or every day. So the assumption there is it's not so common. And then different providers have different possibilities there. And for example, like OpenAI has much nicer caching primitives, for example, I'm happy to talk about those too.

[14:01] Nathan Labenz: Yeah, okay, that's interesting. So it's basically constant maintenance of the higher level summaries that will be fed into the LLM and then pretty short kind of single burst style caching to actually reduce the cost of incremental calls within like one agent run. And it sounds like at least for Anthropic, that kind of is typically limited to like the cache. is hit for one run, but not hit across runs for the most part.

[14:34] Andrew Lee: Yep, that's the current approach. One thing I wanted to note is the way our system is built today, we basically get no cache benefit across users. So it's like caching for, actually even per agent, like it's basically caching per agent. There are some changes that we can make to do a lot of cash optimization across agents and potentially even across organizations and users. And I don't want to get into the specifics of that because that's still upcoming, but I think there's a lot of potential actually to save money across agents as well.

[15:06] Nathan Labenz: Yeah. Okay. Well, that'll be important. I do want to circle back to the OpenAI question because you guys have been clawed maximalists and you Your other mantra that rings in my head a lot is always bet on the models. I'd say it's safe to say that that bet has gone well over the last six months. Obviously we've seen some of the most notable model releases in the sense that the community has sort of flagged like four, five and four, six as kind of qualitative shifts where like things went from not working to working and people are like, oh, I can really get pretty general purpose knowledge work out of these things on a pretty consistent basis now. I would love to hear how you would characterize the advances that we've seen. Maybe you could do that in terms of like what new use cases have opened up, maybe, you know, things that have surprised you, possibly also like things that are still not working that you, you know, that might be surprising given all the things that do work. And then the Then we can get to the latest models. I want to, so kind of give me the like four, five, four, six history, and then we can go to four, seven and five, five present.

[15:13]Brave Search API: Brave Search API gives AI agents a fast, independent search index for research, RAG pipelines, images, places, and fewer hallucinations. Get $5 in free credits at https://brave.com/search/api/?mtm_campaign=q2-26-cognitive-revolution

[16:30]Sequence: Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code COGNISM in the source field to save 20% off year one

Main Episode

[18:52] Andrew Lee: Yeah, I think the overall, you know, approach of always better than models, I think that is, I totally agree. That is like totally held up. When we started working on Tasket, we were using four, Claude four. And that was able, that actually could get you a long ways, and it worked pretty well. 4.5 was a big unlock. So 4.5 started to work much better to do doing computer use, and it could just like manage sort of navigating through the various different connections and tools enablement process and stuff much better. And that, you know, came out very early in the lifetime of Task that was like, I think, a big bonus for us. And the cost reductions around Opus were huge for us. in, I want to say, December when that happened, they dropped. So initially, we could really only have people on Sonnet, and then Opus dropped from 15 to 5, and that was a big unlock for us as well. I think 4.6 was a solid incremental change. It made it nicer for, again, for computer use, which has become increasingly important for us, both headless and headful, as well as CodeGen, and that enabled our Instant Apps feature, which I think we'll probably talk about today, which is a very cool feature. 4.7, we actually haven't rolled out yet. I think 4.7 has been much better in certain areas. It's much better in code and one-shotting long projects. But for the types of... iterative knowledge work that we support, it doesn't actually seem like a huge boost and it's a lot more expensive. The tokenizer changes they made increased our costs for like 30% and costs are huge for us because we essentially pass them on to users. We actually opted not to ship 4.7 as a default recommended model. We are going to ship it, but it's an advanced user option if people want to and we're going to note that, hey, this actually costs a lot more. So I think the progress there has been great. The reason we started on Anthropic and I've been so Anthropic focused for so long was just that the basic core of our agent, like the ability for it to navigate through a discovery process of connections and activate the right tool in your agents and then manage its context the way we manage its context, just the core inner workings requires kind of a base level of intelligence that the other models just couldn't do. You basically couldn't use the same harness with them and have it expected to work well. That has changed, which is really exciting for us. It's kind of scary to be like, we're totally dependent on this one vendor, even though they're great. Drop down models are amazing, don't get me wrong.

[21:33] Nathan Labenz: Supply chain risks are everywhere if you take that approach.

[21:37] Andrew Lee: Exactly. But more recently, GPT-55 has gotten really good. I think it's a huge step up for our use case over 5.4. By the time this podcast comes out, we'll probably have announced this publicly. So just, hey, it's going to be there very soon. And it can navigate our harness super well. I think it gives Opus 4.6 a run for its money for most use cases. I think that's really exciting. I'm pretty optimistic about the OpenAI roadmap this year. I think they made a huge bet on compute last year, and I think that's starting to show, and I think where they're going to have a lead for a while. It also is clear that they've refocused their business much more on these types of use cases. You saw the progress Codex made over six months, and if they put that level of effort into tuning the models around these types of agenda use cases, I think that'll be It'll be huge. So we signed a deal with OpenAI the other day, and we'll be launching stuff there, and we're making a pretty big bet there as well. There's progress in other places, though. The latest Google models are pretty solid. They're not at the level of anthropic OpenAI yet, in my opinion, but they are making very fast progress, and they're much closer. And then we've been playing with Deepseek and with Kimi. And the, like the latest Kimmy is, you know, as far as we can tell, like maybe better than a haiku and cheaper. So I think we're going to see those probably make it, make it, make their way into, into task up soon. So I would expect, you know, within the next few months, we are going to have, you know, anthropic models, OpenAI models, open source models, Google models. I'll bet you the anthropic ones will still be the, you know, the best and probably the recommended most cases. But people have a variety of choices and like some good cost cost optimization options for certain things.

[23:23] Nathan Labenz: So many follow-ups there. Let's maybe start with your, what I imagine has been a little bit of a delicate dance with Anthropic, and then we can kind of compare and contrast that with what OpenAI is now bringing to the table. I don't know, you probably know what the ratio is of API cost to effective token cost when you buy a Claude Max subscription and max it out. And obviously in the intervening time since we spoke last, there's been the whole open claw phenomenon and that's had its own bunch of drama with you can, you can't, you sort of can. We got to pay the API price. We're lowering our limits. We're buying compute from xAI. We're raising our limits back again. What has it been like from your perspective to be building on a platform that you're also sort of competing with that is kind of undercutting you to various levels at different times on their pricing.

[24:23] Andrew Lee: Yeah, it's definitely a it's definitely an interesting relationship. So like on the one hand, The models are amazing. They're super good for our use case. Their team has been really helpful and responsive. We talk to them on a very regular basis, and they're trying hard to support us, and we get early access to stuff, and they take our feedback and all of that. They're definitely totally enabling our business, making it happen, and working hard to do it, which is wonderful. They're a great partner. I don't want anyone to think I think otherwise. But at the same time, if you look at our stats of when someone churns off Tasket, where did they Like 80% of those users go to an anthropic product. So they are like a very direct competitor. And I think there's different use cases where their products shine, but it's clear that they're a very direct competitor. The number one reason that they do that is because they already have a max plan, and they don't want to have to spend additional money on tasks. Basically, every time they release a new model update, we're like, This is great. This is awesome. Every time they send an e-mail being like, Now your max plan is even cooler, we're like, Crap. This is just going to make it harder. They definitely subsidize it, and it definitely has set some distorted expectations with the folks around like what you can get for a certain price. And so it's cost is like a constant, constant struggle for us to try to, you know, help users use the products more efficiently and help them understand, you know, hey, we're actually working at like some pretty razor thin margins here and trying to make this cheap for you guys. So yeah, it's an interesting dance for sure.

[26:01] Nathan Labenz: Do you know what the ratio is or is it opaque even to you?

[26:06] Andrew Lee: I don't know what the ratio is, no.

[26:09] Nathan Labenz: Yeah, interesting. It feels like it's not insignificant. Like I, my intuitive gut guess would be it's like five to one or something, but I don't really know.

[26:20] Andrew Lee: That would be my guess too. Like five to one, one or maybe, maybe even more. It does seem like pretty substantial. Yeah.

[26:26] Nathan Labenz: Yeah, that's a big, that's a big deal. So, okay, one more thing on the anthropic dance and then I'll Obviously this gives you a lot of incentive to broaden out and try to position yourself a little bit differently. How do you think about... Obviously they have an inside lane when it comes to building product experiences that make the most of their model's capabilities. I mean, increasingly we're sort of seeing like the model is being trained in the first party harness. And I have another question on the word harness and if that's even the right paradigm to be thinking about this anymore. But how do you kind of position yourself to, you said like there's some use cases where you feel like TaskAlet exceeds what you get out of the first party claw products. Like, I guess for one thing, like how is that even possible? And how do you think about trying to compete with what they themselves are going to build given all the inside knowledge and advanced prep and kind of close coupling advantages that they have.

[27:35] Andrew Lee: At a high level, I kind of think that everyone is building the same thing. Like you have all of these different agent companies and basically over time, as the models get smarter and the agents built in more sort of general purpose tools, you know, computer use and file systems and whatnot, you can do very similar things in many things. So you can go into Claude code and you can do all kinds of non-coding things in Claude code. You can end, you know, Codex and Claude Code and many other startup products are all able to do coding and non-coding things pretty well. I think where you start to differentiate, and I do think you can differentiate within this space to some extent, is really around what you're optimizing for and what the ergonomics are. In our case, take Tasklet. You can totally write code with Tasklet. You can hook it up to your GitHub, and you can have it generate PRs, and it does it just fine. We knew this for marketing, for example, if we put on a new blog post or whatever. I write the content in Tasklet, and then I just have it generate its own PRs, and it works fine. But it's not going to be as smart and definitely not as cost-effective for heavy-duty coding as going and using an actual coding harness. And it's definitely not going to be as nice to use because the actual coding harness is going to be in conductor or something that's designed for a coding workflow, and our product is not set up that way. So I see a future where you can pick up any AI agent and do anything. But different ones are going to have sort of different like cost and performance trade-offs and different ones are going to have like just different ergonomics for like the different types of work. Where we really shine is 24/7 automation of knowledge work for companies, especially knowledge work for companies that is not like your personal work, but like something the company owns. So if you have, you know, simple example, you have some like complex invoicing process, say you're a company. You don't want to be running that in your local cowork. If you close your laptop and the company can't invoice people anymore, that's bad. You don't want to put that in OpenClaw and put it on your Mac Mini in the corner. Because again, if something trips over the power cord, well, you can't run your invoicing. What you really want is something that's running in the cloud and is manageable by many people, and you have a lot of infrastructure around it to manage and provide oversight, and have audit logs, and you have guardrails around the thing, and you can control costs and your different agents. There's a lot of team enablement features you care a lot about. That's where we really shine. A lot of the work to make that work well is actually fundamental to the way the agent is built.

[30:12] Andrew Lee: I talked about our context manager. The reason our context manager is built the way it is is because you want to have triggers as sort of regular messages into the agent, which means these agents, if you have an agent that's running a trigger every time you get an e-mail, that agent might fire 10,000 times this year. And so you need an agent that can fire 10,000 times and still remember things at the beginning of the chat and still behave in a reasonable way. And that's a pretty different thing to optimize for than a coding session. The way Claude codes or resets context. Makes total sense in a coding environment. Doesn't really make sense in a world where it's like processing all your emails. So that's how I see us differentiating. The other note, a couple notes I want to make here on differentiation is one, the market is just freaking huge. So if you look at coding agents and you might say like, ah, clearly Claude Code and Codex have one, but like, Cursor is going to sell for like $60 billion. And even the fourth and fifth and sixth, Cognition's doing just fine. Factory's doing just fine. Even Windsurf that had to sell, it was a pretty good exit. I'd love to be the number one here, but if we end up being number four or five or six, they could still be a very significant exit. And then the last thing I want to note is, and I think this is probably the most important point, When we go and pitch a business, what we're trying to help them do is deploy AI for real inside their company to automate stuff. And those typical companies, they don't want to have to spend all their time researching AI models and placing bets on which lab is going to win. They want to choose a platform that's going to serve them well, and they want to benefit from everybody's improvements over time. We can go in there and say, Hey, a bet on us is not a bet on Anthropic or OpenAI or anyone else. It's a bet on us and a bet on everybody. We're going to give you Anthropic models and OpenAI models and Google and all the open source models, and then we will be a neutral arbiter of what you use. To the extent that we can build features to help you choose the right model for the job and optimize your costs across the different things, you can trust us because none of these are ours. We're getting the same margins on everything. a neutral party versus if you go back to Anthropic, right now, it's just Anthropic products. Even if they decided to provide other models to their products, which they could, although I don't think they're going to, but they could, I don't know if you'd really trust them to do that in a neutral way. I think that's a pretty compelling part of our sales pitch.

[32:51] Nathan Labenz: Yeah. I think you've maybe navigated this about as well as anyone could in the sense that betting on Anthropic and kind of going all in on whatever the best model is, which has been clawed to make it work as well as possible while the capabilities curve was getting to critical thresholds. And then kind of pivoting to being a more neutral abstraction above the model layer now that there are multiple options that seem like they're, you know, able to deliver the kind of performance that people want. It wasn't obvious. I don't know if it was obvious to you that that's how it was always going to play out, but I wouldn't say it was obvious to me. I think I would have said about, you know, your kind of position six months ago, like, yikes, it is pretty tenuous to be all in on Claude, but I think you kind of tied timed it pretty well on a couple different levels. So how much would you say that's foresight and genius and how much is good luck?

[33:56] Andrew Lee: Yeah, I'm glad you think so. This was very much the plan and I do think it's worked out really well. So, yeah, I'm happy. I'm happy with how it's turned out.

[34:10]Roboflow: Roboflow is an end-to-end visual AI platform that lets you turn raw ideas into fully deployed applications in just hours, powering breakthroughs like Blueprint Pro's floor-plan understanding tool. Read the full Blueprint Pro story and see how over a million engineers are building the next wave of visual AI at https://roboflow.com

[34:59]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

Main Episode

[36:51] Nathan Labenz: So, okay, let's do this harness thing for a second. The word harness itself just always makes me think of trying to control and direct some sort of wild animal to get useful work out of it when it might rather be doing something else. I just took a long road trip with my kids in a Tesla FSD-enabled car over the last 10 days. We went to a lot of historical sites. The juxtaposition of horses and my FSD was pretty funny. But I kind of think of like, you know, trying to get this... you know, unruly animal that is a model to stay on track, right? And do what you want it to do. These days, as you said, like there's a lot more ability to give the model hints and say like, here's a file system and you can kind of go get what you need here. And I'm starting to feel like the concept of a harness is maybe a little anachronistic already. And maybe what we're doing now is more saying like, Here's the world you get to play in. It's not so much about trying to narrow what the model can do, but more about broadening what it can do. How do you think about that sort of like narrowing and focusing versus kind of broadening, giving access, you know, unlocking new possibilities, which, you know, might in some cases even surprise users given the model capabilities we have now?

[38:16] Andrew Lee: I hadn't actually thought of a harness as being a constraining thing, but yeah, you kind of make a good point of that would be the normal way you'd think about that word. I kind of think of it as a mecha suit. I agree with your thesis. The goal is to let that agent or let that LLM actually do things in the world. To do that, it's going to need storage, it needs compute, it needs to be able to reach out and connect to APIs, it needs to be able to talk to the user. There's a lot there. I think when I talk to people who are not deep into the harness world, I think most people assume when they play with an LLM product that it's a very raw thing on top of the model and the type of thing and they send it. What they see on the screen is just being sent to the model and the model's doing everything. That is becoming increasingly less true. The complexity of the code that's translating what you see for the LLM calls is getting more and more. I think that's going to keep going. sophistication of these harnesses is going to get just 10 times as complex. But I think there's going to be some pretty major breakthroughs here that increase the capabilities of these things pretty substantially in the way that we handle memories and the way that we handle oversight and control and the way we connect to other tools. So I'm very bullish on the opportunity here. And I think these things are just going to get more and more complicated. And yeah, I don't know, maybe we need a new name. Maybe it's like a mecha suit and it's not like a Like a harness. Yeah.

[39:48] Nathan Labenz: How much do you think? So this is, I think, one of the more interesting debates right now in the AI builder community broadly. What matters more, model or harness? And I think you see pretty extreme positions on both ends where I see I get emails that are like, models don't matter anymore. It's all about the harness and vice versa. And Obviously, either of those like extreme positions is not going to be right. But I guess I have historically come down somewhat informed. I don't know if you've seen this graph from the UK AI security, I should say, Institute, where they do a capability plot over time with the minimalist harness, you know, whatever kind of basic vanilla thing, and then the best available harness. And of course, you know, both are going up. A year ago, though, the time delta between what level of capability you could get with the best available harness versus the vanilla harness was longer. And now it's gotten shorter. Some of that is maybe just due to more frequent model releases, which is like shortening every, you know, window of advantage. Some of it is maybe because the models are getting more deeply trained to use harnesses, and so You know, they're just good at it out-of-the-box. You don't have to compensate for their weaknesses so much. But I guess my overall summary would be, it seems like... I would say models seem to matter more and you can't get that... How much can I live in the future with the best available harness for any given model? It seems like it's not a huge amount, but it sounds like you maybe see that differently. So what's the case that that's... If you do, what's the case that that's wrong?

[41:39] Andrew Lee: I think... As models get better, they can replace good harnesses. A model today with a crappy harness is going to be better than a model from a year ago with a really good harness. I agree with that, and I think that trend is going to continue. But I also think the effects are multiplicative, and they're orthogonal disciplines, and there's no reason not to take the best model and put in the best harness, and I think we should. You might argue that like, oh, like given like the exponential that actually only buys us six months or something. And okay, fine, but it's six months. But I think more importantly, what you're getting, like the metric, the only metric that matter is not intelligence, right? Like in these real production systems, intelligence is one piece, but, you know, take TaskIt, for example. Much of what we do is automating specific workflows. Once the model plus Harness is smart enough to like, I don't know, order less lunch every day, which it does. And we've been able to do that for like six months. We're not going in there and messing with it very much. Incremental improvements to intelligence don't really matter, but performance and cost do. And so if you look at the harness and say, Hey, the only point is to make the thing smarter, fine, it buys you a fixed amount of time over the model exponential, which is maybe cool, but not amazing. But it might make a significant difference in cost and you know, other attributes, cost and reliability, and the ability to like do oversight and and and speed. And I think those things matter a ton for a commercial product. So like in our case, with our harness, like the benefits you get are, you have a nice UI, and the sidebar pops out at the right time to show you things, and you get nice indications of working states. You can see what it's doing at the time, and you get the ability to have things persisted across long periods of time, and you get nice performance trade-offs and cost trade-offs. I think those things should not be underestimated for a commercial product.

[43:49] Nathan Labenz: Yeah, if you can make it work with Haiku instead of Opus, for example, that moves the needle quite a bit. For sure. Especially in a compute scarce world, which we increasingly seem to be in.

[44:05] Andrew Lee: If I can interject here, as a good example of, and I don't even know if you've called this a harness, but you see what Anthropic is doing with there, I think they call it like a supervisor agent or? Forget exactly the language they use. But basically, they have a system where you can inject a tool that allows a smaller model to call up into a bigger model. And this is a relatively new thing that they've been talking about. And you basically can get close to the bigger model's performance, but do the vast, vast majority work on a smaller model. And that's a huge win. And if you have those capabilities, why not?

[44:38] Nathan Labenz: Yeah. Yeah, that makes sense. So when you think about the... best available harness and what that looks like, especially as you go to a multi-provider paradigm. How much do you think you're going to be building a harness per model versus trying to keep everything the same across models? Traditionally, one would think like, no way we can build this, you know, complicated product across you know, in a bespoke way for all these different models. We've got to keep it consistent, but obviously the old rules don't apply anymore. So what's your strategy? Like how much do you tailor the harness to each new model that you want to launch?

[45:23] Andrew Lee: Yeah, this is actually, this is very much in my mind right now. I think ideally as little as possible because we want to support a lot of models and like it's hard to have a thing that, you know, It's hard to maintain across both balls, but we also want to have the ability for these agents to switch between models. If you're like, Hey, you can run an Opus this way and it persists this state, but then you switch over the same agent to some other model and suddenly you have to find a way to translate things, it gets really complicated. We'd like to keep them as similar as possible. I think so far, we've been able to, and our approach has been, maybe we'll make some prompting tweaks that'll try to address issues in one model while not breaking it in the other model. And I think so far that's mostly worked. I think over time, the APIs of these things have converged and the basic capabilities of these things have converged. So my hope is that we'll get easier over time, not harder, but I could see us having some very model-specific harness things potentially, and then thinking about ways to do that in a really modular way so it's not a huge amount of overhead, but yeah, definitely something on my mind.

[46:30] Nathan Labenz: Even beyond model capabilities, you alluded earlier to caching primitives being different across providers. So presumably on that level, at a minimum, you kind of have no choice but to, if not, I mean, you maybe could have the same context, but you're going to have some sort of different implementation, right, for certain things that are just different, that are inseparable from the models.

[46:54] Andrew Lee: Yep. Yeah. Like in the example of Anthropic and OpenAI for OpenAI, they have a very simple caching API, which is basically like, they'll just cache any prefix for 24 hours and they do it automatically. Anthropic has a much more explicit caching API and you can only cache four points in your. in your call and there's a lot more code kind of making it happen. So in this case, we're kind of lucky in that like once you've done the work to make Anthropic work, making OpenAI work is pretty easy. But yeah, in that case, we do have different code to sort of translate our context to like a cacheable context in each case.

[47:31] Nathan Labenz: Other, so you mentioned, I think five providers, Anthropic, OpenAI, Gemini, Deepseek, and Kimmy. Not on that list were Grok, whatever the new meta models are called, and GLM or Minimax. Like, are there any other, how are you kind of, where are you drawing the line? How are you thinking about who's in and who's out?

[47:58] Andrew Lee: It is so hard to stay up to date on this stuff. We have the ability internally to test models pretty quickly. It's harder to actually ship things in production because, for example, the way thinking blocks work is different across different providers. And if you have bugs, you might just tune the prompts and things. So we haven't shipped that many, but we've tested GLM internally. We've tested the Google models, Kimi, DeepSeek, probably some others I'm not thinking of. I think right now, and most of this is initially vibes, right? You go in there and you play around with it and you're like, Is this close enough to the frontier that we want to put some effort in here? And usually the answer is no. And I think the ones that we've, like Kimi, DeepSeek, and Google, and plus obviously OpenAI are the ones where like, Okay, actually this is pretty close to the frontier, so it's worth doing. But there'll probably be others in that list. I have not been paying a huge amount of attention to Grok. Maybe I should be paying more attention to them. I don't hear a lot of other developers using their models, but they sure seem to be investing a lot. So I don't know, maybe that'll change.

[49:15] Nathan Labenz: Yeah, we can't, in my view, we can't count Elon out of any race until he bows out himself. So, but I, would also agree I don't use it much. I just had occasion to use it a fair amount while riding in the Tesla over the last week. And it's not bad, you know, and the voice mode is pretty good. Definitely still feels a little that's also part of, you know, it's it's not just the model, it's also the integration. But I would say my experience using Grok in the console of the Tesla is definitely rougher, you know, than my experience using Anthropic and OpenAI and Google models.

[49:53] Andrew Lee: Our users are pretty good. Our users are pretty good at being savvy about this stuff. Not everyone, but there's enough users that try this stuff that we start to see requests. I remember back in the day when, this is pre-task, shortwave days, when we were using GPT-4 at the time, we thought we were on the best model. We best a bunch of stuff. We started to get within a very short amount of time of Cloud 3.5 coming out. We started getting a bunch of people emailing us and be like, Why are you guys on this old model? And we're like, These people are just misinformed. We're on the best model out there. It turns out they were totally right. We also walked through our users and said, I have not yet to see a user being like, Hey, you gotta get on Grok. That's the most modern model. Although some people have asked for OpenAI stuff.

[50:43] Nathan Labenz: Where do you think things are most likely to diverge? This is another big... Question. You had said a minute ago that broadly you think things are kind of converging in terms of capabilities, which, you know, hopefully makes it manageably complex for you to support all these different providers. I do hear also the other narrative that we're starting to see more and more meaningful differentiation, and I honestly don't know which is right. I sometimes feel both ways myself, But if you had to kind of zoom in on particular areas where you think models would most likely meaningfully diverge over the next period, what would that be? One candidate that comes to mind for me is like how sub-agent and kind of team, you know, delegation across instances sort of works. Like that seems like nobody's really, I guess one kind of meta point would be like things that nobody's really figured out yet might be the place where people are gonna take the most different strategies and then they'll kind of converge once there's a winner. But right now it doesn't seem like anybody's got a super awesome way to have like many different instances of a model work together. So that's like one idea, but you know, what's on your mind as kind of where they're most likely to be majorly different.

[52:08] Andrew Lee: Within the major labs, I think everything I've seen tells me that they are converging and that they are converging 'cause they're watching each other. So, you know, take Opus 4.7. Like, I think basically what happened, this is my flippant response here is that, you know, they started to realize that like Codex was better than Claude code for many things. And they were like, Hey, how do we make our model more like Codex? And they like made a bunch of RL tweaks to like make it, you know, have a bit of a different personality and make it a little more precise. And then 4/7 feels a bit more like talking to Codex. And then I think when Codex got good, it was because of improvements in the models over the AI side. And I think they were watching Anthropic and being like, Oh man, Claude Code got really good at ready code. How do we do that? It seems to me that those two labs are watching each other and trying to mirror each other. 5-5 is much better at general-purpose, long-formogenic tool calling. I think it's because they're watching on their shoulders. At least those two labs, I think, are just watching each other very closely. I see a back and forth there. I am excited about the number of NeoLabs that have raised a lot of money that are doing totally different stuff. It would be awesome if somebody came out of left field with a totally different approach. I don't know if you've learned anything about JEPA, like Yann LeCun's thing. I finally watched a long-form video on it yesterday. It seems really fascinating. It seems quite different. I have no idea if it's going to pan out, but there's... a billion dollars riding on the idea that this totally different approach to LLMs is going to pan out. And I guess we'll find out. And then there's flapping airplanes who have an approach to let's use a lot less data. So it feels to me like all the big NEO labs really are kind of watching-- sorry, all the big major labs are really watching over each other's shoulders. And there's a bunch of NEO labs trying totally radically different stuff. That's kind of how I see the lay of the land.

[54:25] Nathan Labenz: So convergence unless somebody manages to shake the snow globe with some sort of algorithmic insight-driven breakthrough.

[54:35] Andrew Lee: That's my guess. On the harness side, actually, I want to know. I think the harnesses are also kind of converging in terms of capabilities, and largely that's because it turns out the best harness is just through low-level primitives. In our case, we don't have super specific stuff for doing e-mail. We have a file system, and a database, and a shell, and a browser that it can use, and some simple primitives around writing to-dos and setting up triggers, but it's all very low There's nothing sort of workflow specific in there, and I think that's the right approach. The places where we differentiate are not around capabilities, but more around cost and ergonomics and speed are kind of the differentiators.

[55:20] Nathan Labenz: So you mentioned having signed a deal with OpenAI. I'm sure the precise details of that are under an NDA or whatever. But one thing I'm kind of interested in watching is like a point of apparent divergence is the way they are positioning themselves with respect to products like Tasklet and also open source toolkits like OpenClaw, where OpenAI seems to be really leaning into, you can use your core OpenAI account in these other contexts. So I guess, what is that going to look like and how is that going to complicate life for you? I mean, for one thing, If I can log in with OpenAI and bring my own tokens, that totally changes your pricing model, right? Because now you've got a sort of more like a traditional SaaS type of business where the intelligence cogs are not flowing through you. I don't know exactly where they are on that, though. I know that they allow me to do it with OpenClaw. I haven't seen too many other things around the web. I've honestly expected it to come sooner. I think maybe they were just compute constrained enough that they didn't prioritize it, but I've learned that like compute constrained is like a good answer for anything. Sometimes it's real, maybe sometimes it's not, but it certainly it passes muster as an answer. So, you know, should we expect a future where I come to Task Club and I can just like connect my OpenAI account, bring my tokens, and how will that change the, you know, how will that complicate or change what you're doing?

[56:56] Andrew Lee: Yeah, I, it's a good question. Obviously anthropic has decided to go the exact opposite direction of that. And you know I'm glad we weren't in that situation as they were like cutting off people's API access. I don't know. I guess we want to see how this plays out and if this is something that is popular and we feel like OpenAI is going to do for a long time. It totally makes sense for us to integrate and let people use their tokens. I do think we provide a lot more value than just being a token reseller. I don't think it's necessarily a threat, but it could be a nice onboarding experience for folks. from a competitive position, like, you know, is there concern here that like OpenAI isn't going to like own the user relationship and, you know, if they already have an OpenAI account, why do they have an account with us? I think we are maybe a little more concerned now than we used to be. So up until They killed off Sora. I don't know if you remember the big leak around Sora. The impression that we had gotten was they were very focused on their models. They were very focused on consumer, but they weren't really very focused on business productivity. And you can see that with, in my opinion, with AgentKit, when they came out with it last fall, it didn't really feel like they were bringing their A game. And we thought, Great, we're competing hard with Anthropic, but OpenAI, they're focused on consumer models, and we can kind of run with it for a while on this front. And when they killed off Sora and they had that leak around like, Hey, we're going after business productivity. The scenario that we were worried about or are worried about a bit is basically what happened with Codex, where Codex went from an also-ran to arguably the best coding agent in a relatively short amount of time. If they've brought their A-players over to focus on this stuff, and it seems like a very potentially competitive area, they might start to compete with us in a real big way. That said, we have seen none of this so far. I have yet to talk to a customer who's like, I left Tasklet to go use OpenAI products. um so we'll see if that actually shows up but uh it could.

[59:09] Nathan Labenz: Yeah the whole there's so many strange alliances and kind of um strange bedfellows and you know cooperatition all.

[59:20] Andrew Lee: The weirdest to me is is the the The weirdest to me is the Anthropic SpaceX announcement after Elon badmouthing them and clearly them competing very hard and then doing this big commercial deal. So it's a weird time to be doing deals.

[59:36] Nathan Labenz: Yeah, no doubt. I love to see that, for what it's worth. I thought Elon was just... I mean, I have mixed feelings, certainly, about Anthropic. I echo all the positive things you said earlier. I do think their work also on the safety front, on multiple subfronts of the safety front is second to none, and that's pretty much uncontested. The Constitution, only a slight exaggeration to say I almost cried when I read it, because I really think that's a beautiful document. The interpretability work that they do is amazing. And yet, if somebody launches a recursive self-improvement loop that gets out of control, I would have to say they're probably the most likely candidate to do it at this point. So it's a very weird thing. But I do love to see closer ties between the leading companies because if nothing else, it just takes the edge off the competition a little bit, right? To the degree that they can sort of share in each other's success, even on a marginal basis, is for me, like that's a huge win. So I encourage, you know, all these as much as it's weird, I encourage all the sort of, you know, tying of cap tables together and, you know, just we're all I think we're all going to rise or sink together is kind of my my bottom line for humanity. So let's let's start to make those deals in anticipation of of that reality. And, you know, I think that'll probably in the end serve us pretty well. Anyway, okay, that's just an aside editorial. One thing that has been counter narrative recently, I'm sure you've seen the Andon Labs guys that do vending bench and And then now they've launched a couple of actual like brick and mortar real world retail stores managed by AI models. They've got the retail store in San Francisco that's operated by Claude. They've got a cafe in Stockholm that's operated by Gemini. And a huge surprise was they said 5.5 is what they called clean in the way that it runs its business. Whereas Opus 4.6 and 4.7, they've described as ruthless, like being willing to lie to suppliers, you know, do sort of stuff that's not necessarily illegal, but like questionable, you know, in pursuit of its goal. Where 5.5, they said they didn't see any sign of that. Do you have any interesting commentary on kind of the character of models? And is this something that you have to take into account as you build. Like you could imagine if one model's ruthless or willing to cut corners and another's clean, that that very well could impact what sort of supervisory systems, whatever you might want to have in the harness. So yeah, any observations, any plans on that front?

[1:02:30] Andrew Lee: I had not heard that particular note from them, and this is all like purely anecdotal. I've not done any research here, just my own experiences with it, but it kind of doesn't surprise me. I think my experience with the anthropic models is they are like much more creative, much more empathetic. They understand the human experience better. The OpenAI models are a bit more clinical, and that comes with its pros and cons. I guess it doesn't surprise me that the one that understands humanity is also the one that maybe shows some of the worst traits. We have not run into any problems here that I am aware. No user has been like, Hey, this thing went and did something unethical. Nothing's cropped up But the personalities, that aligns with kind of my experience too.

[1:03:18] Nathan Labenz: Yeah, that's interesting. They're the most creature-like for better or possibly for worse. I'd say, okay, so one big thing that, and I'm using everything, right? I've got a Tasklet max account that I'm maxing out. I've got a Claude Code max that is kind of my on my laptop, you know, terminal thing. I do have the Mac mini that's sitting over on this side that's got another cloud code and an open claw. And I'm really interested in context beyond the single agent. So this is kind of a, I think a frontier for you, but maybe not. I'm not sure if it's something you feel is as important as it has been in kind of my own personal hacking. Do you think that you're gonna need to build a sort of, second brain type of feature for users that sits at a level that's like, I guess you can think of it as above or below the individual agents, but sort of gives the broader context, right? I've got ten Tasklet agents running. For the most part, they kind of stay in their lane. They may access some of the same context via tool calls, but they don't have like a shared meta-state that's like, Here's Nathan and here's all the things he's trying to do. Here's what he cares about. Here's the people in his life. In case you run into these people, you can kind of know what's up. And obviously that's really important at organizations too, right? The sort of general situational awareness of like, Who's on the team? What are our priorities? What did we say no to in the past? Is that something that you aspire to tackle?

[1:04:57] Andrew Lee: Yeah. And I swear to your listeners that I didn't like prime you to, to, to ask this one. So yes, absolutely. We actually have some organizational features that are kind of the starting, the start of this live in the product today. We just haven't announced them yet. So if you go and look in your settings, you may see like now says like organizations and workspaces, and there's some stuff that you can configure in there. We've been laying the foundation for what you described for quite a while, and we're going to have like, you know, a launch at about fanfare and there'll be some stuff on Twitter like, well, we feel like it's really ready to talk about, which hasn't happened yet, but you actually can use it now if you want. You can go invite your team and you can get them on here. And the way that we're thinking about it is there's like kind of a hierarchy of context where if you're in an organization, some things are at the organizational level, right? So you might have like, well, What is our company and what does it do? And what's its mission statement? What are its values? And some basic things that you want to control at the organizational level. And so you might set some context there. And then you have additional context that might happen at the team level. where you say, Hey, the marketing team, they have access to these resources. They have these goals. These are the OKRs for the quarter. Here are some skills that define the various business processes that we have. Here are some files that are important to consider when doing different things. Here's our brand voice or whatever. And then in the individual agents, you have very specific things of like, hey, this is the, you know, the plan for running this particular workflow. This is a file that was uploaded to this agent. This is the instructions that, you know, someone gave me specifically for this conversation. So it's like organization is like company level stuff, workspaces like team level stuff, and then the agent has like stuff for the specific workflow. And we're kind of building everything around this, and today, most of the work has gone into the agent. We have at the workspace level, the only context that we have shared today is your connections. And this is actually super powerful. So if you have a company where you want to have the lead on your team go and configure connections with all the API keys and headers and whatever to connect to your stuff so they can hook up your API access and then give that to other users. Someone who's new comes to the team and they don't have to find all the API keys. They can just go in and start talking to their agents right away and already knows how to connect to stuff. But we want to add in shared skills. We want to add in some form of cross-agent memory. So if I talk to one agent and I explain something to it, it should be able to kind of remember that for other agents. We want to add in probably some shared file system stuff so you can have documents that are available across any agent. And you can do that now if you're connected to Google Drive or something, but we could probably make a much nicer kind of native experience here. So that stuff's all coming. And I think, yeah, Shared Brain is the way to look at it. This is literally Zapier launched. I don't know if you saw their product they launched the other day, which was like, I think they called it Shared Brain. And I think a lot of what they announced is very in line with the vision that we have as well. And I think I haven't tried with it. My hunch is they are farther on the brain side, but the agents are not as good. That's just my hunch. And hopefully we can you know, catch up and surpass on the brain side and like maintain like a lead on the agent side as well versus that. But yeah, the huge priority for us and very excited about what we can do here.

[1:08:36] Nathan Labenz: Yeah. Okay. Cool. I guess maybe let's do a zoom out and then we can end with kind of a lightning round of just some like lower level esoterica type stuff that you know, the real ones will want to hear about, but not necessarily as important as the big picture. Where is this all going? I mean, we're in this weird transition point where, and I guess a couple of dimensions, right? You've got like, we've talked about computer use a couple of times, and you've kind of bundled in command line style computer use with, UI based computer, UI mediated computer use. And that feels like its own sort of paradigm shift, you know, happening under one label, right? Where it's like, everything is kind of going headless, but at the same time, the models are getting really good at using UIs. And so like, which is going to win? Are all UIs going to go away? Or are the models just going to be really good at them? And, you know, maybe it's both. And then I I guess similarly, like, you mentioned everybody's kind of competing to build the same thing. And I feel like I've never felt that as strongly as I do right now, where you could probably name, you know, 10,000 companies that are in some, like, not super indirect way competitive, right? Like you're competing with Claude, but you're also competing with like... Ms. Word and you're competing with Zapier and you're competing with like everything under the sun and you're competing with human labor. Yeah, it's endless. So how do you like conceptualize where this is all headed? What's the big vision? You know, where are we 18 months from now just before the singularity hits?

[1:10:38] Andrew Lee: So a year ago, right before we started the pivot, One of the, like the big thing that we were seeing was, and for context, for people who maybe don't know, we had a product called Shortwave, which was an AI e-mail client. We still have it actually, but it's not the focus of the company anymore. And we had this really nice embedded agent inside and you can do like really cool e-mail stuff. And we realized that it wasn't gonna be too long before you could take a product like, ChatGPT and you could say, show me my inbox. And it would just like generate a UI for your e-mail on the spot. And once that worked well, you wouldn't need an AI e-mail client, right? Because the whole e-mail part would go away. So our entire concept of differentiation where we're like, hey, we're going to embed the this agent inside a custom-built UI that had a shelf life. And the product is actually still growing. It's still doing reasonably well. But in 10 years, I don't think it's going to be around. Probably much less than 10 years, I don't think it's going to be around, at least not in this form. So we said, man, we can't build a business around an AI agent embedded in the UI. We need to do something else. We said, Hey, we're going to build a very general-purpose agent that isn't relying on this, and we're going to go after doing an agent for a specific type of workflow, or these trigger-based knowledge work workflows. Then we built the thing we launched in October, and the feedback from people was like, Hey, We don't wanna have one tool for workflow automation and another tool for doing our day-to-day work because we want them all to have the same context. So I don't wanna have to maintain two systems where they both have all the stuff from the shared brain. I just wanna have one system. And so we said, okay, I guess we need to do not just the workflow stuff, but we need to do the synchronous stuff as well. And again, When we pivoted out of e-mail, it was like, okay, well, actually there's gonna be some more general product that's gonna encompass this stuff. And then again, it was like, oh, I guess there's gonna be some more general products that's gonna encompass this stuff. And in March, we launched our Instant Apps feature, which is basically a generative UI feature. So the idea is, what if you could generate any UI you want that hooks up to any of the data in any of your connections and just works instantly in a single prompt, you can one-shot anything. Turns out this works really well. This is a super popular feature. Our team just uses the crap out of it. So for example, if we do any sort of data science work, we're no longer going into the BigQuery UI or using dashboard tools. We just go into Tasklet and we're like, Generate an Explorer dashboard to help us analyze how these pricing changes would affect our users. And it'll just make a thing and they'll be toggled and you can tweak thresholds and things. It works. It's amazing. And we said, man, that fear that we had a year ago about what would happen with e-mail, that's actually here. You could go into Taskit today and say, Give me an e-mail UI that works. And it will, and it'll work. And you can do your inbox in a UI inside Taskit. It's not as good as shortwave yet, but it's not going to be that long. So I think that the timeline of these things has actually been much faster than we expected. And it's clear that each area where we feel like there can be differentiation has fallen away. I'm looking forward and I see no reason why this isn't going to continue. This trend of basically the general purpose tool continuing.

[1:13:54] Andrew Lee: This is all driven by the fact that the model of general purpose. If all of the model, the best model is best in everything, which I think is increasingly true for economic reasons, essentially, I think the best harness is going to be intelligent at everything. There'll be some differences in ergonomics, but intelligent in everything. We basically need to assume that the number of these products that win is going to be relatively small. I don't think we're going to have many, many, many tools that all have AI embedded them. I think we're going to have a few very horizontal platforms. What we're trying to do, is be the AI agent platform that replaces your SaaS products for knowledge workers. Today, the way most knowledge workers work is they're switching between tabs or they're switching between apps in their dock. Sometimes they're using Word, and then sometimes they're using Notion, and then sometimes they're using Linear, and then sometimes they're using Gmail, and they're going from tab to tab to tab for different things. We think the entire world is going away. Instead, you're going to have one app that has a UI. It's going to be your AI agent. Hopefully, it's tasklet. If you want to access some data from one of these tools, you connect to it through API. If you want to do some interesting analysis, that analysis, rather than being done by some bespoke business logic in the tool, it's done via CodeJet. The agent generates the code and runs the analysis. If you want a UI, the agent generates the UI one-shot it with a prompt. It gives you the UI you need. We think it can cover basically all of your productivity software. In this world, I basically think there's going to be three types of companies left in the software world. There's going to be the horizontal platforms, of which I think there'll be a very few winners because people don't want to have to maintain context and connection across a bunch of platforms. They'll probably just have one for knowledge work and one for coding and maybe one for personal use, but not very many of these, right? So there'll be the horizontal platforms, which we're going to try to be one of those. They'll be headless companies. So give you an example, a Stripe, right? Like I still think you need to do payments. Payments is really complicated. Payments is really important. So probably get set off Stripe, but like you may not have the Stripe dashboard anymore. There may be no reason to ever go to the Stripe UI. It'll be really just an API tool. And then you're going to have solutions companies where the software is totally hidden. And they're selling you a product. So for example, I think you'll still have lawyers and real estate agents. They'll still exist. And they may use AI heavily, but you may not see that. They're going to sell you-- hey, we're going to help you sell a house or buy a house rather than selling you software. So yeah, I think it'll be those three. It'll be horizontal platforms, which there'll be only a very small number of winners. They'll be headless products, and then they'll be solutions companies.

[1:17:11] Nathan Labenz: So what happens to something like a Salesforce? They would obviously fall into that and they just made this big move to go headless. But I wonder if, you know, payments is like, yeah, there's a lot of depth there. There's a lot of compliance across the jurisdictions. There's a lot of risk management. There's, you know, it doesn't seem like it's coming anytime soon where a general purpose agent would like eat that. Salesforce, on the other hand, though, I'm like, what is it really? It's kind of a schema and it's a very, very complicated schema that sort of came from the era when you could only maintain one, so you had to make it fully general across all your customers and everything they might plausibly want to do. But most people don't need anywhere near everything that Salesforce has built for them to possibly want to do. And so it does seem much more realistic for many people to like have TaskCode whip it up for them, right?

[1:18:18] Andrew Lee: I think Salesforce is in real trouble. I think a huge amount of the code that they have built up over the years is probably obsolete. I think the value of being a system of record in a world where you have agents goes down a lot because like moving data around between systems suddenly gets a lot easier. I think there's probably still many sort of like headless things that you can do that are pretty useful, but the ability to build competing products, it's gotten a lot easier. They have a lot more competition 'cause you can now vibe code some of that stuff. And so, yeah, a huge amount of what they built is obsolete. It's now easier to move to competitors. There's gonna be more competitors. So I don't think they're gonna die, but I think you're likely gonna have a much smaller sales force in the future than you do today.

[1:19:07] Nathan Labenz: It strikes me that like system of record and just kind of like really reliable storage are not the same thing, but like really reliable storage is like a key part of what drives system of record value. Like I have had instances in my personal cloud code local AI productivity stack development process where it has, in fact, dropped a bunch of data. I'm trying to export stuff out of Slack, for example, and it realizes like, oh, we didn't quite export it right the first time. I'll just delete everything and go try it again, not realizing that it was so rate limited that that actually took like four days to export what I previously exported. And so I certainly value the fact that Slack is not about to delete all my stuff by accident. But that also sort of suggests that there's maybe an opportunity for the horizontal platforms to, and I know you're a database guy historically, right? So is there an opportunity or a paradigm shift where the horizontal platforms say, here's why you can trust us with your data, even if the agents make mistakes or even if there's sort of a, this or that kind of goes bad, we're going to have some sort of snapshotting rollback durability guarantees where mistakes can't lead to data loss. It seems like if you could make that guarantee for people, they could like get much more comfortable with the idea that they don't necessarily need Salesforce anymore.

[1:20:45] Andrew Lee: Totally. And I think this is a huge place where harnesses matter, where, you know, is the harness going to make the LLM smarter, we can discuss whether that is true or whether it matters. But can the harness do this sort of thing? I think totally. So let me give you a few examples of how I think we can help. So one, as you mentioned, versioning. There's a whole bunch of startups working on file systems for agents right now, and some of those folks are working on versioning. The basic idea is if your agent goes rogue, you just want to roll back to some previous state. In a simple chatbot, you can just throw away the messages at the end. But in something that's touching the world, you've got to be able to roll back the world. For a file system, you can just change the file system, but if it's touched APIs and stuff, you might need to keep logs of things. I think is pretty key. So I think there's a lot you can do there. I think another area is having like oversight and like logging and stuff. So you actually have the ability to have a human in the loop in places where it matters and do that in smart ways. With our product today, you have to activate tools. One of the things we're going to add soon is the ability for you to have some tools that you approve every run. e-mail is an example of this, where people are pretty confident to say, Hey, you could read my e-mail as much as you want. You could make as many drafts as you want, but you can't send anything unless I say yes. And we want to get to the point where that is really ergonomic. So for example, it can send you a push notification that's ready to send an e-mail, where it's like, it'll go crazy reading and searching and making drafts. And then when it's ready to send, you get a push notification that's like, hey, do you want to review this before it goes? And then you can say yes. And that's all like pushed to you. So I think permissioning can be another big area. I think another big area is using code better and in a more like way. So, you know, let's take data migration from like one system to another. The naive way to do this is to like load that data through an API, feed it into the LLM, have the LLM then like call some tools to put it somewhere else. And basically, when you do that every time, you're putting it through language model context and trusting it to not hallucinate and reproduce that data, which I think the models get better at over time, but it's very hard to have a lot of confidence there. The better way to do this is have the model just generate a migration script. and then run the migration script. And that gives you an artifact in the middle that you can test and you can have human approval for. So they think, yeah, if you're moving data from one to the other, you still want to have an agent that's thinking through how to solve the problem. But what it should probably do is generate a migration script, generate some tests, run the tests, and then send the thing to the human being like, Here we have the migration plan and the code and the test, and this is why we think it's going to work. Are you okay with this? And then you say, Yes, and then we run it. You could even have test environments. I think the ability to have tools within the agent that allow it to do really high liability stuff and to have approval, there's a lot of opportunity there.

[1:23:47] Nathan Labenz: Okay, I know time is short. Lightning round. I got to prioritize. How about, first of all, any vendor shoutouts that you would want to make? You kind of alluded to, you know, companies doing like rollback the world type storage. Who's out there that you're using, if anybody that you think is underappreciated?

[1:24:07] Andrew Lee: Yeah, that's a good question. I think the one vendor that we use in a pretty big way that we've been pretty pleased with is Blaxel. which is a sandbox vendor. And they just have like really fast cold starts and, you know, good performance. And it allows us to have sandboxes at like the very core of our products. So I think Blaxel's been pretty great. We also use FireCrawl for crawling and like they have some nice performance characteristics. We have looked at a bunch of these storage tech companies. We looked at some of the people doing databases and file systems. We so far have opted to have our own infrastructure here. I don't know if that'll always be true, but there's kind of a trade-off here of, Hey, we think this is pretty core, and if we're going to go with some vendor, they better provide a lot of value and be somebody with a lot of confidence that has a good roadmap and stuff. So far, we've decided to do that all ourselves. Then obviously, the labs, right? The models are amazing. We would not be where we are today without Anthropic.

[1:25:13] Nathan Labenz: How about the possibility of reselling on perhaps like a fractional basis, other services. So like there's lots of connections, right, where I can go connect my Gmail and connect to my personal stuff. Then there's this whole broader universe of tools that I could go have an account with, but I maybe don't have one, and I don't necessarily want to create one, or they make it somehow difficult to do what I want to do. So classic example for me is Suno. I'm loving generating music these days, but it's not very agent-friendly, and I constantly end up in their UI, and I'm like, This UI should have been an API call. I just want to hear the music. But I also kind of think maybe I could use my Tasklet credits to fund generations with these other services where it's not like a highly personalized service. It doesn't matter if it's my Suno account or somebody else's. They may think it long term could, but as of now, it doesn't really. So is that something that you plan to do, sort of open up a Swiss army knife of things that are paid, but that I access through you via my credits that I've bought?

[1:26:17] Andrew Lee: Yeah, I do think we will do that eventually. Some very small forays into this already. One of those is web browsing, search. We use fire crawl. You could argue that, hey, that's us reselling an API. Another one that is likely to come very soon is ImageGen. You can today connect us to NanoBanana and they can make images, but this is such a common use case that we'll probably have some native ImageGen where you just use your credits to do it and you don't have to have an account. I would love eventually to have something a bit more open here. And we've had 10,000 people have emailed me about X402, and it just hasn't been a priority yet. So I'd like this to happen. One of the things I want to note is we intentionally have this credit system. And the reason that we have this credits rather than having some fixed number of tokens or something that you can use, is we would like to be able to spend on many different types of things. So when you spend tokens, fine, that costs you credits. But yeah, if you generate an image, that costs you credits too. When you search a web page, that costs you credits. When you make a song, that costs you credits. So it gives us this nice intermediate currency that we can use to spend on a variety of things.

[1:27:44] Nathan Labenz: Okay, three more. I'll keep it super quick. What is the ratio right now of your token spend for the purpose of Tasklet development to your payroll? So leaving aside what users are costing you in terms of API calls, just what you are spending via APIs versus on humans?

[1:28:10] Andrew Lee: So let me do some quick math here. So I want to note that we have at least three products where we do a lot of internal token spend. Quad, obviously, Codex, and then TaskLit, actually. We spend a lot of money on tokens through TaskLit for our internal processes. I would guess we're at about five, 5 to 10% of payroll right now in terms of internal token spent.

[1:28:41] Nathan Labenz: How excited are you for Mythos and how big of a difference do you think it's going to make for what you can do and what the trajectory of the business will be?

[1:28:50] Andrew Lee: You know, it's hard. I haven't tried it, right? No one's, not no one, but most people haven't tried it. So it is hard to get too excited about a thing you can't touch. It feels a little bit to me like a marketing stunt where they're like, hey, we don't have the compute to actually serve this thing. So let's get some benefit out of it from marketing, even if we can't. It obviously sounds amazing. The benchmarks look really cool. It claims it can find all these zero days and stuff. So I'd love to play with it, but I'd be more impressed if I could.

[1:29:27] Nathan Labenz: All right, last question. I'm sure you have taken interest in the recent CCP forced unwinding of the meta acquisition of Manus. And a fun fact about me, I was in the same dorm as Mark Zuckerberg and the other Facebook founders way back when. Not to date myself on as we wrap up this podcast, but our 20-year reunion is coming up. I don't know. He didn't famously didn't graduate. I think he's probably still invited if he wants to come. If I run into him, how many billion dollars should I tell him is the going tag for TaskLit?

[1:30:01] Andrew Lee: I mean, we've obviously been watching this pretty closely. I actually got a note from Nat shortly before the Man of Steel got announced, and we were supposed to get coffee, and then he just never followed up and it never happened. and then the unwinding. I'm very curious how that's even going to happen. I don't even know what it means to unwind something after they've already been working there for a while. That'll be wild. But I thought of another follow-up as, okay, you still want to get coffee. He has not responded to me. So I don't know if they want to chat. It's not hard to find my e-mail address. I'd be happy to talk.

[1:30:38] Nathan Labenz: I'll see if I can plant a seed for you at the reunion. Angelique. CEO of Tasklet, this has been amazing. Thank you for being part of the Cognitive Revolution.

[1:30:47] Andrew Lee: Thanks for having me again.

Outro

[1:34:04] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.