Watch Episode Here

Read Episode Description

Flo Crivello, CEO of AI agent platform Lindy, provides a candid deep dive into the current state of AI agents, cutting through hype to reveal what's actually working in production versus what remains challenging. The conversation explores practical implementation details including model selection, fine-tuning, RAG systems, tool design philosophy, and why most successful "AI agents" today are better described as intelligent workflows with human-designed structure. Flo shares insights on emerging capabilities like more open-ended agents, discusses his skepticism about extrapolating current progress trends too far into the future, and explains why scaffolding will remain critical even as we approach AGI. This technical discussion is packed with practical nuggets for AI engineers and builders working on agent systems.

Sponsors:
Google Gemini: Google Gemini features VEO3, a state-of-the-art AI video generation model in the Gemini app. Sign up at https://gemini.google.com

Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

The AGNTCY (Cisco): The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org/?utmcampaig...

NetSuite by Oracle: NetSuite by Oracle is the AI-powered business management suite trusted by over 41,000 businesses, offering a unified platform for accounting, financial management, inventory, and HR. Gain total visibility and control to make quick decisions and automate everyday tasks—download the free ebook, Navigating Global Trade: Three Insights for Leaders, at https://netsuite.com/cognitive

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) Sponsor: Google Gemini [Pre Roll]
(00:31) About the Episode
(05:36) Introduction and Agent Definitions
(09:40) Living the Automated Lifestyle
(14:20) Agent Architecture and Tools
(18:48) Multi-Agent Systems Discussion (Part 1)
(19:02) Sponsors: Google Gemini [Mid-Roll] | Oracle Cloud Infrastructure
(20:41) Multi-Agent Systems Discussion (Part 2)
(23:42) Task Complexity and Ambiguity
(27:45) Performance Optimization Strategies
(31:23) High-Volume Use Cases (Part 1)
(35:17) Sponsors: The AGNTCY (Cisco) | NetSuite by Oracle
(37:31) High-Volume Use Cases (Part 2)
(38:49) Building Better Chatbots
(42:43) Context and Knowledge Management
(45:46) Memory Systems and Simplicity
(49:51) Big Tech Competition
(53:27) Model Tasting Notes
(01:00:15) Agent Architecture Strategies
(01:04:17) Commercial Solutions Lightning Round
(01:08:48) AI-First Product Disappointments
(01:13:47) AI Safety Concerns
(01:19:04) Proof of Personhood
(01:21:04) Future of Scaffolding
(01:22:12) AGI and ASI Future
(01:25:24) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

TRANSCRIPT

Introduction

Hello, and welcome back to the Cognitive Revolution!

Today my guest is Flo Crivello, CEO of AI Agent platform Lindy. This is Flo's 6th appearance on the podcast, but only his 2nd that goes deep on Lindy specifically.

I reached out to Flo to request this conversation while preparing for my recent presentation on AI Agents, at Imagine AI Live, because I know that Flo lives & breathes AI agents as much as anyone, and yet he always cuts through the hype and shoots me straight on what's working and what's not working, yet.

For comparison, where that presentation, which we ran as our last episode, is a higher-level overview that I hope people will find valuable for structuring their thinking, and helping bring leaders in their organizations up to speed, this conversation is a much deeper dive, fit for AI engineers, tinkerers, and builders of all types, and full of practical nuggets.

Now, if you've been following the AI agent space, you know it's been a bit of a rollercoaster. We've gone from mind-blowing autonomous agent demos like BabyAGI and AutoGPT around the time of the GPT-4 launch in 2023, to a world today in which most "AI agents", and certainly the vast majority of "AI agents" in production, are better described as intelligent workflows in which AIs perform specific tasks that do require fluid intelligence, but nevertheless work within a structure and control flow that was designed and built by humans.

That while it's taken longer to get here than I expected, two things are now becoming very clear:

First, even in this more constrained paradigm, the practical value of AI Agents is becoming undeniable, for use cases across customer service, sales, recruiting, inbox management, scheduling, and information synthesis.

And Second, the more open-ended, choose-your-own-adventure form of AI Agents are also starting to work. Flo describes a few that he's built on the Lindy platform, but you can see this happening everywhere from OpenAI's Operator to Claude Code to Google's Spreadsheet Integration of Gemini, which I recently used for the first time in months and found to be infinitely improved.

At the same time, these are only starting to work, and how quickly capabilities will improve is of course hard to say – Flo, as you'll hear, isn't convinced that we should extrapolate METRs historical trend analysis of AI task length doubling times all that far into the future, and perhaps as a result he's not ready to go all-in on multi-agent systems either.

Beyond that, you'll definitely want to hear Flo's takes on model choice and managing model upgrades, Fine-Tuning, RAG, whether the tools that AI Agents are given to use should themselves be intelligent & dynamic or dumb & deterministic, how Lindy helps users maximize performance by curating successful examples, and how and why … even assuming we get AGI soon, Flo believes that scaffolding will continue to play a critical role in helping humans understand and control AI activity.

On a personal note, I'm also glad to share that I do now have a Lindy doing meaningful work for me. I've previously mentioned that we're experimenting with Sponsored Episodes, but to date they've been very few and far between. Now, with podcast guest pitches from AI companies' PR firms ramping up to an unmanageable volume, I've finally decided to enlist the help of an AI agent to help process them. And so I built a relatively simple agent on Lindy that takes a PR firm email as input, researches the company they represent using Perplexity, evaluates whether they're legitimate AI leaders that our audience would actually want to hear from – which, btw, filters out a large majority – and then for those that pass review, finds contact information for key executives via RocketReach, and drafts personalized outreach emails explaining our sponsored episode offering, which are placed directly into the Turpentine business team's Gmail DRAFTS folder for review and sending.

One could build such a workflow on many platforms, of course – but I do find value in seeing how an AI-obsessed team is thinking about these things, and Lindy does a particularly nice job of having AI set all the configuration values for you, of making sure that the AI has full context at each step, and of allowing you to chat with individual instances of a task.

In any case, this is exactly the kind of task that's perfect for current-generation agents and that every business should be experimenting: well-defined, multi-step, requiring both research and judgment, and ultimately allowing us to scale a valuable activity that we previously couldn't. And of course, I'll be transparent with you, the audience, by always identifying sponsored episodes – today's is NOT one – and I'll maintain my own agency by remaining personally responsible for the companies we choose to feature on the feed.

As always, if you're finding value in the show, we'd appreciate it if you'd share it with friends, write a review on Apple Podcasts or Spotify, or leave a comment on YouTube. And we welcome your feedback via our website, cognitiverevolution.ai, or by DM'ing me on your favorite social network.

Now, I hope you enjoy this no-BS conversation about the state of AI Agents, with Flo Crivello, CEO of Lindy.

Main Episode

Nathan Labenz: Hello, Cremelo, CEO at Lindy. Welcome back to The Cognitive Revolution.

Flo Crivello: Yes, thanks for having me.

Nathan Labenz: Let's talk about agents. It's on everybody's minds. I've been studying the subject from a bunch of different angles, and I knew who I wanted to call to get an honest, real talk assessment of where we are in the development of agents. So, for starters, a rudimentary question, but I wanted to set the terms and hopefully de-confuse the subsequent discussion. What is an agent? What is it? There are so many different definitions, right? Everybody's putting forward their own definition. How much does that matter? And what's the definition that you work with?

Flo Crivello: My favorite definition is Harrison Chase's definition. First of all, I don't think the definition matters. I think you know it when you say it, right? You can operate perfectly fine without being too nitpicky about the definition. But if you really insist upon the definition, I really like Harrison Chase's, the CEO and founder of LangChain's definition. He says it is software which at least part of the control flow is defined by an LLM. That's it. And so I think part of what I really like about that definition, and I think it also pinpoints why it is sometimes hard for people to really define what is an agent, is it's a spectrum, right? The more of the control flow of the software is defined by an LLM, the more agentic the software is. Which I think is the same as for humans, right? So sometimes some humans in companies operate within very tight guardrails, and they're not very agentic. They're not high agency. Some others are life players or whatnot, and have very high agency, right? And if you think of a Jen Chen Wang, the CEO has ultimate agency, right? Yes, that's my definition.

Nathan Labenz: Okay, so that would mean and I think the spectrum of definitions that I've heard maybe ranges from Dharmesh on the one hand said basically anything that you go to that's an AI that helps you get stuff done is an agent. He doesn't care if it's a fully deterministic workflow. And then on the other end, you've got, I heard a good one from Amjad, which is that it's an agent when it decides when to halt. Yours is, I would say, a little closer to Dharmesh's in that any tool call would count, right? And any fork in the flow, like if-else type of logic. As long as that's entrusted to an LLM, you would count that.

Flo Crivello: Yes. I think where my definition overlaps with Amjad's is 'decides'. To me, 'decides' means there is this brain, right? To me, the brain is the LLM. And so if you introduce the decision power made by the LLM inside your workflow, software, whatever you want to call it, it's agentic.

Nathan Labenz: Okay, so not. I think this is something we'll return to as we go. Not too much there in terms of autonomy or open-endedness required, and I think that is. One of the reasons I'm digging in on this is because I feel when I talk to people who aren't super deep down the AI rabbit hole every day, as both you and I are, they have seen your baby AGI type demos or maybe your chaos GPT type demos, and they've latched onto that in their minds, "Oh, wouldn't it be amazing if I could just give a simple instruction to an AI and it could just go do everything and figure it out and come back to me when it's done," right? The old 'come back to me when it's done'. That I feel is leading people often quite astray in terms of what is actually realistic today and also in terms of how much work is really needed to, if nothing else, assemble context for your agent or guide it to the necessary context so that it has a chance of doing what you want it to do accurately. But maybe you're going to destroy my worldview here in a second with telling me that you actually have tremendously open-ended Lindys running your life. So, keep that in mind.

Nathan Labenz: Tell me, what's it like living the automated Lindy lifestyle today?

Flo Crivello: Yes, I wouldn't say running my life, but yes, I have a couple of very open-ended Lindys running. I think this definition of agentic is just on the far extreme of that spectrum that I just defined. When you let LLMs take a lot of the decisions, then it's very agentic. And so that's just this definition. I was thinking about it the other day, because the very first version of Lindy perhaps was overestimating the LLM's capabilities. It was definitely overestimating the LLM's capabilities, and was just these open-ended agents. And so since then, we've actually backtracked and let you set up more of that deterministic scaffolding where you can really force the agent to go through a deterministic set of steps where, suppose you want a customer support agent. You're going to receive a support ticket on Intercom or Zendesk or whatnot, and you want it to check your knowledge base on Notion before entering the support ticket. You're not asking it, you're not asking the LLM, "Please," right? "I beg you, LLM god, go ahead and check this knowledge base." You want that to happen deterministically all the time. You want it to be hard-baked into the cognitive structure of the agent. So now, the current version of our product lets you decide however tight you want your guardrails to be, when you let the agents roam free and so forth. I have some agents. It's funny actually, my meeting scheduling agent is just a pure agent. It's got very little to no guardrails. It's just a very big prompt where I tell it the rules and I tell it how I like my meetings scheduled and so forth. But within these boundaries, it does pretty much anything it wants. I have another agent that, this one's going to sound funny, or very minor, but it wakes up every week. So, every Monday morning. It woke up a couple of hours ago and it checks whether there is a new podcast from my favorite podcasters, who are Darkash and Lenny and obviously, The Cognitive Revolution. And it looks for the podcasts, and then once it finds them, summarizes them, and sends me the summary. That has to be an open-ended agent because this concept of 'look for the podcasts' because there's no one source of truth. So you have to just look for the podcasts, go on YouTube, go on iTunes, just figure it out. And then once you've found it, ping me. So that's an open-ended agent.

Nathan Labenz: One thing I noticed using the product recently is that you've built a bunch of primitives. There are many integrations now. As time has gone on, it's become much more like Zapier, meaning there's an increasingly good chance the app you want to use will be available. But then you have your own primitives, such as "search the web for something." Not too much detail is given, obviously, to make it seem free and easy for the user, and perhaps to keep some secrets. I have a couple questions on that. First, is that an agent? If so, aren't we running into trouble with how we define these things? This is something I'm looking at from many angles, including liability, testing, and even design. What's my responsibility? What's a tool? When I use your Lindy official "search the web" function, should I think of that as an open-ended agent? It seems like it might be doing many steps, and I don't really know what they are, and it just comes back to me when it's done.

Flo Crivello: No, it's literally a Google search. We try to be careful. There's an agent, and then there are the tools the agent uses. In the early days, when we were more exploratory and trying to figure out what an agent was, we experimented a lot with agentic tools. We learned that's a bad idea. You should draw a sharp line between your agents and your tools, and your tools should not be agentic. The reason is, empirically, it just doesn't work well. Second, it makes it very hard to reason about your system because you're introducing complexity. It's hard enough to make one agent work. Now you need to make two agents work and figure out how to get them to interface with each other, which is tricky. So, the search web action is just a Google search.

Nathan Labenz: Okay, interesting. Does that make you bearish on new agent protocols coming out? People are talking about A2A from Google. MCP is meant to be more of a tool, but I'm increasingly seeing smart MCPs. Not often yet, but a familiar example I've mentioned is Augment, which created a clone of Claude Code.

Flo Crivello: Yes.

Nathan Labenz: In the Claude Code blog post, they say, "We have a planning tool, and it can call the planning tool." Augment thought, "We don't have a planning tool. What should we use? Should we make our own?" They found Pietro's sequential thinking tool already packaged as an MCP. So they're using that. They have the coding agent, but it's tapping into this other smart planning MCP, which raises interesting questions about what context it's fed and how much of its thinking you get back. But would you call yourself bearish on all these multi-agent frameworks at this point?

Flo Crivello: No, I wouldn't call myself bearish. I've been very excited about multi-agent systems for a long time. I do think they're a lot younger. It's much harder to make a multi-agent system work than it is to make a one-agent system work with a bunch of tools. It's just much, much harder. I think part of that is the models haven't been ready. I think part of that is the tooling hasn't been ready. And I think perhaps part of that is we haven't really had... I'm sure these protocols are going to help, and I am excited about Google's work here. It seems like something like this will be necessary. It really reminds me of EDI, a very old protocol in supply chains, e-commerce, shipping, and logistics. It lets you formalize your relationship with a supplier. For example, "I bought this from you." Then you confirm the purchase order went through, and then confirm shipping, and if the quantity is different. It's very formally defined. That's the backbone of modern logistics; the whole world runs on EDI. So I think we will need something like that for intelligent communication. I think it will help a lot to make these systems more sturdy.

Nathan Labenz: Okay.

Flo Crivello: Okay.

Nathan Labenz: We're just not quite there yet. Maybe they are in the same spot you were a year ago, a little bit ahead of the game.

Flo Crivello: It's early, for sure. But I think it's close enough to start thinking about them. I have some multi-agent systems that I use daily in production.

Nathan Labenz: Okay, tell me more.

Flo Crivello: It's going to sound like toy examples, but Lindy is in all my meetings. She sits in the meeting, takes notes, and does a bunch of other things. She's my meeting note-taker, which is huge. One thing she does, which might sound weird, is related to candidate interviews. When I chat with a candidate and decide they're not a fit, I decide to pass on them during the interview. Sometimes it's more ambiguous, and I'll talk with the team because I'm the last round of interviews. But sometimes, I'll just veto the candidate. So, what I do is, after the candidate leaves the meeting, my Lindy is still there. I talk to my Lindy and say, "Lindy, let's pass on this guy. Send him an email." So, that's my Lindy note-taker. I have another Lindy I call my Lindy Chief of Staff, and that Lindy does many things for me, exactly this kind of task. She knows what I mean when I say, "Pass on the candidate." She does several things. First, she doesn't send the rejection email immediately; she waits a couple of days. If this candidate was introduced by a recruiter, she also sends a note to the recruiters to let them know. There's a lot she does behind the scenes. So, when I do that, my Lindy meeting note-taker sends a message to my Lindy Chief of Staff to tell her, "Hey, Flo wants you to pass on this candidate. Can you please do so?" That's a simple example, but it's real. I use it weekly.

Nathan Labenz: How gnarly do those things get in terms of the overall control flow?

Flo Crivello: The agent delegation and collaboration itself is just an extra step. Agents as a whole, regardless of the multi-agent system, can get very gnarly and complex, leading to frustration.

Nathan Labenz: Before digging into another example or two, everyone has seen the meter graph that's been the talk of the town. I think we were together the weekend that dropped, or maybe it was at re:Invent when they put it out, and then they circled back to their doubling time for agent task length a little while later. I was thinking of the curve. I know you know it, but they've gone back in history, looked at the task length agents could do 50% of the time, plotted a straight line on a graph, and determined the doubling time is every seven months. More recently, there's been talk that maybe that's even sloping up a bit. Perhaps the doubling time now, if we are indeed in a different regime, looks like maybe four months. What's your thought on the increasingly infamous meter graph?

Flo Crivello: I don't have a very strong opinion on it. I think it's very dangerous. I say that as someone who is very appealed by AGI, and I'm always the one to talk about exponential takeoff and all of that. But I will acknowledge it is dangerous to draw conclusions and drive these lines on a log graph when you have so few data points. In the case of AGI, you actually do have another data point. You have 60 years of Moore's Law, so you know that compute will keep increasing. In the case of AI, at this point, we've gone through I don't know how many, five to 10 orders of magnitude, and we see the scaling law just keeps working. So I think we have enough data points to draw that line on that log graph for AGI. I don't think we have that for agents yet.

Nathan Labenz: So how would you say that that lines up with what you have seen as you've created 1001 Lindys over the last year?

Flo Crivello: I share the empirical observations so far over the last two years. I have seen the same trend described by this line. I don't dispute the past. It is the future projection that I have a question mark on. I just don't know if I can keep drawing this line. I don't know if it's going to be linear or exponential. When we started this, we started pre-GPT-4. We started with GPT-3.5. In hindsight, we were... I wouldn't say too early because it was only two years too early, which I actually think is the right moment to start a startup. The agents definitely didn't work. They really didn't work. It was so dumb. 3.5 was profoundly dumb. Then GPT-4 came out, and things changed. GPT-4 was too expensive, and it wasn't as good as the new models. Now we have Claude 3.7 Sonnet and Gemini 2.5 Pro. These models are incredible, very fast, smart, cost-effective, with huge context windows. I have seen all of that happen over the last two years. Yes, I have very high expectations for the next two years. I don't know if they'll be exponential or as strong as what happened over the last two years, but I do expect agents to get better and better.

Nathan Labenz: So, it's important to keep in mind that 50% versus 99% is a pretty big gulf. Where would you say, in terms of task length or complexity, people should aim if they want to put points on the board as a new user of Lindy? What's something that offers maximum practical value while being confident that you can get this to work? How do you guide people?

Flo Crivello: It's a slight reframe of the question, but I think of it more as task ambiguity rather than task length. I call it a reframe because fundamentally it's the same thing. If you can describe something as a sequence of steps, that's a low ambiguity workflow. I understand this workflow, and I can describe it in this sequence. Again, that's essentially the equivalent of saying it's a short task length because it's just a succession of short tasks. But thinking of it in terms of this succession is important because it allows you to cover a much broader set of work tasks. Few work tasks are just 30-second tasks in and of themselves, but a surprising amount are a succession of two-minute long tasks. With that in mind, I think anything you would feel comfortable giving to an intern with a Google Doc that describes a succession of steps. Honestly, the length of the Google Doc doesn't even matter, because all that matters is the maximum complexity of any of these steps. And I would say even that doesn't matter a whole lot because modern AI agent builders, certainly Lindy, have this concept of human in the loop. So you could just build your Lindy, essentially turning that SOP, that Google Doc, into a Lindy. Then you have a huge Lindy. If you detect that one step is particularly risky, you just toggle human in the loop on that step. Now you insert yourself and likely adapt to something like computer IC or LHF, in-context reinforcement learning from human feedback, such that if you ask for human confirmation on any step in your Lindy, she learns from your feedback little by little. She actually learns really quickly. You've seen the same papers as I have about in-context learning; the in-context learning ability of these models is surprising.

Nathan Labenz: Interesting. So that's a feature that helps people curate gold standard examples by doing it

Flo Crivello: Right.

Nathan Labenz: bit by bit over time?

Flo Crivello: That's right. That's exactly right. You're thinking about it in the right way.

Nathan Labenz: I really like that because it's hard to get people to sit down and create gold standard examples, I find. I have had quite a few adventures on that front, and it's been really eye-opening how some people, you cannot get them to focus and do it. It's very, very weird.

Flo Crivello: Yeah.

Nathan Labenz: That remains my number one tip for performance optimization. Obviously, good clear instructions are key, but presumably most people are at least able to sit down and write a couple paragraphs of instructions for what they want. The gold standard examples typically don't exist, I find, or they're so fragmented across context, or the chain of thought was always in their heads. All those problems really hold people back. So I really like this idea of starting off in a human-in-the-loop paradigm, having people come in and review or fix, then compiling those and building up longer prompts that drive performance by leaning on those examples. That's really good. Is that the number one driver? Is few-shot prompting still the biggest thing?

Flo Crivello: Big time. Absolutely. Much better than instructions.

Nathan Labenz: Any push into fine-tuning? That would obviously be the next step at some point, right?

Flo Crivello: No. Maybe at some point. I just think the models have become so good. I feel like fine-tuning is a bit of a thing of the past, isn't it? We heard a lot more talk about it a year and a half ago than we do today. We used to have a fine-tuning team that we cut, frankly, because part of the issue was the open source models did not deliver in the way we hoped. Fine-tuning went from a world where Lindy didn't work, and we thought the models weren't ready and needed fine-tuning for agentic behavior. Then the models worked, so it just became a nice-to-have. I just don't think the juice is worth the squeeze for fine-tuning for the vast majority of use cases.

Nathan Labenz: OpenAI just put out their reinforcement fine-tuning to a lot of accounts this week. So it seems like they haven't given up on it yet. This has been one of the biggest divergences between them and everybody else, right? Google has made a token effort at most. Claude allegedly was going to allow you to fine-tune Haiku at one point. I still don't think I've ever been accepted into that program. So OpenAI is leaps and bounds ahead of anybody else in terms of their fine-tuning offering, but they must still be seeing something from it, right? To be pushing something like that all the way to production.

Flo Crivello: First of all, it's a very big company. They have a lot of things on the stove. It's hard to infer too much. Also, I'm not saying fine-tuning is completely useless. First, if you operate at scale, that's the first requirement because there's a very high fixed cost to fine-tuning that you have to amortize over a large volume. So if you operate at scale and if you have an important, critical part of your workflow that you are looking to make faster, cheaper, more reliable, and if that part is sufficiently narrow on the task it's trying to perform, such as RAG use cases often have in terms of re-ranking and prioritizing, then you probably ought to fine-tune a small model and insert it in that workflow. I think I have heard that Cursor and Windsurf have at least part of their workflows using a fine-tuned model, but I'm not sure.

Nathan Labenz: I think the point on narrowness is really key. It's tempting in some cases to try to imagine creating a fine-tuned model for our company that does everything for our company, and that, in my experience, is not the way to go. It's much more about nailing down with clarity what the desired behavior is on something that really matters. A good example from the reinforcement fine-tuning docs I was reading this weekend was from healthcare: doctor notes, transcript of the appointment to a diagnostic or even a billing code, which is super gnarly stuff and obviously accuracy is really important there. So that kind of thing often, I think, will work and hopefully will push the frontier of what people can actually do in these various frameworks. But the narrowness piece definitely resonates a lot with me. What would you say are the most, if you were to weight by the actual number of tasks as opposed to the number of Lindys, what's driving the bulk of the value through the system today?

Flo Crivello: If you look at it by task, it will almost be ironic. If you look at it by task, you will see the least important use cases because almost by definition, if there is a Lindy... well, it's not the least important, but it will be very high volume, very small tasks. So most likely, if we do that, it will be an email task or a Slack task. It will be one of those two things because those two things are such high volume. We see people deploy Lindy to automate their email workflows. That's a big use case of ours. So, email triaging, email drafting, if you receive a lot of proposals by email, Lindy can look at the proposal and reject it practically if it's not worth you looking at. We've got a variety of use cases here. That's probably going to be the biggest use case.

Nathan Labenz: That does resonate with me because

Flo Crivello: Yeah.

Nathan Labenz: That is often where I tell people to start: something simple, relatively low risk, high volume, put some points on the board. What if you reweighted by credits consumed?

Flo Crivello: If you reweighted by credits consumed, I think it's going to be one of two things. The first one, built into our credit system, is that we use prospecting APIs for lead generation, and those are very expensive. We charge a lot of credits for that. I could show you, I have a recruiter Lindy that I talk to, and I say, "Hey, find me 30 engineers working in San Francisco at this or that company." She uses these prospecting APIs to find these 30 engineers, and it's 40 cents per engineer per lead. So, right there, that's $12. Then she says, "Okay, I found them." And I say, "Okay, send them all an email." So right there, if it costs me 10 credits per outreach, that's going to cost me $3. That's one. I think the deep research use cases are quite big, and I'm using it as a portmanteau for a very broad category of use cases. Anytime you want your agent to review, consume a large amount of data and then do something with it, I think agents are excellent at that. It's one such killer use case because it's good at reading tokens fast. If a human had to read these tokens, it would be very slow and very expensive. Then she can write a report about that. One of my favorite use cases for Lindy is we have this Lindy that you can basically think of as sitting at the interface between the company and the outside world. She reviews every customer call we have, every prospective customer call, every support ticket we answer, and at the end of the day, she writes a report based on that interface between the company and the rest of the world, which I think is a very important interface. She says, "Hey, this is what's happening. This is what's happening in the sales pipeline. This is what customers are saying. These are the issues we're having in the support inbox," and so on. That's hundreds of thousands of tokens every time.

Nathan Labenz: Is that the same one I interact with when I talk to the chatbot on the site for help?

Flo Crivello: It's not exactly the same, but yes, that Lindy in question does also ingest these interactions.

Nathan Labenz: Gotcha. So when I talk to that, that log becomes an input to the higher level summarizer.

Flo Crivello: That's exactly right. If a lot of people talk about the same thing to that Lindy, that's going to come up in the end-of-day digest. That's awesome. By the way, it sends the digest in the general channel on Slack. It's such an awesome... It's like the heartbeat of the company, right? You can think of it as: ingest all the context, broadcast it back, ingest, broadcast it back. It does that every 24 hours, and as a result, the whole team is in sync. It's really powerful.

Nathan Labenz: My compliments to the chef, I guess I'll say, on the onsite chatbot. It was actually helpful and able to respond in a way that felt like I was actually talking to something intelligent. It strikes me today that I'm not really sure why this is. Obviously, inertia is a powerful force, the old Tyler Cowen, "You are all the bottlenecks." But chat still sucks on most sites, right? It's not good. When I went to yours, I was quite impressed that it was actually a natural conversation. It had relevant answers to the questions, and then at the end, I said, "Can you forward this to the team?" And it said, "Yes. Okay, I've done that. I've forwarded it on to the team." I thought, wow, that's pretty... It felt definitely much more like the future and where I think a lot more people I would have expected to be at this time. Why aren't more people here? Aside from just the general slowness of life, I feel like people have tried, but they've often failed to make these things work as well as a few people have demonstrated they can.

Flo Crivello: Yeah.

Nathan Labenz: What accounts for that in your mind?

Flo Crivello: Well, first of all, and I don't say that to peddle my shit, but that chat is Lindy. We've just spent a long time crafting the platform as builders of Lindy, and then as users of the platform, we've also invested a lot of time in that Lindy. We know how to build good Lindys because we built Lindy. So it's a really good Lindy. It's big, it's got a lot of prompts, the whole scaffolding makes a lot of sense. It injects the right context at the right time from the right source. It's just a complex Lindy that we've spent a long time crafting, and it uses good models. I sometimes suspect that companies, in a misguided effort to save money, are using really bad models for these chatbots, and I think you should not. I think you should just have the best possible model. Well, not the best possible because today there's o1o3 and that's just going to be very expensive, but come on, give your customers a Gemini 2.5 Pro. It's not that expensive and it just performs extremely well. So, yeah, we just know how to build good agents. But I think, thank you. I will take the compliments and I will pass them to the chef, which is me. It's not me, it's the team.

Nathan Labenz: And I think that's an indicator of the current capabilities of these systems, which I agree with you, I think are under-packed. It's really crazy just what is possible today that is not yet really exploited by 99% of businesses.

Flo Crivello: So tell me a little bit more about context. You said it injects the right context at the right times. That's along with the difficulty of getting people to actually buckle down and write some gold standard examples. Generally speaking, the challenge of assembling context or accessing context also seems a constant theme when I talk to people who are trying to implement stuff.

Nathan Labenz: Yeah.

Flo Crivello: Aside from many iterations, what lessons would you say you've learned, or what tips would you give to new users about how to muster the right context at the right time? I think it is a lot of iteration. I think you do enough reps that you end up building an intuition, and I think that intuition is on... There is a balance between using just similarity, vector search, BM 2.5 and all that stuff, to search your knowledge base. And, on the other hand, handcrafting exactly what to search and when and in what knowledge base. The more you use these products, the more you understand where that balance lies. So very concretely, we've got that Lindy chatbot that assists our customers, and the customers ask it all sorts of questions. So we know that if they ask a question about billing, so refunds and how credit works and all that stuff, we've got a specific portion, a segment of our knowledge base that's specifically about billing. So we're going to have a branch there that's like, "Here's a key billing question. Okay, now you consult this knowledge base and this is the kind of query that you draft for this knowledge base." I'll also say, another intuition you build is when to not even use a knowledge base at all. When you are not very conscious about or worried about saving money, I think more and more, look, I hate to be the guy who says RAG is dead, but it's not dead, but it's limping. There's a lot of use cases that we have where we don't use knowledge bases anymore. We just... Or, hey, these are the five or 10 pages we have about billing. It's not that much. It's a couple thousand tokens. These are the tokens, just insert them all at once in your context window. We don't do it for the support part because it does get expensive, at least for now. But anyway, it basically becomes this hybrid between a handcrafted pipeline, a handcrafted RAG pipeline, and a BM 2.5, vector similarity search.

Nathan Labenz: So maybe people should be thinking, "How can I 80/20 or 90/10 this, where I will actually create top categories of situations I want to handle, branch into those, hand curate relevant context, whether it's the five or 10 pages about billing or what have you, and then have one catchall bucket at the end that's, okay for that you can just search through this knowledge base." But then you maybe increasingly pull out of that, and you sort of minimize that bucket as you go. Is that what you would recommend

Flo Crivello: Yeah.

Nathan Labenz: in terms of the iteration cycle?

Flo Crivello: And in the end, it ends up happening very naturally because what ends up happening is, you create your Lindy agent and you deploy it, and then you monitor it. Every so often, you check in on it, you look at what it's done, and then you're like, "Ah, this was really dumb. This is not how you should answer this question." So you go back, right? And you edit the Lindy, you edit the prompt, you add steps, you modify the knowledge base, you tweak it around the edges, and you rinse and repeat. I find it funny that there is a natural instinctive reluctance that people have to go through this loop. There's something about it. I think it's just not instinctive. But when you consider the time that you invest to onboard a new teammate, a human teammate, it's a lot. Training a human takes weeks for a human to fully ramp up. So I actually think agents are easier than humans to onboard. It's just a less natural mode of interaction because a human, you can just go to them and you're like, "Hey, don't do this, do that, move it forward." With an agent, you have to know how to use this editor and you've got to build that intuition that I just mentioned. That's not always going to be the case. Soon we're going to announce something big that's going to make it a lot more natural to iterate and improve on your agents. But I would say, it's just iteration. That almost sounds like a memory module. That's been a space that I've been watching really closely. What's your take on... I mean, there's been a lot, right, of different frameworks for memory, whether it's graph databases. I did an episode on HippoRAG. There's HippoRAG2 out now. Then, of course, there's more inherently neural structures, which could just be a vector database, but... We've got an episode as well on Titans, which is building an MLP into the thing and updating that MLP so that it can retrieve from history. ChatGPT is doing its own thing, we don't know exactly what it's doing, but it's currently got... It's at least got a mix of the explicit saved memories that you can go and read, and then some more vague, nebulous... It'll check in with your chat history and they don't really tell you exactly how that's working under the hood. What paradigms for memory are you most excited about?

Nathan Labenz: Um, I think this is one of these things... Like, I read all those papers. I've seen like the Hippo and Hippo2 papers and so forth, and it's, it's very exciting, but like, I think this is one of these things where number one is the bitter lesson comes for us all. I think as, as models become better at having more context and that's fully utilizing this context, I think...

Flo Crivello: ... like all of these systems just become moot because you can just throw it all in a context window and I think this can be just fine. And also, like, I'm, I'm a big believer in simplicity when it comes to these systems because the more moving parts you introduce into the systems, the harder they are to reason about and to debug. There's this principle of engineering that I really like that states you need to be twice as intelligent to debug a system as you do to, uh, design it in the first place. So if you are operating at full intelligence when you're designing the system, you're gonna be unable to debug it. And I think that's the case of like the... all of these fancy memory systems. It's like, you guys are operating at full intelligence here. I can't figure this out. Like, I have to really sit down to understand these systems in the first place. Like, I can't debug it. I don't think you can either. By the way, that's always the, the problem with, like, these academic, uh, papers, is like none of them is really building with that constraint in mind, which in my experience, when you're building systems that go into production is actually the defining constraint that you need to keep in mind. So with that said, like, my understanding of what ChatGPT is doing, memory system there, is that it's actually the simplest system out there that also is operating at the greatest scale and I don't think, again, to my point, that is a coincidence. And I think what they do is literally just, like, they take conversations, they determine whether there is a memory that's save-worthy in that conversation. If so, they use an LLM to distill down the memory into like a short sentence or two, and then they just inject all of that into the context window. They may go one step further, but honestly I don't think so. They may go one step further which is perhaps assign a, an importance score to the memory, and so you could imagine, like, hey, you've got so many tokens worth of budget in the context window for your past memories, and you're going to prioritize based on like that, that, that, uh, priority score that you've defined before. And then you could imagine, again, you can go slightly more fancy, you can imagine a sort of like decay with time, you know, and so you could, you could come up with like a composite score between like the priority score and like the time score and maybe like the older the memory is or the lower the priority is, like you just, like, allocates, like, fewer and fewer tokens to the memories. Maybe you save, like, multiple lengths worth of representations of each memory. Like that's the kind of thing I'm thinking about, but even like that incarnation of that system, which by the way is, is purely con- just conjecture, um, it's pretty simple. I think it's, it's really simple and I, I think that's just how it works.

Nathan Labenz: It's pretty similar to what O3 guessed when I asked it, um, it, it, it guessed that it was doing some sort of vector search. Well, it kind of went back and forth between, like, distilled and then vector search or just, like, chat history direct into vector search, but it did have a vector search component in its guess.

Flo Crivello: It's not derivable. I would bet you, I'd bet you a lot of money that there is a vector search in there. If there was vector search, it wouldn't be able to retrieve when you say, "What do you know about me?" It wouldn't, it wouldn't be able to retrieve it. Vector search won't let you retrieve that. Unless it's a really fancy rack pipeline, like a high power.

Nathan Labenz: It'd be a custom retriever.

Flo Crivello: It's simple. It's simple. I think it's a very-

Nathan Labenz: They do have a tend- I mean, certainly I think that's a good prior for all the things that the leading companies do. They definitely have a strong bias toward doing the very simplest thing and just applying a lot of compute. So I think you're certainly right to use that as the jumping off point.

Flo Crivello: Yeah.

Nathan Labenz: Um-

Flo Crivello: But, you know, look, th- the other thing is, like, they're all building on shifting ground because, like, the entire underlying paradigm is changing every three months. And so the more complexity you bake into these systems, the more assumptions you bake into these systems and so the more brittle they are to future paradigmatic changes.

Nathan Labenz: Yeah. Interesting. How do you think that, um, will impact the frontier lab vers- uh, you know, API, uh, powered developer as we go into the future, right? There's, there's of course been multiple rounds of the debates around like who has moats, where does value accrue, et cetera, et cetera. It seems like the... I mean, take OpenAI specifically, like they're both going toward chips on the one hand and toward, like, buying Windsurf on the other hand, right, and kind of trying to be a real full stack vertically integrated provider. Do they... you know, is this... do we... how do we escape like total big tech victory, uh, you know, big tech black hole value?

Flo Crivello: Yeah. Um, I mean, I, I really think of Sam Altman as like Bill Gates 2.0 basically. I think in the, the scope and the breadth and the nature of his ambition, he's very similar to Bill Gates. Like, if you study Microsoft's history, it's remarkable, right? Like they started as this like basic compiler and then they almost stumbled upon the operating system. But they didn't... it, it's not like it was just like, oh, pure luck. Like it's... Bill Gates, his modus operandi is very much like we want to own the whole stack. So he was philosophically open to the operating system as well as to the compiler as well to the applications and so forth, right? And so he really studied as his whole charter to like own computers, right? Personal computing, like we own the whole thing, right? And so pair referrals, yep, we're gonna do it. We're gonna do the mice and the keyboard. Like operating system? Absolutely. Like modeling software and like, like defen- like security software and like application layer and like yep, yep, yep, yep, yep, yep, we're gonna, we're gonna do it all. We're gonna own the whole fucking thing, you know? We're gonna be like an index stock on computers. If you believe in computers, you've got to believe in Microsoft, you know? So I think that's what Sam Altman is going for. He's like, "Yep, that too." You know, we're gonna do the compute, we're gonna do the, the API, we're gonna do the applications, we're gonna do the code, we're gonna do it all. We're gonna do it. Now, um, look, I mean history doesn't repeat but it does rhyme. There are patterns in here and in the end, Microsoft, like, did very well for itself but...It's just too big for one single company to own it all. And, like, certainly that's what's happening right now. Like, there is this, like, 800 pounds gorilla, and then there's, like, a lot of, like, smaller players, like, all beating around it. And, like, you know, Cursor is doing very well, Replit is doing very well, Lovable is doing very well. We're doing quite well. Like, the market is just ginormous. Like, this is by far the biggest opportunity in the history of tech and s- software of computing. So yeah, I think it's gonna pan out exactly like that. You're gonna have a couple of very, very, very big players, and then you're going to have a thriving ecosystem around them.

Nathan Labenz: How about some tasting notes on models? You said earlier, at least give your customers Gemini 2.5 Pro. I might say Gemini 2.5 Pro is my favorite model today. That might be a little strong. I certainly wouldn't want to be without any of the leaders at this point. I use Claude and o3 increasingly. I think I pretty much use all three of those on a daily basis. Give me your tasting notes first, especially with an eye toward what is working in the agentic context, and then we can trade notes from there.

Flo Crivello: Yes, I agree with everything you said. I love Gemini 2.5 Pro. It's delightful to see Google finally wake up. I wouldn't even say wake up because they've been aware of this, it's just the machine takes a very long time to get going. I think you should be... What's the saying Gen Zs use? You need to be model maxing. You should talk to all models. I like o3 a lot for very beefy... I use it as a thought partner as well, and it's quite good for that. I think o3 is the first model I've talked to, because I use models a lot as a thought partner, and o3 is the first one that has really blown my mind where I've been like, "Goddamn, this is insightful." It has really opened my eyes to some deep insights that I've really appreciated.

Nathan Labenz: Do you do that just directly in ChatGPT, or is there any other

Flo Crivello: Yes.

Nathan Labenz: intermediate interface?

Flo Crivello: Just ChatGPT. The memory system in ChatGPT is also killer. It's so good.

Nathan Labenz: So the default in Lindy is Claude. How do you decide to make the default? I believe it's called three five, right, as opposed to

Flo Crivello: Three seven, I think. I thought we switched it to three seven. We are considering switching to Gemini 2.5 Pro. We're looking deeply into it.

Nathan Labenz: Okay, unpack that a little more. I've been exploring different agent products lately, and I've noticed there's starting to be a division between three five and three seven. It seems three five is more reliable; we can trust it to do what it's told. Three seven, however, is a little overambitious sometimes, hard to wrangle. Amjad told me a couple of funny stories about what three seven was doing in the context of their

Flo Crivello: Yes.

Nathan Labenz: agent, their app building agent, specifically when they tried to get it to not edit a certain config file and the multiple ways it still attempted to do so despite being told not to, and despite actual barriers being put in its way. I was interested to see that that seemed to be, although I might be wrong, I thought that was the default in Lindy. I guess, one way to put it is, how automated or well developed is your eval machine at this point? Are you going on a set of 1,000 tasks across a bunch of categories where you're able to say, "We know exactly how these things compare on a rubric," or how much room is there still for the proverbial vibe check?

Flo Crivello: A lot of room, and more and more room. I think we've not invested as much as we should have into our eval suite, so as a result, today, we have limited trust in it. It is a signal that we look at, but I think that's also a function of the business. People are using Lindy for so many use cases now, more and more, and frankly more than we foresaw initially, that we're very careful about changing the default model because it's basically hard swapping the brains of your AI employees. It's a big deal. It's like all of a sudden your entire AI headquarters is operating on a different brain. So we're very careful about it.

Nathan Labenz: Yes, that's an interesting challenge. Would you go back if I have a Lindy that's working and I just accepted the default, whatever it was at the time, and you want to upgrade the model? I could see a strong case for, hey, let's go back and upgrade the model that everybody's using, where they just accepted the default anyway. Let's give them whatever we currently think is the best. On the other hand, I could also imagine that could create a lot of chaos, and maybe the alternative would be to freeze all that stuff and set the default to the new one for people going forward. But that sounds like a pretty hard decision to make because you want to bring people into the future. You don't want to have so many versions you have to maintain or worry about. How do you think about how much to change when somebody's not even aware that you might be making a change for them?

Flo Crivello: We take it seriously, for sure. We do it all the time though, so if you create a Lindy and you pick the default model or you don't change the default model, it's not like the default model when you created the Lindy was classic 1.5 Sonnet, hence that Lindy is on 2.5 Sonnet. No, that Lindy is on the default model, and we change the default model all the time. So when we change it, it's not like we have to go back. The Lindys that are using the default model use the new default model. We actually have what we call model labels. So we have default, then we have fastest, which currently is Gemini 2.5 Flash, 2.0 Flash perhaps. Then we have most balanced, which right now is Claude 3.7 Sonnet. Then we have smartest, which right now is o3. And then, if you want, you can also manually pin your Lindy on any one specific model. It's like, I know what I'm doing, I want O3, I want specifically O3. But most of the time when you want O3, you don't really want O3, do you? You really just want the smartest model possible. So we actually recommend that you use the model labels, and then trust us to do our job well, which we do. We've done it all the time, and only once did it go wrong, and that caused us to upgrade our protocols here. It was the very first release of O3, and this is when we also updated our priors on the validity of our eval suite, because I don't know if you remember when O3 first came out, it was very clearly just a reaction to the DeepSeek blowup that weekend. And O3 was not ready. It was simply not ready. It was not a good model. Our evaluation suite was weird. Overall it showed a superior model, but it showed a lot of variance. So we went ahead and swapped out the model, and it did not go well. Our customers who were using that smartest model label reported issues. So we rolled it back same day. It was very fast. So yes, we do it all the time.

Nathan Labenz: Reminds me of the Sycophantocalypse episode that we've recently seen.

Flo Crivello: I think they took much too long to roll back this one. I think that should be part of the post-mortem, right? There's always a time to detection and time to mitigation. The detection was very fast, the mitigation was much too slow. I want to add one more thing about this idea of swapping out the models. Again, that's part of the value proposition. Just imagine if you were still running on GPT 3.5. You shouldn't have to think about that. You should trust us to pick the best model. And sometimes we actually save you money. If and when we swap our default model from Claude 3.7 to Gemini 2.5 Pro, you're going to save money. Your agents are going to be more cost-effective.

Nathan Labenz: We've touched on this a little bit, but maybe just to double-click on it for a second, see if you have any additional thoughts. You could put this in the context of building Lindys or other product builders who are building agents. I've recently seen it seems right now we're in the proliferation of strategies phase still, and I recently did an episode with Andrew Lee of Shortwave who basically said, "We just trust Claude. In Claude we trust, and he said, "we do a very careful job with the caching, because that's critical to make the whole thing economical for us." And they have the best cache hit savings rate in the game. Although Gemini just got into that game in a meaningful way too. But aside from a very careful implementation of the Claude cache, he basically said, "We just load the thing up with tools, let it go to town, and basically have really long episodes. No sub-agents, no handoffs back and forth." And that, he said, gives them the best results. Then on the flip side, you have the OpenAI Agents SDK where a handoff from agent to agent is one of the core abstractions in that toolkit. And I thought Harrison from LangChain also had an interesting point of view on this recently where he was basically, and a little bit more like the OpenAI side, he said there are two kinds of agents. One is the task-specific, dialed in, highly curated context, and maybe you have a bunch of those. Then in front of that you have a different kind of agent that's your facade, the one that faces the outer world, the one that chooses which of those task-specific agents to call on for any given interaction that it might have. And that one maybe also can be a little longer running and have a more global sense of your history, whereas the task-specific one, you don't want to distract with all that. You just want to localize it, give it everything it needs to know, but not too much so that it becomes overwhelmed, distracted, whatever. Any thoughts on is one of those right, wrong? It depends. What do you think?

Flo Crivello: I think it's all the above. I think there's a spectrum of maturity of these different approaches. And today, the most mature, and it's really being deployed pretty fast right now, is this single-agent system that's using some tools and sometimes put on some deterministic scaffolding, and that just works. Then on the other side are many-agent systems, and that is still being defined and does not work nearly as reliably. And then there's another approach which I surmise is the one that Harrison from LangChain is talking about, and it's also the one I believe that OpenAI makes available through its recent SDK, which is somewhat in the middle because nominally, it's a multi-agent system. You've got this passing of the baton from agent to agent as the workflow. But actually when you do that, the agents share the same context. So you can almost think of it, at that point, if you share the same context, you're just one agent going through multiple states and multiple stages of your life cycle. And at that point, it almost seems like a matter of terminology. Is it a multi-agent system? Is it just one agent going through multiple steps and it's just one of the graph-based agent systems? I don't know. But that is also, I would actually say, closer to that side of the spectrum where it's also mature enough to be put into production.

Nathan Labenz: How about a lightning round on commercial solutions that you possibly use or don't use because you rolled your own before they came out or whatever? But I think one of the things people are always looking for is what's a good solution for these different parts of the overall build-out? So let's imagine you're advising an enterprise, let's say, and they're trying to build some stuff. Data acquisition. I don't know if you guys do any data acquisition partnering. Who would you trust? Who would you look to? Anybody in that category?

Flo Crivello: With respect, Scale, Cert, Invisible are the three main players right now. I suspect this is going to be an underwhelming exercise for you, because we got started before much of that ecosystem bloomed. So we had to build, unfortunately, a lot of automation. I don't recommend people to do it, but we had to do it out of necessity and it is not good. I would rather we use what's available because it's better and cheaper.

Nathan Labenz: Are there any parts of what you've built that are top of mind to replace with something commercial?

Flo Crivello: The evaluation suite is zero. We had to build it initially ourselves. I hate it. It's not good because it's not our job to build an evaluation suite. So right now we're looking into Braintrust, and there's also this new French startup, called Basalt, B-A-S-A-L-T. They're doing a really good job so far.

Nathan Labenz: Okay. Say the first one again too?

Flo Crivello: Braintrust. Braintrust and Basalt, B-A-S-A-L-T.

Nathan Labenz: That's right. Yeah.

Flo Crivello: Yeah.

Nathan Labenz: Okay, interesting.

Flo Crivello: Barat.

Nathan Labenz: So I assume you're not using anything such as LangChain, LangGraph, any observability, nothing like that? Everything in-house?

Flo Crivello: Nothing like that. No.

Nathan Labenz: Is there any

Flo Crivello: I don't think we're great building in-house. I would do it again, because I think it's too close for comfort to give it to an outside party.

Nathan Labenz: Do you do your own guardrailing, or is there any sort of mechanism? If I tell Lindy to do something bad, are you just relying on the foundation models to refuse? Or do you have any additional layers? How do you think about that?

Flo Crivello: We also built a feature where you can toggle 'ask me for confirmation' at any point in your Lindy. So we trust the users a lot on that. If you don't want Lindy to send an email with that, don't ask her to send an email with that. If you want her to ask for confirmation, there's one click. You click on 'send email', and then you toggle 'ask for confirmation' and it just works.

Nathan Labenz: How about voice? There's some stuff with calling now as well, right?

Flo Crivello: We do voice. We use ElevenLabs for that. We use Deepgram for transcription. We use Twilio for the phone infrastructure. We don't use any higher level solutions. There's Vapi, Blend, and I forgot the other players, but there are a couple of players there. We just rolled that out because we really did care about meeting a lot of the flexibility that we needed, because that's the beauty of Lindy. You can really create your agent. Every time we looked into the solutions, which we did, they were too opinionated, too high level to be useful for us.

Nathan Labenz: So for ElevenLabs, you're using their voice models for synthesis, but you're not using their call scaffolding? They have call scaffolding as well at this point, but you have your own internal scaffolding.

Flo Crivello: Not even that. We really care about the model agnosticity of Lindy. So in any of the Lindys, and in any steps of your Lindys, you can override the model that this Lindy is using. We really care about that. If we used ElevenLabs full-blown scaffolding, you wouldn't be able anymore to define what model you want to use.

Nathan Labenz: That makes sense. Any other providers in any category that you would shout out?

Flo Crivello: Providers? We're very close to the metal here.

Nathan Labenz: How about any

Flo Crivello: From my corner of the world, I am bearish on LLM apps and agent apps as a category. I don't view a nearly big enough pain point and I don't view a big enough market. I think the market is going to end up being concentrated by a couple dozen players. I could be wrong. I hope I'm wrong. And insofar as there is a pain point, I view it as too closely related to what a Sentry is already doing, for example.

Nathan Labenz: Are there any of those sorts of things where you have seen an AI first or an AI evolution? I recently pitched something that was, 'Oh, it's an AI first Sentry.' I've been out of that game myself for a little while, so I don't know. Maybe Sentry now is an AI first Sentry. But have you seen or have you adopted any products in your technology stack that you would say are notably next gen in their application of AI to these classic product info problems?

Flo Crivello: Right. Obligatory notes about Lindy. We'll set this one aside once and for all, but I use Lindy all day, every day, and it's a life changer. I really like Whisper Flow. I use it all day, every day. It's a life changer. It's basically replaced my keyboard. So for those who don't know, Whisper Flow is this software on Mac, they also released on the iOS app recently, that you dictate to your Mac and it's next level in the quality of its dictation. It also tweaks what you said to match more closely what you would have typed if you'd typed, because people speak differently than they type. So Whisper Flow is incredible. I have built my own shortcut on iOS using the shortcuts app that taps into the Whisper API, and I mapped it to my action button on my iPhone. I use my phone, I press the button on the side and I can dictate. Even though I have a French accent, as you can probably hear, it's subtle, it's flawless.

Nathan Labenz: I had no. That hadn't.

Flo Crivello: I know, right? I'm basically American. It's flawless. It's really good. I was thinking I've been very disappointed by the incumbents here, so very disappointed. I think there are so many apps that are basically begging for LLMs. For example, the Kindle and the books app obviously have no LLM. It's just so obvious. I'm sure there's some IP reason why there's no LLM here, but okay. Social media. I don't understand. I'm part of all these group chats, and I'm sure you are as well, but they're much too active for me. I can't keep track of them. There's way too much going on. Where are all the LLMs? Why isn't there an LLM in there that summarizes the group chat so far? Take Twitter, for example. Why isn't there an LLM? I just tweeted yesterday something that went viral, and there are all these people with very low reading comprehension in your mentions that say something that simply is not what you said. They're attacking a point that you simply did not make. Why doesn't Twitter have a feature that says, "Hey, before you send a tweet..." You can still send the tweet, but maybe there should be a little message here that says, "Hey, this is not what he said." Also, when a tweet goes viral, which is an experience everyone with a modest following on Twitter has had, you get the same points back again and again. It doesn't matter how many times you address the point; people don't read the mentions, which they can't be blamed for. Why doesn't Twitter do that? For example, "Hey, you're making a point that was made and addressed 20 times by the author in the mentions." So now maybe you can respond to the answer he made, and maybe that answer to the answer was also answered. That's my point. So no, I have been very disappointed at the slowness of adoption here in what I perceive to be just obvious opportunities.

Nathan Labenz: Yes. I agree, broadly speaking. Gamma comes to mind for me as one notable exception. I think they've done a really nice job of having a super high shipping velocity and trying every conceivable AI feature, almost. They just released a big update that I actually haven't used yet, but I suspect that they've consolidated a little bit because they had an AI at literally every touchpoint in the product, so much so that I compiled them into a slide at one point. It was like, "Here's all the ways you can integrate AI into your existing product." Maybe a little bit much, but very effective. And it's really worked for them. They've got one of those cursor-like growth curves recently. Okay, so last little stretch here. You are, as we've covered in previous episodes, concerned about the big picture of AI safety. What have you seen, if anything, from the latest models in the wild in terms of bad behavior? We've got the trend, obviously, of jailbreaks being down, but these higher-order bad behaviors seem to be on the rise, whether you want to call those deception or scheming. I think recently with O3 it's been termed hallucinations, but I've been trying to draw a distinction between a hallucination of the old kind, where it would fill in a small detail that wasn't real, versus some of these, what I would call lies from O3. For example, that was not a small detail. I asked you, I gave you some guidance on what kind of Airbnb I might like, and you made things up outright. That was actually my first experience with O3, and O3 and I have been very gradually rebuilding trust since that first loss of trust interaction. Have you seen any of that in the wild or any odd stories to tell or anything that's got your hackles up at all?

Flo Crivello: Yes, I think mostly we're on track for the worst-case scenario, frankly. I think things are getting more concerning, not less. Open source is delivering. The one thing here that puts us on track for the worst-case scenario is that Meta is not doing well in open source. Something's happening, I don't know what, but obviously DeepSeek is crushing it, and they're on the curve. So open source is delivering, and DeepSeek is a Chinese company, and I think we cannot let China win this race, period. I think they're catching up. That's number one. Number two, O3 lying through its teeth. It's insane how much that model likes to lie. It will tell you things. Sometimes you talk to it and it says something credible, and then you ask, "Do you have a source about this?" It says, "Oh, yeah, check out this paper." And then you realize, "No way. The paper is not... This is not at all what the paper says." And then it says, "Oh, yeah, look, I must confess, I heard it in a conversation in the corridor of this seminar." I'm like, "What are you talking about?" So that's another cause for concern: just lying a lot, which is weird. The sycophantic debacle in GPT-4O I think was really bad. If there is one cause for hope throughout it all, it's that we are making really good progress on interpretability. I think the work that Anthropic is doing here is really good, but they're not the only ones doing really good work here. So that's good. But overall, I remain very concerned.

Nathan Labenz: Are you seeing instances, object level, in the Mindy platform? Are users coming to you and saying, "Hey, I selected smartest and that meant O3, and now I got crazy shit?"

Flo Crivello: No, not yet. Knock on wood.

Nathan Labenz: What do you make of that? I expected that answer.

Flo Crivello: That's a good question. I will say that is one thing that makes me update my priors a little bit. If you told me in 2019 or 2020, if you'd given me access to a computer that has Gemini 2.5 Pro on it, or Claude 4.7 Sonnet or O3, and that's all I can do, it's like a glimpse into 2025. And then you'd ask me what's going to happen in the world while these models exist, I would have predicted all hell to break loose. And I would have been wrong. So I don't know. I don't know what's going on. I don't know if it's just a case of slow diffusion of innovation. I suspect that's what it is. It just takes a little bit longer for people to really exploit the systems. Or I don't know if there is something deeper about the world that we're missing here.

Nathan Labenz: Yeah, I'm confused by that. The most flagrant example I've seen from real life was when Sakana published their CUDA engineer and then came back a couple of days later and said, "We got reward hacked." That was a pretty notable one from a company that can do serious work.

Flo Crivello: The concerning thing is lots of the doomer concerns are based on the clear ideas of reinforcement learning. Reinforcement learning really likes to reward hack. If there is an easier way for it to get to its reward, even if it's cheating, it doesn't care about cheating. It doesn't understand the concept of cheating; it just wants the reward. And so that's why a lot of doomers were concerned about the monomaniacal properties of these systems and so forth. That at first did not happen, because at first it was just supervised fine-tuning and all of that. Now actually more and more of these models were back in reinforcement learning. And now all the researchers at these frontier labs talk about and think about how to scale reinforcement learning for reasoning large language models. So that is what is giving rise to the reasoning abilities of the O-class of models, O1 and O3, and even Claude. Much of the improvements in the latest few generations of Claude is because they have beefed up their reinforcement learning part with their training pipeline, in particular for code. Claude is really good for code and so is Gemini 2.5 Pro. It is because they have a part in their training pipeline that is dedicated to reinforcement learning for code. Now if you look at what's happening with Claude 4.5-3.7 Sonnet, you can actually see the reward hacking. You can actually see it. "Hey, can you please fix this unit test that's failing for me?" And it says, "Yes, no problem. It's not true." Which basically removes the unit test. Or, "Hey, the code doesn't transpile. The TypeScript doesn't pass because there's a type issue." It says, "Oh, no problem. Type A." So it basically removes the types. It's like, "Hey, this is not what I was going to do."

Nathan Labenz: Yeah.

Flo Crivello: I've seen it many times myself. "Hey, I'm vibe coding and there's an issue with this component." It says, "No problem," and then just removes the component. So it's reward hacking, plain and simple. So again, I think that should nudge us a couple of points in the direction of the doomer concerns being at least somewhat warranted.

Nathan Labenz: Do you have a point of view on how close we are to meeting things like proof of personhood and various other schemes to say, "Whose agent is this?"

Flo Crivello: Yeah, I think we're pretty close. I actually think there's a big business opportunity. I was having dinner with a friend of mine a couple of days ago, and he had this business idea which I'm not betraying his trust. He's got his hands full. I think he would be glad for someone else to do it. But there's a very big opportunity here. He wanted to build a USB stick that would be a YubiKey. It would have a mic in it, a camera, a fingerprint reader. And it would allow you to jump on a Zoom call. On the receiving end of the Zoom call, you would also need a piece of software. And what this would do is it would correlate the actual sound waves that are captured both by your computer's microphone and the microphone into that USB stick. And it would correlate a bunch of these things to be, "Hey, most likely, it's not going to be fully foolproof, but most likely this is a human on the other side of the line." I think if you did that, you could sell it to a bank or a massive airline. There are a lot of people here that care about identity verification. And you could probably grow into a pretty sizable revenue pretty quickly.

Nathan Labenz: What, if any, questions are burning in your mind around agent dynamics? The first simplest mental model is, "The world is the world. I'll deploy an agent here and then I'll be efficient and it'll be great." Obviously if everybody's doing that and we start to have agents negotiating with agents, or my agent talks to your agent, what have you, negotiations between agents, that seems like a very dynamic system that we don't have great models for. I recently did an episode on the study of Claude learning to cooperate and pay it forward to itself. The flip side of that, of course, would be if it starts to collude with itself. But do you have any, if you could put out a request for research or the biggest questions that you have about what the giga-agent future might look like, what are the big questions you'd like to see answered?

Flo Crivello: The question that is most top of mind for me, because of the nature of what we're working on, is the relative importance of the scaffolding into the model over time. Because what we're doing is we're building the scaffolding. So, is the scaffolding going to grow in importance or is it going to shrink in importance? That is one of the top questions on my mind. So far it seems it will grow in importance because, at least in absolute importance, models and AI are going to become more and more absolutely important. In relative importance, I'm not sure yet. I'm still deciding on this.

Nathan Labenz: That's a good transition to my last question: what does Lindy look like in an AGI or an early ASI world, if you can extrapolate that far into the future? Somebody might say, "A super intelligence, what does it need scaffolding for?" As you said earlier, you're very AGI-pilled, so I'm sure this is something you're thinking about actively.

Flo Crivello: Every day.

Nathan Labenz: How do you? Are we in a wait-and-see mode, or do you have a vision for how you can be a channel by which people access AI that might be legitimately more capable than they are?

Flo Crivello: We definitely think of it all the time. I think the drop-in replacement human worker is coming. But I think that speaks more to the user interface than to the underlying paradigm. I do think AIs will have voices, perhaps faces. You'll be able to talk to them, ask them to do things, and they'll do it very reliably. That doesn't mean the underlying paradigm is just an end-to-end agent or one very big model with a very big prompt. To be convinced this happens, the two research areas I watch closely are new attention systems. Specifically, attention systems that make attention much cheaper, so they are not polynomial. Everything I've seen so far resembles more a hack than a fundamental innovation that truly makes attention much cheaper. That's the first thing. If you can get infinite attention, you truly get an immediate drop-in human worker. Second, dynamic compute. Models that will decide at inference time which of their weights to activate. Gone may be the days of having all these different classes and sizes of models. Maybe you just have one very big model, and you can pass it a parameter for how smart you want it to be, or it decides how smart it needs to be depending on the task. That's also coming; there's a lot of activity in that research area. If both of these things happen maximally well, I think there's a stronger case for the end-to-end agent versus the scaffolding agent. Even then, I still think there might be room for scaffolding for other reasons. Scaffolding will always provide something: extra reliability, extra speed. It will provide certain benefits. But if these things don't happen, I'm then very bullish on the value of scaffolding. So, in this world, I'm thinking of it as you chat with your AI employee, and then something happens. Much of that will depend on the LLM and the model paradigm we're running under at that moment. At the end of that black box—and again, we have many ideas of how that black box will work—what you want to see happen, happens.

Nathan Labenz: It's almost like when the unhobblings become the hobblings again, when the model becomes more capable, it may no longer need the scaffolding, but instead, humans need the guardrails. Maybe the scaffolding serves a future duty as a guardrail, when it's more about limiting what the model can do as opposed to maximizing what it can do. It's an interesting paradigm. All right, we're out of time. Anything else you want to share before we break?

Flo Crivello: No, this was great.

Nathan Labenz: Thanks for doing it. Flo Krevelo, CEO of Lindy. Thanks for being part of The Cognitive Revolution.

Flo Crivello: Thank you so much, Nathan.

The Dawn of Dynamic AI: RFT Comes Online, w/ Predibase CEO Dev Rishi, from Inference by Turing Post

Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe

The Data Factory: Inside the $100B Race for Post-Training Supremacy, with Labelbox CEO Manu Sharma

Living Lindy: a No-BS Conversation on AI Agents with Flo Crivello

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

The Dawn of Dynamic AI: RFT Comes Online, w/ Predibase CEO Dev Rishi, from Inference by Turing Post

Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe

The Data Factory: Inside the $100B Race for Post-Training Supremacy, with Labelbox CEO Manu Sharma

Living Lindy: a No-BS Conversation on AI Agents with Flo Crivello

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

The Dawn of Dynamic AI: RFT Comes Online, w/ Predibase CEO Dev Rishi, from Inference by Turing Post

Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe

The Data Factory: Inside the $100B Race for Post-Training Supremacy, with Labelbox CEO Manu Sharma

Cheat on Everything: Cluely's Vision for Always-On AI Assistance