Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands
Today, Emily Sands, head of data and AI at Stripe, joins The Cognitive Revolution to discuss how the company built a payments foundation model that processes tens of billions of transactions into dense embeddings, exploring the technical architecture behind fraud detection improvements and the modular approach that enables rapid deployment of AI across Stripe's $1.4 trillion payment network.
Watch Episode Here
Read Episode Description
Today, Emily Sands, head of data and AI at Stripe, joins The Cognitive Revolution to discuss how the company built a payments foundation model that processes tens of billions of transactions into dense embeddings, exploring the technical architecture behind fraud detection improvements and the modular approach that enables rapid deployment of AI across Stripe's $1.4 trillion payment network.
Check out our sponsors: Google Gemini Notebook LM, Linear, AGNCY, Claude, Oracle Cloud Infrastructure.
Read the full transcript: https://storage.aipodcast.ing/transcripts/episode/tcr/d857a4bc-25d2-4859-a7fc-9ad788004690/combined_transcript.html
Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at: https://notion.com/lp/nathan
- Stripe's Scale: Processes $1.4 trillion annually, serving both small businesses and large enterprises (over half of Fortune 100 companies).
- Business Growth Impact: Businesses using Stripe grew 7x faster than the S&P 500 last year through optimizations across the payments lifecycle.
- Payments Foundation Model: Stripe has developed a domain-specific foundation model that converts transactions into vector embeddings, creating a specialized AI for the payments ecosystem.
- Data Advantage: Stripe's competitive edge comes not just from data volume (1.3 trillion annually, growing 38% YoY) but from the compounding feedback loop where better models deliver more value to businesses, driving further growth.
- "Business in a Box": There's emerging potential for AI to handle end-to-end business creation, including selection of third-party tools, payment providers, and other essential services.
- AI Company Partnerships: Two-thirds of the Forbes AI 50 companies already run on Stripe, as they focus on helping AI companies monetize effectively and scale globally.
Read the full transcript: https://storage.aipodcast.ing/...
Sponsors:
Google Gemini Notebook LM: Notebook LM is an AI-first tool that helps you make sense of complex information. Upload your documents and it instantly becomes a personal expert, helping you uncover insights and brainstorm new ideas at https://notebooklm.google.com
Linear: Linear is the system for modern product development. Nearly every AI company you've heard of is using Linear to build products. Get 6 months of Linear Business for free at: https://linear.app/tcr
AGNCY: AGNCY is dropping code, specs, and services.
Visit AGNTCY.org: https://agntcy.org/?utm_campai...
Visit Outshift Internet of Agents: https://outshift.cisco.com/the...
Claude: Claude is the AI collaborator that understands your entire workflow and thinks with you to tackle complex problems like coding and business strategy. Sign up and get 50% off your first three months of Claude Pro at https://claude.ai/tcr
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive
PRODUCED BY:
https://aipodcast.ing
CHAPTERS:
(00:00) Sponsor: Google Gemini Notebook LM
(00:31) About the Episode
(05:49) Introduction and Context
(08:16) Payments Foundation Model
(16:22) Multi-Entity Transaction Patterns (Part 1)
(20:09) Sponsors: Linear | AGNCY
(22:42) Multi-Entity Transaction Patterns (Part 2)
(24:47) Model Architecture Details
(34:02) Ground Truth Challenges (Part 1)
(38:45) Sponsor: Oracle Cloud Infrastructure
(39:55) Ground Truth Challenges (Part 2)
(45:24) Iteration Loop Optimization
(54:32) Multimodal AI Applications
(01:05:11) Building on Stripe
(01:11:54) Agentic Commerce Evolution
(01:16:16) Platform Competition Dynamics
(01:20:42) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Nathan Labenz: This podcast is supported by Google. Hey folks, Steven Johnson here, co-founder of Notebook LM. As an author, I've always been obsessed with how software could help organize ideas and make connections. So we built Notebook LM as an AI-first tool for anyone trying to make sense of complex information. Upload your documents, and Notebook LM instantly becomes your personal expert, uncovering insights and helping you brainstorm. Try it at notebooklm.google.com.
Hello, and welcome back to the Cognitive Revolution. Today, my guest is Emily Sands, head of data and AI at Stripe, the programmable financial infrastructure company that in 2024 processed $1.4 trillion in payments, or roughly 1.3% of global GDP, for everyone from solo entrepreneurs to the Fortune 100, and which continues to grow at a blistering pace.
We begin by discussing the many fascinating details of Stripe's new foundation model for payments and how Stripe is using this model to deliver improved performance across their broad suite of products. While it might seem unassuming at first glance, I would argue that the payments foundation model has several important lessons to teach us.
First, while payments are represented in text, the payments foundation model is not a language model in the familiar sense. On the contrary, payments are treated as a distinct modality, and importantly, no payment is an island. To properly understand a single payment requires Stripe to assemble extensive context, including recent activity associated with multiple entities: the buyer, the card, the device used to make the purchase, and the merchant. So much context quickly becomes overwhelming to humans, but this is exactly where neural networks can shine. And indeed, when Stripe first deployed this model to detect card testing—which is a process that fraudsters use to determine which stolen cards actually work—they saw a jump in their detection rate from 59% to 97%. Obviously, a massive win, not just for Stripe, but for the entire ecommerce ecosystem that collectively bears the cost of fraud.
Now if you've listened to this show for a while, you know that one of my pet theories is that the surest path to superintelligence is to integrate today's reasoning models with models that are trained on other modalities that humans aren't well adapted to understand. I'd say it's safe to say that the payments foundation model is superhuman when it comes to understanding payments, and this conversation left me wondering how many other businesses are training foundation models on their own modalities, as well as how many other interesting modalities might still currently be hiding in plain text. I can imagine that this proprietary modality strategy might work on any number of domains, including health, cybersecurity, logistics, energy, and insurance. But to be honest, I haven't found too many other examples of this strategy being used today.
So if you happen to know of any other foundation models being trained on any interesting proprietary modalities, please do ping me and let me know as I would love to do more episodes exploring this theme.
The next lesson, perhaps as important to Stripe's success as the model itself, is the way they are using it. Rather than trying to design the foundation model to support all use cases directly, they are exposing payment foundation model representations and thus allowing engineers to use them as additional inputs to the many classification and other ML systems that they've already developed. The richness of the foundation model signal makes everything else work better, but doesn't require a major rethinking of existing systems.
Again, outside of social network companies, who I do believe make their user and content representations available in this way, I've not heard of other companies taking this approach, and it seems to me now that more of them should consider it.
Finally, the most important lesson from a societal standpoint might be that AI strongly favors the incumbent platforms that have the data necessary to train such differentiated models. The flywheel that Stripe has created here, which translates their incredible scale to commercial advantage, is allowing them to reduce the cost of fraud for their customers even as fraud is rising across the broader ecosystem. This makes Stripe the obvious choice going forward, which in turn further strengthens their data advantage and product lead. It is genuinely hard for me to imagine how anyone, aside from a few of the world's largest tech companies, could ever compete with Stripe—meaning that even as history begins to unfold at a dizzying pace in many respects, competition in many key markets may effectively come to an end.
This isn't necessarily a problem. I've never supported punishing companies for their excellence, and I've never been convinced that we should break up American tech companies. But it does seem like something that policymakers will need to think long and hard about as they envision the AI future and hopefully begin to imagine a new social contract.
There's a lot more in this episode besides these key strategic insights, including how Stripe is designing processes to iterate quickly enough to stay ahead of fraudsters, including by using LLM-as-judge to fill in missing data, how they ensure reliability in their LLM-powered Talk To Your Data product experiences, how developers can accelerate product development by treating Stripe as their payments database of record, what Emily and team are seeing in agentic commerce today, and how they think about scoping their AI ambitions and investments.
All in all, as you might expect from Stripe, it's a high alpha episode with practical lessons for rank and file AI engineers and big picture implications for executive level AI strategists. So without further ado, I hope you enjoy this deep dive into how smart use of AI is transforming one of the world's most critical financial infrastructure companies with Emily Sands, head of data and AI at Stripe.
Nathan Labenz: Emily Sands, head of data and AI at Stripe. Welcome to the Cognitive Revolution.
Emily Sands: Thanks for having me.
Nathan Labenz: I'm excited for this conversation. Stripe obviously is a globally recognized leader in payments and doing some really interesting things in AI with high standards everywhere, and obviously a lot of shared DNA with some of the big frontier AI developers. So there's a lot to get into today. For folks who want to do a deeper dive into Stripe and the payments ecosystem, you did maybe six months ago now a podcast with our sister pod Complex Systems with Patio11.
Emily Sands: Patio11.
Nathan Labenz: And I would definitely recommend that for folks who want to do a deeper primer on the payments world, which is a fascinating and Byzantine one with many rabbit holes to go down. We won't do nearly as much of that today. We'll stay more focused on some of the cool new AI stuff that you guys are doing. But maybe just for a super quick primer, how would you describe the role that Stripe plays in the economy? And then we'll use that as jumping off point to get into the AI stuff.
Emily Sands: You said payments infrastructure. We started as payments infrastructure, absolutely true. We now build broader programmable financial infrastructure. So in plain terms, we give any business—could be a teenager who's selling a Figma template, or it could be any one of now more than half of the Fortune 100 that run on Stripe—the rails and intelligence to move money online and to grow faster.
So last year, companies processed $1.4 trillion through Stripe. And we'll talk about AI today. Every one of those charges becomes training data for the AI systems that we'll talk about. But that flywheel also means that we're no longer just the payments API. We optimize the entire payments lifecycle. Yes, the gory details are covered with Patio11, but it's the checkout UX, fraud prevention, bank routing, retries, even things like dispute paperwork so that businesses can really keep more of every hard-earned dollar and scale up with very small teams. So we think of the tools we're building as structural growth tailwinds, and we're already seeing it in the data. Businesses on Stripe grew 7x faster than the S&P 500 last year.
Nathan Labenz: Wow. Okay. A lot of good nuggets there. I have been a customer, actually, for what it's worth since maybe not the earliest early days, but pretty early days—like at least 10 years that I've been a Stripe customer with my company, Waymark. So we've seen a lot of the evolution from the customer side.
The biggest thing that has caught my attention in terms of what Stripe is doing with AI is the payments foundation model. And I'd love to just spend a good chunk of time really going into details on that because one of the things that I have been fascinated with and kind of trying to see around the corner and better understand is to what degree are we going to get a form of superintelligence via AIs that become natively capable of understanding potentially a huge range of different modalities.
People are familiar now with image generation. Of course, we had text-to-image generation models. Now those have kind of come together in this really tightly coupled, deeply integrated way with native multimodality and other recent innovations in that space. And so I have this theory that one thing that people really underappreciate is the degree to which training on these other modalities of data is just going to create superhuman capability in these domains that are sort of familiar to us, but also in many ways very alien.
So maybe for starters, what can you tell us about the fundamentals of the payments foundation model? Like, what does the data look like? Obviously, it's transaction data, but give us more detail on that. What is transaction data when you really get into the weeds of it? Emily Sands: Yeah. And it's a good point. There's been a ton of coverage of the large scale traditional LMs and a lot less coverage of domain-specific foundation models, of which the payments foundation model is one. For us, it's been really a step function change in the speed and quality with which we can deliver all of those optimization solutions I talked about in auth, in fraud, in disputes.
At its core, it's a transformer model that turns every payment—so the tens of billions of transactions that run through Stripe—into a compact vector. So it's like giving each transaction its own latitude and longitude. And then once you have that map, you can use it for all sorts of downstream tasks—to figure out what's fraud, to figure out how to authenticate, to figure out what's a valid versus invalid dispute—without having to train a new model from scratch every time.
And I think what makes it work, the reason you can build a domain-specific foundation model in the payments context, is Stripe's scale. So we process about 50,000 new transactions every minute. And at that density, payments start to look, in a lot of ways—not in all ways, but in a lot of ways—like language. So there's kind of a syntax to a payment. There's the card bins and the merchant codes and the amounts. And then there's sort of an analog to semantics—how a device or card gets reused over time.
And so in the same way that language transformers are learning embeddings and words with similar meanings clustered together, the premise of the payments foundation model is: what if every charge or sequence of charges had its own vector in a similar space?
So the inputs are, you're right, just the raw payment signals as they come in—the card details, the merchant categories, the IPs—but also those sequences. So what a given card or device or merchant bin or customer has been doing in the last few minutes or the last k transactions. And it's actually that history that turns out to be a huge unlock.
And then from those inputs, the model produces an output, which is just a reusable embedding—a dense vector for each payment or short sequence. And then we can layer lightweight classifiers on top for real-time detection. We also have a slower, higher latency variant that generates explanations through a text decoder. And I think we'll get to a stage where that can be real-time-ish as well, but we're not there just yet.
Nathan Labenz: Cool. Okay. There's already a number of interesting things there. In terms of scale, the blog post that introduced the payments foundation model said tens of billions of transactions, and then it also indicated hundreds of subtle signals. Could you give a couple of examples of the long tail of these signals that illustrate just how much information the model is able to ultimately take in that might be hard for a person to represent? Because we can classically handle about seven items in working memory. So what are we missing with our feeble human working memories that the model is able to take in? And from there, I'm kind of interested in the overall scale of data—it sounds like it's getting into the trillions of tokens, which would be not at the high end of text foundation models, but not too far off, maybe like one order of magnitude less. So I wanted to sanity check my estimates with you on that.
Emily Sands: Yeah. Your math is legit. I'll answer the second question first and that first question second. Yes, your math is legit and the data is very different from the free-form text that you'd use to train a model to write like Shakespeare. Payments data is highly structured and dense, and so we actually built a custom tokenizer that compresses the numeric and categorical signals really efficiently. So yes, the dataset is big, but it's also packed with this purpose-built information that's incredibly rich for the set of tasks that we care about in our context.
You asked about what's hard for a human to eyeball. I think the thing that's hardest for a human to eyeball is looking across those dimensions, not within any one payment, but within any combinatorial sequence. If you think about it, what you need to look at in order to figure out if a fraud attack is happening or how to get a payment authenticated has very little to do with that particular transaction and everything to do with where that transaction sits vis-a-vis the transactions that have come around it.
So you're not looking at a single screen. You're looking at a clip of a movie. But there are a lot of different clips that include that screen that are relevant to look at. Like, you want to know what I was doing. You want to know what the merchant was doing. You want to know what my card was doing. You want to know what my IP was doing.
And so that's really where the model sings—making it really efficient not just to look at the individual payment. It's hard to do with the scale of 50,000 a minute, but a human could, I suppose, if you had enough humans. It's really about the sequences that make the problem intractable for humans, but also very hard for traditional ML approaches where you have to hand-engineer features to capture what's happening in each of a range of different sequences.
And so in our context, the foundation model pays off dramatically because it expands really three things. One is how much data we can learn from. We can learn from literally all of Stripe's history, not just a task-specific subset of history. It changes how richly we can learn. These dense embeddings capture very subtle interactions that manual feature lists, like counter-features, wouldn't capture. And then third, which is more about how we work internally, it changes how efficiently we can build. Once you have a shared embedding, then spinning up a new model becomes a weekend project, not a quarter project, and that means we can open the aperture for the types of ML-powered solutions we can build.
Nathan Labenz: Yeah. Cool. I really like the idea of multiple clips. And I take it that that just basically reflects the reality that there are obviously multiple parties to any transaction, and I'm kind of inferring that the pattern of behavior of each of those different parties is really where the strong signal is. It's not the—if you looked at this particular transaction in isolation, you might not get much. But when you combine recent history for all of the parties to a single transaction, the combination of those recent histories is really what tells you what you need to know. Do I have that right?
Emily Sands: Right. There's nothing about me using my card in Boston that tells you it's fraudulent. But if I just used my card on my device at my home IP, which is, by the way, in Palo Alto, not in Boston, and I tend to be buying things that are totally different from what you suddenly see someone doing in Boston, that's a red flag that that's actually fraudulent use of my card.
Or conversely, if you see someone rotating across a small number of cards to buy thousands of accounts from a given AI provider, maybe the card is truly theirs, but you're almost certainly going to see some sort of reseller refund abuse happening where they're trying to steal your compute.
And so it becomes more complicated when you add more entities like the merchant, where there can actually be internal collusion happening. And so you're exactly right—it's not about how any one entity acts in isolation. It's about: an entity is an individual or a card or a merchant, and that's like the node. And then the edges are the transactions. And it's like, how much sense do these edges make in relation to each other and in relation to the combination of edges that we've seen in the past?
Nathan Labenz: Does that go out—I can imagine that that web could extend easily farther, or you could imagine including the rendered judgment on previous transactions. So for example, if I am trying to buy something from you and the model is looking for the signal of fraud, you could also say, okay, well, all these transactions that you have recently done as a seller, maybe you just have the determination if they were fraud or not fraud, or you could even look at who were all those buyers, what's their—so how far out does this path through the graph have to go to get you the—what is the shape of the curve in terms of scale versus diminishing returns?
Emily Sands: Yeah. Totally. So I talked about the scale of the Stripe network—it's like this $1.4 trillion. But it's not just a big network. It's also a very dense network. So for example, 92% of cards that a merchant sees for the first time, Stripe has seen before on another merchant. So okay, well, in those cases, you don't have to do very many hops. Although you do want to validate that nothing's changed about the card or how the card's being used in the time since.
But fraud and conversion are kind of tail events in some sense too. If you can get 1% more conversion or 1 or 2% less fraud, that goes a long way. So you get really far from the dense network, but you also want to be able to traverse wide for more novel traffic that you see. Nathan Labenz: Hey, we'll continue our interview in a moment after a word from our sponsors.
AI's impact on product development feels very piecemeal right now. AI coding assistants and agents, including a number of our past guests, provide incredible productivity boosts. But that's just one aspect of building products. What about all the coordination work like planning, customer feedback, and project management? There's nothing that really brings it all together.
Well, our sponsor of this episode, Linear, is doing just that. Linear started as an issue tracker for engineers, but has evolved into a platform that manages your entire product development lifecycle. And now they're taking it to the next level with AI capabilities that provide massive leverage. Linear's AI handles the coordination busy work—routing bugs, generating updates, grooming backlogs. You can even deploy agents within Linear to write code, debug, and draft PRs. Plus, with MCP, Linear connects to your favorite AI tools: Claude, Cursor, ChatGPT, and more.
So what does it all mean? Small teams can operate with the resources of much larger ones, and large teams can move as fast as startups. There's never been a more exciting time to build products, and Linear just has to be the platform to do it on. Nearly every AI company you've heard of is using Linear, so why aren't you? To find out more and get 6 months of Linear Business for free, head to linear.app/tcr. That's linear.app/tcr for 6 months free of Linear Business.
Build the future of multi-agent software with AGNTCY, a-g-n-t-c-y. Now an open source Linux Foundation project, AGNTCY is building the Internet of Agents—a collaboration layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on AGNTCY, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. AGNTCY also provides open, standardized tools for agent discovery, seamless protocols for agent-to-agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and 75 more supporting companies to build next-gen AI infrastructure together. AGNTCY is dropping code, specs, and services, no strings attached. Visit agntcy.org to contribute.
Nathan Labenz: So architecturally, this sort of reminds me a little bit of some of the stuff that Meta has done with their joint embedding video models. I'm not sure if that is the right intuition for me to have, but it does seem like there's a clear difference here where you're not trying to predict the next token. It seems like it would be more of a dedicated—so it's not like an autoregressive type model. It seems like it would be more of a dedicated encoder, like a masked type situation where you could imagine doing a training setup where it's like mask out whatever randomly and have the model learn to fill in details.
Emily Sands: Yeah. Exactly. So our v1 did use a BERT-style masked modeling setup that you're talking about. And then we paired that with a second stage, which is explicit similarity fine-tuning. And so most of the heavy lifting there was: okay, let's curate the right sequences to learn from. Let's build the right encodings. Let's do post-training with that kind of similarity objective so that near neighbors in this payment space cluster together and the oddballs separate, and you can start to reason about those oddball clusters.
And again, the big unlock here was modeling short histories—what a card or device or merchant bin is doing over some number of minutes or last k transactions rather than the isolated payment.
Now in what we call v1.5, we're actually moving towards encoder-decoder setups and compressed memory sequences. So a few vectors together—that makes it actually easier to catch subtle abuse in real time because you're not averaging across noise. You're distilling the full story into these compact representations.
And so the mental model is like v1: masked modeling plus similarity training. V1.5 is compression first with a tight sequence embedding. And then we can put lightweight task-specific heads on top, which are important for the charge path use cases. If you think about the charge path, you've got to get the job done in tens of milliseconds at most. And so those lightweight task-specific heads are important for latency and speed.
Nathan Labenz: Yeah. Can you say how big the model is? I mean, ten milliseconds doesn't allow it to be that big, I would assume. Although they don't have to do a lot of steps in the forward pass.
Emily Sands: Off the cuff, but the task-specific heads on top are small. And when you reason about it, all you have to be able to do actually is place the new charges and sequences as they come through in this dense embedding space as they come in, which is a much easier problem than obviously the upfront training.
Nathan Labenz: Yeah. And it's also just one forward pass of the model, as opposed to having to generate a whole sequence. So that one-pass nature of it definitely helps with the latency as well.
That's really interesting. It also reminds me of one of the first vision-language models that I studied deeply, which was the BLIP family of models. And I remember that they had really amazing success with a frozen language model and then also a frozen vision model and just trained a few million parameter connector between the two to sort of bridge from one latent space to the other. And of course, we've gone way past that now in vision-language, but this was an early 2023 thing. And you were able to get really quite good captions out of that setup even though neither of the foundation models that were used had anticipated that use case.
Emily Sands: Been a while since I looked at BLIP, but that's an interesting analog.
Nathan Labenz: But it sounds like you've kind of created a similar situation where people internally at Stripe can say, okay, I have a new use case idea for this. I can train something really small. You mentioned it can be like a weekend project instead of a multi-month project. And if I understand correctly, the idea is because you've got the foundation work done, you can train a few million connector classifier head, whatever you want to call it, very quickly. That step becomes a rapid iteration step.
Emily Sands: Yeah. Exactly. And actually it can be even simpler. So the embeddings themselves—where most modelers start, actually, is most modelers start by just taking the embeddings themselves, which are stored in Shepherd, which is our feature engineering platform, and literally just adding them as a feature to existing models, whatever those existing models are, and saying, is there added signal from these embeddings?
And I would say you get some false negatives there where obviously just shoving the raw embedding into some number of state-of-the-art models that have been iterated on over the last six years isn't going to produce uplift. But in other cases where it's a lower priority model that's only in its v1 state, you actually do get something straight out of the gate, and you can start to reason about—we've been talking much about the payments embeddings—how much signal do I get from understanding the payment better? How much signal do I get from understanding the customer better? How much signal do I get from understanding the merchant better for each of these use cases? And then that's also motivated which applications folks have leaned in harder on.
Nathan Labenz: Yeah. That's really interesting. So just to make sure I understand that correctly, you've got—obviously Stripe's been around for a number of years. There have been many types of problems that you've brought machine learning to over time. Typically with a more classical feature engineering type of approach.
Emily Sands: Yep. Hundreds of production models. Point solutions.
Nathan Labenz: And so now the foundation model embeddings can become just tacked on as additional features, rerun that training, and immediately backtest against your set, and then you're like, okay, cool, we just made this better kind of for free because we were able to get additional signal. That's really interesting.
Emily Sands: I mean, our standout application wasn't that. It was card testing where we literally—it was a whole new approach to card testing with the foundation model.
Card testing is when fraudsters try hundreds of tiny authorizations, iterating across stolen cards or literally just doing raw enumeration, trying a bunch of cards. And they bury those attempts inside floods of legitimate traffic. A big retailer can have hundreds of thousands of charges come through, and then there's a couple hundred or maybe a thousand peppered fraudster charges of 30 cents or 50 cents. Classic models couldn't really pick up those needles in the haystack.
And so our first application of the foundation model was just treat those sequences, again, like frames in a movie. And suddenly, these 200 nearly identical requests—same low entropy user agent rotating across proxies coming in about every 40 seconds—they light up as an island in the embedding space and they get blocked.
And the impact of that one was huge. Our detection rate of card testing at large merchants went from 59%, which is not bad but not great, to 97% from that change. But then as we started reasoning about where else could it be useful, yes, just expose the embeddings and let them be added as features to these traditional single-task models was the next step. And that was never intended to be the final state, but it's a way to get signal on where is there incremental value or incremental signal from these embeddings that requires very little lift. Nathan Labenz: Yeah. Fascinating. That's a very modular approach to AI deployment, and I can't recall hearing any organization that has had a similarly modular structure. Maybe Meta comes to mind as another one that might have a user model that could then be bridged over to any other space or problem that you might want to apply them to. But this is a fairly uncommon setup, I would say. Are there others in the space?
Emily Sands: Don't know exactly what happened or how it went or how it worked, but I know because I know the former leader, I know there was a Cortex org at Twitter that was basically doing horizontal models. Again, don't know exactly how it worked or the architecture.
We've been talking about it in the context of payments, but we actually did the same thing over the last year in the merchant space. We have what's called the merchant intelligence team, and it basically has this MI Serve. It can go out and find anything on the web about a merchant and generate embeddings and be used to answer questions. And those merchant embeddings are also features in downstream, for example, merchant risk models.
But it's a service where the model owner can ask Merchant Intelligence, the agent, to come up with a more custom embedding or more custom insights. So maybe you want to know what payment methods the merchant offers or whether they have anything that's counterfeit. And that's actually been another horizontal layer that's provided a ton of leverage for Stripe.
Because historically, you've got a lot of use cases. You want to know things about the merchant to understand supportability—whether they meet the requirements of the card networks and the issuers and the banks. You want to know whether or not they're fraudulent. You want to know whether they've had an account takeover. You want to know if they're creditworthy. You want to know whether or not we should give them Stripe Capital, like a loan, and you want to figure out whether you should be going to market with them. Like, all sorts of things you want to know about a merchant.
And historically, teams at Stripe were, when LLMs hit the scene, out building their own custom versions of this. But what we realized is there's actually just one service now that does that much more efficiently than everyone rolling their own.
Nathan Labenz: The "better" lesson strikes again. I've got a lot of different directions I want to go, but where does ground truth come from on some of these questions, and how long does that take? Because I sort of imagine, especially in a fraud detection situation—fraudsters, I always assume, are going to be some of the most clever people in the world, diabolically so. But nevertheless, you've got to respect the smarts of some of these folks.
So I assume that they are very savvy to real world events. You mentioned, I think in the conversation with Patrick, that somebody might have a flash sale and that sort of spike. Obviously, you don't want to turn them off when they're having a flash sale because that's a horrible experience and loss of business for the company running the flash sale. But at the same time, that's potentially a really good target for a card tester to come in and try to do whatever it is they want to do. And you're sort of, I imagine, in a kind of eternal arms race between fraud and fraud detection.
And then what little I know from my experience as a consumer and as a business owner is, to actually close the whole loop and get to the point where whether this was fraudulent or not has actually been set in stone—that's a long process. So how do you deal with that?
Emily Sands: Well, it's a long process if you even get to a definitive answer. But something like card testing—I said the first thing we did with the foundation model was deploy it for card testing. Actually, the first thing we did with the foundation model is deploy it internally for card testing, pass those labels to internal expert humans, have them go and validate the labels, then feed the validated labels into our traditional ML model for card testing. And suddenly, our traditional ML model—don't deploy the foundation model—our traditional ML model for card testing started doing way better because finally, it had a more comprehensive source of truth for the labels. So that was actually the first version, although I hadn't revealed that fun fact before.
Yes, attackers iterate and so do their models. And so our job is just to iterate faster. And we are, and I'll talk about some of the ways we get around the late-arriving labels or the missing labels altogether.
But just to give you a sense of how we're comparing to the fraudsters: industry-wide, ecommerce fraud is up. I think it's up like 15% year on year. But the dispute rate for the businesses that are running on Stripe are down 17% year on year. And that's because we, in a bunch of different ways, are just consistently shortening the loop between new tactic shows up and defenses go and adapt. And that sort of loop shortening is happening in production and in some cases in real time.
So an example that our users are getting a ton of value from that we recently released is dynamic risk thresholds. It's basically like, Radar's out, they've got their threshold score, block stuff above the threshold. But then when an attack starts, Radar learns an attack starts, it tightens the defenses. So it throttles. And that allows revenue to flow freely when you're not under attack, but then we're much more aggressively blocking when an attack arises. Because again, an attack is almost never a single event. It's almost always a true cluster. And in that case, the model is learning the policy of how to act. Now it's not learning that policy online just yet, but it's learning the policy of how to act.
Another powerful tool—and in payments, it's easy to think I put in my credit card and then just an objective decision is made to block me or not. But that's not actually true. And so we've been leaning in harder on what we call soft blocks. Adaptive 3DS is an example here. It applies that 3DS authentication. So if you're in the US, most of the time you don't get 3DS. But we can...
Nathan Labenz: Can you tell me what that is? Because I don't feel like I know. You might have defined it in the Complex Systems episode, but I could use a refresher.
Emily Sands: Yeah. You just have a second—like a 2-factor auth sort of experience where you're verifying to the bank network or the credit card issuer that it is in fact you. And this is very common in Europe, very uncommon outside of Europe. And by the way, when it does happen, it often creates unnecessary friction.
And so part of what we do at Stripe is figure out when we need to authenticate and when we don't. But also, with Adaptive 3DS, we are pushing for authentication selectively in cases where we have a sense that the charge may not be good. And so instead of just having this binary decision of block, don't block, you have this other arm you can go down, which is hit them with 3DS.
And what ends up happening is the good guys get through the 3DS because they're excited to buy the thing and they're legit users, and the bad guys do not.
And so a lot of the AI companies are using this. AI companies being hit with fraud is extra painful because their marginal costs are high, unlike for SaaS companies who care a lot less. And so early adopters of Adaptive 3DS were like ElevenLabs and Character AI, and they're able to just dramatically cut down fraudulent disputes without any effect on conversion because 3DS isn't super heavyweight for the end user.
In fact, US checkout users—it's a little different in Europe because a lot of those folks are already on 3DS—but US checkout users saw a 30% average drop in fraud, and they just turn this on with a single click in the dashboard. And then it lets us basically learn the policy of who is worth 3DS'ing to balance conversion and fraud to maximize their profits.
Nathan Labenz: Hey, we'll continue our interview in a moment after a word from our sponsors.
Today's episode is brought to you by Anthropic, makers of Claude. Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you, not for you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter.
Regular listeners know that Claude plays a critical role in the production of this podcast, saving me hours per week by writing the first draft of my intro essays. For every episode, I give Claude 50 previous intro essays plus the transcript of the current episode and ask it to draft a new intro essay following the pattern in my examples. Claude does a uniquely good job at writing in my style. No other model from any other company has come close. And while I do usually edit its output, I did recently read one essay exactly as Claude had drafted it, and as I suspected, nobody really seemed to mind.
When it comes to coding and agentic use cases, Claude frequently tops leaderboards and has consistently been the default model choice in both coding and email assistant products, including our past guests, Replit and Shortwave. And meanwhile, of course, Claude Code continues to take the world by storm.
Anthropic has delivered this elite level of performance while also pioneering safety techniques like constitutional alignment and investing heavily in mechanistic interpretability techniques like sparse autoencoders, both internally and as an investor in our past guest, Goodfire. By any measure, they are one of the few live players shaping the international AI landscape today.
Ready to tackle bigger problems? Sign up for Claude today and get 50% off Claude Pro, which includes access to Claude Code when you use my link, claude.ai/tcr. That's claude.ai/tcr right now for 50% off your first 3 months of Claude Pro. That includes access to all of the features mentioned in today's episode.
In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure.
OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads.
Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive. Nathan Labenz: So one way I like to frame some of these conversations is just in terms of practical lessons that people can apply in their own AI pursuits. So one takeaway there is add middle ground outcomes to your classifiers so that they're not binary, but try to find that sort of middle space where something other than the model itself can step in to help resolve the most challenging cases. It's almost like Claude now can sometimes end the conversation if it needs fundamental information...
Emily Sands: If there's fundamental information that could help you, and you can get it from your users in a low cost way, don't constrain yourself to being a modeler. Be a product thinker and go figure out how to get that information. And the model's really good at deciding—that's not brute force. You don't require that additional information from everybody, but then let the model decide where it needs more information and where it doesn't.
Nathan Labenz: Yeah. On the adaptive threshold concept, this suggests a state—basically a world state—that is maybe being fed into the model. I assume it's not like the model itself is calculating that on the fly. This would be a more global variable sort of thing that the model would receive?
Emily Sands: No. So it's actually like, hey, this merchant is starting to see clusters of scores creep up. Isn't that interesting? And when we look at the subset of transactions that have those higher scores—maybe they're still below the block threshold, but they're looking elevated—is there anything about those that looks like it's something collusive or coming from a small number of attackers or rotating across IPs or coming from a geography that they haven't seen before?
And then once we get signal that it looks like there's a slice that's an attack, you can actually start to lower the threshold from that subset for what it takes to block.
Nathan Labenz: But is that—that's all happening with the same short histories that you described previously, or is there a longer—it just sounds like there's a longer history at some point coming in to inform that kind of decision. But maybe the short history could be enough.
Emily Sands: There's a longer history, but less at the individual transaction level. It's basically detecting anomalies in slices of traffic. So this geo, these bins, this cart size—something anomalous is happening here. That anomalous thing has kind of elevated risk scores. Hey, it reads a bit like an attack.
And actually, what's interesting—rules are really good in a lot of ways. And so maybe another general lesson is rules are good, but they're also blunt. So figure out where you can blend rules with models.
You'd asked earlier when disputes actually come in. Disputes are super lagged. They can take days. They can take months. I'm the cardholder. I have to see my bill, notice I didn't buy the thing, tell my bank. My bank has to go and file with the network. And so those labels for sure arrive late, but we don't wait. We use proxy signals, and those weak labels show up way earlier—all the way to real-time issuer feedback.
And it can be very—so real-time issuer feedback would be like CVC mismatch. The CVC code, the little 3 or 4 digit credit card code doesn't match, or the ZIP code doesn't match. It'd be easy to write a blunt rule that said if CBC doesn't match or if the ZIP code doesn't match, block it. But you'd be blocking a bunch of good revenue because who doesn't sometimes fat finger their CVC or their ZIP code in a hurry or on their phone or whatever.
And so we have these risk-based Radar rules, which are like: take the model score, combine it with the issuer's real-time responses, and make a decision based on that intersection. So if it's looking marginally risky and the CVC is wrong, for sure block. But if it's a pretty known good user and they fat fingered a thing, let them through.
And I think that blend of rules and models is—I think it's easy for modelers to put their nose up at rules, and it's easy for rule makers to put their nose up at models that aren't fully explainable. But in plenty of contexts, blending the two actually does far better.
Nathan Labenz: So you actually do let transactions go through with a wrong CVC?
Emily Sands: Nathan, like, I know you're good. You've bought from this person before. Maybe you even use the same credit card. You're coming from a legit IP. And I feel good about you in a lot of ways. And yeah, the issuer comes back and says, hey, there's a mismatch. And we say, hey, let it through.
And then by the way, once we let it through, we also have to get the issuer to let it through. And there, we actually have data sharing with the issuers where we pass them our risk scores so that they can also understand why we passed it through, and that motivates them to also pass it through when they see our signals. So it's kind of a two-step.
Nathan Labenz: Very interesting. Let's go back to how you are tightening the iteration loop. Again, I think this is something that basically everybody who's developing AI products could stand to get better at. So what have you guys found to be effective needle movers in shortening your cycle time?
Emily Sands: This isn't one for us where it's like there's some magical reinforcement learning that we need to be implementing online for every single use case. I think it's actually been quite context dependent. The things that matter are having enough labels and having good labels and having those labels fast enough.
And actually, you can get pretty creative about what the label is. We talked about some examples. We also talked about human-generated labels. But another thing that we've been leaning into is LLMs as a judge.
So especially for contexts where there actually is no source of truth. A simple example: we've been talking a bunch about fraudulent disputes, but there's a lot of suspicious payments that never result in a fraudulent dispute. Maybe the person starts a free trial and then they cancel, or they ask for a refund, or they just spin up a bot account but never even get to the checkout page. That type of friendly fraud is actually really costly to businesses, and almost half—I think 47% of businesses—say friendly fraud, which is a total misnomer because it's not friendly, hurts their business more than stolen card credentials or what most people think of as fraud.
And that cost of friendly fraud is particularly true for AI companies. Very different than SaaS. Again, they have inference costs. They have compute costs. Therefore, they have very high marginal costs. And so when someone is engaging in free trial abuse or reseller abuse or refund abuse, it's super expensive to their unit economics.
Anyway, so built on the foundation model, we now have these suspicious payments that we identify. So these are fraudulent-ish things, but not in the traditional "going to result in a fraudulent dispute" sense. And when we pass those over, for example, to the AI companies, we want to be able to describe to them why they're flagged as suspicious. So it has an enumerated email or it is cycling through a small number of IP addresses or whatever. Those explainers are generated by the foundation model.
But then the question, of course, is like, well, how do you know if they're right? And so we have this LLM-as-judge that sits on top that looks at every transaction-label combination and asks, given everything you know about this transaction and everything you know about the cluster to which it belongs, how do you feel about the quality of the label?
And what ends up happening is that there's a large share of labels that are good enough, trustworthy enough that we pass them over to AI company du jour to decision on. And there's some small number that are too noisy, and we're like, okay, we've got to go work a little bit to make that label stronger.
But I call out that example because there's no source of truth. Like, I could—you and I could manually go through, I guess, but at the transaction level, we're not going to. And so it's been really helpful to have LLMs kind of as a judge where there's no clear north star. Nathan Labenz: Yeah. That's really fascinating, but I'm still kind of confused about one thing—well, I'm probably confused about a lot of things. But the thing I'm focused on being confused about right now is: when I try to advise people on AI broadly or when I try to give people the lay of the land, one of the things I tell people is AIs are not very adversarially robust. They are really good these days at the happy path. If you dial in the performance and you control the inputs, you can in many, many cases get to superhuman performance on routine tasks.
However, if you don't control the inputs and you're exposing your AI system to the world, you do have to be mindful about the fact that these systems are not adversarially robust. People can usually find some weakness. And that's even been true—we did an episode once on superhuman Go playing AIs that were beaten by really simple attacks that no human would ever fall for, but which the AI, even though it was superhuman when playing Go in the normal way against other high quality Go players, it was just totally blind to this certain class of attack that was found through adversarial optimization.
And so it seems like you would be in this environment where you've got it on hard mode kind of everywhere because anybody can come test the system from kind of any position. You can't really deny people the ability to try a payment. So they can kind of gray box you. They can test from a bunch of different angles and try to see what's going to get through, what's not going to get through. And presumably, there's always some vulnerability that you're not aware of that they can systematically attack or try to find through these attacks.
And my guess would then be the only way to really deal with that is to just constantly be identifying and iterating. But that sounds still hard. Despite everything you've told me, it still sounds hard to be as responsive as you would need to be given especially that the actual ground truth is so lagging. So how do we not—maybe we do—how do we not just bleed a ton of money in one incident after another as attackers figure out that there's some gap and then just jam as much as they can to exploit it for a while until it's closed? How does that not end up being a huge problem?
Emily Sands: Yeah. A couple thoughts. One is we expose capabilities through products and APIs to our users, not through raw weights. So for sure, anybody can try a payment and test and see if they can do a workaround. But just to be clear, we're not actually exposing the model for them to have an attack surface against. So I think the products and the APIs actually better meet the user needs, and they also kind of narrow the potential attack surface. So just clarification one.
I think when you reason about it, what's the relevant alternative? What is a fraudster's job? A fraudster's job is find loopholes and exploit the system. That's what they make their money on. And so the relevant alternative isn't perfectly airtight. The relevant alternative is baseline approaches.
And actually, when you start to think about foundation models or LMs, the payments foundation model, for example, it's actually a lot more nuanced, the type of information that it's using to decision, versus—you could think of an early transaction fraud model that's using last 7-day counters. And the fraudster figures out that as long as I'm 8 days out, I'm safe. I'm just going to do everything on day 8 and then hit them hard and then go 7 days back. So to some extent, I think traditional ML is easier to get around, whereas the foundation model is more comprehensive.
But the other thing that we certainly have long done and continue to do is a layered approach. So there's not just a single set of defenses. There's a set of model defenses. There's a set of rule-based defenses. There's the soft blocks I mentioned. There's the user's own defense set, which can also vary—all the way to how they treat you at sign up or how they block bots at sign up.
And so fortunately for us, unfortunately for the fraudsters, they're not fighting against one model. They're fighting against a whole system that is, I guess, until I said it, opaque to them.
Nathan Labenz: Yeah. Why don't you tell us about any other big use cases? You mentioned auth, fraud, disputes. There's some interesting Talk to Your Data product experiences in Stripe. What stands out to you as the most interesting applications? Not even necessarily from a "what moved the most money" perspective, but what would be most interesting to the AI engineers audience in terms of just interesting implementation details or surprises, quirky stuff that you've learned along the way?
Emily Sands: Yeah. We've talked a lot about transaction-level understanding and the path there—if you think about modality, it's mostly payments plus text. You have these structured payment signals with language using contrastive learning, and you align the two, and you've got the text decoder and whatever else.
But payments plus text is only the start. The system's actually designed so that new modalities are just considered tools that the router on top can invoke. So if you wanted to add another encoder—maybe for financial time series, which I'm very interested in, but I don't have anything yet that I could share, or for images—it doesn't require a whole rewriting of the system. It's just a modular expansion.
And I think the multimodality—I'm starting to see it really shine at the merchant level. So actually, yesterday, I was testing two lightweight agents. Neither of these are in production, so just full disclosure. But the team has them in shadow. And one crawls merchant sites to assess fraud, and it's relentless, incredibly relentless. The other spots counterfeit products, and it just does so literal orders of magnitude better than trained human reviewers that we have at Stripe doing the same thing.
So it'll find—there's a print shop and there's thousands of items in the print shop, and the agent will patiently zero in on the one Spider-Gwen sticker that was an example I was staring at yesterday with no sign of official licensing. Ruh-roh.
Then it also knows, oh, this other site, the Canada Goose that's marked as "with tags" is secondhand, and so it's actually fair game.
So I think that kind of multimodal roadmap is interesting. Not for multimodal in and of itself, not for the technology in and of itself, but for where it's going to unlock real value.
Nathan Labenz: Cool. On the Talk to Your Data thing in particular, that's something that I think a lot of people have tried to do for themselves or they've tried to use a product to do it. It strikes me that where most people have kind of gotten stuck there is like, well, I was able to get GPT whatever or Claude whatever to be pretty good, but it still made some mistakes. And I didn't really feel like I could confidently give somebody who wasn't a proper data analyst this tool and be confident that they would get good insights out of it.
So you guys have that problem at maybe the biggest scale in the world. How did you think about what is the right threshold of accuracy for a Talk to Your Data model? I assume you didn't achieve 100% accuracy on this sort of thing. But what was the threshold that you felt you had to get to, and what was needed to keep dialing in until you actually got over that threshold to where you could deploy? Emily Sands: One of the reasons that this Talk to Your Data was interesting to us in Stripe's context is a lot of what a business wants to know is captured in Stripe data. Who's selling what, for how much, to whom, who's retaining and churning their subscriptions, etcetera. So that's thing one.
Then thing two is the data is actually very well structured because it has to be. It's generated from the transactions that are flowing through Stripe that are incredibly robust and well documented, and the schemas downstream of that make sense and are well documented as well.
A lot of this Talk to Your Data stuff, it's the garbage in garbage out problem where my tables aren't well labeled, my fields aren't well labeled, maybe the underlying data actually isn't deduped. And so you can't really tell if the issue was that text to SQL or if it was actually the underlying data was bad or the data structure was not understandable. So we kind of were able to leapfrog that, which is great.
But still, LMs do make mistakes. And so our approach there is actually, if we have reasonable confidence, we'll provide it. But—I don't know if you've ever used it—we overlay on top a natural language explanation of what we're doing. So: we thought you wanted to know whatever. You asked, "how did Black Friday this year compare to Black Friday the last 2 years?" And then we'll say, okay, these are the dates we used for Black Friday. These are the timestamps we used. Because by the way, most things on Stripe happen in UTC, and many people aren't reasoning about their business only in UTC. We looked at Black Friday over the last 3 years. Here's how we computed the percentage growth.
You're like, that's boring. Doesn't everyone compute the percentage growth the same way? But that actually allows someone who's not a data analyst to build comfort in the output versus saying, either YOLO-ing it, just taking it and running with it, or throwing their hands up and saying, I can't trust anything because I don't know what's happening under the hood. Like, you just wrote a SQL query for me, but I have no idea how to interpret it.
So when I think about Talk to Your Data, it's like: is your data interesting to talk to? If yes, make sure it is well structured, well documented. And if it's not, invest in that before you invest in the natural language interface on top. And then just make sure the LLM is explaining what it's doing, which they're of course very good at doing now. And that allows you to open the aperture a bit in terms of less certain questions you're willing to answer because you know anyone can read the natural language and make a call on whether or not that was the right approach.
Nathan Labenz: For folks who want to do a double click on the process of getting the data into shape, the episode with the CEO of Illumix was really good on that. And just for what it's worth for you, they have built basically canonical structures of enterprises across a bunch of different categories, like a drug company, for example. They've kind of built out a vast representation of data that in their studied opinion represents the canonical drug company. And then when an actual drug company comes to them, they do this painstaking process of mapping all of their actual data with all of its idiosyncrasies onto the canonical version that they've kind of made work well, and that mapping becomes the kind of cleanup process that gets them the reliability that customers obviously ultimately want. Pretty interesting.
Emily Sands: Well, I was just going to say there's also kind of an interesting feedback loop here with users. So if you're a usage-based billing company, the types of metrics you want to know to reason about your business or to share with your investors are generally very similar to the types of questions that all the other usage-based billing AI startups also want to know.
And so that both means that we can really make great the subset of questions that matter in a given domain. But also—forget the natural language to SQL interface or Talk to Your Data—we can just push those commonly asked questions over onto the dashboard and even benchmark you. We have smart benchmarking now. Benchmark you on those metrics versus a peer group.
And by the way, that smart benchmarking is one of the applications of the merchant intelligence service, which is: figure out which websites are like this website in terms of good comps because they have similar user bases and are at a similar stage of their development.
Nathan Labenz: Yeah. Cool. There's a good pattern there as well for sure. I've been thinking about that in the context of agents lately. And there's kind of the choose your own adventure agent where you give it a bunch of tools, here's some MCPs, whatever, have at it. And then there's the—sometimes better described as a workflow—maybe with a couple forking decision points that people also call agents in many cases.
And I'm starting to see the emerging pattern be like, have that choose your own adventure agent sort of at the top level of user interaction, but then in terms of the things that it's choosing, make those actually pretty detailed workflows in a lot of cases where you know that as long as it makes the right choice at a high level, the process that's going to be kicked off is one that you've really deeply understood, dialed in for accuracy, confirmed for yourself is going to work reliably.
So I think that's another—you're kind of talking about a push model instead of a pull, but nevertheless, there's an isomorphism, I think, between those structures.
Explainability is obviously huge. One thing I was interested in asking is, are you doing any mechanistic interpretability? Are there sparse autoencoder type things now happening on the foundation model so that you can learn in a semantic way what new features the thing is learning?
Emily Sands: Yeah. Not literally. So we're not dissecting individual neurons in the way some research groups are. I think our focus is really on making the outputs self-explaining in the way that we and our users, where it's user facing, can actually trust.
Mechanistic interpretability is really important when you're releasing the full open-ended model into the wild. In our case, we control both application and the environment, and so our priority is that really practical explainability that's tailored to the payments application.
So when the foundation model flags a transaction, it doesn't just say "high risk." It says, you know, gibberish email and enumerated name pattern and device concentration. And that's actually the layer of explanation that lets the fraud analyst or even another system, like a follow-on agent, act confidently.
And then as we were talking about a little bit ago, in many of those cases, there's actually no ground truth for that explainer, and that comes up more and more as we expand to new domains. Like, oh, we're actually detecting fraud further up your customer funnel, all the way when someone's creating an account on you and well before they're entering credit card details.
And so that's where things like the LLM-as-judge framework are really helpful. It'll look at that and the cluster of similar events and the tag definitions and then output how confident is the LLM, basically judging how confident is it that the cluster really matches the label.
So no, we are not peering in neuron by neuron, but we are focused on interpretability at the output level, and that's really valuable for us. And then, for example, you're in your dashboard and you're seeing a bunch of suspicious users, you want to know exactly why we flagged them as suspicious so you can decide how to action. And so in our setting, that's really what matters most. Nathan Labenz: Gotcha. Okay. You mentioned usage-based billing. And this led me to a very practical question around how you would recommend people build on top of Stripe today.
Ten years ago or so when I first became a Stripe customer, it was already a respected company, but not such a foundational part of the economy as it's become. So we were like, well, we don't want to switch off of this one day, or who knows, whatever. We'll have kind of our own database of all the transactions and all that kind of stuff. And Stripe will of course have their view of it, but we'll maintain our view. That was a lot of work then.
With usage-based billing, it sounds like an even more challenging project now, especially for your proverbial couple of people that are doing a hackathon and want to kind of get something started.
Do you recommend that people—so the alternative I have in mind, which I wonder ultimately if you recommend is: could I just leave all of that to Stripe and just basically make nothing but API calls, trust Stripe to be real-time ground truth across the board and not even have a financial side to my database, but just purely do that through real-time API calls?
Emily Sands: Totally. Don't even have it. Stripe's APIs run at six nines of uptime. So they are safe to use as your system of record for very critical flows. Our usage-based billing APIs process 100,000 events per second, and they've got all the built-in monitoring and alerting and invoicing.
The alternative is also pretty painful. If you're going to build your own mirror of Stripe's data, it's pretty complex. You've got to sync across all the events. You've got to build your own monitoring systems. You've got to keep everything reconciled. And especially if we're talking about a startup, that's just a lot of work that doesn't create any kind of differentiated value.
A lot of these companies that are taking off have 5, 10, 20 employees. They shouldn't be spending an ounce of that limited capacity on this stuff.
And then on the flip side, if you treat Stripe as your source of truth, you get real-time signals that you can actually act on in the Stripe ecosystem—like the billing threshold has been exceeded or whatever—without having to have this whole parallel system.
And we talked about the Sigma system earlier, but with products like Sigma and Stripe Data Pipeline, you can still run all your analytics and all your reporting without doing the job of building your own warehouse.
You might be wondering about the downsides. Historically, the biggest downside of not mirroring, of just leaning into Stripe, was: okay, but what about when I want to join Stripe data to my own business objects? And now you actually can. You can extend Stripe's objects with metadata. So a lot of users will attach their own order ID or shipment ID to an invoice. And for most companies, that closes most of the gap.
Now it's different if you're a large enterprise and you've got a whole bunch of other follow-on systems that are off Stripe and data sources that are off Stripe. But for most startups, it's both simpler and just a lot safer to let Stripe be the system of record.
Nathan Labenz: Yeah. Cool. I imagine some companies have gotten pretty big in revenue terms over the last however many months while still doing just that. I don't know if you would want to highlight any by name or if that's too secret. But when I see the curves from folks like Lovable, Bolt, Replit recently, obviously things like Cursor, one starts to wonder about the head counts that they have. I bet a lot of them are probably doing exactly that and just kind of trusting, which six nines gives you pretty good reason to trust.
Emily Sands: Yeah. And that's just for the system of record. Lovable is a great example. They hit $100 million in ARR in their first 8 months, and their stack is basically a case study in all-in on Stripe.
So they incorporated the business—before they monetized, they incorporated the business with Stripe Atlas. They, from the very beginning, used our optimized checkout suite. So the front-end customer-facing services is our optimized checkout suite, which allowed them to localize payments in over 100 countries and get something like 150 payment methods out of the box.
They leaned on billing for subscriptions, so they didn't build their own billing system or have to contract with another third party. They leaned on Link, which is our one-click consumer checkout for fast checkout. By the way, nerds love to buy from nerds. So our concentration on Link of AI buyers is very, very high.
They leaned on Radar for fraud prevention. They leaned on Sigma for analytics. And so really, Stripe took care of the financial plumbing so that Lovable is just really focused with that small team on product and growth, which of course they nailed.
But there's smaller ones too. Retail AI—have you used them? They build...
Nathan Labenz: Call agents.
Emily Sands: Yeah. For customer support. And so they launched last year. I think they have over $10 million in ARR in their first year.
We mentioned Link concentration. Link actually powers 38% of their payments. So 38% of their payments run through our consumer network where the individual has an identity and their payment methods are saved on file, and it's literally a one-click checkout for them.
They use us for smart retries. So when the transaction, usually a recurring bill, fails, we retry it at the optimal time, which allows them to recover about 60% of their failed charges.
They use us for Stripe Tax, which keeps them compliant in 100 countries.
And so it's just a great example of how these AI companies—very lean teams, growing fast, going global—are really just scaling up to look like a much bigger company is right behind them.
Nathan Labenz: I've heard you talk a couple times, and we don't have too much more time, so just to hit on a couple last topics. I've heard you talk a couple times about the time you spend getting new clothes for your kids. That's mostly something, all honesty, my wife does in our home.
Emily Sands: Lucky you.
Nathan Labenz: Yes. I would flatter myself that I do my share in other ways, but she's definitely better suited to pick out what will make the kids look cute.
What are we—what's interesting—folks who listen to this podcast will sort of know the basics, that Perplexity has a shopping thing and whatever, and we know what MCPs are. Are there any recent developments or is this really happening? Or is it still, from what you've seen, kind of the "wouldn't it be cool if one day this were real" sort of phase of agentic commerce? Emily Sands: It's kind of both. It's definitely still early. There's still a ton we're sorting out about how's this actually going to work and how quickly is it going to take off. But we're seeing meaningful traction.
You mentioned Perplexity. You can discover and book hotels directly inside the app. But it's not just big guys like Perplexity. Hipcamp is a little site that uses agents with virtual cards to book campsites off platform. I'm from Montana. It's impossible to get into Yellowstone National Park. I hate using their website, although I love the park and value that they're not spending a ton investing in tech. But Hipcamp is actually solving that.
And then I think it's easy when people think about agentic commerce to think about commerce—buying kids clothes—consumer. But on the developer side, we're seeing the same trend. Developers now in Cursor can buy Vercel services right inside their editor. That's a brand new channel—really embedded commerce directly in the workflow, and Stripe powers those transactions too.
So we're not totally new to this. It was last November when we launched our agent toolkit, but we still have thousands of downloads each week. And I think just looking at the pace of adoption and looking at who's testing, I think agent commerce will be a major channel far sooner than most people think.
Nathan Labenz: With that embedded stuff, like Vercel in Cursor, it seems like that's much more about the connective tissue of the user has an intent and it's a question of how it's going to get executed, as opposed to any sort of autonomous decision-making by the agent or any sort of meaningful delegated discretion to the agent. Have you seen anything that is really interesting in the "I'm going to actually trust you to go figure out what to buy and execute on it" at this stage, or is that still not really materialized?
Emily Sands: Yeah. So I can't name names, but this idea of a business in a box. I want to build this business, and I actually don't know what third party tools and services I need. I just want the business in a box and go spin up the business. And that's not just the payment provider or the front-end service or the bot protection or the HR system. But give me my whole business in a box, I think, could be an interesting direction.
Now getting that right for the whole world of businesses that might be created is hard. Getting that right for a pretty focused AI wave that's coming online isn't a crazy thing to think about.
So I agree with you that the option set in consumer is broader, and so there's more job to be done for the agent to select from this very broad option set. But SaaS procurement is also very inefficient. Maybe we underestimate how inefficient it is, and that's not just in the selection of vendors. That's also in the pricing and negotiation with vendors.
And so I don't think it'll be tomorrow, but I think there will be a there there.
Nathan Labenz: Cool. I'll keep watching out for that.
Last question. Just about the future of platforms, the future of scale, the future of market power. I think back often to the Anthropic deck from two years ago where the claim was made: we believe that the companies that train the best models in 2025, 2026 may have such an advantage that nobody will be able to catch them from there. Why? Because presumably the models will help train their successor with all these data filtering and synthetic and constitutional AI and whatever. And once you've got Claude 4 contributing to the training of Claude 5, anybody who doesn't have Claude 4 and is still sourcing everything through Scale AI or whatever is just at a massive disadvantage.
It seems like that basically applies to Stripe as well. Is there any hope for anybody to ever compete with Stripe given the 1.3% of global GDP flowing through the system and the massive data advantage that already exists? Or are we now sort of in a future where we just need to rely on the Collison brothers to continue to be good actors? It seems like this position is almost unassailable.
Emily Sands: I think financial services is a big broad space, and there are a lot of services that one can provide in that space. In the context of data, the $1.3 trillion a year is a lot. And volumes are growing like 38% year over year. That's a massive growing dataset.
The real advantage, I think, isn't the raw size, though. To your comment earlier, it's more the compounding loop. And in our context, that loop is: the more data we process, the better our models get. The better our models get, the more value we deliver to businesses. Incentives are super aligned. The more value we deliver to businesses, the more the businesses grow, which means the more transactions they run through Stripe, and that loop compounds year after year.
And that's why we talked earlier about why it's hard to make horizontal bets, but that's also why we can make horizontal bets. Not just because we have scale, it's because we're in a position to harness that scale to create even better products, which then is the feedback loop.
And we're pushing this further. You may have heard at Sessions last year, we announced a big push for modularity. And so now products like Radar, our fraud prevention product, or our billing product, or that optimized checkout suite for your users are available multi-processor. So they don't just work on Stripe transactions. They work on transactions or on billing plans or on checkouts that are happening outside of Stripe too.
And that actually gives us window into an even bigger data network and kind of further reinforces that loop.
So I think there's a lot to be done in the financial infrastructure space, and I think there will be plenty of players playing important roles there. But I think we are quite differentiated in the intelligence that we can serve to users, and it's just really fun to see how that intelligence in turn helps them grow more profitably. Nathan Labenz: On the other end of that, do you ever think about trying to compete at the foundation model level? This is something that obviously not many companies are really able to do. But given the depth of ML experience and the unique dataset that does exist and just the reputation of the company, I sort of expect that if there was a special fundraising round to raise $10 billion to go train a Stripe 1 to try to compete with Claude 5 and GPT whatever, the money would be there. Do you ever think about going that hard? Or how do you think about calibrating how ambitious to be with the AI investments?
Emily Sands: Yeah. I mean, Stripe has always leaned into new technology waves. Back when we were founded, it was the platforms and marketplaces wave that got us a lot of the way here. Today, it's the AI wave, and our mission is to build the economic infrastructure for AI.
That shows up today in four big bets. First is being the best partner for AI companies. So just helping them monetize effectively and scale globally and manage billing and manage tax and manage fraud. Two thirds of the Forbes AI 50 already run on Stripe, and we're very focused on co-building, whether it's usage-based billing or whatever the next wave is, co-building with them and being the best partner.
The second is enabling agentic commerce. We only talked about it briefly, but agents are going to be buying on your behalf, and we want that to work really well for the whole ecosystem. Yes, for the consumer. Yes, for the seller. And yes, for the platform or commerce facilitator.
The third place we're really focused in the world of economic infrastructure for AI is making Stripe native inside the AI-enabled tools that developers already use, whether that's Vercel or Replit or Cursor or Mistral's LeChat. Payments should show up right where the work is happening. So that's thing three.
And then fourth is what we talked about today, which is deploying our foundation model across the network to improve fraud detection, yes, to boost authorization rates, yes, but also expanding the intelligence layer that we provide to every user.
So those are the four big investments. I'm not going to say that there could never be a fifth, but today, we're really hyper-focused on the economic infrastructure for AI, not being an AI model shop directly.
Nathan Labenz: Gotcha. Cool. This has been excellent. I really appreciate the time and the depth. Anything we didn't touch on that you would want to leave people with or just any concluding thoughts?
Emily Sands: Nope. Super fun. Thanks so much for having me.
Nathan Labenz: Emily Sands, head of data and AI at Stripe. Thank you for being part of the Cognitive Revolution.
Emily Sands: Thanks so much.
Nathan Labenz: If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network.
The Cognitive Revolution is part of the Turpentine Network, a network of podcasts where experts talk technology, business, economics, geopolitics, culture, and more, which is now a part of a16z. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing.
And finally, I encourage you to take a moment to check out our new and improved show notes, which were created automatically by Notion's AI Meeting Notes. AI Meeting Notes captures every detail and breaks down complex concepts so no idea gets lost. And because AI Meeting Notes lives right in Notion, everything you capture—whether that's meetings, podcasts, interviews, or conversations—lives exactly where you plan, build, and get things done. No switching, no slowdown. Check out Notion's AI Meeting Notes if you want perfect notes that write themselves. And head to the link in our show notes to try Notion's AI Meeting Notes free for 30 days.