Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything
Vlad Tenev and Tudor Achim discuss Aristotle, Harmonic's AI that solves Olympiad-level math with formally verified Lean proofs, detailing its architecture and how verifiable reasoning could improve safety and practice in software, mathematics, and future superintelligent systems.
Watch Episode Here
Listen to Episode Here
Show Notes
Vlad Tenev and Tudor Achim from Harmonic explain how they built Aristotle, an AI system that reaches International Mathematical Olympiad gold-medal performance using formally verified Lean proofs. They unpack the architecture behind mathematical superintelligence, including Monte Carlo Tree Search, lemma guessing, and specialized geometry modules. The conversation explores how verifiable reasoning could harden mission-critical software, reshape mathematical practice, and lead to trustworthy superintelligent systems by 2030.
Use the Granola Recipe Nathan relies on to identify blind spots across conversations, AI research, and decisions: https://bit.ly/granolablindspot
Sponsors:
Claude:
Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr
Framer:
Framer is an enterprise-grade website builder that lets business teams design, launch, and optimize their.com with AI-powered wireframing, real-time collaboration, and built-in analytics. Start building for free and get 30% off a Framer Pro annual plan at https://framer.com/cognitive
Blitzy:
Blitzy is the autonomous code generation platform that ingests millions of lines of code to accelerate enterprise software development by up to 5x with premium, spec-driven output. Schedule a strategy session with their AI solutions consultants at https://blitzy.com
Tasklet:
Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai
CHAPTERS:
(00:00) About the Episode
(04:58) Math as reasoning (Part 1)
(15:22) Sponsors: Claude | Framer
(18:51) Math as reasoning (Part 2)
(18:51) Inside the Lean language
(27:51) Lean intuition and MathLib (Part 1)
(34:08) Sponsors: Blitzy | Tasklet
(37:08) Lean intuition and MathLib (Part 2)
(38:47) Inside Aristotle's architecture
(48:33) Scope, boundaries, and applications
(54:37) Training, taste, and interpretability
(01:08:18) Formal math and software
(01:16:50) Limits, entropy, and roadmap
(01:25:24) 2030 vision and safety
(01:33:38) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Full Transcript
(0:00) Nathan Labenz: Hello, and welcome back to the Cognitive Revolution. The presenting sponsor of today's episode is Granola. Regular listeners have heard me describe the "Blind Spot Finder" Recipe that I'm using to look back at my recent calls and help me identify angles and issues I might be neglecting. I love that concept, but it's also worth highlighting how Granola can help raise your team's level of execution by supporting follow-through on a day-to-day basis. This morning, for example, I had two very practical calls in which I committed to a number of things. In the past, there's a good chance I'd have forgotten at least a couple of the things I said I'd do. But with Granola, I can easily run a "TODO Finder" Recipe and get a comprehensive list of everything I owe my teammates. This is the sort of bread-and-butter use case that has driven Granola's growth and inspired investment from execution-obsessed CEOs, including past guests Guillermo Rauch of Vercel and Amjad Masad of Replit. See the link in our show notes to try my Blind Spot Finder Recipe and explore all the ways that Granola can make your raw meeting notes awesome. Now, today, my guests are Vlad Tenev and Tudor Achim, co-founders of Harmonic, an AI research lab dedicated to building Mathematical Superintelligence, and also the creators of Aristotle, an AI system that achieved gold medal level performance at the 2025 International Mathematical Olympiad. While OpenAI and Google DeepMind achieved similar performance by scaling reasoning in chain of thought, Harmonic stands out for their commitment to formally verifiable methods. Because it generates candidate proofs in Lean — a programming language that serves as a proof-checking assistant by using a trusted kernel to confirm that every single step of reasoning follows from explicit premises and accepted logical rules — Aristotle's work can be automatically validated, and its performance is in principle limited only by the scale of compute available for reinforcement learning. In an effort to better ground my own intuitions for mathematical superintelligence, we begin with a metaphysical discussion about the nature of math, what it is that mathematicians do, the assumptions that underpin a Lean verification, and how Lean is already revolutionizing the math world by eliminating the need for traditional peer review. From there, we turn to the Aristotle architecture that delivered IMO gold performance. It consists of a large transformer model that uses a Monte Carlo tree search strategy reminiscent of systems like AlphaGo to discover valid paths from point A to point B in mathematical reasoning space; a "lemma guessing" module that helps manage context and keep things on track by generating candidate waypoints between a given starting point and a potentially distant end goal; and a specialized geometry module modeled on DeepMind's AlphaGeometry. We also discuss the Aristotle API's "informal" mode, which attempts to auto-formalize whatever the user asks it to prove, and discuss what its responses to my admittedly silly requests — that it prove propositions like "All is Love" and "Epstein didn't kill himself" — imply about the boundary between statements that could in principle be mathematically proved, and those which are sufficiently factual or philosophical in nature so as to fall outside the scope of the system. In the final section, we discuss the role of entropy and the importance of taste to Harmonic's future plans; how the community is using Aristotle, sometimes on a standalone basis and sometimes in conjunction with other frontier models, to solve previously unsolved problems; how we might use systems like Aristotle and Lean to harden mission-critical infrastructure and improve complex systems across society; how Harmonic's emphasis on verifiable outputs could create a superintelligence we can trust even in the absence of mechanistic understanding; and what mathematical superintelligence might look like in 2030. On this last point, I have to say — with so many grandiose AI promises flying around these days, from "a country of geniuses in a data center" to "a century of progress in five years" to "curing all diseases in our natural lifetimes" — it is rare that I am genuinely taken aback by a company's vision for the future. And yet, as you'll hear, Tudor did manage to leave me at least momentarily speechless when he described a future of theoretical abundance, in which all physical phenomena we observe have multiple competing coherent explanations that can only be separated by increasingly exotic experiments. If you're like me, you'll find this episode a useful opportunity to improve your intuition for the nature of math, an instructive preview of what's to come as RL continues to scale across the industry, and an inspiring challenge to keep thinking bigger and bolder about the nature and impact of superintelligence. With that, I hope you enjoy my conversation with Vlad Tenev and Tudor Achim, co-founders of Harmonic. Vlad Tenev and Tudor Achim, co-founders of Harmonic, makers of Aristotle, and winners with an asterisk of the IMO Gold in 2025. Welcome to the Cognitive Revolution.
(5:10) Vlad Tenev: Thanks for having us. Greetings and salutations.
(5:14) Nathan Labenz: Thank you. So this is going to be, I think, a fascinating conversation. It's probably going to be more metaphysical than most of our episodes, but there's also a lot of practicality, because what you guys are doing certainly has aspirations to go beyond the pursuit of mathematical superintelligence. Maybe just for starters, how do you understand what math is? That was something I was really wrestling with in preparing for this — it's obviously very metaphysical. To make it a little more practical, what would you say are the core cognitive skills that people who are good at math really develop and excel at? And how do those skills hold up when we look at the performance of the frontier large language models that all of our listeners are familiar with today?
(5:59) Tudor Achim: Yeah. Well, look, first, thanks for having us — it's really great to be here. When you ask what is math, what is it useful for, what are the core cognitive skills, it gets to one of the core theses of our company, which is that mathematics is reasoning. A lot of people think of mathematics as this really esoteric thing — maybe group theory, stuff you've seen in movies like Good Will Hunting. But mathematics, at its core, is the process by which humans understand the world by breaking their understanding down into small sequences of logical steps that other people can understand and verify for themselves. So when you're solving a physics problem, doing your taxes, or thinking about what happened at the beginning of the universe, ultimately you have to have an explanation that is self-consistent, that follows from other facts, and that your colleagues or other humans can check. And so when we talk about what it takes to be good at math, the question is: what does it take to be good at reasoning? That ability to break things down into steps. It turns out math is really useful for understanding the universe and building lots of engineering things, but ultimately it's just about reasoning.
(7:07) Nathan Labenz: I watched your podcast that you did with Sequoia maybe 16 months ago or so, and I recall Vlad's story of: basically, I thought that if I got good at math, then I'd probably be good at other things, and it sort of worked for me. That's one way, in a very practical sense, to unpack the idea that math is reasoning. It certainly seems to help people generalize to at least related domains and be really effective, for example, in entrepreneurship. But I'm not entirely clear still on whether you're making a more almost platonic claim. It seems like there's the very simple notion that I should teach my kid a lot of math because then they'll be smart generally — and again, that works for humans. But is there something that you see as a more fundamental law of the universe, some correspondence between what we are doing in math and what we are doing in these other domains? Because it doesn't seem like we have the same sort of verifiability in almost anything else. We do have it a little bit in computer science, but even in physics, we've got still very fundamental questions about whether the paradigm is even right, or what it would mean for it to be proven right. I don't think that stuff is at all agreed upon. Maybe you guys throw up your hands at this mystery too, or maybe you feel like you have an intuition for what the answer is.
(8:32) Vlad Tenev: Yeah, I can give you my perspective. I got into math through physics. When I first came to Stanford as an undergrad, I had read Brian Greene's The Elegant Universe, which was sort of like the first popular string theory book. And when I was a kid, one of the earliest memories, one of the first English books I read cover to cover, was A Brief History of Time by Hawking. So I've always been interested in the big questions, right? What happened before the Big Bang? How did the laws of physics come about? Is there just one law, one particle, one force that eventually, as the universe cooled and expanded, splintered into all the different forces we have today — gravity, electromagnetism, strong and weak force? Because back in the day, that was not obvious. We thought electricity was separate from magnetism, and it was probably one of the greatest achievements of science to figure out that these two are actually two sides of the same coin. And then the big question is: what's going on with gravity? Is it the same? And in the middle of all this, we found out that the weak force and the electromagnetic force were also splintered off of one electroweak force. So it kind of feels like there was just one thing at the beginning, and we have to understand what that thing is. And what I found when I became a physics major at Stanford and started asking all these questions was that they'd eventually send me over to the math department. They'd say: in order to understand string theory, you have to understand all of these other things. If you want to understand general relativity, you've got to get into differential geometry. And that's how I became a pure math major and ended up doing a PhD. The impetus was actually trying to understand the real world through physics. And if you think about the usefulness of physics — all of the big inventions that humanity has, the ones that really push us forward, like flight, rocketry, computers, transistors, GPS — they're physics things. So the real reason to do math is that math is interesting and beautiful, there's an art aspect to it, but it helps you understand physics, physics helps you understand engineering, and then you can create things that have huge value.
(11:31) Tudor Achim: You were asking how math works in other fields where things are not as precise. I think math shows up just a little more subtly than people think. There's this physicist Eugene Wigner who wrote a famous essay called "The Unreasonable Effectiveness of Mathematics," which was commenting on a really interesting phenomenon. Vlad mentioned differential geometry and special relativity. It turns out that when Einstein was creating that theory, he relied on thought experiments from the nineteenth century around how to think about certain manifolds and their properties. And that was actually the key tool used to explain special relativity and then develop general relativity. Those thought experiments in the nineteenth century were almost preposterous — it made no sense to think about them, because how could you possibly apply these concepts to the real three-dimensional world? And then it turns out they're very useful for understanding the four-dimensional world when you include time and curvature. There are myriad examples like this. If you consider number theory, for a long time it was seen as an incredibly esoteric branch of math with no practical implications. But people pushed on that theory for a long time, and then it turns out it's the key tool you need to create a secure digital economy. So now essentially all of human civilization has a digital economy based on this branch of math. I think it's almost the wrong question to ask: well, there's a lot of math out there, how is it useful? The point is you just do the math, and then eventually some of it — not all of it — will be more useful than you possibly could have imagined. The investment in math isn't just to build a really smart system; it's to create a lot of new math that we can then figure out ways to apply later.
(13:25) Vlad Tenev: One interesting thing the conversation reminded me of — when you first asked, okay, what is math? What does it look like? I think one of the reasons we got excited about applying AI to this domain is that there are lots of different things that mathematicians do. Some of them are very creative, almost like artists, and maybe they're not super prolific, but they come up with something new once every five to ten years, and that can be just an amazing accomplishment in the field — like a Grigori Perelman, for example. Others are just machines: they can read more papers and comprehend more papers per unit time than other people, and what they're doing is basically synthesizing all the knowledge, figuring out all the tricks, and applying those tricks quickly to new domains. We're very excited about the prospect of AI accelerating the former — we think that'll happen — but the latter is something that AI is already really, really good at today, and was good at to some degree when we got the idea for Harmonic. You look at GPT-4, which had just come out when we started, and it excelled at pulling information, doing these needle-in-a-haystack type things: can you quickly go through all the literature and pull things that might be relevant? And I would say, even if you can be an amazing mathematician in that category, a lot of the work could be accelerated if you just knew all the math that was being done and could pick out the relevant things to an unsolved problem at hand. So the problem itself lends itself really well to what AI is already good at.
(15:18) Nathan Labenz: Hey, we'll continue our interview in a moment after a word from our sponsors. One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks — challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple of years, saving me countless hours. But as you've probably heard, Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude Code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and iMessage. The result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions. Based on those, I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts — not because they're bad, but because it's important to me to be able to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information-gathering work and allowed me to focus on understanding our guests' recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at claude.ai/tcr. That's claude.ai/tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's claude.ai/tcr. AI agents may be revolutionizing software development, but most product teams are still nowhere near clearing their backlogs. Until that changes, designers and marketers need a way to move at the pace of the market without waiting for engineers. That's where Framer comes in. Framer is an enterprise-grade website builder that works like your team's favorite design tool, giving business teams full ownership of your .com. With Framer's AI wireframer and AI Workshop features, anyone can create page scaffolding and custom components without code in seconds. And with real-time collaboration, a robust CMS with everything you need for SEO, built-in analytics and A/B testing, 99.99% uptime guarantees, and the ability to publish changes with a single click, it's no wonder that speed-, design-, and data-obsessed companies like Perplexity, Miro, and Mixpanel run their websites on Framer. Learn how you can get more from your .com from a Framer specialist, or get started building for free today at framer.com/cognitive and get 30% off a Framer Pro annual plan. That's framer.com/cognitive for 30% off. Rules and restrictions may apply. Okay. That's quite helpful. I think coming into this, I had focused my own mind on sort of two modes of math. One being the kind of Einstein-like eureka moment — obviously a high-level example — of having some insight that this highly abstract and seemingly very esoteric formalism can actually unlock major understanding. That's amazing. And then there's also this sort of grind-it-out mode: I've got this thing I want to prove, and I'm going to stumble my way through the space of possible logical moves until I finally chart a path there. And then you're adding a third layer, which is problem selection in the first place, which I guess is pretty related to the Einstein thing but certainly distinct in some ways. Let's take a minute before we get into the Aristotle system — how it works, how you've trained it, all that — to talk about Lean. Lean is basically a programming language that does this very bit-by-bit logical maneuvering, right? You have certain assumptions coming in, you take various steps, and the goal is to get to a certain outcome. Give us a more intuitive understanding of what Lean is — I'd be keen to understand it on a practical level too. How many symbols are there? How many axioms are we starting with? How many rules can we apply? How big is the space that we're manipulating our way through?
(20:40) Tudor Achim: So Lean, in my view, is the best programming language ever created. In Lean, you can write any program you would write in Python or C or C++, but you can also express essentially any logical concept. If we're okay getting into a bit of the details: it is a dependently typed programming language, which means that at compile time, you can express very complicated properties of the program that you can check before ever running it. On one end of the spectrum, you have something like JavaScript where you can check basically nothing; on the other hand, you have Lean. But the really cool thing is — you asked about axioms. When Aristotle produces any output, it's produced as annotated Lean code. There's programming-language Lean: we write theorems, we write programs, we prove things, and there are a lot of comments explaining to the reader what it's doing. When we talk about proving things, you end up relying on three axioms, in addition to the basic concept of the calculus of constructions, which is what the programming language is based on. Two of them are extremely technical: one is propositional extensionality, and one is something about quotient soundness. But the third one is the axiom of choice. To show what an axiom means — the axiom of choice is not saying anything controversial. It's saying: if you have a non-empty set, it's possible to choose an element from it. And from these three extremely basic axioms, it turns out you can build all of mathematics, all of computer science, all of mathematical modeling, physics, economics, stats, biology. It's all based on this core set of axioms. The goal of a system that outputs Lean is to find interesting statements and programs and then prove things that depend on these axioms — and that's really where the difficulty lies. Sometimes you have to make big logical leaps; sometimes you have to grind through a lot of math; but both are essential. You can't really skip one of those steps. But Lean itself is just incredible. You can express so many ideas in it, prove so many things, and use it as a programming language too. It's really up there for me in programming languages.
(22:59) Vlad Tenev: I started playing with Lean when Tudor and I started making a plan for this business, and we had a pretty early decision about whether we wanted to go formal or informal. One thing that struck me about it is that as a former mathematician, I barely used the computer when I was doing math. I was in my PhD in the late 2000s, and the only time you'd really be using a computer doing math was when you wanted to type up your homework or your research paper. All the thinking would happen on a chalkboard or a whiteboard. All the collaboration would happen in person at conferences or on a chalkboard in one's office. For a while, it felt like mathematics would always be this pure thing, untouched by technology. But what Lean has done is transform mathematics from chalkboard and couch to now it's in VS Code, you can do it in Cursor, you're putting your math on GitHub where you can run large collaboration projects. Even when you subtract out AI, Lean by itself changed how people do mathematics. Now you're seeing extremely prolific, famous mathematicians running large projects where they're collaborating with dozens of people around the world, trying to formalize research or formalize the proof of Fermat's Last Theorem. More and more people are adopting Lean as an accelerant. It's changing how mathematics is being done — it accelerates collaboration, accelerates progress, and removes this notion of peer review. If you're a mathematician and you want to prove something, a big part is getting someone to read it and actually spend the time to tell you if it's correct. With Fermat's Last Theorem, what happened was a collection of people got together and when they all agreed the proof was complete, it was sort of ordained as proven. Lean makes that unnecessary. If the proof checks — with the caveat that there's no bug in the Lean kernel or how you've set up the statement — you obviate the need for manual human verification. The implications are pretty interesting: you have all these potential citizen mathematicians who can now, with AI, solve unsolved problems and don't need to get anyone at a PhD program or a leading institution interested in their problem in order to validate it. They just have to have the Lean certificate, and the proof is correct. If you think about journals — journals in math exist for this. The prestige of the review board tells you whether you should read something or trust it. The notion of trust is really changed fundamentally with tools like Lean.
(26:30) Tudor Achim: Yeah, and I think the open source software community really solved this problem a long time ago. If you go on GitHub, one can simply open a pull request on some repository, and if it passes the tests and the author of the repository agrees with your style, it gets merged. So now you've contributed. That element of trust is not so present — you can just run the tests. And when you talk about impact and prestige, you can look at the number of stars a repository has. If it's very popular, it gets forked a lot, it gets a lot of stars. You've disintermediated any gatekeeper. It's totally open source, there's no more trust required, and there's a measure of impact. Math is going to start going the same way. Previously, mathematicians relied on their social networks to figure out who tends to do the right thing, who tends to not make mistakes. But with Lean, you can have a big math project where anybody can come and contribute a proof, and if Lean accepts it, then it's right. If a lot of other mathematicians start to depend on that result, we're going to notice a lot of forks, a lot of dependency graphs, a lot of stars. You start to measure prestige that way. It would be very interesting if Lean is the one tool that allows you to go from cathedral-style development — very closed networks — to a more open, bazaar-style development where it's a kind of Wild West, but Lean is the computational certificate that everything is correct.
(27:51) Nathan Labenz: I wish I had a more intuitive sense for what exactly is going on with Lean. This is going to be hard, I think. But in doing my research, one thing that stands out is that the kernel is really small. In terms of what you need to trust, it's a pretty small amount of core code that has been thoroughly vetted many times by many people. I would still love to have a better sense, because when you mention the three axioms, for example, it's a little weird for people outside the field: there are two that are kind of bizarre and technical, and then there's one that's like, if you have a non-empty set, you can choose an element from it. And that seems like common sense — but why was that ever controversial? Is there a way to describe the space of legal moves in math or in Lean? I don't usually like analogies — I often try to set this up as an analogy-free zone. But because I and a lot of others are going to struggle with the very literal understanding, maybe this is a time for an exception. Is there a chess analogy or something where you could say, here are the pieces and here are the legal moves, to give people a better sense of what it actually means to move through these spaces?
(29:22) Tudor Achim: I think the chess example is perfect. A theorem in Lean is something like: given this starting configuration of a chessboard, it is possible to get to this configuration. A proof of this theorem would be listing the sequence of moves. And what the kernel is doing in Lean is saying, for every single move that you claim is valid, it's checking: does this rule exist in my rule book? So the theorem says you can get from A to B. The sequence of moves is, here's the sequence, and the kernel is just saying: this step is right, this step is right, this step is right. And now I've confirmed that I've ended up in the target state. Lean is doing that, but of course the individual steps are different — they're mathematical steps and they depend on one or more of these three axioms. The three axioms, although they're technical, are very short. If you write them down as mathematical statements, each of them is under a tweet in length. The axiom of choice definition in Lean is maybe 10 characters, and the other ones are maybe 100. They're not very complicated — they're just a little bit annoying to write in math. And then people say: if we assume these axioms are true, and they're common sense, just a bit more complicated, and we've checked every single step against those axioms, then we say the whole proof is correct.
(30:57) Nathan Labenz: Could you give a few examples of the pieces and the moves? Obviously we can't come anywhere close to being exhaustive, but what are the primitives in terms of the--
(31:10) Tudor Achim: I'll give a mathematical but simpler example of a primitive. So let's consider first-order logic. The deduction rules you have are: if A, then B. So let's say you have a proof that says, if I have A, and I know if A then B, and if B then C, the theorem says C is true. And the proof of that says A is true, and I have if A then B, which means B is true. And then I have the step B is true, and I know that if B then C, and then I can conclude that C is true. So this is first-order logic, so it's not quite the same as what we're talking about in Lean. You can do more advanced types of logical statements there, but ultimately that's what's happening. I think the next step beyond that is just getting to Lean and the calculus of constructions and these axioms.
(32:08) Vlad Tenev: There is one thing -- when I learned it, there's actually people also exploring the use of Lean to teach math. And I think now it's sort of practical at the high school level, but you could see a world where it extends to middle school and maybe even younger if someone's precocious enough. I think mathematics education will go from the chalkboard to the computer lab. So there's this thing called the natural number game where you learn Lean by deducing properties of multiplication and addition, basically. So, for example, the commutative law, which is basically that a plus b equals b plus a, or the distributive law, where a times the quantity b plus c equals a times b plus b times c. So you can sort of discover and prove these fairly basic facts just using the core axioms and the Lean language. If anyone just wants to know, "what is this Lean thing? Why is it useful? But I'm not a research mathematician," dip your feet into it -- I would recommend that. And it's been extended to harder things too. I think there's the real analysis game now, which is for learning real analysis, which is very proof-based and basically the foundation of calculus. You can start with basic facts about what's a sequence, what's a real number, how many of these numbers are there, how big are the sets, and then you can kind of keep proving more and more complex things.
(33:56) Nathan Labenz: That's a great tip. I'm definitely going to bookmark the real analysis game and see if I can get my soon-to-be seven-year-old into it. Hey, we'll continue our interview in a moment after a word from our sponsors. Want to accelerate software development by 500%? Meet Blitzy, the only autonomous code generation platform with infinite code context. Purpose built for large, complex, enterprise scale code bases. While other AI coding tools provide snippets of code and struggle with context, Blitzy ingests millions of lines of code and orchestrates thousands of agents that reason for hours to map every line level dependency. With a complete contextual understanding of your code base, Blitzy is ready to be deployed at the beginning of every sprint, creating a bespoke agent plan and then autonomously generating enterprise grade premium quality code grounded in a deep understanding of your existing code base, services, and standards. Blitzy's orchestration layer of cooperative agents thinks for hours to days, autonomously planning, building, improving and validating code. It executes spec and test driven development done at the speed of compute. The platform completes more than 80% of the work autonomously, typically weeks to months of work, while providing a clear action plan for the remaining human development. Used for both large scale feature additions and modernization work, Blitzy is the secret weapon for Fortune 500 companies globally. Unlocking 5x engineering velocity and delivering months of engineering work in a matter of days. You can hear directly about Blitzy from other Fortune 500 CTOs on the Modern CTO or CIO Classified podcasts or meet directly with the Blitzy team by visiting blitzy.com. That's blitzy.com. Schedule a meeting with their AI solutions consultants to discuss enabling an AI native SDLC in your organization today. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24/7. Just describe what you want in plain English -- send a daily briefing, triage support emails, or update your CRM -- and Tasklet figures out how to make it happen. Tasklet connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works. No flowchart, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO, Andrew Lee. Try Tasklet for free at tasklet.ai, and use code cogrev to get 50% off your first month of any paid plan. That's code cogrev at tasklet.ai.
(37:09) Vlad Tenev: And we haven't really talked about MathLib, but the Lean kernel is quite small. There's an open source project called MathLib, which you can think of as the largest digital repository of mathematical knowledge. A lot of the famous theorems and results can be found in MathLib, and those give you almost like additional complex moves or algorithms to prove your thing. You can apply a theorem and it's almost like applying a function from a library, and that can help you get to the goal. So MathLib you can think of as an amalgam of the knowledge that's written in English, Russian, German, and all the textbooks out there -- consolidated in Lean and available open source. You can just browse it on GitHub by category: there's algebra, real analysis, geometry, statistics. So part of doing this is actually building this great repository that people are then building on and proving novel results on top of.
(38:21) Tudor Achim: Yeah, I think people can understand what it is better if they just think of it like every math textbook in the world merged into one in a self-consistent way. So eventually all of mathematical knowledge will be in this one repository, and if you hit build on your computer, you'll be able to check it all from the foundations. If you have any question about any math concept, you just search for it, click go to definition, and you can jump around. It's really going to be the new foundation for math in the future.
(38:48) Vlad Tenev: It's pretty exciting. I think mathematics is certainly going to change fundamentally -- how it's done, how fast it moves. And I think to a large degree it already has, and AI is just going to accelerate it. The great thing about our timing is Harmonic really started when both of these things matured to a level of capability where you could start doing interesting stuff. Lean basically went from being essentially beta software -- not appropriate for real mission critical use cases, which was Lean 3 -- to Lean 4, and that was about the same month we launched the company. Also GPT-4, where you were starting to actually see glimmers of it being really, really good at synthesizing information and the starting points of reasoning, came out at around the same time. I think both of these matured to a level where you can start putting them together and doing really cool things. And I think we were just the first to see that, and that's how we came up with this concept of mathematical superintelligence -- which really means the combination of formal verification and formal tools with artificial intelligence.
(40:07) Nathan Labenz: Funny story -- as I was using Aristotle a little bit to try to wrap my head around all of this, I don't have the sophistication to pose any really interesting problems. So one challenge that I gave it was to prove that 2 plus 2 equals 4, and then I had to laugh when it came back just citing something from MathLib that was like, "this is already proved in MathLib" -- the theorem is literally called the 2 plus 2 equals 4 theorem. So I was like, it's done. And I was like, yeah, that's not exactly what I was looking for, but I guess I got what I deserved for asking such a basic question.
(40:41) Vlad Tenev: Were you using the web interface or the terminal UI?
(40:45) Nathan Labenz: I started by having Claude Code install the terminal and then was using that a little bit. And then somehow it tipped me off to the fact that there was a web interface, so I moved over to the web interface.
(40:57) Vlad Tenev: Yeah, that came out last week and it's probably a little bit more appropriate for those types of questions. I think we wanted to roll it out on terminal first because I think it makes it a little bit more clear what the tool is great at. I mean, lots of things can answer 2 plus 2 equals 4.
(41:15) Nathan Labenz: Even I can answer that. Even the calculator.
(41:18) Vlad Tenev: Yeah. And for a while we were talking about how do we describe what Aristotle is. It's kind of like an amazing calculator where you can imagine you could just talk to your calculator. So it has the reliability -- you know, if your calculator gives you an answer, it's correct, but it's not very expressive. At the same time, something like ChatGPT or Claude is very expressive, but sometimes you have to double-check its work because it doesn't have the verification. The intent is really to put those together. And it turns out that the first things that people really want to be sure about and to verify are the more complicated things. So I think the complicated things are where you really start to have aha moments when you're using it.
(42:12) Nathan Labenz: Yeah. Let's get into Aristotle, and I appreciate the time spent in remedial education. I think it's beneficial not just for me but hopefully everybody will now be able to grok what we're about to get into much better with the foundation that we've laid. So Aristotle has three core parts. I'll just sketch them, and then you can give me the double click on them. First, there is this Monte Carlo tree search type thing -- I kind of think of that as an AlphaGo-like structure where we are systematically exploring the space of moves. I guess that's where I got the chess analogy, making this equivalence between Aristotle and AlphaGo. So it's kind of, well, maybe I can make this move, and then there's this learned scoring function that's like, okay, does that move seem promising? Does this branch of all possible moves seem promising? Do I seem like I'm getting closer to my goal? And with that, you can kind of grind things out, run deep tree search. The second part, in some ways, jumped out as even more interesting, and I really want to dig into the metaphysics of it a bit -- because this is the lemma-based informal reasoning system. I take that to be sort of saying, okay, if I have some really big mountain to climb, and it's maybe so big that it becomes impractical to just grind my way through all these small localized steps, it's sort of guessing what are the base camps I would want to get to along the way -- the really good waypoints such that if I can get there, I know I've made it somewhere. But that's really interesting because it seems a little bit more like a language model where it's kind of guessing and not so formal. I mean, it says in the technical report that it is an informal reasoning system. And then there's a third part, which we maybe don't have time to go as deep on, which is specifically dedicated to geometry. In the technical report you describe that as being like AlphaGeometry, which I think DeepMind developed. So correct any misconceptions I have there and give me the double click on what more I should understand about how this thing works.
(44:34) Tudor Achim: Sure. I think you covered the components pretty accurately. One thing I have to say is that we revamp our systems pretty often here. So I think Aristotle now looks quite different than Aristotle for the IMO -- a lot of things have been consolidated and improved. I think you made this point about the Monte Carlo tree search being more of a grinder. I wouldn't quite characterize it that way. The Monte Carlo tree search is actually doing a lot of inference on its own about high-level steps. The lemmas we're talking about are much closer to "solve a challenging math problem" than they are to "prove that 2 squared equals 4." So there's a lot of reasoning that goes into them. In some sense it's grinding once you get low enough in the search tree because you're just closing out cases or easy subproblems, but it's really solving harder problems on its own. When we combine it with the informal reasoning system, you can almost think of it as a form of context management. Ultimately, you need to end up with a Lean proof, and that's going to involve big steps and small steps. And it's helpful when you're focusing on the smaller steps to not have to remember the entire context of the bigger steps. It turns out the informal reasoning system itself actually makes enormous quantities of mistakes. One should not think of it as a really smart human laying out the steps to base camp -- it's more like a system that can propose lots of things that are wrong and don't have to be formalizable or even correct, and you kind of try to assemble things from that. So you can think of both of them as doing the same thing, just at slightly different scales and complementing each other. And they're actually all LLMs. As we described in the tech report, the tree search itself is driven by language models: part of the language model is proposing steps, part of it is scoring steps, but they work in concert to solve the lemmas and then eventually the full problems. And as you mentioned, AlphaGeometry is a slightly different system -- we're exploring high-level steps and then trying to use an algorithm to grind through the rest of it. I would say AlphaGeometry and the deductive reasoning system is really a grinder -- it's trying every possible conclusion of a geometry diagram. There's not too much pattern recognition intelligence going on there.
(47:14) Vlad Tenev: Yeah, and that's because geometry is more constrained. You have points -- if you have three points, there's only so many angles involved. Obviously if you go to like 10 or 15 points, things blow up pretty quickly, but it also then becomes hard for humans to solve. And I think that's why geometry was among the first class of competition problems to fall to AI and automation. I think there are also a couple of other components that might seem simple but are nontrivial that Aristotle does and are independently improving. One is auto-formalization -- taking input that you provide in natural language and faithfully translating it into Lean in the best possible way. Relative to our competitors, at least, I'm not aware of anything that's as good at that as we are. And also theory building -- sometimes in the way of solving something, you have to create new theories and new structures that might not exist in MathLib. And Aristotle has the capability of actually building that on the fly and incorporating it into the proving process.
(48:42) Nathan Labenz: Another funny anecdote -- you're referring to what I discovered is informal mode, right, where I can provide -- I think real users would not do this, but you can provide anything, any natural language input, something that the system will then try to prove. I asked it to prove "all is love," and it came back and said, "this is a philosophical statement and outside the scope of the Lean-4 kernel's ability to prove." I also asked it to prove "Epstein did not kill himself," and it came back and said, "this is a statement about current events, and again it's sort of outside Lean-4's ability to prove." But yeah, I think this does kind of get back to this metaphysical question that I find so perplexing -- that translation from the messy real world of human affairs and intuitions to the formal definitions of what it is we would actually want to prove. I found it very interesting that you had such a thing at all. Do you have a sense for -- I also do want to get into a little bit more detail on technically how you created the models and all that. But on my spectrum from "2 plus 2 equals 4" to "all is love," how do you think about the intuition for where the boundary is? Because in listening to your previous interview with the Sequoia folks, it seemed like you had the sense that eventually, as the system gets capable enough, more and more things that are of interest to everyday people will start to become the sorts of things it can do. So how do you think about that boundary and how it expands over time?
(50:29) Tudor Achim: I think the ultimate boundary of a system like Aristotle is in reasoning through any problem where people can also agree on what it means to be a valid sequence of reasoning steps. So right now you have math -- that's one obvious one. When we talk about mathematics being the same as reasoning, that chess example you gave is a perfect one. You can express the logic of a chess game and then check it, and then reason about it. I think one area that's really going to touch a lot of people's lives is that you can use the same reasoning approaches to think about software. When people write software, they write unit tests and integration tests -- it's having the computer run the program and check the output against what they expect. But that's what they do after they've written the code. It turns out that when engineers are writing code, they're thinking logically: if I have this range in my input, as I go through this for loop and these if statements, it implies certain things about the output. And that itself is logical and mathematical reasoning. So we're starting to see API users reason about programs in the same way they can reason about math. People are writing cryptography implementations and then checking: is there any possibility that two inputs might give me the same output, which would violate a certain principle of the crypto algorithm? Or they might be implementing a controller for an autopilot and asking, is there any sequence of inputs for which I'll have an unstable dead zone? So I think the same kind of reasoning that was good for math will then go to software and help take us to a bug-free software future. Now, Vlad and I disagree a little bit -- it's not clear to me if we'll be writing history essays or something. Maybe there is a way to value them objectively, but I think the boundary is really in anything that's quantitative and logical in nature.
(52:27) Vlad Tenev: I think in the first version of Aristotle, it would actually formalize and build a theory for your "all is love" example and give you a correct proof that it's probably true. And I think it surprised us -- people were asking all sorts of questions. We had people asking biology questions, medical questions, of course economics and financial math, and Tudor mentioned computer science. So it's actually surprised us how broad a set of things it can successfully create a theory around and formalize. The constraints we put in place were just that when you're building a product, you want to make sure you deliver value. At this point, I don't think we provide the most value if you want to write a history essay. So we're trying to nudge people toward discovering what Aristotle is really, really good at as quickly and simply as possible. And I think over time you should expect that surface area to increase. We start formalizing more things, and I don't think it's inconceivable that at some point it pulls current events and news from the internet, puts out the axioms, and can sort of fact-check and make conclusions based on real world events. Not our focus right now, but it's not a crazy thought. I mean, I ask questions sometimes -- I'm interested in astronomy, and I wanted to know when's the next full solar eclipse I can see from within 50 miles of Palo Alto. And the models usually struggle with this type of stuff because nobody's asked that identical question out on the internet, so they can't pull it -- you actually have to do some math. So you can imagine there's a spectrum, and there are questions like this that a model that can reason from first principles is going to be way better at.
(54:38) Nathan Labenz: Okay. Let's talk about just how you created this thing a little bit and how your experience and lessons learned relate to some of the live questions more broadly in the AI space. I think you can take on faith that folks listening to this show will be familiar with things like reinforcement learning from verifiable rewards and how the ability to generate synthetic data feeds into a system like that -- and that's, I'm sure, part of what you're doing. But what more can you tell us? Would it make sense to start training something like this from some off-the-shelf pretrained model, or does that messiness that LLMs start with corrupt or pollute the purity of the mathematical reasoning too much? Can you tell us anything about size of models -- parameters, tokens, whatever? I'm also interested in whether there's any role for taste in this process. Obviously mathematicians are very interested in correct proofs, but they're also interested in eureka moments and the sense of elegance. There's a sense of beauty that matters a great deal to many people alongside the correctness. And then I also noticed there's test-time training that's part of this, which is a huge trend I'm watching in general. So you can take any of those pitches -- what do you think are the most interesting next levels of depth that people can use to inform their own AI worldview?
(56:28) Tudor Achim: Well, first, I have to say that if your audience knows about reinforcement learning from verifiable rewards, you've got a great audience.
(56:33) Vlad Tenev: Synthetic data.
(56:34) Tudor Achim: Yeah, that's not --
(56:38) Nathan Labenz: I think that is a --
(56:39) Vlad Tenev: Safe assumption. Nobody was talking about that stuff -- it was like science fiction almost. But it's cool to --
(56:46) Tudor Achim: See it entering the popular consciousness. I want to address the taste question, because that actually strikes at a key thing that companies can decide on. So we got gold performance at the IMO with a very powerful system, and it was obvious we had to give it to people. And there are two ways you can do it. One way is you say, well, we're going to keep this in-house. We'll recruit some great mathematicians to come work in secret on problems, and as they make progress, we say, "Aristotle's now done X and Y and Z." That's one way of expressing taste in the research map. The other way -- which we ultimately decided to do, and we think it's been great for the community -- is we said, "We're not going to be the ones to decide what's important in math. We're going to make Aristotle accessible to everyone." So we opened up the API, the web interface, and there's a lot of great features coming. In this scenario, taste is expressed by the community through the revealed preference of what they submit to the API. We don't choose what kind of math they do. We're not saying, hey, Navier-Stokes is more important than P vs NP. It's the mathematicians with API credits who say, "we care about X or some other thing." And that's why we've seen so much interest in computer science, crypto, and certain branches of number theory. For a while, there were people doing a lot of interesting conjectures in graph theory on the platform. I think that's actually the right way for companies to engage with the community -- you open the system and let the people decide where they want to allocate those compute resources. That's an important decision, and we've come down on one side of it, but I think it's the right long-term approach.
(58:23) Vlad Tenev: I think there's a philosophical question in there too, which is: are we headed for a future where the AI labs themselves generate all the discoveries? Will the cure for cancer or diabetes look like a giant AI lab with a two-gigawatt data center just churning on the problem, and then it comes out and they capture all the value? Or does it look more like millions of people empowered with these tools, working independently and collaborating -- and in that world, they'll get the credit and the value will largely accrue to them? I think we believe the second world is more interesting and more likely. The first one is rather dystopian and less likely. And I think we noticed that because when we rolled out Aristotle, we had one view of what people would use it for. But then we started getting all of these Erdős problem results and things like that. We're not going to run on all the Erdős problems ourselves. We're not going to do computational learning theory formalizations in-house. So the amount of cool things being done with it just explodes if you make it generally available. It's not only right from a business strategy standpoint, but I think the world we're building assuming this path is a better world that I would like to live in.
(1:00:05) Nathan Labenz: So that speaks to taste in terms of problem selection. But I was also just thinking in terms of, as you're training the model, you've got the correctness signal. Maybe one heuristic for elegance would be just brevity -- one kind of way of trying to send an elegance-like signal through a deterministic mechanism. But I would be very interested to know if there is a panel of mathematicians reviewing solutions for elegance to try to make sure that this thing is not just a pure grinder long-term, but really has a more eureka flavor to it.
(1:00:45) Vlad Tenev: Well, if brevity is the definition of elegance, then our 2 plus 2 equals 4 proof probably takes the cake, right?
(1:00:53) Nathan Labenz: Yeah. Can't get any shorter than that.
(1:00:55) Tudor Achim: I would feel bad for any mathematician whose job it was to compare AI proofs. That's certainly not the job I'd want. So we have never --
(1:01:03) Nathan Labenz: It's a big business these days across all domains, right? Many billions spent on expert validation of AI outputs.
(1:01:12) Tudor Achim: Yeah. We have done essentially zero of that in the two years we've been around. I think the metric we optimize for is the net present value of future proofs -- or computational cost of future proofs. And that guards very naturally against certain phenomena. When you're solving easy problems early on in reinforcement learning, you absolutely can solve them with grinding -- you can say, let me just do brute force. But you know that if you do that, it's going to cause issues later because you haven't learned how to do more complicated things. In contrast, if you're given two proofs that are not grinding, but one is drastically longer and more inefficient than the other, you prefer the more efficient one. So there's a tension there because you can get more efficient by grinding, but that messes you up in the future. It's a balance that our AI researchers strike based on their intuitions about what will be helpful long-term. But we have never had panels of mathematicians do A/B testing on proofs or anything like that. Really, you want to give your system as few priors as possible and just run reinforcement learning at scale. There's a famous essay called The Bitter Lesson -- which I'm sure your viewers are familiar with -- and we really believe in that at Harmonic. To get to your question about how we started: yeah, sometimes we'll start from pretrained models. Ultimately, you want to do whatever optimizes that net present value of future cost of proofs. So pretrained models are great for that. I think at some point you might ask, well, is that going to bias you too much toward how humans do math? So you want to mix in reasoning systems that are not trained from human knowledge -- they have more entropy and more complementary knowledge. That kind of thing we always play with, but it hasn't really been the limiting factor so far. I think pretrained models are a great starting point.
(1:03:07) Nathan Labenz: Cool. So Goodfire just announced today that they raised a bunch of money at a unicorn valuation. I was a very small-scale early supporter of theirs. It has me thinking -- and this also kind of connects to Vlad's comment that the system can sort of invent new theory. One big thing that people have said AIs can't do, which is always a dangerous position to take, is that they can't come up with new abstractions. Sure, they can learn from what we've done and what we've encoded into language, but will they ever come up with their own abstractions? I think that's an increasingly hard position to defend. But what's so interesting with Goodfire is they're now starting to look at model internals and unlock new kinds of understanding based on what the model has learned -- the famous one they just put out is new markers of Alzheimer's that people didn't know about, which the model was able to figure out, and they were able to figure out what the model had learned by looking internally. I'm wondering -- have you guys done any interpretability work on your models? Do you think there is a different kind of latent space that you're tapping into? And do you see hybrids as part of the future? Because one thing I could imagine happening is starting to stitch together a mathematical superintelligence with a more fuzzy, associative, understand-the-world superintelligence, perhaps later in post-training, to try to get the best of both worlds.
(1:04:51) Vlad Tenev:
(1:05:11) Vlad Tenev: One of the things that I'm very excited about is eventually Aristotle powering a spacecraft, much like HAL 9000, but a benevolent one that doesn't go crazy. I think eventually you'll see it expanding into more real-world things. I don't know if you're as excited about that.
(1:05:14) Tudor Achim: A safe HAL 9000 sounds great.
(1:05:15) Vlad Tenev: A safe HAL 9000, I think, would be very valuable.
(1:05:18) Tudor Achim: To your question on interpretability, I think interpretability is often used as a proxy for trustworthiness. A lot of the reason people explore interpretability technology is that they can make sure the system does the right thing and aligns with the user's intent. When it comes to trustworthiness, we made the explicit decision at the very beginning of the company to focus on Lean. By outputting our reasoning in a formally verified way, that is the most interpretable possible output. The computer can check it. If the human wants to understand how the proof works, they just keep hitting go to definition. It's almost like navigating through a codebase. There's no more interpretable way to output math than in Lean. That's the maximal version. So the question becomes, okay, how interpretable is the model? In the context of the bitter lesson, we just focus on letting the system do whatever it can to optimize for computationally cheap proofs of more and more complex things, with the caveat that it has to output in a way that's verifiable. Down the road, we're very curious — how does it do math? How is it so smart? And we'll look into that. But for us, we've solved the trustworthiness question upfront by focusing on formally verified output.
(1:06:39) Nathan Labenz: Yeah. That's quite interesting. I have this one kind of mental model — mathematicians are famous for visualizing things. My visualization of what is happening in a large model is sort of like shrink-wrapping reality. You've wrapped in plastic all of internet data, or whatever domain it is that you're trying to learn at scale, and you're just sucking all the air out of it and gradually shrinking down to whatever, hopefully, is the true structure. It strikes me that in math in particular, that structure might be amazingly simple, and there might be really interesting things to learn by running that process and then cracking it open to see what's inside. I would expect it to be maybe a lot more interpretable internally than something that has had to learn all of internet data and can recite Wikipedia and all that sort of stuff.
(1:07:41) Tudor Achim: I actually think what these models are doing is interesting because they're smashing together all of the techniques that all mathematicians have done before. While I haven't seen the spark of superintelligence yet — some breakthrough eureka idea that's incomprehensible — I'd say that if you push it on learning how the models do things, you can ask it to solve more and more complex problems and just see how it pulled together three subfields of math in a way that no human has done before. I think that'll be a lot more interpretable and comprehensible than trying to dig through the weights. I might be wrong, but that's probably where I'd start to interpret how it does things.
(1:08:19) Nathan Labenz: So does that mean we can kind of look at different levels of difficulty of problem? We've got the Erdős problems. There's definitely a phenomenon happening right now where people are using either Aristotle by itself, or I've also seen increasingly more examples of GPT-5.2 Pro to sort of generate a proof in token space and then bring it over to Aristotle for formalization. Then there's, of course, the IMO. If I understand correctly, it was just three groups — you guys, OpenAI, and DeepMind — that got gold-level performance. I think everybody missed the same one question, which is really interesting to me. I'd love your thoughts on why so consistent. And then of course you've got these extreme problems where you would need a move 37-like moment to solve them. So maybe sketch out where we are on this curve of problem difficulty — are we just going to ride a smooth exponential, meter-task-length style, all the way up to Millennium Prize problems? Or are there going to be breakpoints where you might need a new architecture, a new insight, or a new learning method to get from one range of problem difficulties to something qualitatively different?
(1:09:41) Vlad Tenev: On the IMO, the three labs that announced gold medal performance — us, DeepMind, and OpenAI — all missed question six. And I think that wasn't super surprising to us because question six is probably 5x harder even for humans. It's just a more complex question with lots of steps, and it requires a type of spatial reasoning that right now is more difficult to encode formally. We were running our system on it quite a bit and felt like we saw signs of life. So it's definitely not inconceivable that before too long, question six is going to fall and be gobbled up just like the other questions. Even one year before, questions three and five would have probably been well beyond reach for most of the models. So I think it does appear to be more or less a smooth exponential.
(1:10:49) Tudor Achim: Yeah, I agree with that. I want to highlight that there are two aspects of this. I think we're continuing to see a smooth exponential in terms of AI capabilities in math. What I think is a little more interesting — and was less predictable before — is that there's now definitively been a phase transition to formal math. Years ago, if you had asked someone whether you could automatically formalize a number theory paper in Lean, Rocq, or Isabelle, you'd have been laughed out of any room of mathematicians. And today, we are seeing people upload the full text of a math paper and run Aristotle a few times. We're thinking of adding a "run again" button to just keep going and keep going, and then you get a formal version of it. I think that phase transition has essentially come and gone now because of Aristotle. So in the next couple of years, as AI keeps improving, the fact that we can now formalize the AI's arguments obviates the need for humans to just be the verifiers — sitting there checking if some output is correct — and moves them to being the tastemakers, the ones deciding what problems to work on and whether they're happy with the techniques used. That is the interesting transition that's happened. Smooth exponential capabilities, but we've gone from zero to one on verification.
(1:12:09) Vlad Tenev: I think that's such a great point because there was some debate about this at the beginning. In a way, if you look at DeepMind, they started with formal with AlphaProof, which was the silver-medal-winning model back in 2024 — a great result at that time — and that was a formal model. Then they went back to informal for Gemini this year. I'm sure they ran AlphaProof; maybe it just didn't do as well. OpenAI, obviously, went informal. But if you think about it — let's say we go to a world five years from now where the autonomous math being done by AIs increases, and instead of five to ten page proofs, you're starting to produce 5,000-page proofs. You should assume that, as these models can autonomously reason more and get more efficient, they'll produce longer and longer output per unit time, and that'll be a proxy for complexity. Who's going to review that? Nobody's reading a 5,000-page math proof. So I think it's becoming even more clear that the future is formal, because someone has to validate and check it, and we want to make sure the time to validate doesn't actually grow linearly with the complexity of the proof.
(1:13:36) Tudor Achim: Yeah, and that was really the founding thought experiment of Harmonic. We asked ourselves in 2023: these models can do high school math poorly, but they could do elementary school math poorly a year before that. So what happens in ten years if we ask it to prove the Riemann hypothesis? Any model will make an attempt and give you 100,000 pages of output, which you might as well throw in the trash for two reasons. First, there's probably a mistake somewhere. Second, you can't process it — there's just nothing to do with it, you can't wrap your head around what's going on in that proof. So there were two hypotheses, both of which have been proven out: first, outputting math formally makes it digestible for humans and provides a high level of certainty and trust; and second, it'll lead to more efficient ways to do reinforcement learning for math, which is what we saw proved out. If you compare the resourcing we've had compared to the big labs, we're punching well above our weight at the IMO. In our view, the debate on formal versus informal is settled — clearly it's going to be formal. One can debate what's the most efficient way to train a model, and there are some aspects of informal that are helpful, but I don't think we're ever going back to a world where it's just informal from here on out.
(1:14:52) Vlad Tenev: I think the interesting question is to extend this to software, because the same things that hold for math actually hold for software. Let's say AIs are getting to the point where they can autonomously work and create a software project over a period of a week or multiple weeks. You know, the Cursor team ran this and generated a Chromium-compatible browser — something like one and a half million lines of code. Incredible. So who's going to read that code and find all the security vulnerabilities and bugs? And is the code generated by AIs in the future going to be in Python and Java anymore? Why would it be? Those are just languages optimized for human readability. If the answer — for humans reading and trusting something, or even an AI that the model is collaborating with checking something — is the same, then you want to make the cost of verification as low as possible. That makes us believe the future of software is formal as well, and more and more software will be written in formally verifiable languages.
(1:16:11) Tudor Achim: Yeah, and Lean is our favorite language. It would be amazing if everyone could write in Lean. I think that as AI writes more and more code, it will be easier for people to accept that. But we'll see.
(1:16:20) Vlad Tenev: And it'll start with mission-critical stuff where bugs are much more serious and costly. There's a bunch of domains that are already doing formal verification for software, but doing it in a very artisanal way — hiring Lean, Rocq, or Isabelle experts and painstakingly formalizing things. So I think you'll start to see it accelerating the work of those people first, and then it'll just diffuse and you'll see formal vibe coding before too long.
(1:16:52) Nathan Labenz: Yeah, I love the term "vibe proving," by the way. I think that vision is an incredibly compelling one, and it's also one I'm still kind of wrapping my head around. For listeners who haven't already heard it, I did an episode with Kathleen Fisher, who was at RAND — I think she's now moved to ARIA in the UK to lead their whole operation — and Byron Cook, who's a legend of the formal methods field at AWS. They're right there with you, envisioning this world of basically totally verified, bug-free software, starting with mission-critical stuff but potentially extending to everything over time. The one nagging question I have is: if we are training an AI to be superhuman at formal reasoning within the formal reasoning system that we have, how do we get new abstractions from that? How do we get an Einstein-like moment? It seems that at some point, we all sort of thought the world was just naturally three-dimensional and obviously intuitive. Then it came to light that that was an adaptive understanding of the world that served us well as primates — it allowed us to survive — but at the end of the day, we now know it's a lossy approximation of true physics. So I'm wondering: do we have any room for doubt that the math we have now, as sophisticated as it has become, might also at some point prove to be not quite the right paradigm? And is there any way, if you're training in this purely formal way, to punch your way out of the box as Einstein did?
(1:18:52) Tudor Achim: So he broke the fourth wall conceptually, but the key thing to remember is that he was able to describe his theory rigorously and formally in the framework of differential geometry. The point I was making earlier about math being reasoning is the point I'll appeal to now: no matter what complicated theory somebody might come up with to explain how the universe works in the future, if it's going to be based on a series of logical deductions that can be explained to someone else and checked independently, that is itself a logic that can be encoded with Lean or other languages like it and then verified. The axioms that Lean is based on are so minimal — expressing just the most basic possible common sense about how reasoning should happen, like "one thing might follow from another" or "if two things look the same, they are the same." That's the level of axiom we're talking about. So I really don't think there's any conflict here. One should just think about formal reasoning as an especially detailed version of informal reasoning that a computer can check automatically.
(1:20:04) Nathan Labenz: There's no—
(1:20:05) Tudor Achim: There's no limitation to it. Sometimes it might be a little more verbose than you'd want, so you write tactics and things to cut down on that, but there's really no fundamental tension between the two.
(1:20:15) Vlad Tenev: And you might also be thinking about Gödel's incompleteness — the fact that in any axiomatic system, there are statements that are true and unprovable, and also statements that are undecidable and independent. So there's a bunch of edge cases here, but I think it doesn't prevent us from making a lot of progress and proving the lion's share of useful things. There could be things that are unprovable but true that are very useful to know as well. But no way to know unless you explore the frontiers.
(1:20:57) Nathan Labenz: Do you think there's always going to be a role for entropy of some sort in these systems?
(1:21:06) Tudor Achim: Yeah, I think hallucinations are a key part of a reasoning system. Hallucinations are what allow a model to explore something that has never been encoded by a human before. So when we run Aristotle, whether it was at the IMO or now, it makes a lot of mistakes. It tries a lot of paths that don't work, but that exploration is the very thing that lets you get to the right answer after enough attempts. Entropy is crucial. I think this whole notion of seeking fundamentally hallucination-free LLMs doesn't really make much sense. Of course, you want to pair them with a system like Aristotle that can verify things end to end. But I think entropy and hallucinations are a key part of the training process for models like this.
(1:21:52) Nathan Labenz: You've got to be able to pose false statements in order to prove that they're false.
(1:21:58) Tudor Achim: Learn like humans. You try a lot of things.
(1:21:59) Vlad Tenev: True for humans. Some of the most creative humans are the ones that hallucinate the most.
(1:22:04) Nathan Labenz: So what's the latest progress on the path to superintelligence? I think this is true of all good frontier AI companies, whether at the application layer or the model layer or any hybrid of those — you're updating your systems frequently. It sounds like there's a kind of convergence going on between the tree search part and the informal lemma guesser you described in the technical report. What can you tell us about what the trends are right now?
(1:22:39) Tudor Achim: Just to review the progress: we started in 2023, and then in 2025, gold performance at the IMO, we topped out the Varina benchmark at the end of the year with our public API, and users started solving Erdős problems that had been unsolved for thirty or forty years. So there's a very clear trend in capabilities, and I think the phase transition I mentioned has also happened. What's next for Harmonic and for the field at large involves a couple of things. We can expect Mathlib to grow — think of it like the Wikipedia for math that's computationally certified. As Aristotle makes it possible to auto-formalize a lot of math, users will start contributing a lot of pull requests to Mathlib, which makes it possible to solve more and more problems on top of that base. When we look at how mathematicians are using our API, people are starting to work on more important unsolved conjectures. You can think about conjectures on a spectrum: there's a conjecture that's technically been open but nobody really cares about, so people aren't trying all the time. But now you might have conjectures where a mathematician might take a shot once or twice a year, and maybe a hundred mathematicians would try. And then eventually you've got the Millennium Prize problems, where any mathematician would be happy to spend years on it if they might be able to solve it. So what you can expect from Aristotle and other systems is that more and more problems get picked off, it becomes easier to use, and it extends to software — we have users using it to check safety-critical software, whether in Lean or other languages. Overall, if I had to pick just one trend, it's really that formal reasoning goes more and more mainstream. As more stuff is produced with AI, you'll see complementarily more formal reasoning to verify all of it.
(1:24:32) Vlad Tenev: And on the product side, we've gotten a lot of feedback from the folks using it. Whenever you've got customers using a technology like this, they're very passionate, so there are lots of ways in which they're still raising issues — improving the ergonomics, making it so people don't have to hop between so many different tools, and just solving their problem as simply as possible at the lowest possible cost. You should see that continue to improve. There have been updates to the system pretty much on a daily basis. Maybe you've seen some of them as you've been experimenting yourself, but that's going to continue and you should expect it to get exponentially more useful over time.
(1:25:25) Nathan Labenz: So maybe a good place to close is the vision for what that looks like as you succeed. Obviously one thing is solving Millennium Prize problems, but I'd love a more intuitive understanding than that. One dichotomy that comes to mind is this very formal reasoning-based paradigm versus what I think of as intuitive physics. Models seem very good at developing intuitive physics in any number of spaces — folding a protein with a model is not done in a formal way, it's just whatever mess of heuristics they learn, and they can do a protein fold orders of magnitude faster than we could via a physics-based simulation approach. When we think about no limit to math and what mathematical superintelligence looks like, I also think about what Eliezer once famously wrote — at least famously to me — that a real superintelligence in his mind could look at one still image and deduce all of physics from just the information contained in that image. That also connects to test-time training. What is your vision of how this thing evolves? Is it an ever-bigger tower of formal statements? Is there some role for new kinds of intuition, new abstractions that emerge that aren't so strictly defined but potentially useful? What is this thing doing in 2030 once all the Millennium Prize problems are solved?
(1:27:12) Tudor Achim: I think that by 2030, we will have theoretical explanations for everything, basically. If you look at the history of science, there are leaps of intellect and leaps of data. The microscope comes along, and all of a sudden you build a lot more theories of biology. The electron microscope comes along, and you can build more theories of chemistry. I think right now there's really been a shortage of people able to reason logically at the highest level. When you think about unifying general relativity and quantum mechanics, it's just a very hard thing to do. What you'll see is that anything that can be posed mathematically — which is what underlies all of science — we're just going to get theories for everything that are self-consistent and make sense. Then we're going to go back into a regime where we're data-limited. We're going to have maybe five theories that unify QM and GR, and we're going to have to run very high-energy experiments to figure out which one is right. We'll have to wait a while to build those colliders. But at the very least, we're not going to be bottlenecked anymore on wondering whether we can explain something. We'll have a system that can explain anything correctly. It really will be a renaissance of science — you just remove the intellectual bottleneck and everything flows.
(1:28:31) Nathan Labenz: So do I understand that correctly? Basically, you're envisioning multiple grand unified theories that all explain all the data that we have, and then it becomes a problem for the collider experiments to figure out which one of these is, in fact, right.
(1:28:48) Tudor Achim: Yeah. AI is not omniscient. Whether it's our model or others, they'll be able to reason about anything they can ground in their own logical deduction rules. But ultimately, there are aspects of the universe you just have to run the experiment and find out—
(1:29:04) Nathan Labenz: How it really works. Yeah. Wow.
(1:29:06) Tudor Achim: Just to be clear, I think there's a lot of utility before you get there. If I have to analyze asymptotically where we end up, that's my point.
(1:29:13) Nathan Labenz: Well, we've heard about centuries of scientific progress collapsed into five years. That sounds like more like a few thousand years of scientific progress that's left — and then you just have—
(1:29:26) Tudor Achim: To get more data. But you'll have a superintelligent system that can help you.
(1:29:30) Nathan Labenz: Wow. Okay. That's about as grand a vision as I've heard anywhere. Do you guys worry about the safety of these systems? We haven't talked about that at all in this context, but I've done many explorations of different safety concerns. Eliezer, when he described the AI he was envisioning when he said it could understand all of physics from a single image, also thought it was going to be super dangerous because it would be so powerful. How do you guys think about that aspect of all this? We're talking about a lot of stuff in the next five years.
(1:30:02) Vlad Tenev: Right now we're not so worried about it because the outputs of our system are constrained. I think the first dangers will probably look a lot like cybersecurity incidents, because you have models making API calls and running autonomously and interacting with other systems. That both creates API-level cybersecurity holes and the mechanisms to exploit them. For our model, since the interface to the outside world is tightly constrained — it's not just going to fire off a request to your Gmail account or the iMessage APIs — we're a bit further away from that. But you can imagine we're going to have to start taking it much more seriously when we do connect the model to the outside world and the interfaces are not just Lean files being output.
(1:31:12) Nathan Labenz: Yeah, I do think constrained action space is certainly one of my favorite paradigms for keeping things under control.
(1:31:18) Vlad Tenev: The whole multibot thing has been fascinating to watch. I think we're entering a strange new world for sure. And the benefit is we're probably not at the danger frontier, so we'll have the opportunity to learn from others' mistakes — and hopefully they don't screw up too badly in order for us to learn.
(1:31:45) Nathan Labenz: Yeah. Okay. Fascinating stuff. This has been a fascinating conversation, guys. I really think the approach is interesting, and the vision for how far things could go by 2030 is both inspiring and, for me, a little bit scary. Anything else you want to leave people with before we break?
(1:32:08) Vlad Tenev: For me — and you can kind of see this in the values we put on our website — we believe in a future where humans are going to be at the center of all this progress. We'll definitely accelerate it, but humans should be in charge and calling the shots. That's also why we care so much about putting this into people's hands and making them use it, rather than being a lab that runs things secretly and makes big proclamations. Humans need to be at the center of everything and still calling the shots. That's what we believe in, in the future we're helping bring to life.
(1:32:53) Tudor Achim: Yeah, and just to add to that — for me, when I started using Aristotle, it was very different to have an experience where the output is always correct. If people haven't experienced that before, they should just try it out. It's free to sign up for.
(1:33:06) Nathan Labenz: Cool. Well, I'm sure there'll be plenty of ways to monetize mathematical superintelligence when the time comes.
(1:33:11) Vlad Tenev: We might do ads, you know?
(1:33:13) Nathan Labenz: Yeah, I can't wait for that. Alright.
(1:33:16) Vlad Tenev: I'd love to bring those Anthropic ads to life.
(1:33:20) Nathan Labenz: Fascinating stuff, guys. I really look forward to watching your progress. Thanks for both the remedial education and a grand vision today. It's really extraordinary. What a time to be alive. Vlad Tenev and Tudor Achim, co-founders of Harmonic — thank you both for being part of the Cognitive Revolution.
(1:33:37) Tudor Achim: Thanks for having us.
(1:33:37) Vlad Tenev: Pleasure to be with you.
(1:33:39) Nathan Labenz: If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or leave us a comment on YouTube. We always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network — a network of podcasts, now part of a16z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.