Intelligence on the Edge: Liquid AI's Ramin Hasani on the Search for Device-Native Foundation Models

Watch Episode Here

Listen to Episode Here

Show Notes

If the dominant story of modern AI is "scale is all you need," this episode is the loyal opposition's most technically grounded rebuttal. Ramin Hasani, co-founder and CEO of Liquid AI, joins Nathan for ninety minutes on what it actually takes to pack the maximum amount of intelligence into the smallest possible package — and why the answer to that question keeps turning out to be different from the answer to "how do you build the most capable model in the world." It's a conversation that runs from the 300-neuron nervous system of a worm to a Mercedes-Benz dashboard, and it lands on a surprisingly humble recipe for efficient architectures hiding underneath all the bio-inspired machinery.

The origin story starts at MIT CSAIL, where Hasani, his co-founder Mathias Lechner, and CSAIL director Daniela Rus spent a decade asking how few neurons it takes to do something genuinely useful in the real world. Their muse was C. elegans, the only animal whose entire nervous system was fully mapped when the work began around 2015 — a worm that produces remarkably dexterous, out-of-distribution-robust behavior from roughly 300 graded (non-spiking, and therefore differentiable) neurons. By writing down the differential equations that govern how two such neurons exchange information and making them trainable with backpropagation, the team got eye-popping results: 12 neurons to parallel-park a car, 19 to drive one, 30 to fly a drone — and, in work hosted at MIT by Boeing for the US Air Force, a handful of neurons that could fly a jet. They called these liquid time-constant networks — "liquid" because their dynamics stay flexible even after training.

The catch was scale. Each "neuron" is a differential equation with internal feedback and nested nonlinearities, and rolling those out with numerical solvers pushes computational complexity toward cubic. The breakthrough was solving the whole system in closed form — a result, published in Nature Machine Intelligence in November 2022, for a family of equations that had lacked a known closed-form solution since the Hodgkin-Huxley lineage of neuronal modeling (which Hasani traces back to Louis Lapicque's 1907 membrane work). With no solver in the loop, liquid networks could in principle scale to billions of neurons. A February 2023 Quanta Magazine profile of Hasani and Lechner did the rest: the inbox filled with term sheets, and Liquid AI was off.

The richest technical thread is about nonlinearity, parallelization, and scaling laws. The reason the field's "alternative" architectures — including Mamba and other state space models — are all linear dynamical systems is that you can't cleanly tensorize a nonlinear recurrence to run it in parallel. Hasani's sharpest claim: "scaling laws define architecture." The bigger the model, the more you want it unstructured — pure matrix multiplication, no hand-tuned bias — which is exactly why attention works so well at trillions of parameters. Architectural bias (extra gating, recurrence, convolutions) only earns its keep at smaller, more specialized scales, where it can match the model to the dynamics of a specific dataset.

That insight is what let Liquid go "architecture-neutral." Rather than letting what Hasani calls "the Avengers of the architectures" hand-tune designs based on personal intuition — a practice he thinks is "genuinely broken in all the foundation model labs" — Liquid built Automated Foundation Model Design (AFMD), a meta-learning search that puts real target hardware in the loop, optimizes against ~100 real downstream benchmarks (not perplexity), and ruthlessly strips out human bias in the spirit of the Bitter Lesson. The punchline is almost anticlimactic: across CPUs from AMD, Qualcomm, Intel, and ARM, the winning recipe simplified down to a double-gated convolution — short convolutions plus a gate — making up 70-80% of the network, with a reduced share of attention layers (~30%) preserved where the O(N²) richness is genuinely needed. If you were to update "Attention Is All You Need," Nathan suggests, it'd read: attention is still something you need at scale, but a gate on a very simple convolution does most of the rest. And the gate is the heart of it — input-dependence is the thing Liquid introduced to SSMs in Liquid-S4 roughly a year and a half before Mamba popularized the idea; crucially, the magic shows up in the backward pass, where the model learns dynamics, not just weights.

The flip side is where bias wins: genomics, where vocabularies are tiny but sequences run to billions of tokens and attention's quadratic cost is a non-starter; video, where diffusion earns a place; audio, where recurrent nets remain "very, very powerful," especially in low-data regimes; and physics-informed networks for simulation and digital twins. This is the bridge to commercialization. Liquid's open-weight LFM family now sees over a million downloads a week on Hugging Face, ranking fifth in the US behind Google, Meta, Microsoft, and NVIDIA. Its real ambition is the trillion-dollar substrate outside the data center: the ~$500B annual smartphone market, plus laptops, wearables, and cars. The LFM2/LFM2.5 models already run in production at Shopify (improving click-through on recommendation and search — Nathan flags the Latent Space episode with CTO Mikhail Parakhin), and a new Mercedes-Benz deal will put a 600MB Liquid model behind the voice in the car.

On hardware, Hasani argues the CUDA moat is effectively gone — agents can now automate kernel-level optimization — so chip makers need to climb the stack to the "intelligence layer," the way NVIDIA has with Nemotron. Whoever ships an efficient, pre-loaded, self-improving intelligence layer on top of their silicon wins, even with slightly weaker raw specs, because agentic applications all need an intelligence base to build on. For the practical local-AI build Nathan wants — running his five-year personal "deep context" database through a model on his own machine before anything touches the cloud — Hasani's prescription is an on-device LFM2 mixture-of-experts acting as a router/orchestrator, with small PII filters scrubbing data before any cloud call. Off-the-shelf 24B models aren't good enough yet, but targeted fine-tuning (possibly on synthetic data the model generates itself) can recover frontier-comparable performance for "tens to low thousands of dollars," and Liquid plans to ship platforms that automate exactly that. His advice to Nathan: you won't have to DIY it — just wait a few months.

The closer is the most philosophical stretch. Asked how far intelligence can be miniaturized, Hasani is candid that today's algorithms won't reach the brain's intelligence-per-watt — partly because evolution did pre-training we can't shortcut. He frames in-context learning as an emergent "mushy version of gradient descent" — a single learned algorithm (roughly least squares) that fell out of next-token prediction. Humans, by contrast, run a diverse portfolio: reinforcement learning, mental simulation, Bayesian inference. Getting to truly efficient intelligence, he argues, will require discovering the right set of "confounding algorithms" to seed at the start of training, not forcing RL onto trajectories after the fact — echoing themes from the prior CR episode on Ali Behrouz's "Nested Learning: The Illusion of Deep Learning Architectures." The note he ends on is pure techno-optimism: this is the best time in history to be curious, agents have made frontier research broadly accessible, and — as he puts it — even while talking to Nathan, he's "already training something in the background."

Topics covered

(02:00) Liquid AI origin — spun out of MIT CSAIL; "maximize intelligence in the smallest format"
(04:00) Why C. elegans: a 300-neuron worm with graded, differentiable, trainable neurons
(06:00) Liquid time-constant networks — 12/19/30 neurons to park/drive/fly; jets with the Air Force / Boeing
(09:00) The scalability wall — differential-equation neurons, cubic complexity, numerical solvers
(10:00) The closed-form (CfC) breakthrough; the Nature paper and Quanta profile
(14:00) Counting neurons vs. parameters (×7), and why nonlinearity blocks parallelization
(18:00) "Scaling laws define architecture" — bigger means more unstructured
(24:00) Continual learning vs. adaptivity — the "second axis" of dynamics; rain on the windshield
(30:00) AFMD and going architecture-neutral; the "Avengers of architecture"; the Bitter Lesson
(34:00) What won — the double-gated convolution; Elephant 2; ~70-80% gated convs, ~30% attention
(38:00) Gating as input-dependence; Liquid-S4 before Mamba; magic in the backward pass
(44:00) Where bias wins — DNA/genomics, video diffusion, audio RNNs, physics-informed nets
(52:00) The $500B smartphone market; the trillion-dollar substrate outside data centers; Shopify, Mercedes-Benz
(60:00) Kernels, NPUs, the dead CUDA moat; hardware makers should own the "intelligence layer" (Nemotron)
(68:00) Vertical integration — will the model come welded to the hardware?
(72:00) A practical local-AI build — LFM2 as orchestrator/router, PII filtering, fine-tuning to frontier-comparable
(80:00) Limits of miniaturizing intelligence — emergent in-context "gradient descent," and why more learning algorithms are needed
(85:00) Closing — techno-optimism, curiosity over fear

Resources

Liquid AI — efficient foundation models at every scale
Ramin Hasani — co-founder & CEO (MIT CSAIL profile)
Mathias Lechner — co-founder; liquid neural networks collaborator
Daniela Rus — MIT CSAIL director and Liquid co-founder
MIT CSAIL
Liquid Time-constant Networks (LTC)
Closed-form Continuous-time Neural Networks (CfC) — Nature Machine Intelligence, 2022
Liquid Structural State-Space Models (Liquid-S4)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (code)
Quanta Magazine profile (Feb 2023)
LFM2 / LFM2.5 (all models on Hugging Face)
LFM2.5-8B-A1B — on-device mixture-of-experts for local agents
Shopify (Latent Space ep. w/ CTO Mikhail Parakhin)
Mercedes-Benz
NVIDIA Nemotron
Ali Behrouz — "Nested Learning: The Illusion of Deep Learning Architectures"
The Bitter Lesson (Rich Sutton)
Hodgkin-Huxley model
C. elegans nervous system
Louis Lapicque's 1907 membrane-potential model — origin of the integrate-and-fire lineage (no canonical link found)

Quotes worth pulling

"Our objective function at MIT has always been maximizing the amount of intelligence we can pack into the smallest format of algorithms."
"With 12 neurons you could parallel-park a car. With 19 neurons you could drive a car. With 30 neurons you could fly and navigate a drone."
"I believe scaling laws define architecture… the larger the neural network you make — approaching infinite size — the more you want it to become less and less structured."
"There are a bunch of people I call the Avengers of the architectures… I think this is genuinely broken in all the foundation model labs. You have to find a systematic way to discover what the true architecture is."
"When our current AI systems do in-context learning, they learned a vague representation of one algorithm — essentially least squares. A mushy version of gradient descent in context."
"We don't have enough energy to host all of this at scale in data centers. We have to work smart. We cannot just use the fanciest model for everything."
"You can't imagine, even right now while we're talking, I'm already training something in the background."

Mercury: Command is Mercury’s new conversational interface, giving you natural-language access to your finances and helping you take actions within your existing permissions and approval policies. Visit https://mercury.com to learn more and apply online in minutes.

Sponsor:

Claude:

Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(03:53) Special Sponsor

(05:41) Liquid AI origins

(22:35) Neurons versus parameters (Part 1)

(22:40) Sponsor: Claude

(24:32) Neurons versus parameters (Part 2)

(30:51) Scaling liquid networks

(40:09) Automated model design

(52:04) Gating and input dependence

(01:01:16) Architecture bias spectrum

(01:09:17) Device foundation models

(01:18:16) Hardware intelligence layer

(01:30:01) Local agent setup

(01:36:20) Miniaturizing intelligence limits

(01:40:40) Curiosity driven AI future

(01:43:45) Episode Outro

(01:46:46) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Main Episode

[00:00] Nathan Labenz: Ramin Hassani, co-founder and CEO at Liquid AI, welcome to the Cognitive Revolution.

[00:06] Ramin Hasani: Thanks for having me.

[00:08] Nathan Labenz: I'm excited for this conversation. I've been following Liquid AI from afar for a number of years and fascinated by some of the architectures that you guys have developed going back to your time at MIT together. Also fascinated by the trajectory that I perceive the company to have taken over the last couple of years as it's become more customer-focused and commercial entity. So maybe for starters, how would you tell the kind of broad story of Liquid AI leading up to what you're doing today and what the company's mission is today?

[00:43] Ramin Hasani: Yeah, absolutely. So Liquid AI, 3 1/2 years ago, we spun out of MIT, CSAIL, building on a technology that we have actually have been working on it like 10 years before, like basically like a decade before when we started the whole company. We've been, our objective function like at MIT has always been like maximizing the amount of intelligence that we can into smaller format of algorithms. So efficiency has been like the cornerstone of our research. And we have been working on like specifically thinking about robotics and systems that were like coming into the real world, like the idea of liquid neural networks that was kind of discovered on my PhD thesis and together with my co-founders, the four co-founders that I have, we've been researching ideas about how can we build machine learning solutions that can go on robots and doesn't have like millions of billions of parameters. so that we can actually host them directly on, let's say, CPUs, NPUs, or let's say like, smaller GPUs that are mounted on top of like physical systems while delivering the reliability of much larger kind of instances of artificial intelligence systems. Essentially, what we try to do, we try to, build alternative algorithms in order to get creative in the algorithmic space to see how can we build machine learning systems that can generalize beyond the data that they have seen. Because when you go in the real world, the distribution shifts becomes like a real thing. You know, like you imagine you deploy a robot in an open world, kind of like let's say an autonomous car, a flying drone, a fixed-me vehicle, you know. So All of these systems are going to go on very, very rapidly you're going to get out of distribution. So you have to actually build systems that are really comfortable around getting it, being able to get out of distribution. You know, humans are extremely good at it, and humans do that extremely at an extreme, I mean, natural learning systems, animals also like the same way, right? So they... they actually follow a really nice trajectory out of distribution as well. So that means like the learned concepts can generalize to data that you have not seen before, right? So our idea was that like can we build systems that can have that kind of properties, you know, better out of distribution generalization. This has been not just our goal, I think an entire artificial intelligence field has been like working towards this thing and especially like in the robotics and closed loop environment, you know, like closed loop begin environment where you have an agent acting in an environment. You want to have that kind of properties, so naturally the place that we started looking into was brains. We started looking into animal brains, and we started looking into how one thing that I really liked, I wanted to look at it from a first principle kind of approach, like how neurons exchange information with each other. We started looking into brain of worms, small animals. and why worms, this specific worm C. elegans, you started working on the brain of the worm. The reason behind it is that it was the only animal that in 2015 when I started this type of research as part of my PhD with my co-founder Matthias Lechner, this was the only animal that we knew the entire nervous system. as a whole, and this animal exhibits massive amount of kind of sensory reactive kind of behavior, like amazing levels of control with 300 cells in its nervous system. And that's what's fascinating for us because this is much smaller than any neural network that performs control at that time, like on, let's say like autonomous systems, and this worm, it can do better, dexterous movements and stuff like better than, the best robotic systems that we actually had in the world. So we thought that, okay, so let's start understanding how neurons exchange information in the brain of the worm. From there on, let's start building nervous system more complex than complex kind of neural circuits so that we can get to the stage where we can, let's say, like build the next animal, build the next, you know, like basically follow the path of evolution of nervous systems, you know, and see how we can rise and evolve as part of this thing. So There are equations that describe the neuronal dynamics of, let's say, like 2 neurons. In the brain of C. elegans, because the worm is very small, the neurons do not spike.

[04:46] Ramin Hasani: They're graded kind of neurons. Like they behave like in an electrotonic way, you know? So they're very similar to artificial neural networks that we have because they're also very differentiable, right? Because you don't have spikes in the activity of neurons, they're very differentiable. That's why it was even nicer for us, more attractive for us, because we could apply learning theory to these type of neural networks. Once constructed, let's say two neurons, four neurons, eight neurons, let's say 100 neurons next to each other, and then you start training them, you can apply back propagation as a differential kind of programming on top of a system that is built by these type of inspirations that we got from nature. The type of differential questions that were there were also very well-behaved. You can make them as complicated as you can, and then when they get more complicated, they mimic their biology much more. And of course, from a computational footprint, they become more complex to scale, but it's still like you can sacrifice, you can actually tune how much complexity you want to encode inside a neural architecture. In artificial neural networks, we try to abstract away those complex differential equation that describes the neural dynamics between two cells, and we just show them with a sigmoidal function, gated sigmoidal functions, and now we have matrix multiplications that are coming, capturing the impact of the inputs that are coming to a system with a transformer-based architecture and attention mechanism and all of those things. But these are all simplified computations for us to be able to scale these machine learning solutions. For us, from a neuroscience perspective, it was very interesting for us to explore to see if we actually start making, like bringing back those differential equation-based computation and a little bit more elaborate form of computation into the behavior of every single neuron and mimic the behavior of like how 2 neurons exchange information with each other, maybe we can unlock something like greater than what we have seen from an artificial neural networks perspective. So, I mean, early on, the results were fascinating. We saw that with 12 neurons, with actually 12 neurons, you could parallel park autonomously like a car, like a small car. With 19 neurons, you could drive a car. With 30 neurons, you can fly autonomously, like navigating kind of a drone. And you can get sensory information and process information with these. a little bit more complex and elaborate version of these neural dynamics, which we called liquid neural networks and liquid for adaptability. I called it liquid time constant neural networks, like that was the LTC kind of paper that we got out. And we coined the name liquid for the fact that the... dynamics of these systems are staying flexible even after training. So the models would be able to react to new type of inputs that they receive and even learn during back propagation how to be adaptive towards the inputs that you're receiving. So it actually encoded a little bit more degrees of freedom on flexibility of the learning dynamics of a neural network compared to artificial neural networks, compared to other systems that you've seen. That form of dynamics actually allowed us to scale this technology into not just, not just in robotics, but also like applying it to predictive AI in financial services, in medical domain, and in many, many different places that we applied in, like, let's say audio modeling, you know, multimodal video understanding, you know, like we applied this technology over and over, and we saw promise that this technology is actually very... very amazing. Every neuron is much more complex than a normal kind of neural network, but at the same time, you don't need that many points of computation in order to actually get to the results that you want. We also saw that the out-of-distribution performance of these models are extremely good, and they were suited for robotics applications. So, I mean, we had interactions with the United States Air Force. I mean, there was, at that time, Boeing was actually hosting my postdoc at MIT. And there we showed that you can actually fly even jets, like with these type of neural networks, with like a handful of these neurons, which was very interesting kind of topic to see how far you can push, like let's say out of distribution generalization with a handful of like in a very, very small set of system, a system that is like powered by these liquid neural networks as opposed to artificial neural networks.

[08:48] Ramin Hasani: Then, so, as I told you, that each node in a liquid neural network is very complex, like it's a differential equation you have to solve that, and you have to solve a set of different the more neurons you have. The more complex is a forward pass and a backward pass through a network. Therefore, we needed to think about efficiencies. These systems are highly non-linear. We tried not to sacrifice on non-linearity because adding non-linearity to learning systems allows you to build more expressive systems. This is a proof that actually I have in my PhD thesis where I showed like non-linearity actually directly contributes to, especially on the smaller size of models. But then the idea here for us is that, how can we scale these systems? you have nonlinear systems, but it is extremely difficult to actually scale nonlinear systems, so you have to solve the reason behind scalability is the computational complexity of the models are hitting even cubic complexity. It's not even quadratic, you're talking about cubic complexity, which is too much, and then you have... You usually have, when you have a set of differential equations, you can roll them out with a numerical solver. You run the numerical solver and you can step-by-step compute kind of the outputs that are desired. And then the more steps you want to use in order to actually compute, then you would get more and more accurate kind of results at the end, right? But then the problem becomes the scalability again. You cannot really get to infinite precision with these numerical solvers. One idea that came to us was that, what if we just solved the whole system in closed form? literally, just you have a differential equation system with these liquid neural networks, each of them representing neural dynamics of, let's say, two cells that are exchanging information with each other at a certain level of abstraction, and we saw that, okay, let's take this system and try to solve it. in closed form. Turns out the closed form solution for this type of equation hasn't existed since 1907. So 1907, there was a scientist called Louis Lapic that actually modeled the membrane potential, like how to model mathematically membrane potential kind of in cells. And that format of equation became like a fundamental of channel modeling, like how information propagates through ion channels inside a cell, and then how neurotransmitters are getting propagated to the other one. So 1907, this was Louis Lapit's kind of membrane potential equation that is like an open differential equation. And then really, I've seen like some scientists called Hodgkin and Huxley, they started working on really biological kind of grounding this type of differential equation and adding a little bit more complexity into the model. They developed a neuroscience model of 1 neuron, how actually a neuron reacts to, let's say, a member of potential. 1953, they started. 1963, they won a Nobel Prize for the fact that they actually developed a better and a more accurate representation for a neuronal dynamics, right? And then from there on, So that in every textbook that you read, these type of formats of equations, they do not have a known closed form solution. And this format, like liquid neural networks, were also part of that type of equations. Really, for the first time, we actually solved that. And this was like 2022. Around 2022, we solved the liquid neural network kind of interaction of neurons with each other in closed form for the first time.

[12:50] Ramin Hasani: This became a Nature Machine Intelligence paper published in November of 2022. This is called the closed form continuous time systems. You know, like this is liquid neural networks in closed form. And the closed form, I mean, it has massive implications. Why? Because now I don't need to use any numerical solvers to actually run a liquid neural networks. I can now have not only hundreds of neurons, but now I can have billions of neurons next to each other, and I can actually scale these computations. Still keeping the non-linearity as part of this thing, you know? Then, in January of actually beginning of February of 2023, an article came out of a Quanta magazine, about let's say me and my co-founder Matthias, we it was about a profile about like all the implications of having some closed form solution finally on. on neural dynamics, and how important this can be for both machine learning and also for brain science as a whole. And then, so my inbox was full of VCs. Silicon Valley started like talk, like everybody wanted to throw it like a term sheet at us, like to really get started and really scale this technology because this was fundamentally different than a transformer-based architecture and attention-based architecture. This was grounded in biology, like the type of math that we have around the same times we have alternative models are coming out like state space models, like you've seen faster iterations of state space models, like convolutional neural networks that came out to be scalable, all of them in a linear form because the scalability of alternative models, you have to linearize them and then you can scale them. And we have something here that is also another category, like it adds some operators that we learn from biology and from physics as we grew the research. And we got to the point where we can now become very competitive and building models for solving more and more general purpose tasks. And when I say more and more general purpose tasks, I'm talking about modeling signals beyond predictive performance, modeling signals for language, for audio, and for vision in a way that humans understand, right? That's the complex format that humans understand. So the mission of Liquid AI became building efficient general purpose AI systems. systems at every scale. And we coined the word efficient in our mission, and we care about it so much, because I think we are, from a foundation model company's perspective, we are probably the most efficient foundation model company on the planet. I'll tell you why about this in a minute. But the reason why we set the mission into efficient machine learning, because we start at how much of the considerations that we had to do in order to really build a closed-form solution, then taking this closed-form solution and scale it to, let's say, a system that can model language, a system that can model audio models, a system that can model video and vision in general, you know, how much considerations we had to go into these things so that we can, in a computationally tractable way, scale our neural architectures to the point that where we are today. the base of our technology. I'll talk about like the core technology of the company later on, but I will tell you today the technology became liquid foundation models. Liquid foundation models or elephants are pretty popular right now. Like we're ranked #5 actually in the US from the number of downloads from the popularity perspective. We have over 1,000,000 downloads per week on the Hugging Face. Like that's a lot of downloads of these small models that we're building. They're very popular. Like the top ones from the US side are Google, Meta, Microsoft, and Nvidia, and the fifth one is So from the US side. So I mean, we got ourselves there by consuming about 1,000 GPUs, having 1,000 GPUs in-house. So that's, like from a foundation model perspective, like getting to that level of popularity and releasing more than 50 models, instantiation of models that people are using in enterprises, that's kind of the place where we brought Liquid, building on top of inspirations that we got from nature and the past that I actually like portraying for you.

[16:58]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

Main Episode

[18:51] Nathan Labenz: Brilliant. So many different directions I want to go and follow up there. Maybe for starters, one thing that just on a very fundamental level jumps out at me is when you describe the liquid neural networks, you describe them in terms of the number of neurons. As we're used to hearing about the number of parameters, and something tells me there's like a kind of paradigmatic difference there that underlies the difference in description. So Yeah, help me develop my intuition a little bit more for, I guess it probably connects to non-linearity. And this is so fundamental, right? Because there's so much incredible progress that has been made, but on things like really robust out of domain generalization, we've still got a lot of work to do on things like adversarial robustness. We still have a lot of work to do in the mainline paradigm. So yeah, tell me, unpack a little bit, why are we counting neurons versus parameters and what does that tell us?

[19:49] Ramin Hasani: That's a great question. So I think early on, we actually started talking about the number of neurons as the unit of computation, like for in our research, because that was like very interesting. Like you can associate, it's a, every neuron is a process that can be modeled by a mathematical equation. You know, now you can allocate basically like in a sigmoidal kind of from gates or the more functions you have, the more parameters getting added to that single cell, you know, and the cell that now in a liquid in terms of parameters, you can say multiply by seven, you know, and that will be the number of parameters often, you know, so that in that differential equation. But that cell itself, because of the equations that it has, It computes more than just, like a forward pass computation. It has like internal feedbacks. That's three degree feedback kind of mechanisms inside. All those parameters are actually like one-to-one analogous like to any artificial neural network system that you would build like in terms of parameters. So a rule of thumb for like when I say number of neurons, you multiply it by 7, you would actually get like number of parameters of the system. But this is for early on version of liquid neural networks, you know. As we started the standardizing parallelization schemes for the type of mathematical functions that we want to explore, and let's say inspirations that we're getting, we just converge to the number of parameters. Like of course, like today when we are talking about a liquid foundation model that has, let's say, 1 billion parameters, we are literally meaning 1 billion parameters in the sense of like the any other kind of GPT that is getting built out there. Does that make sense?

[21:26] Nathan Labenz: Yeah, but let's zoom in maybe a little bit more on that neuron. And I guess I'm wondering a little bit too, what is the fundamental limit or what are the first bottlenecks that you hit when you try to scale the original liquid approach? With the sigmoid type functions that we have today, those are, I think, certainly carefully selected to be easily run on GPUs. I'm guessing that these sort of multiple degrees of freedom inside a single neuron maybe present challenges in terms of executing that process on available hardware. Like we can get these amazing results from 10s of neurons. So the obvious bitter pilled question would be, what if we take that exact paradigm and go to millions, billions? But what, you must be hitting some bottlenecks along the way. What are they?

[22:23] Ramin Hasani: The main challenge is turning sequential computation into parallel computation. So when you have degree, like when you start like going from a single neuron dynamics to multiple neural dynamics, weight parameters of your system, instead of being scalars or vectors, they become matrices and tensors. So now you're talking about matrix multiplication. You want to turn the scalar computations into tensor computations, right? You want to be able to, this is how you parallelize kind of, let's say, sequential computers. The problem is that not all the time, like if you have non-linear relationships in an equation, you cannot trivially create a one-to-one map between a vectorized computation and a tensorized computation. So those non-linear kind of non-linear attributes of liquid neural networks and any other non-linear recurrent neural network, I mean, recurrence itself adds like some degrees of complexity into the math of the whole equation. I mean, let alone if the recurrence has a non-linearity on it, becomes like a lot more complex to disentangle vectors or write them with typical linear algebra kind of methods that we know into tensors. And you have to be able to turn these computations into tensors to be able to compute in parallel, like you compute them once at a time, you see, so that's kind of the mathematical it's not related only to any mathematical operation that you want to tensorize. If it is nonlinear, you're gonna have the troubles that we're talking about. That's why the state-space models that actually came out, these are all around linear dynamics. We're talking about linear dynamical systems, right? And why those linear dynamical systems are important is just pure fact that we do not have proper ways to scale nonlinear systems. You can apply non-linearity to a linear dynamical system. For example, let's say like a gated, or let's say you can add a sigmoidal function after you perform the dynamical system in a linear way. You compute all the matrices and everything in parallel, but then you can do a point-wise kind of application of a non-linear operation on top of the entire tensor. That's what you can do. But what if the relationship between the parameters themselves are governed governed by some non-linearity. So you cannot really use typical linear algebra. You gotta always approximate that non-linear system into a linear system, then you would be able to actually parallelize the systems, right? Does that make sense? So that's the fundamental bottleneck.

[25:09] Nathan Labenz: So how much does the closed form solution address that? And how far has the... With available computing resources, how far has the original liquid network paradigm been able to scale so far to present?

[25:30] Ramin Hasani: Fantastic question. One of the properties of liquid neural networks was the fact that they have multiple feedback mechanisms. They're not just one format of feedback. They have like 3 layers of feedback between two cells, in the synaptic dynamic itself, because we wanted to mimic how it is done in the brains, right? So it has multiple degrees of feedback. It has then basically nested non-linearities on top of each other. That nested non-linearity itself would also add degrees of complexity. Even the closed-form solution version of a liquid neural networks, from a dynamics point of view, you can now, instead of 100 neurons, you can have 100 thousands of neurons, maybe 1,000,000 to 10 million neurons, but the non-linearity, you still have to compute the models sequentially. Because of these nested nonlinear relationships that actually happens in the circuit, so even the closed-form kind of solution would be limited in a scalability in the sense that you cannot parallelize them very effectively. But people are also working on these topics a lot, like parallelizing, not just parallelizing, but the speeding up sequential computations. So we're talking about sequential scan. Scan is an operator that you can run on top of, let's say, doing matrix multiplication instead of doing it in cubic or let's say quadratic, you can do it in sub-quadratic kind of time, right? So you can actually reduce the complexity of computations even on non-linear operators without parallelizing them, without linearizing them. So there are ongoing research going on, like we are doing some of that kind of research ourselves to really find ways to really sequentially run those non-linear systems, because clearly there is an advantage to have those form of non-linearities when you do function approximation. We are talking about also scaling laws becomes also very important here. I believe scaling laws define architecture. How does it do that? When you're talking about transformers being this revolutionary thing, we're talking about maximum scale. The reason why transformer architecture and attention mechanism is such a brilliant architecture is the fact that it is unstructured. There's no...

[28:05] Ramin Hasani: When I'm talking about nested nonlinearities and all this crap that we have, like in liquid neural networks, you don't have that in transformers, right? You have basically just matrix multiplication as the core functional kind of things that represent, and it is unstructured. You can literally multiply any matrix into any matrix of any size. So the whole idea here is this, the larger neural networks that you make into infinite size, the larger neural networks you make, the more you want them to become less and less structured. And we've seen the success of transformers at the trillions of parameters. Now we are talking about tens of trillions of parameters. I mean, the next generation of models that are going to come, we're talking about trillions. And you can do that with a transformer architecture. As soon as you start adding a little bit of bias in that architecture at scale, things become completely messed up. So we are talking about right now, liquid neural networks, these alternative architectures that we are talking about, they have a scale to hit. There is a regime of parameters that you can just do better. Let's say up to, let's say 100 billion parameters, up to a trillion parameter. That's kind of the range we are operating right now. In this range, The smaller is the model architecture, the more you want to specialize them for a certain application to solve, and they are actually like mathematically, they're biased to actually solve a certain type of task better than other types of architectures. So, I would say biases on algorithms, the more kind of biases you put on, and by biases, what I mean is this... Adding a lot more nonlinearity, adding like multiple gating levels on top of a neural network, adding like recurrence, like multiple different types of recurrence, adding, let's say, convolutions, like as a, I mean, convolution itself, if you just keep the convolutional kind of neural networks, they're also pretty unstructured, because you can apply convolutions on any size of neural networks, so... That question becomes like how much bias and what kind of problems you want to solve so that the bias in the neural architectures becomes kind of a function of, I would say, scale of the neural networks as well as like the use cases you want to solve the most. You see, so the spectrum would be the more you go to the larger size, the more unstructured kind of mathematical operators. Use pure matrix multiplication, use pure convolutions, use pure, let's say, sequential scan. Pure, pure operators, you see, it's like you get much more like, and then forget about gates, forget about adding, fancy control, forget gates and all those things, and at scale, this is something that we have seen, because we have scaled neural networks and we have a really good understanding of that architectural kind of variation at different scales.

[30:40] Nathan Labenz: So just for calibration, when you talk about the liquid neural networks that have gone up to hundreds of thousands or pushing a million neurons, what kind of hardware does that run on? Is that like a CPU supercomputer?

[30:53] Ramin Hasani: You can do that on a CPU, like on even simple CPUs or simple GPUs, you know, because they don't have that much of, probably like one of these liquid neural networks would fit on, I don't know, 1 to 25 megabytes of like file size, basically, you can literally put them there, and then, a CPU would be enough, like a Raspberry Pi would be enough, like to perform computations. We have shown that on a lot of predictive, specialized applications of AI, that these systems can be like very, very powerful and do stuff, and they can be as powerful as the systems, like these class form variants could be as powerful as the open differential equation version of the neural networks in speech synthesis, you know, like that's one of the applications. in, as I said, predictive sequence, like imagine you have complex formats of sequences coming in, multivariate kind of sequences that are coming in and you want to perform some sort of a prediction on top of them, these neural networks are actually pretty good. And then if you're talking about like more, let's say out of distribution generalization to the extent that you are like within a bounded period, you kind of have like an open-ended continual learning. These are not, liquid norm levels are not continual learning systems. They are more adaptive formats of computations because of the many gatings, many feedbacks, and many, let's say input dependent parameters that they have in their mathematical kind of arguation.

[32:15] Nathan Labenz: What is the definition of continual learning that you're using there that they don't satisfy?

[32:20] Ramin Hasani: So continual learning would be a system that is continuously receiving new data and re-tunes itself. So it changes also the parameters of the system. Liquid neural networks are, the dynamics of the systems are input dependent. So that means new data comes in, the dynamics of the system would behave to those inputs, but the parameters of the systems, like any other neural networks, is fixed. It's just because the format of every neuron or every node in a liquid neural networks was like a differential equation, you would have a different dynamics. So dynamics, it's another dimension that you add to a number of parameters. It allows you to compress more information, compress kind of more knowledge. That's why with the smaller instance of the models, there's no free launch or there is no magic here like you're doing like also with liquid neural networks. It's just that the axis of dynamics has been something that we added to the neural networks for what? For like being more adaptable. So I'll give you a very tangible example here. Imagine you're driving and all of a sudden it starts raining. So depending on the format of the rain, your autonomous driving system that is actually taking control of the car could react to those kind of driving scenarios. If you have a traditional neural network that hasn't seen that kind of environment, it might actually get biased, because that rain that actually hits the camera, it's noise, a certain type of noise, or a certain type of adaptation of the input that is coming. But the reality hasn't changed. The confounding variables of the whole environment hasn't changed, right? So that's why liquid neural networks, because they react to the input completely differently, they absorb the input, they apply the low-pass filtering on top of the input, they are much more adaptable to those kind of scenarios. That doesn't mean that they change the parameters of the system to be more adaptable, because there are two axes here. One changing the parameters of the system, the other one is changing dynamics of the system. I would consider continual learning to be attributed to where you continuously also change the parameters of the system. In liquid neural networks, we don't do that.

[34:27] Nathan Labenz: Gotcha. Okay, helpful. Okay, so let's fast forward to the present. You guys are now in the market working with customers. One notable conversation I heard with one of your customers was with the CTO of Shopify on the Latent Space podcast, who had some very good things to say about you and your technology. I was struck by the fact that you've gone neutral in a sense. Now, excuse me, what I What sort of jumped out at me in terms of your approach is that you've developed a architecture search process where the promise to customers is not that, hey, we developed this one paradigm and the old calculus teacher used to say, when all you have is a hammer, everything looks like a nail. So you're explicitly promising to people that even though we came from this lineage of this particular kind of network, we're not just going to blindly apply it to your problem. Instead, we've created this sort of higher abstraction or more meta process for searching through architecture space to find the thing that's going to work best for you. And the two notable details of that are, one, proxy metrics you found don't work that well, just measuring perplexity or whatever, you found you needed to actually go farther and test models on the actual downstream tasks that they're going to be asked to perform. And second, hardware in the loop, actual target hardware in the loop to test the architecture subject to the very real and physical constraints of the robot, the sensor, the phone, whatever it's going to be running on. Maybe I understand that the LFM model came out of that, but maybe before we even get to the LFM, could you sketch out the range of different types of problems that we're putting into this architecture search? And then maybe also a subject that implies, of course, what kind of hardware are we targeting? And then what kind of different architectures are winning for different kinds of problems under different kinds of constraints?

[36:40] Ramin Hasani: Absolutely. So there's a system that we developed in-house, we call it automated foundation model design, you know, AFMD, you know, like that's a meta-learning system. that puts the hardware in the loop and then tries out many different operators with an evolution strategy. The criteria is an evolution strategy, optimizing for a couple of things. Optimizing for memory consumption on that device, optimizing for latency, optimizing for speed, while no sacrifice on quality. When we talk about quality, perplexity is not the measure. It's actually the application, the downstream applications that we care about. It's not just also public benchmarks. We're talking about 100 different benchmarks. So the problem space becomes, from a meta-learning perspective, becomes like a very, very complex kind of problem. Now, I'll tell you that, why did we take this approach, like to design an architecture? Why? Because we wanted to remove all the human biases early on as we are actually building architectures. One of the things that we realized culturally at companies, this is what I can tell you, even at largest foundation model labs in the US right now, Entropic and OpenAI, there are a bunch of people, people that are coming from the science, I call them the Avengers of the architectures, or Avengers of the post-training, or let's say pre-training. These groups of people, there is usually like a very small set of people that are calling the shots on like, oh, you know what? You're going to tweak this portion of this architecture so that it performs better. Why? Because in my personal experiences, it has started working better. If you're really, truly think, and this is something that is broken in all the foundation model labs. You cannot say that somebody has a fix to this, but now the recursive self-improvement kind of process is actually fixing for that, because now people are just finally realizing, you got to give it to the algorithms. You have to be better lessons. You got to be giving it to a systematic way to actually find out what is the true architecture for the problems that you want to solve. You can build like a general-purpose computer. The insights that I shared with you in the format of the scaling laws of neural networks is coming out of our massive exploration of the space of architectures. You know, the fact that in the smaller kind of category of models, some biases on the architecture help. In the larger instances of the models, you don't need to bias the systems. You can actually go pure convolutions, pure transformers, and you can do pure, let's say, reconnaissance that are like very simplified. You don't need to add any form of kind of, specialized kind of treatment like gating there, gated delta nets, and then there's the Mambas and there's the Jambas and there's like, there's a lot of different variations of these architectures that are coming out. What you want to do though, you want to be completely unbiased. Even us ourselves, day one, when we actually started Liquid, we had one of the inventors of SSMs actually as our founding scientist, Jimmy Smith, you know, he invented S5, you know, like the simple, he basically brought the parallel scan idea to the SSM world, you know. Then we had Stefano Masarolli and Michael Polli, they were actually designing like the, you know, like the hyena hierarchy, you know, there was like a convolutional kind of path that is going, you know, all of these are linear systems, you know, like within the realm of linear systems. And then we came out of like this non-linear like hyper control theory kind of enabled foundation models, which is liquid neural networks. A lot of big heads in the room and you got to call the shots around. Again, the same problem in all the labs, you know, like you got to figure out. We thought that, okay, let's fundamentally solve this problem. Let's really like systematically first principle. Let's put all the operators of interest. Whatever we think that could be an operator that can give rise to a general purpose computer, like a system that can, let's say a liquid neural network on its own is a general purpose computer. You can actually scale it. If you can theoretically, you can get it. So you put all those kind of equations, all those kind of things in a unified kind of theory. like with linear input varying kind of operators, like operators, all the, let's say, space of operators can actually be encapsulated inside this linear input varying, or let's say in general, liquid input varying kind of operators, because input dependence is something that we have been talking about this for the last 10 years.

[41:31] Ramin Hasani: That's extremely important. And we see that right now in transformers is also extremely important to have. That input dependence is actually coming naturally with the attention. architecture as well, but in a very, very unstructured way. It's not as structured as, let's say, like what is done in RNNs, in liquid neural networks, in, like in SSNs, in all of those things, you don't see these gated variables like to be very structured in transformers. Now, we actually have Various variants of a convolution operators, various variants of a recurrent operator, and then the attention themselves, there's like there's group query attention, there was the original transform, there's like many different things. Then, on all some of these variations of dynamical systems, and also various types of convolution, I said convolution before, so, but then what we brought, we brought also the liquid computational blocks, like, for example, the gated, the double gated conference, so gating what like a certain type of biases on to like different dynamical systems or different operators. We added them, and probably the space becomes like something around 50 to 100 different operators. And then you want to build hybrid models so that you can reduce. What's the goal here? The goal here is that maximizing efficiency of computation without loss of accuracy. That's kind of the goal that we set. The objective function of the search space is also this. Let's run. You know, let's have like all of our, let's say all of our compute actually thrown at this problem and let's try to see like. how the system is going to design this. And we started doing scaling laws on this. Like we went as a small as a neural network as let's say 10 million parameters to running the scaling laws up to like 72 billion parameter models, you know, like in these hybrid kind of structures. So we have done a massive, early on, early days of Liquid, like 2023 and 2024 has been always like proving out on a certain type of processor, what's the most efficient type of architecture that can come out. And turned out, as all of our, when we put all of our biases away, all the gating stuff that we are putting around operators, they got simplified. So, in Mamba style architectures, you have a bunch of gating mechanisms in, let's say, gated delta nets, you know, gated delta nets, you know, you got, you have like a bunch of operators in there, you know, in linear attention, there are like gated variants. Everybody's like tweaking a certain parameter in the network, like by hand, because in their own experiments, actually, they observe something. Turns out all of this has to go away if you want to get to the most efficient format of architecture. And it became the double-gated convolution that actually came out of this massive search space, like AFMD, you know, like this original kind of system that we designed. And this was one of the candidate architectures that came out that turns out to be very good on general-purpose computer that I call CPUs. On CPUs, because CPUs don't have, like they have a special kind of structure, right? On CPU, on all the CPUs coming out of AMD, Qualcomm, let's say Intel, you know, everyone that is building a processor, ARM, all the ARM processors. We try to get to the place where test the different operators, what's that generic kind of structure and let's say computational graph that gives you to the most simplified, no, you know, no added hand tuned, like there's no hand tuned features to anywhere, you know, it's just literally coming out of that, the test, the semantic test that we've done. And this became like the de facto kind of architecture LFM to structure that we actually announced. But while exploring these things, like we've observed like many different candidates also popping up. like there's like too many things. And then we try to really like figure out, for example, for let's say an NPU and neural processing units that is like inside an AI PC powered by AMD or powered by Qualcomm or powered by Intel, which of these variants would be like a better neural architecture that gives those hardware providers and silicon portals a boost on the amount of efficiencies that they are unlucky, or let's say a speed of competition, latency that they get, memory footprint that they get, while having a computational graph that doesn't sacrifice any format of quality. So this was like the whole thing. Removing all the human bias with a systematic approach or two architectures, even our own biases, you know, and the only thing that actually remained in elephant 2 on CPU competition is this double-gated As I told you, we have nested computation in liquid neural networks originally. This nested format of computation was something that actually became very interesting to be part of these, a very simplified kind of neural networks that we built. And the fact that we have unstructured 1D convolutions as the layers of choice, 70 to 80% of our networks are structured by these gated convolutions that we have, double gated convolutions that we have. They're extremely simplified and they replace attention. They reduce the computational complexity by a lot. They reduce the memory footprint by a lot. They maximize kind of the speed of computation. by a lot and at scale as well. And at the same time, on the quality, I mean, you have seen some of these models are actually extremely kind of competitive to the transformer-based alternatives. I think you asked a bunch of other follow-up questions as well, but I would pause here for any follow-ups here.

[46:22] Nathan Labenz: Yeah, let's get into use cases in a minute. I'm definitely interested in that. And I also want to... talk about the future of hardware and what your work implies for the future of hardware. But I think it would probably be helpful for a lot of people, including myself. Although this is something I've studied in some depth, I still would like to crack it better than I do. This concept of gating, I'll give you my rough and ready understanding and then you can improve it and deepen it. I especially notice this with Mamba. when I went down that rabbit hole and became very excited about it. seems that the kind of trick that is played over and over again with these gating mechanisms is we want the transformation that is done on the data to be input dependent. So it's not enough to learn a transformation. We want to learn a transformation, but then also have some relatively, typically the gate is relatively low dimensional. Sometimes it could just be like a scalar that's applied to that transformation. It could be obviously more complicated than that. But it's some relatively simple mechanism that says, for this learned transformation, here's how we're going to modify it given the input currently under consideration. And that Seems to be great. It seems to unlock a tremendous amount. Help me understand more anything you think I'm missing there, or help me deepen my intuition for why that is such a powerful and recurring theme in all these different architectures.

[47:55] Ramin Hasani: That's a liquid, that's a liquid structure. that's when I say like this input dependent kind of thing that is actually coming into Mamba. So right before Mamba actually got out, you know, like actually not right before, but a year and a half before Mamba come out, we released a paper called Liquid S4. This paper was, if you just read the abstract of that paper, like we're actually for the first time, we're introducing this idea of input dependent SSMs, basically input dependent SSM. The idea there was that let's bring the fundamental, let's say building block that we found like has a lot to do with representation learning, with capacity of learning, this input dependence kind of element, to really bring that format of gating to the neural architectures, to SSMs, and then bringing that to convolutions as well with the other kind of systems that we are designing today, liquid foundation models as well. That kind of format of gating, it actually adds a lot of, like you should just think about it, like naturally it makes a lot of sense. If the neural networks, it's not about the forward pass of the neural network that it is adaptive, because once when you're input dependent, you have, like input dependence in the learning, from a learning theory perspective, everything, all the magic happens in the backward pass. If you think about when you're computing the gradients backwards, that input dependent kind of operator itself, it's going to represent itself. So when you're learning from that data that you're seeing, You're also learning some sort of dynamics, input-dependent dynamics there, so that second axis that I was talking about, it's not just the number of parameters of the neural networks, it's also the dynamics that the neural network actually learns. You have some representation of that dynamics being learned in the backward pass. Now the complexity of that gate is going to make a huge change. And also it turns out like that gate is extremely important also like in language modeling, especially when you have sequence models like RNNs, like let's say classical control, like let's say continuous time RNNs, let's say discretized RNNs like LSTMs, and then you have liquid neural networks, like even not very non-linear versions of these linear versions of the in SSM. And when you add this type of gating, you can improve the language capabilities as well. It seems like for, let's say, discrete sequences, this gating also helps A lot. You know, it's not just for the continuous time sequences, but also discrete sequence, because I would consider language to be like a discrete. sequence, right, of information. So you would have like that sequential dynamics that is kind of added by this gating. So it is a very right frame of mind to think about that kind of feature. And for us, like we think that this input dependence is one of the fundamental discoveries that we have done, like that we learned also in biology. It happens, you know, like it comes out of physics. Like if you just put the math together, you're going to realize like, oh, what is special about this dynamical system is the fact that it has this non-linear input dependence. That's kind of a structure. And by biases, I mean the complexity of that gating. Even existence of that gating is a bias that you're adding to your system. Now, is it needed or not at scale? That's a big question mark. You're going to see when neural networks at 100 trillion parameters, is matrix multiplication is enough to really perform general purpose computer, get to AGIs of the world or not?

[51:09] Ramin Hasani: That's a big question mark. You need to also put this discussion about architecture and the obsession about architecture in perspective. Another perspective you have to put this in is the learning theories themselves. Right now, we're talking about different learning schemes that are coming. For example, you can train a model with next token prediction. You can train them in a word modeling kind of context. You can train them in a sequential, in a long-term horizon and long-term history. There's so many ways you can construct also the objective function of the learning algorithm itself that would also contributes a lot to the learning process. Architecture, I'll tell you that we found the purpose of the architecture and the most important application of architectural kind of research has been inefficiency, right? Really getting to efficient formats of competition without lots of quality. extremely important thing because we're talking about the resource allocation problem right now. As the models are becoming bigger and the demand of AI is exponentially going higher, you want to have more efficient versions of the systems to actually run the systems at scale. Otherwise, how can we actually provide access to AI to all? And we are seeing early instances of this thing that the larger labs are wiping out the compute off of the planet. Why? Because they have to train and they have to then host these models for all of us, right? So efficiency becomes a fundamental property of this model architectures that we are doing the research on. I would say if you want to get into architectures that enable next generation of neural networks, sorry, next generation of intelligent systems in a way that they can become like human brain that actually performs, you know, like performs computation with 20 watts of power and it's already an AGI that does that between, that's like the massive Exploration that goes beyond architecture, it goes into memory research, it goes into learning algorithms kind of feature, it goes into data feature, you know, it goes into prior research, you know, like looking at it from a statistical learning perspective. That becomes kind of a... And then learning theory itself, like the limitations of learning theory itself, it also gets imposed on today's learning systems because the definition of learning theory itself is actually broke at scale. But what we have right now, everything is IID. We are building averaging machines. You've seen the quality of writing and the sycophancy that actually comes out of AI systems because of the fact that we haven't gotten... Maybe to some extent, multi-agents has been becoming the solution and test time kind of scale. became kind of a solution for the caveats that you're seeing from just learning systems with autoregressive kind of modeling as a pre-training kind of method. But I think there will be innovations needed, not just on architecture, but the whole thing as a whole. You have data, you have algorithms, you have data, you have models, and you have learning algorithms. All together, they can design basically a future, that ultimate kind of holy grail, basically, in this space.

[54:23] Nathan Labenz: Yeah, that's very interesting. So basically your perspective is that as we get into the recursive self-improvement era, the real advances will come more from new learning paradigms, new objectives, and then those advances will be made efficient through architectural optimization. But the architecture comes after. the paradigm question of what exactly are we learning and how are we giving a signal?

[54:55] Ramin Hasani: It is 1 component of it. That's what I would say. Like looking at it from a data problem, data representation from, let's say, an axis would be architecture and the other axis would be just the learning algorithms themselves, right? And then we get into that recursive self-improvement, which is kind of the continual learning kind of characteristics of what the new system, the fancy name for the continual learning curve research that has been happening, like for for more than probably 4 decades now. I mean, and that's, so, and that's how I would actually characterize the whole space. so you cannot just look at architecture in isolation as like the fundamental thing that actually changes everything, right?

[55:35] Nathan Labenz: Yeah. You remind me a little bit of Ali Behrouz's, the illusion of architectures. If we have time, maybe we can touch on nested learning. Okay, well, let's stay focused on your work. for the moment at least. So in LFM2, this is the result of this architecture search. And the kind of surprisingly simple thing that comes back is some reduced but still critical number of attention layers. And then the other layers are, and this is, we've seen this kind of with SSM attention hybrids as well, but I think the kind of surprising revelation from the result of this search process is the non-attention layers can actually be extremely simple as long as they have some gating. So you've got the gate and then you've got just a real simple convolution that I think I read only considers a very short span of tokens, right? Is it like just the short comes back? Yeah. And So if we were going to update the headline, attention is all you need from however many years ago now, we would maybe say attention is something that you really do still need, at least at a certain scale, but you also need gating on your other layers. But you don't actually need anything like super crazy, fancy, sophisticated, the state space model, all that kind of stuff. It turns out that keeping the gate is actually the part that really drives the most value. And then you can have a really simple mechanism behind the gate. And then you can have 70% of that and 30% of attention and subject to obviously resource constraints that ended up being the winning formula. Am I getting anything wrong there?

[57:26] Ramin Hasani: No, I mean, you're touching on the right things. And this is basically, it's like the regime you're operating and the goal of your system. Are you trying to build super, are you trying to build like the most powerful version of the AI system? You need the most unbiased version of an algorithm. Now, attention is an extremely rich, unbiased format of algorithms. Even if the computational complexity of attention is N to the power two, maybe we really need N to the power two to really get to that kind of level, you know? And maybe And we need more complex architectures. We've always tried to actually reduce the complexity of architectures for the sheer purpose of the fact that we are resource constrained. As humanity as a whole, we are resource constrained right now. So I would say the discoveries that we have right now, it just shows that there is a gradient on architecture that you can follow as you scale models. The gradient that you're following is the fact that for smaller kind of models and specialized models, you can put as many biases, like these scaling mechanisms that you're bringing in. And you can play around with as many operators of interest in your computational graph. And it is going to work and it is going to give you some sort of a boost if you're really maximizing for like linear let's say linear time complexity, like you want to implement linear attention systems, you know, like just the fastest kind of, if the speed is like so important and you're actually wanting to even sacrifice a little bit quality, you can bring in like linearity and the whole system could be linear, you know? You don't even need some of those hybrids. But hybrid is just boost that accuracy because as we see, as I said, the ON2, like basically the complexity, that computational complexity at a certain level, at a certain scale, it is needed for us to really get to those performances that you want to. The larger the network becomes, the more unstructured you can make it. That's kind of the learning from that whole kind of algorithmic approach that we started designing in all architectures.

[59:28] Nathan Labenz: Yeah, very interesting. Okay, let's look at the other end of the spectrum then. As you work with customers, what are some interesting examples of when Given resource constraints, given the narrowness of the domain of interest, other kinds of bias are actually winning in the architecture search process.

[59:50] Ramin Hasani: Great question, so... For example, if you go to, let's say, biology, and you want to model sequential data in biology, you're talking about DNA data in biology. DNA data from a vocabulary kind of perspective, they're very limited, right? They're not like language that is like maximum kind of amount of vocabulary and stuff. In DNA language, they're very simplified, but... The lengths of those sequences you have to actually process are, let's say, for a human being, or maybe for a bacteria, I heard, it's somewhere between 1 to 100 billion sequences, like basically elements in a sequence. For that long context kind of things that you want to perform, when you do not have that much of a large vocabulary, you don't need attention. So you can actually run on this kind of data. You can run... pure convolutions, pure SSMs, your pure liquid neural networks, like system recurrent neural networks and parallelized version of these recurrences and linear attention for extremely long context, you cannot do it any other way. The reason behind it is because context has become so large that that quadratic cost of attention just kicks in. So let's say on biological data, you would want to have some sort of a structure there. Then there are places like video modeling and stuff, in those kind of places, let's say you might want to have different architectures and even different learning algorithms. You have seen like the success of diffusion, for example, there, right? Like you've seen like diffusion is also like a, I would say like this format of a prior that you put on a certain architecture, but it's still like you, there's debate between like diffusion and learning algorithms or diffusion is actually part of the architecture. You can actually make that connection and you can make that distinction as well. So you have seen on video and seen understanding probably in elements of diffusion would be needed, like we some people still believe that with autoregressive kind of modeling you could actually get there anyways, but we will see, we'll see if that stays true. Then when you're talking about audio signal as well, like if you audio alone, for example, if you're talking about audio alone and language is not part of the whole thing, it's just pure like voices, like voice to voice, let's say noise to clear kind of signal, you know, signal to signal, these are places where recurrent neural networks are still very, very powerful. They're extremely powerful, especially. in the low data regime. I would say models that have a lot more biases in them, like in this smaller kind of regime, when you do not have that much data, this is kind of the place where you can bring a lot of value. And recurrent neural networks and biases in architecture can help you in the low data regime to really fill out that gap with the feedback mechanisms that they have. The more complex architectures, the more you would be able to handle like the closer the architecture is to the dynamics of the data sets that you're trying to solve, the better of a learning system you're building. In fact, in the physics modeling, like when you're talking about, there was like a class of models like physics-informed neural networks, these are kind of, again, another architecture that we're talking about here. They would be really good for physical kind of simulations. And yeah, so this is the other side of the spectrum that I'm talking about, like the property of the data, the lengths of the context lengths, and all the other considerations that I said would change the architecture by a lot. And then again, you want to scale this to the largest kind of regime, make it, again, unbiased. Again, a transformer would be able to add scale to beat this. But at a smaller scale, transformers would not be able to beat any of the other formats of technical systems that we described.

[1:03:35] Nathan Labenz: And how about the differential equation inspired original liquid neural networks? It's clear that the worm is running on a very small amount of watts. Do we, in practice, do you encounter things that are so resource constrained that you have to go to these extreme, extremely specific or extremely biased architectures to actually deploy today?

[1:04:05] Ramin Hasani: Yeah, 100%. Think about places where you have to bring in latency to, like computation has to happen like in microseconds. So you cannot really afford to have like larger computational complexity. You want to have the most simple type of system that actually handles that. And you know what, there you can have adaptive systems, like you can have like many different formats of systems. And there, this is like one place. The other thing is simulation in a whole, as I mentioned, like in physics, You have physical data that is coming, like you want to build a digital twin of a, let's say, like a chemical reaction that happens at a factory, right? So, in those kind of places, you would go towards like an extremely biased and even maybe just differential equation-based models like liquid neural networks today, like they're getting applied to many, many different applications. I see that because the repository, the original repository from years ago is still like open source, you know, like people are still like building predictive kind of machine learning models on sequential data or physical data or sensor data that it makes sense because the systems are continuous time and it makes sense to have like a continuous. time that I'm consistent to apply to modeling those kind of behavior. And then you can like today with these larger instances like clouds of the world, you could also like direct your agents to actually try out a bunch like in an auto research kind of format, you can just direct them to say, hey, you know what, go pick like the neural networks, like the best neural networks that would be the best fit for these type of data set. And then here is kind of the space of possibilities you want to explore, like based on the biases that you have, like machine learning people are now Mostly like orchestrating kind of these automated kind of agentic pipelines as we are building, so that's what's happening in terms of Liquid ourselves like we are trying to contribute to the open source and we are open sourcing some of the instances that are coming. out of our, let's say, search spaces, like let's say LFM2, LFM3 is going to be like, again, the objective is that the next generation should always beat the previous generation on the criteria that we care about, without sacrificing quality. That's extremely important for us. And then the range of possible, like we decide, for example, how large of a neural network we want to make based on these neural networks that we have, and then this becomes the open source version of our models. Then we work very closely also with silicon companies. we work with AMDs of the world, Qualcomms of the world. And what we do with them, we try to understand the silicon roadmap that they have, you know, the hardware roadmap that they have. And based on that roadmap, we try to kind of inform also like what they have to do even in the next generation of their ASICs, you know, because when we have like this understanding of the algorithmic aspects of intelligence and the variety of things that they could be supporting, in order to reduce, dramatically reduce the cost or satisfy the constraints of the use cases that they want to enable, this is, these are the considerations that you have to do. Or even building like to the point that we can build also like, you know, like a specific foundation model graphs, like graphs for them, you know, like that's, and that's kind of the projects that we do with semiconductor companies. In terms of like on the commercial kind of side of things, we also take models, we apply them in many different I call the category that Liquid AI is very enabled, we call the device foundation models. Anything, all the processors that are outside of data centers, we try to apply our technology to those kind of places. This is outside of data centers. Inside of data centers, we apply to constrained use cases. Use cases like low latency, like ultra low latency applications of AI. Extremely long sequences and wanting very, very small memory footprint of your AI system. You want to have one cost efficient implementation of these foundations.

[1:07:19] Ramin Hasani: models that you have. We contribute to memory, speed, latency without sacrificing quality and cost. These are kind of the elements that define use cases where liquid can come in. terms of access to our technology, any new generation of our architecture that comes out of our efforts of, let's say, architecture search or new modalities that we check. For example, with Shopify, you're exploring many different kind of, you know, like range of applications, you know, across recommendations, search, product catalog, like understanding multimodal kind of systems very, very well. That's something that we do. Our models are in production like with Shopify right now, and they're improving the quality of the click-through rates and the criteria, internal criteria that Shopify actually cares about. Our models outside of data centers, they go inside cars. some of the use cases that we enable is like basically in-car intelligence. For example, we power that. Recently, we have like a contract design. It's actually a historical one with Mercedes-Benz, where our models like is going to be powering kind of the audio and also the visual kind of elements inside the car. So whenever you want to talk to your car, basically the new voice would come out of a liquid foundation model. And we control that with a model that is like giving you the quality of the best models that you have seen. so far, let's say the audio models that you've seen all around, but at the same time, it is 600 megabytes. It can actually fit inside the smallest processor that is inside that car. That changes the game because you're enabling local AI at scale on places that matters, on cars, let's say on mobile phones, because you can do that. There's a lot of devices, you're talking about billions of devices in the world that mobile devices can can be extremely useful. And you've seen that market itself, by the way, just the mobile kind of business is $500 billion market. This is absolutely insane. And it is as big as kind of the data center kind of market. So you can imagine there's a power here to be made, like the efficient markets, and also constrained intelligence market. It's something that Liquid AI is going after. Laptop markets, like let's say wearables, robotics, manufacturing, IOT kind of systems. Anywhere that we have a processor inside a system, Liquid can bring in intelligence on top of that. So the aspirational goal that I have is that we really build an intelligence layer on top of the diverse formats of hardware that is out there and processors that are available in the world. That would be something that I would want to have.

[1:10:34] Nathan Labenz: I'm glad you said that stat. I had that noted to make a point of at some point. It's worth repeating. The global annual smartphone market is about $500 billion for all of the hundreds of billions that are going into the data center build out, and it will get bigger than the smartphone market. But it's only now getting to the scale and getting bigger than the annual smartphone market. So that is a that there's a lot of dark compute out there from an intelligence perspective that is not being anywhere close to maximum. And that does not include the laptop market. So there's a trillion dollars, generally speaking, worth of compute going out into the world on an annual basis that are sitting in people's desks and pockets. And there's a massive substrate there to take advantage of.

[1:11:20] Ramin Hasani: And we need to do it. We need to take advantage of it. I mean, because we don't have enough energy to host it. I feel like we are realizing it right now, like how difficult it is to really power this format of intelligence that we have at scale. We really have to work smart. We cannot just do the simplest format or maybe more constrained use cases. Like let's say if you have like a one-shot, let's say predictive task or maybe data extraction task and stuff, you don't want to call the fancy This type of intelligence, like the Mythos level kind of model to perform data extraction for you, can have like a variety of different intelligence systems that are doing many, many different things, and then for let's say... the most sophisticated problems in the world, you can go to the most sophisticated AI systems that are existing in the cloud, right? So that it's inevitable that the world, the processor world outside of data centers should get enabled, like as we speak right now. And I think there is actually some talks like coming out, there's a lag in Silicon Valley that investors like that are actually like just, oh, okay, now that makes sense. You know, there is that lag that is getting answered right now. And I think we are in exciting times also. for efficient AI in general.

[1:12:34] Nathan Labenz: Okay, so on the one challenge, obviously, with this like other trillion dollars worth of compute that's going out into the world is it's super heterogeneous, right? There's like many, many different devices, different chips, et cetera, et cetera. So how much of the work that you're doing is about trying to get closer to optimal use of that compute. Obviously, we talked about scans earlier and the transformers and the data centers. That has been not, I'm sure there's still more optimizations to be made, but a ton has gone into it to make sure that you can get as close to max throughput as possible, right? When you go then do an architecture search and you're targeting some random cell phone or sensor in a factory somewhere, how much do you have to work on like kernels and scans as kind of foundation to even be able to do?

[1:13:32] Ramin Hasani: Yeah, well, I mean, that's one of the... You see, there's different layers of abstraction. And we're testing out also the AIs to see how good they can get on the kernel design side of things. For the things that they have seen in the open source, the kernels that are available in the open source, let's say the GPU structures that they have seen out there, but not the NPUs, the most hidden type of NPUs that their IP is actually not disclosed. Those require, let's say, another architecture search, because we don't have those knowledge, but I would say that kernel design is at a level that we can actually start automating, and also you can definitely have loops where you can do, like you can juice out the best kind of post-hoc optimization, because some of the things that we were whatever we talked about so far has been... The optimization that we do before design of a foundation model, all sort of considerations before we start pre-training a foundation model. The approach that Niku takes before getting into, let's say, pre-training the model, we run this massive architecture search before getting at maybe the hardware in the loop. But then right after you design the architecture, then there's all these sort of quantization of ever training, changing the bits of the system, going into the kernel level and trying to even juice out even more off of the system. Those post-hoc stuff is also something that is like you can automatically kind of orchestrate it with a kernel engineer that does that and performs those optimizations, and we do that, we do those post-hoc, we do that as a post-hoc optimization step. That being said... We have the capability in-house to define architect, define operators at a kernel level as well. So, for example, you could say matrix multiplication have, like, on this fernet of kernels, they are like probably 100 with... There's 100 ways that you can actually structure matrix multiplication, with caching, with not caching, like, you know, like how distribution between CPU workload and then GPU workload, you know, how do you actually do the entire forward pass, you know, as you perform? All of these optimizations could also, in principle, become part of that massive search space as well. And these are kind of the questions that some of the silicon partners are asking us, you know, how can we, how can we even before design a foundation model, figure out the competition graph first, and while having like the guarantee. The key is that when the training is finished, because that's like a costly process, multiple millions of dollars has to go into full training of a neural network, even the neural network is small, you want to have like all sort of like inference optimization before the post-hoc quantization aware kind of optimizations, right? But yeah, in principle, we can definitely launch something like that as part of our search.

[1:16:18] Nathan Labenz: So what do you think hardware makers should be doing differently? What are you telling them to prioritize, to help us realize this more efficient, distributed AI future.

[1:16:32] Ramin Hasani: I feel like you need every iteration of the technology, software technology, like the level of abstraction that the hardware providers should be building for, it's coming up. For example, we've been talking about CUDA modes for a long time now. You know, there is no CUDA mode anymore. Why? Because if you look at the AMDs, there are like agents at the kernel level, at that level, now they can automate that stack. The stack that foundation model companies should, sorry, hardware companies should start working on is the intelligence layer, like what Nvidia is doing with their Nematron project. If you see how this project is actually quite successful, and probably a couple of billion dollars actually went into design of these Nematron projects. And if you really look at the, what does it do for Nvidia is just... like basically building an intelligence layer on top of NVIDIA's compute so that you have like an easiest barrier of entry as an enterprise to buy NVIDIA solutions. So the sales of the solution is always like around those things. I'm just talking about the business of like enterprise business of these hardware providers. If they want to sell more hardware, they're going to up their stack from that kernel level optimizations, those are kind of those are things and they get into the intelligence layer. Now, Does that mean like they have to become a foundation model company? To some extent, yes. And they have to be able to train that intelligence layer themselves. And Nvidia is like a success example there. How these things actually like, you know, paying dividends for them, you know, like building the Nematron project. If you look at the other foundation model companies, like the other hardware companies, they have not done this yet. They have started like strictly like optimizing their models for the open source architectures that are out there. That's also extremely valuable. You have to do that. That's like a given. But again, if you want to create a differentiated value and you want to be successful, I would say you've got to be able to bring this stack to the intelligence layer. And that intelligence layer should naturally fit on top of your hardware. If your hardware have limitations, let's say in terms of like to your competition, software can always help to give you that juice. Like for example, if you're talking about maximum token speed on an AI PC between an Intel computer, between let's say an AMD computer and let's let's say, a Qualcomm computer, they might have their own trade-offs.

[1:19:17] Ramin Hasani: The company that is going to have an edge is a company that actually owns that efficient intelligence layer on top of their hardware so that they can say, At the end of the day, you're going to run tokens on top of this system. It doesn't matter what's the bandwidth of my CPU. If it's a little bit lower than you or higher than you or the amount of memory that I'm actually having in there is lower or higher, with the intelligence layer and software optimization techniques, I can actually get myself to the place where I'm actually the winner here. And for example, all of my laptops or all of my PC solutions that I'm actually building as a hardware provider, they are naturally coming with an intelligence layer on top of that. Nvidia is entering this game also heavily as a competitor in the CPU space now, as you're seeing, like devices. Google actually entered with Android with the Android neck system that they have, they're replacing Chromebooks with Androids, right? Like with Android laptops. This is kind of what they're putting out there. I think it's called aluminum OS or something like that. I don't actually know what exactly it is. And then Meta is going to come with the sort of devices themselves as well. So all the processor builders, I have a feeling they need to get a little bit closer to that intelligence layer and try to leverage software and efficiency kind of considerations into account when they are going for their planning, you know, like planning ahead, you know, where things are going and why that intelligence layer is important. Because right now, agentic AI is something that everybody else wants. All that applications you want to do on silicon, you want to do that on a base of intelligence that has already provided to you. You know, you want to build harnesses. You want to build like all sort of the harnesses are basic applications that you can build for solving a certain problem in, let's say, in a traditional sense of an application, right? And, that market can get enabled on top of chips if you have a very nice stack, that enables intelligence on top of what you have. The other thing that I would say is that, like... Silicon is always, if you're going outside of data center, I mean there's a massive diversity of silicon, but then the resources are constrained no matter what, you don't have that much power, so you gotta be very careful, like what sort of applications are the most important uses of your hardware medium and platform, and then... those use cases, if you already have an intelligence layer, you could identify when is the time to sunset basically a set of, let's say, devices. For example, I strongly believe I think glasses could be like a format of competition for hardware later on, like devices. I think glasses would be like very, very interesting kind of medium. And I think somebody's going to get it right, you know, eventually, you know, and it might actually be a replacement for, let's say, for laptops, right? And then But is it, we need to really figure out what is it. But if you really want to define a market and define kind of where the field is going from the hardware perspective, I would say you got to up your software game by a lot and really get hands on and investing a little bit more energy and capital into the R&D side of things on the foundation model side. Because that's the new base of software, like inevitably.

[1:22:03] Nathan Labenz: So does that imply a future where we have just a lot of vertical integration and a lot of coupling? The model will come with the hardware that I buy, and it may not be so swappable in the future because the model is heavily optimized for the hardware and vice versa, such that these things are not so modular in the future as they are today.

[1:22:25] Ramin Hasani: Or why do you want to change the model? It brings you to that choice kind of question. You're given basically a default, now you want to switch this thing by all means, but why do you want to switch it? If this model, the intelligence layer that is in there, it's not fixed, it's kind of an adaptable kind of system, it is a, let's say, self-improving system, like with a call to your data sets and stuff, you can actually have a platform that does, let's say, full fine-tuning of that system, it is enabled. And there is not just one model that you can load into the system. And there's not just one cloud model or one on-device model that you're going to use. You've got to be able to orchestrate between, let's say, many different instantiation of this model to be able to build application. We're talking about a model class that would actually sit on top of, let's say, hardware on a laptop, for example. And yeah, so I would say you need to have that, but I mean, if the default is just giving you that efficiency and that, Like it's ready to go, that's like the choice, that's something that I think Nvidia is trying to propose, like in enterprises, when they go and sell, like the NIM project, Nvidia NIM projects, when they go out there and sell these things, it makes a lot of sense, because you already have like a project, you already have a multimodal model that is loaded on top of your, let's say, the PC that you bought, you know, and... Why should I switch? Because they already have done all sort of optimizations for me, and it is running extremely fast. Why do I need to change it? And the model itself is tunable. I can use their Megatron kind of framework to actually tune the model. Now, if you don't want to do it, and you want to just choose another model to host it. As I said, this has to be given. Your hardware should be already, you have to be optimizing for the entire open source ecosystem and models that are available, but at the same time, it would give you an advantage, like to yourself and to your customers.

[1:24:19] Nathan Labenz: Okay, maybe in the last few minutes, how about a practical application of this? You have a blog post on local co-work. No cloud, no waiting, tool calling agents on consumer hardware with LFM2 24B, A2B. So that's a, I think people know, that's a mixture of experts, 2 billion active out of 24 billion total parameters. Let's say I want to make that a part of my life. I am interested in how you would coach me on setting this up. Today I have, for reference, And I, by the way, try to make the, sometimes I try to make the transcript of the podcast something that I can feed to my agent. So you can think of this as partly coaching me, partly coaching my agent. So I've got this deep context database that is the last five years of my digital output. This podcast will be recorded, obviously recorded, transcribed. It'll go into this database. So everything you said, everything I said will be in there and searchable. And it's got all my e-mail and Slack messages and everything. Okay, cool. So now I've got Claude. on my desktop that can call tools locally to get data back, but then it sends all the results to the cloud to decide which of those results are actually the right ones to be looking at. And so far, I've been okay with that. The benefits are certainly worth whatever risk I'm taking, I feel. But I would maybe love to run that data through a local model first so that I don't have to send all my data to the cloud every time. and notably what it gets filtered through, I'm probably still going to end up burning through a foundation model. So it's not going to entirely skip the cloud. But also I'd like to save some tokens too, because I'm going to be going to use Fable whenever I get it back for whatever it's most appropriately used for. So how do I get really good performance on my local computer with this model? Do I need to be doing fine tuning? Do I need a distillation strategy? How do I actually take the base model that you have and recover as much of, let's say, Opus or even Fable performance in terms of searching through, understanding my data as I possibly can? And how much is that? What should I expect in terms of how far I'll be able to push that process? How close to parity with frontier models can I get? So Tell me everything I need to know, and I'll go have my agent do it.

[1:26:43] Ramin Hasani: Yeah, well, I mean, that's a great question. Obviously, the local coworker is as it stands today. It's obviously just to open minds, like, hey, you know what, this type of applications you can enable. The class of what this format of local agents are going to enable is basically a local computer. You have an orchestrator. If something is so complicated, it should be able to send it to the cloud, fetch the answers for you. If it's not sensitive, we have like, let's say, smaller models that are PII models. It should be able to use the PII models to really like filter out all the personally identified information, send it to the cloud for you. don't need to even see these models. These models should be like on the background. The model that you have to tune is that orchestrator that be able to route between many different services or even smaller specialized models that are doing stuff, and also some of the cloud models that are out there, right? That router is the computer, that's like the local computer. That's like a definition of, but when you open your laptop, it should be just that, you literally have just that, and then you start working with all the services you want to have, it's just the same way that you communicate with your, let's say, assistant, it should be in the form of an assistant that does all sort of those jobs. It takes a while for people to get not weirded out by the user interface, like just being that and not seeing all the file formats and what you have to do, but you've got to get used to it. Like now cloud code is even as for the IDE. Is the 24B off the shelf is going to be there on the quality that does all of those stuff for you? No, it's not. None of the local models today are there. On the local models today, you got to fine tune them. You got to get them to be specialized for the stuff that you want to do with the proper explanations for your Claude agent, you basically give it these things. But is Claude being able to go out there and actually build this model for you, let's say, Do the fine-tuning and getting it to that place? No, because today, even Fable level kind of models, they would not be able to, first of all, you wouldn't have access to that because Anthropic is actually having access to the auto-tune and stuff like automated models, models generating models kind of platform. There are companies and also ourselves, we are building platforms that enables you to do fine-tuning of this whole thing. That would cost you between 10s of dollars to, let's say, low thousands of dollars to actually get you to the cloud quality for your, and with all the checks, production quality kind of model, and it's not gonna cost you like 10s of thousands of dollars it's going to be like between 10s of dollars to actually like... those thousands of dollars. And that's kind of the scale that we are thinking about, because again, efficiency here matters a lot because we don't want, and we want this fine tuning to happen on the compute that you already have. Or if you don't have it, you would actually put it like in a secure and proper kind of data center that you would actually like host it like from the providers basically. And you would be able to get to that quality of the models. There are platforms that hopefully like as we go forward in the next few months, we are going to announce like some Platforms that would you can hook it just directly in your channel you don't need to do anything, you just say, Hey, you know what, go call this platform now for fine-tuning this thing, go call this platform, and this would give you a production-grade foundation model that you can deploy it now for yourself. either fine tune on that data or depending on the use case, depending on what you want to do, it doesn't even have to see that data. It can even synthetically generate data on its own and actually train the model to be like just the perfect reliable tool caller and understands when is the shortcomings of it and it can actually go away and deliver it to some other places. Yeah, so I would say you're extremely close to get to that point.

[1:30:38] Nathan Labenz: I'll wait for your platform to be ready. I don't have to DIY is the message. Yes, this has been. Brilliant. Maybe one last question, and then I'll just kind of give you the floor to close however you'd like. How far do you think this goes in terms of kind of miniaturization of intelligence, if you will? One intuition would be like the biological world that we see is maybe on some sort of Pareto frontier already. And so the watts that go into our brain is maybe like getting us close to the maximum that we could get for that amount of power. but maybe you have a different intuition. What would you expect in terms of upper limits of intelligence that I could have on my phone or on my laptop as we really get to the kind of physical limits of the technology?

[1:31:31] Ramin Hasani: You know, when you think about intelligence, like the way I look at it right now is that like, Transformer-based networks, and also our type of, let's say, the current architectures, the landscape of architectures that are available, they gave us with scale, they gave us in-context learning capability. The thing that actually emerged from next token, you see, the word is important, emerged from next token prediction. Intelligence for me is an emergent property. If you want to get into, if you want to miniaturize kind of intelligence, bring it to the physical world, I don't believe with the current set of algorithms, you would be able to get close to the, let's say, intelligence per watt that human brain is actually providing, right? You're not gonna get there. Why? Because I believe... human brain over the... We also have to consider the amount of energy that went into design of humans as a whole of biological evolution. It's actually a very, very long kind of process. And a lot of the pre-training process that people say, Oh, a lot of... You're going to see the entire world. Humans do not need to see the entire internet to be able to reason about something. No, but humans have gone through... years of evolution. So I would actually attribute it a lot to the evolutionary kind of aspect of things. But again, what I would say, human intelligence came with multiple mechanisms of in-context learning. You just don't have, when our current AI systems do in-context learning, they learned a vague representation of one algorithm, which is least square. It's basically least the square. So what they figured out is basically gradient descent in a mushy way, like for some use cases, With some examples, you would be able to make the system understand and give you kind of the next example. These are kind of beautiful properties. This is what I would call an emergent property of the systems. You set the algorithm to be next token prediction. You got a vague version of a gradient descent in context learning capabilities. For humans, You don't have only emergent graduates. You can learn by examples, but you can also do reinforcement learning. You can run simulations in your head. You can do all sort of algorithms. You can do Bayesian statistics in your brain. You see what I'm saying? So you have like a diverse set of algorithms emerged from the way that humans actually got designed and got intelligence. For me, I believe if you start optimizing for those, if you try to force your way into the system to become a reinforcement learning, not learner, but learning from trajectories, that's not the right way to actually get to emergent property of intelligence. I believe the format of intelligence that is going to be miniaturized, you know, like being like in a smallest amount of time, maximum amount of intelligence, like the human brain, we need to come up with these identifiers, like these confounding kind of algorithms, which is next token prediction is one of those. What else do we have to do at the beginning of the design of these systems so that reinforcement learning, like curiosity-driven like intelligence, drive with limited amount of energy from that final system that actually comes out of it? So I would say that emergent property is something that we need, there's a lot of research that has to go in that direction, I would say. And multiple kind of variations of way of learning, that would enable the next generation of artificial intelligence, I think.

[1:34:59] Nathan Labenz: That could be a great note to end on, but let me just give you one more opportunity. Anything else that I didn't touch on or anything else you just would want to leave people with before we break?

[1:35:09] Ramin Hasani: No, I think. I think we covered a good lot of information, I think. we went into some things that I don't usually talk about. Now, in my current role as the CEO of the company, I get to talk a lot more about the business opportunities and the market aspects of the intelligence and stuff, but it's still like, at the core, we are scientists and we are really pushing the boundaries. If you're actively thinking about it day-to-day, you can't imagine, even right now, if you're talking, I'm already in the back end. I'm actually like training some stuff, and I really don't want to get away from that kind of mentality, because I think everybody can actually be able to now build, and I think that the times couldn't be better the agents are pretty... amazing, at giving you like this opportunity of really doing frontier, frontier form research. if you're curious about something, it's just that we need to find people that are a lot more curious, and improve the curiosity in people as well, to really get instead of just, fear off, everything. So I'm a techno optimist, and probably you've observed that. Like I usually like talk about technology really in a very, very positive light. And I want people like want, One thing that I always say is that we need to really get to the level where people are as curious as the scientists were, like from day one, like what gets us like into science? And the purpose of science has always been satisfying our own curiosity, and also understanding the world around us, right? So that's kind of the purpose of science. And I feel like that with the current superpowers that are given to us, like with these AI systems, I think everybody could actually start like really contributing to understanding the world around us. to generate more value and also just satisfying our own curiosity. But also it requires a certain degree of rebasing our biases towards like, oh, what is work, what is not, and what is automation? What is an army of agents working for me? What's that definition even look like? And then it's changing culture and changing way of thinking about this thing, not just for individuals, but also for enterprises where we are dealing with as well as Liquid AI. And yeah, so I think that change is something that I'm looking forward to and I think I think there's something that's going to happen. The pace of it is extremely fast, but I think we got to do it one way or another.

[1:37:30] Nathan Labenz: I love that. I consider myself extremely fortunate to live to a really remarkable degree, a curiosity-driven life these days. And that's one candidate for a positive vision for the future that I think should inspire a lot of people, especially because, as you note, there's an opportunity to move into that phase already with the AI systems that we have. This has been excellent. I really enjoyed it. Thank you so much for the time. Rameen Asani, co-founder and CEO of Liquid AI, thank you for being part of the cognitive revolution.

[1:38:00] Ramin Hasani: Thank you.

Outro

[1:41:04] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.