Distributed Training, Decentralized AI: Prime Intellect's Master Plan to Make AI Too Cheap to Meter
Vincent Weisser and Johannes Hagemann, founders of Prime Intellect, join a conversation on the Cognitive Revolution to delve into distributed training, decentralized AI, and their vision for a future where compute and intelligence are widely accessible.
Watch Episode Here
Read Episode Description
Vincent Weisser and Johannes Hagemann, founders of Prime Intellect, join a conversation on the Cognitive Revolution to delve into distributed training, decentralized AI, and their vision for a future where compute and intelligence are widely accessible. They discuss the technical challenges and advantages of distributed training, emphasizing how such systems can democratize AI technology and create a more equitable future. The founders also describe their broader goal of creating a public utility for compute and intelligence and touch on their collaborative work in biosafety and scientific research to illustrate the practical applications of their vision for decentralized AI.
SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive
NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive
Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive
CHAPTERS:
(00:00) Teaser
(01:02) About the Episode
(05:43) Welcome to the Cognitive Revolution
(05:55) Exploring Decentralized AI
(06:46) A Positive Vision for the Future
(08:19) The Risks and Rewards of AI
(08:56) Superintelligence and Its Implications
(13:22) The Future of Work in an AI-Driven World
(17:09) The Role of Billionaires in an AI Future (Part 1)
(20:41) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite
(23:21) The Role of Billionaires in an AI Future (Part 2)
(30:20) The Compute Market Landscape (Part 1)
(35:10) Sponsors: Shopify
(36:30) The Compute Market Landscape (Part 2)
(47:49) Decentralized Compute Fabrics
(51:25) Regulatory Challenges in Europe and the US
(53:28) Policy Regrets and the EU AI Act
(54:30) The Impact of Overregulation on AI
(57:00) Frontier AI Labs and Safety Plans
(01:00:02) Open Source vs. Closed Models
(01:06:19) Scientific Progress with AI
(01:14:56) Distributed Training in AI
(01:35:29) Challenges in Model Interpretability
(01:36:06) Scaling Paradigms and Decentralized Training
(01:38:30) R1 Style Training and Efficiency
(01:40:06) Supervised Fine-Tuning and Reinforcement Learning
(01:43:24) Swarm Parallelism and Distributed Training
(01:45:19) Future of Compute and Infrastructure
(02:01:02) NVIDIA's Market Dominance and Competition
(02:05:22) Decentralized Training and Open Source Collaboration
(02:09:58) Governance and Incentives in Decentralized AI
(02:14:19) Conclusion and Call for Collaboration
(02:15:54) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
PRODUCED BY:
https://aipodcast.ing
Full Transcript
Vincent Weisser: 0:00 Execution is cheap. Ideas are worth everything. Right? In a world where you can just, like it's almost inverts to the current reality. And I think it would just lead to, like, billions of startups. We don't buy, like, hundreds of billions of in compute. So in that sense, like, we're not a hotel. We're more like Airbnb or, like, we're more marketplace sitting on top even of other marketplaces.
Nathan Labenz: 0:19 Why does this matter? Multiple reasons. Right? It's like, in the limit, you know, it could create a sort of truly decentralized AI infrastructure that nobody can control.
Johannes Hagemann: 0:29 We obviously have a lot coming up in terms of like improving all that algorithm, right? I think what we've done so far is just what we've realized those pseudo gradients actually sent after those hundreds of steps. So it's not the actual gradient of the model. It's a difference between the beginning of the weights and the end state of the weights after all those inner step updates.
Vincent Weisser: 0:48 In the intelligence age almost, you want to own a piece of a superintelligent system that is able to generate value where you have actually had, like, access, like, through through your ownership in it to the compute, to the intelligence.
Nathan Labenz: 1:02 Hello, and welcome back to the cognitive revolution. Today, I'm excited to share my conversation with Vincent Weiser and Johannes Hagemann, founders of Prime Intellect, whose mission is to make intelligence too cheap to meter by building foundational technology to support decentralized, collectively owned AI. Vincent and Johannes stand out for offering a positive vision of a future in which a wide range of AIs empower everyone simultaneously, amplifying each individual's abilities and improving societal resilience, while all actors implicitly check and balance 1 another's power. At the same time, they've articulated an ambitious master plan and shipped a number of notable milestone projects in pursuit of this goal. Part 1 of their plan is to build an international market for compute. And as of this writing, you can rent an H200 for a dollar 49 an hour via their website, primeintellect.ai. Part 2 is to build software frameworks for distributed training. And in late November, they released INTELLECT-one, proving that distributed training can scale up to at least the 10,000,000,000 parameter level. Part 3 is to train high impact science models, and the Metagene 1 model, developed in collaboration with the Nucleic Acid Observatory and others, and designed to be useful for pandemic detection but architecturally incapable of generating new pathogens, is 1 of the best examples of a defense favoring AI project that I've seen anywhere. Part 4 is to launch a decentralized protocol for collective ownership of AI models and to collaboratively build towards aligned AGI that benefits all of humanity. While that still remains in front of them to do, given their track record to date, I would not bet against them making a meaningful contribution. We spent much of the first half of this conversation unpacking their vision. To be honest, I'm still not sure realistic it is to expect that we can maintain a stable societal equilibrium with AI changing everything everywhere all at once. But then again, to be real, this is happening very fast, and I don't think anybody has articulated a credible big picture plan so far. If that's true and we're mostly just going to keep developing this technology as fast as possible and hope that the resulting AIs end up being mostly harmless by default, I do find a lot to like in their vision for a more decentralized and hopefully resilient balance of power, as opposed to a world dominated by a few major AI players. In the second half of our conversation, we get into the technical details, including both the fundamental challenges and recent progress in distributed training. Johannes walks us through the 3 main parallelization strategies used in model training data, pipeline, and tensor parallelism and discusses strategies like DeepMind's DiLoCo, which reduces communication overhead by allowing training nodes to process hundreds of steps before needing to aggregate gradients and sync model states. That they've managed to use this and a number of other optimizations to train a 10,000,000,000 parameter model across a globally distributed network of compute resources is impressive. And the latest from Google called Streaming to Logo, which was released just after we recorded, suggests that they have not yet hit any fundamental limits. It might still be very difficult to aggregate enough compute to train foundation models from scratch in a distributed fashion, but the recent shift toward reinforcement learning, which is far more inference heavy and thus friendlier to distributed training approaches, strongly suggests that any number of network groups can probably muster enough compute to train whatever models they might like to reinforce into existence. Or as Anthropics Jack Clark put it in a recent edition of Import AI, we might soon live in quote, a world of models trained continuously in the invisible global compute sea. That world would almost certainly be simultaneously weird and beautiful and scary. But barring extremely draconian measures, the likes of which are well outside the Overton window today, it seems like the sort of thing that technologists can and will create and that governments will have a very difficult time preventing or controlling. Perhaps in the end, we can only hope that DEAC, the strategy of accelerating the differential development of decentralized defenses, as exemplified by the Medigene 1 model, will ultimately win out. As always, if you're finding value in the show, we'd appreciate it if you'd share it with friends, write a review on Apple Podcasts or Spotify, or leave us a comment on YouTube. We always welcome your feedback and suggestions via our website, cognitiverevolution.ai, or by DM'ing me on your favorite social network. For now, I hope you enjoy this discussion about a positive vision for and the technical underpinnings of decentralized AI development with Vincent Weiser and Johannes Hagemann of Prime Intellect. Vincent Weiser and Johannes Hagemann, founders of Prime Intellect. Welcome to the Cognitive Revolution.
Vincent Weisser: 5:52 Thanks for having us.
Johannes Hagemann: 5:53 Thanks for having us.
Nathan Labenz: 5:55 Yeah. I'm excited for this conversation. You know, for a while, I have been really wondering about distributed training and if it's gonna work, what the trade offs are gonna be, and just generally the the concept of decentralized AI, I think, been has quite fascinating. And when I recently mentioned on an episode that I'm looking for the right people to talk to about this, your names came to my attention. So I appreciate you being, willing to do this and definitely look forward to getting into some of the nitty gritty details of all the work that you have done to realize this decentralized AI vision. And just before even getting into that though, 1 of my common refrains is that a positive vision for the future is currently the scarcest resource. And I'm just always amazed by how little of that we hear, including from people leading the frontier AI companies, we get very vague visions of what's going be good about the future. 1 thing that has struck me about the 2 of you as I've explored your work in preparation for this is that I think you do have a positive vision for the future. So I'd love to start off by giving you an opportunity to articulate that positive vision for us a little bit.
Vincent Weisser: 7:02 So I can kick it off to frame it also in in kind of like starting with what we plan to build with Prime Intellect and then more broadly, like how I see the positive vision of the future with AI. So basically, really, like, our goal is to make intelligence and compute too cheap to meter, and I think we're entering that era right now. And really, I think the the key goal for us, I think, which is would be a key part to make it Utopian is to make it widely accessible and open to everyone. And I think really, like, the goal is to see almost like if we zoom forward, right, a few years or decades ahead, the biggest risk would be if we have the super powerful intelligence and compute, but it's not accessible to everyone, but rather to the select few and kind of like the big states, the big, tech giants and individuals. So I think that's really how I see the the the most positive future. And I think what does enable us really is, like, I think it empowers, like, every human and creator to leverage the intelligence, right, and like be able to do more with less, right? It's like basically have agents at their disposal, like create more advanced science, more like, have better access to medicine, to education, right, to all of these things. So I think it's actually, I think, really, like, has a potential to usher us into kind of an era of like, widely distributed abundance, if done right, but if done wrong, it also has the risk that it actually kind of like creates, like, much harder to access intelligence age, let's say. Right? And I think that's the most important piece. Think that, basically, large parts of the society is, like, able to participate in this, like, upgrade.
Johannes Hagemann: 8:38 Yeah. We're we're pretty much aligned on that. Right? I think it needs to be widely distributed in a sense. Right? The most dangerous outcome is that if there's only 1 superintendent, right, like, the state is a 0, obviously, but the disability raises to go where we are going forward. Right? So that's how we basically think about the future of open source AGI as well as ASI.
Nathan Labenz: 8:58 It's interesting that you mentioned AGI and ASI in that in that answer. I think, you know, so many AI worldviews or expectations really diverge on how powerful do people expect the AIs to become and on what timeline. How would you compare and contrast your own expectations to what we've heard recently from Altmans and Darios who seem to
Johannes Hagemann: 9:23 be
Nathan Labenz: 9:23 increasingly confident that we are gonna see superhuman AI in the next few years?
Vincent Weisser: 9:29 Yeah. I think it's basically, it's like most of the views put forward for me are very applauseable, maybe, like, even since the last 15 years, like, since reading, like, the similarities near in Bostrom, like, when it came out. And I think, basically, we're still on that path. Right? I think it's actually surprising how well those predictions, held up. I think the most common views of the big lab, leaders, like, from them as DiLoCo, Sam, to folks like Leopold, I think, like, I would put a basically high probability mass on their scenarios, and I think they also base it off like, obviously, the research progress that we've seen over the last, like, they get, I think the the biggest question I think is like, how general it will be in like, the broader sense, right? And I think a lot of people I talk to also at the labs like have high confidence, right? We'll get to superhuman coding, superhuman math, maybe superhuman software organizations that run autonomously, right? But that doesn't mean we get superhuman humanoids like tomorrow. Right? Like, basically, I think a lot of these things will look probably like autonomous cars or something where it's easy to make radical progress. It's it's hard to get to a 100%, like, accuracy and robustness. Right? I think, basically, an autonomous car, it isn't 100% robust, but like 95% robust, right? Like, is is is really useless. And I think we'll see the same with superintelligence where it's like the deployment of it, right? Like, will happen if it has extreme high robustness in its application. So I think it's not enough to have a superhuman investment agent. Like, it can't screw up 1% of the time and lose you all your money. And I think, like, that's how we'll see probably progress shape up is that, like, actually, I think it will move much faster in some areas. Like we've also seen, right? Like with the new like inference time compute scaling paradigm. And I think we'll see it to like move slightly slower than like some of the biggest optimists maybe like make it out to be in trickier, like harder to simulate areas. Right? I think really the user framework, I think it's like even like folks like Steven Wolfram, on kind of like how computationally reducible, basically different areas are, right? It's more difficult to solve all the diseases than to write a bunch of code autonomously. So I think we'll run into the limits of what you can compute. Right? And I think that's probably 1 of the things, like, the only maybe bottleneck I see in the next 10 years is, like, that you can't simulate everything. You can't compute everything, but a surprisingly large amount that, and you can approximate maybe reality with like pretty good simulations, including for things like biology. So, yeah, I'm quite optimistic that like, I think most people are not optimistic enough, especially in the face of like, all the progress in the face of like the infrastructure build out. So I think we're on track for, like, superintelligence in, like, probably the next decade.
Nathan Labenz: 12:14 So can you maybe put a little bit more color on the future that you envision? Like, I hear you on the accessibility part. I definitely think people are underestimating what's coming, but I'm more unsure as to like, is it going to be good by default, bad by default? What are the hinge points? Obviously a lot of questions there. It's funny you mentioned like Kurtz too. I've increasingly been using the phrase, revenge recently. Because in retrospect, I look back and basically say, there was really no other way this was gonna go. You know, it seems like the fact that so many different algorithms seem to work and the fact that we have the compute at scale and the data at scale, like there was really no trajectory where we were going to have these compute and data resources and not have somebody figure out, you know, a workable algorithm. So I think in that very fundamental sense of like, you know, if you have the physical and informational inputs, like somebody will figure out how to unlock the lock and and will get, you know, reasonably and potentially like very powerful AIs. That seems to have been, like, very well borne out. Do you have a vision for daily life? Like, am I working a job in your future, you know, imagined successful scenario? And is there, like, a stable equilibrium? I really struggle to envision a stable equilibrium. I think the decentralized AI, you know, line of thought in general sort of tries to get there, but I would love to have a more concrete, you know, detailed, richer vision of that than I do, to be honest.
Vincent Weisser: 13:49 Yeah. Yeah. I think, like, it really depends on different dimensions and axes and and also on the timeframe. Right? And, like, personal preferences. I think in many ways, people that are, like, financially independent, right, they could basically retire, but they could also still work on a startup or like do other things and I think that's playing out, right? It's like every billionaire in society is working like probably harder, not like less hard on like their philanthropic efforts, right, like the entrepreneurial efforts and and other efforts. And I think we'll see basically more people, like, behaving, like, as if they're, like, financially independent. So it's like, I think, in that sense, there's already glimpse of, like, I think, what, like, how abundance manifests, right? It's like billionaires are definition living in abundance. But I think, like, for a lot of people, obviously, they have different priorities in life. So I think actually good recent book I read on this was, like, from Nick Bostrom, like, kind of like life and meaning in the self world. And I think, like, some of those points, I think, make a lot of sense, where it's like, there's so many other aspects to life that people like care about, like, be it having children or like doing art, like creating things, right? Like, understanding the world, right? Like, reading, like all of these things, that I think people will be able to spend more time in, right? Like, even if 1 sees other, like, almost like jumps in in human progress, right, like in industrial age or now age to knowledge, like, basically, in those shifts, I think people were able to basically, like, get more freedoms and ultimately have an easier time to reach financial independence as 1 example, but also more broadly, almost like do whatever they want to do. Most people I know could stop working, they don't. Right? It's like and and I don't see them to completely stop working once you have AGI. And I think like, what I think is interesting is, like, even seeing this now with a lot of people working on this technology, I think to your point, here's the camp of the people that are, like, too optimistic or too pessimistic, and they kind of, like, give up. So they, like, resigned from their AGI jobs because they're like, oh, time to enjoy the last 2 or 3 years, before they're superintelligence. But also on the other side, I don't know, it's like kind of like pessimists who are like, oh, at the end of the year, and I think that that's like a dangerous and kind of like unlikely scenario to play out where it's like, I think the nature of jobs, I think, will change, but I think another good line of argument that I've heard, which I agree with is like, there will be probably like a return to very like human labor and people might still go to a restaurant and they might still go to a theater. They might not drive an Uber. Like, that's drives itself, right? It's like, so basically, I think they will just be like a different sort of labor, right? It's like, that will become even bigger. Like, basically, everything where you and I would like be even post Hive, like pay for a human, right? Because it's just more joyful to have a human around than like a steel robot, you know? Like if you wanted your kids to be taken care of, maybe you don't want your Optimus robot to do it, but like a human you're, you trust and, can understand, but I think that that's how it will play out. Like I think in many ways, it won't actually be absurdly different from today. It's just that I think there will be way less like knowledge work that isn't an exceptional, right? Like, it will be much more curation and even like I think all of the knowledge workers will just like still work. They'll just work in very amplified ways on different functions, and they'll have an army of people working for them for free, basically, which are they, to an extent, the AIs. Right?
Nathan Labenz: 17:12 The 1 point on that I wanna push on a little bit more is, like, the billionaire point working today. Because I I feel like there's some maybe you can resolve it, but I feel like there is some contradiction, but the sense that, like, why are they working today? I think in part it's because they feel like they can make a meaningful contribution. Right? They want to make a positive difference. Does your vision sort of imply that for the people in that position that there is not an AI that they can hand that responsibility off to to do a better job? Because you might imagine, you know, a Bill Gates or a Dustin Moskovitz or whatever might say, you know, my goal is to eliminate these diseases. I could sit here and run this foundation, but there is a superintelligence that could probably make even better, you know, grant evaluation decisions or whatever. And are are you imagining that they don't have that or that they just choose not to use it for some reason? That part is always a a little weird to me because it feels like if you have a superintelligence, what are even the billionaires doing?
Vincent Weisser: 18:14 Yeah. Totally. I think there was a good interview between Sam Altman and Bill Gates exactly on this, and you could see almost, like, Bill Gates also grappling with it being like, oh, you get so much meaning out of being so good at, like, fighting malaria and like, if I swear to tell you just schedule it for him, you would lose some meaning because he's so good at it, which I think is the wrong worldview, right? It's like, hey, if you really care about it, be happy if you can like hand it over to your employee, to your agent or whatever, to do it better for you. I think the best almost like founders or like philanthropists or whatever, they build a team that is way better than they are on every dimension. Like basically, maybe they are they have broad skills, like ultimately, ideally, they hire people that are like super intelligent compared to them on something very specific. I hope Elon has someone who's way more intelligent than he is on propulsion, on autonomous cars, right, and and AI and systems, right, like, on across all those, efforts. And I think that's more how we'll play out where it's like, like, basically, execution is cheap. Ideas are worth everything, right? In a world where you can just like, it's almost inverts to the current reality. And I think it would just lead to like billions of startups, right? And billions of movies or much more, right? It's like basically just an explosion of content companies, autonomous creations that ultimately still like a human is either like seeking or creating. I'm kind of extrapolating from, like, where we are today, right, and, like, how I I see and try to adapt to what I see as, like, plausible in a year, in 5 years, in 10 years. Like, I think there's still a lot of things we'll want to do, like, we'll want to work on. And I think to your point, I think, like, the ultimate, like, even this, like, almost, like, master of pyramid of needs, right, it's like there's things like impact status, right, and, like, meaning and fulfillment data are, like, timeless. And even if you have superintelligence, you might still get, like, meaning out of specific things, right, that you're, like, seek. And I think it, I see it almost more like humanity will just have like a gigantic workforce of like AIs and agents that like do a lot of the work, but like, it will basically, hopefully, stay on on top of it and I basically manage that and and direct it towards the the aims like humanity cares about, right? It's like, it's a bit like an, in some ways, our biggest corporations or states are already like super intelligences with a lot of agents in it, and they're directed to, like, through democracy, through, like, shell, like, capitalism or whatever, like, into specific directions. Right? And I think that, actually, the biggest risk, right, is that we'll just have, like, nation states and, like, for profit corporations have the super intelligence instead of, like, everyone to to do, like, everything they want to do.
Nathan Labenz: 20:44
Hey. Continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workload. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.
Nathan Labenz: 21:57
It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, And that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR altogether into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.
Nathan Labenz: 23:21 When you said hopefully, the 1 1 interjection I was gonna ask is, do you maintain a p doom? Do you entertain that conversation at all?
Vincent Weisser: 23:30 I think it's not extremely useful as a concept. I think, basically, I think a lot of people are way too confident in, like, weird sci fi scenarios and kind of, like, probabilities, take non probabilities. But then if you ask them to like unfold them, like they they they can't really it's like like I've seen too many people, you know, like, they are like, oh yeah, my PDoom is like 20% and they're like, why? Oh, because this person also has it and it's like, basically, it like usually falls apart. So I think it's I think a mind barter, it's like I think it's a really dangerous concept actually. Like I wouldn't put strong confidence in anyone putting out random numbers and I think there's almost like, it's really about like probability mass, right? Probability is like between 0 and a 100% and it's like the universe will go on for billions of years. Like, will will something at some point kill it? Sure. Like, when, how, and what kind of, like, the order of events is, I think it's, like, so so hard to know that I think it's almost not worth pondering. And I think 1 of the problems like the utilitarian philosophy brought to the space is to have very high confidence and impossibly to know future scenarios and events, and then basing like policy, all the actions. So I think what what is more important and I think the ironic thing is like that in this world view, I think like 1 needs to realize that ultimately like on shorter time frames, the probabilities are stacked in favor of good outcomes, and then even the actions, like, by, for example, on the AI policy, like, sometimes have the exact opposite consequence, where basically it's like someone even with a low PDOM, like, it's like, oh yeah, that's why we need to regulate it heavily in Europe. Okay. Goodbye AI in Europe. Like, or we need to have regulated heavily in The US. And suddenly, like, things move outside of this. So I think there's basically, yeah, like a a cascade of, like, very bad, almost like ep epistemics and reasoning that follows from people going too deep on PDOMs.
Nathan Labenz: 25:32 I've certainly seen some of that. At the same time, I do sort of worry that things could get out of control. I answer that question when I'm asked for my own PDOM. I usually say 5 to 95% or like 10 to 90%. And yeah, what I mean by that is basically just nobody has said anything to me that seems compelling enough that I shouldn't worry about things getting totally out of control nor that, you know, it's so likely to happen that I should, you know, check out and spend with the rest of my life at the beach or whatever. So but I do worry about the, you know, the possibility of things getting out of control. Do you think that that worry is misplaced or or do you share it to some degree? I don't think quantifying it is super useful, but I I do think, like, considering it is is still useful.
Vincent Weisser: 26:20 Like, I see it almost like the industrial, revolution brought like, it created chaos, like, it created havoc, created maybe even, civil war in, like, some ways and, like, had second thought, all the consequences. Was it that good? Sure. And, like, I think like, the Internet. Right? It's like it created, like, some, like, complexities, political consequences, and everything else. And I think still, like, ultimately, I think, like, AI, like, almost, like, benefits, like, far outweighed risks. And I think also some of the past to avoid the risk become risk themselves. Right? It's like basically, like, kind of like in a world where, like, everyone is worried about, like, the the existential risk and then, like, calls for, like, 1 world government. I think probably the 1 world government is, like, the bigger risk than actually, kind of, like, the, like, sci fi scenarios that people are worried about. So I think all these things are very complex systems, right? And ultimately, I think it's hard to know what will be specific risk, right? And I think the biggest risk is actually people losing their autonomy, like putting life on autopilot with AGI, losing their freedoms because they handed over heavily to the state and to over regulation. So I think really that's my biggest worry is basically that even the most well intentioned people actually create almost like the worst outcome or scenario unknowingly and unintentionally. And then also in in some ways being misused by people that seek power. Right? I think like a lot of the people that like were fighting for alignment safety and policy have I think been co opted by big labs and big tech that has their own goals and their own intentions, and and you can see this really well, right? Like with the big tech giants and the big tech AGI labs, like, and I I know stories about, almost all of the AGI labs are like extremely worrying of, like, things happening behind the scenes and how they power seek, even the people that this whole community kind of revere. So I think that's the thing that I'm really worried about. It's basically it's like seeing some more details that I think it's like the biggest risk I think is basically like full centralization of superintelligence by a few nation states and big tech giants and basically completely disenfranchising large parts of humanity.
Nathan Labenz: 28:36 I don't know. Like, it seems like it's all in play in a way that it's never before been in my life. But I do hear you very much on the idea that a crazy imbalance of power or just like, you know, 1 super intelligence in 1 organization or 1 company or, you 1 even potentially even 1 individual's hands does seem like a really problematic situation, you know, in biology, right? Like I've, 1 lesson I've learned over and over again in life is anything that is sufficiently concentrated is dangerous. You know, like all the drugs become dangerous when you take them from like the leaf that it grew in on the tree, that was fine. But then when you, you know, purified it down to a 100% and started huffing that, like that's when you really got into problems. Right. And that seems like it kind of keeps coming up over and over again. So I am quite sympathetic to the idea that if we can maintain a sort of buffered solution sort of balance of power that, you know, where everybody is sort of checking everybody else much like things work today, right? Nobody has like the ability to dominate the world unilaterally. If we can maintain that as we like bring AI online across all aspects of life, I think that does sound really appealing. Of course, I don't really know like what that also could bring. It seems, you know, anything is possible there. But I do find a lot to like in that vision of the sort of all you know, everybody rising simultaneously, everybody's interests, you know, kind of keeping 1 another in check. No single actor, you know, being able to make a mistake that kind of throws everybody else into a huge problem. Let's talk about the company and your kind of master plan for getting there. I I love a good master plan. And again, you've got 1. So 1 thing I wasn't able to tell from my research is, like, what kind of a company is it? Is it, like, just normal corporation or something else? So maybe, you know, give us a little bit of the the foundational backstory and then take us through your master plan.
Vincent Weisser: 30:39 So, basically, like, almost like our broader goal really, right, is, like, to make computer intelligence kind of, like, to achieve the media, and, like, really how we want to do that is, like, in multiple stages where basically, in the first 1, we started with aggregating global compute and creating an efficient compute market, but also the developer interfaces for it, like an API, command line interface, and basically ways for people to just like, no matter what kind of compute they're looking for, like be it a 1 H100 or a 1,000 H100s that they can basically make a request and then they they find like where they can find the cheapest computer they're looking for. So that's what basically how we started and how we launched 2 or 3 months into the company creation. So basically that really was like also the biggest thing we initially did and which gained a lot of traction. And then really building on top of this and and kind of like in conjunction is building basically all decentralized training and other approaches like distributed synthetic data generation and others, which really can then like leverage us to global like compute fabric, right? It's like you can envision if there's extremely cheap compute in 1 place or another and it dynamically shifts, it's extremely beneficial if you can train, decentralize, and a false alert way because then you can basically just save massive amounts but also really leverage the global compute especially when there's idle compute available somewhere in the world. So basically, it ultimately reduces the cost of intelligence, reduces the cost of compute because you can make more efficient use of this resource. So that's kind of like step number 2, and we also made a lot of progress there and can go into more detail. And connected with it, which is very like closely connected, it's like the goal is really to train kind of like leading open models collaboratively that anyone in the world could contribute their idle compute to this network and also get rewarded for it, basically make the highest possible return on their compute because it gets fully utilized. And then if it's, for example, idle, if it's not utilized, it can contribute this to front end models, I'd like to say to continuous improvement of, like, other r 1 model from DeepSeek, which is 1 thing we're looking into. And I think there's basically like, that's almost the third pillar, which really connects to the psych more broadly to the psych peer to peer compute and intelligence, where the goal is anyone in the world could contribute compute, and anyone in the world can use that compute, and it's kind of like an extremely efficient market for compute. But on the other side, like I think on a more abstract level, the same can be said about like intelligence itself. Basically, be it like any AI API. Like, it also becomes sort of a market, right? It's like, there's different people hosting Llama models, DeepSeek models, the diffusion models, and they are also competing. Ultimately, their cost is purely compute, right? So basically, you can lower the cost of those intelligence endpoints, if you like, like of agents, of Llama, of other models, by having the most efficient compute market. So basically, like an efficient peer to peer compute market also enables like efficient peer to peer intelligence sitting on top of it. And really I think the goal is to create like a system that is almost more like a protocol, like parts of the internet that like can basically be maintained, like, almost like a public good, and people can just, like, use it trustlessly and permissionlessly. And I think the closest to structures like something like Ethereum, like, for almost more like smart contracts, which, like, really the goal, right, is there to, create a technology, make it openly accessible, have a, like, foundation like structure to support it, and then over time, like, it it really can be owned by anyone. Anyone can use it, like, for the cheapest cost possible, and I think that that's a broader vision, and I think to your question how we would structure this, kind of like it's a Delaware C corp, but really the setup is more akin to Ethereum in a sense that we're creating a foundation and basically giving out grants for people to develop this in the open, right, like, through, like, open source approach to basically fully, like, make this accessible to anyone. And anyone could fork the system, improve the system, contribute back to the system, participate in it. And that's really the broader design principles, almost more creating like a public utility to some extent. And also in conjunction with some of those systems like Ethereum, where anyone can also create their own agents and commercialize it, and the agent can make money for you, right? So you're, I think that's also, like, going back to this utopian vision, like, how we see this play out. Is that, like, anyone in the world can, like, contribute to it, own a piece of it, and it basically just, like, generates, like, revenues, like, for the end users that participate in that system.
Nathan Labenz: 35:11
Hey. We'll continue our interview in a moment after a word from our sponsors. Being an entrepreneur, I can say from personal experience, can be an intimidating and, at times, lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1 and the technology can play important roles for you. Pick the wrong 1 and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.
Nathan Labenz: 37:13 A lot of points I wanna follow-up on there. Maybe let's start with the compute market, and, you know, we'll work our way forward in time through the master plan. For starters, how would you describe the compute market today? I think we've had a sort of much hyped cycle of people raising large equity rounds and just plowing all of that proceeds directly into GPUs. In some cases, like NVIDIA, even just taking equity positions in exchange for GPUs, it seems like. That seems like it's over. Are those the kinds of companies that are now, like, saying, jeez, we maybe, like, overbought a little bit and have a have more you know, we maybe can't use the cluster we rushed out to buy as much as we thought we could, now they're contributing to your marketplace? Or like yeah. I mean, I have 1000000 questions about the compute marketplace, but tell me what you think is interesting that's going on.
Vincent Weisser: 38:07 I think if you zoom out, GPUs were a tiny, tiny market, right, like pre-tech GPT, and they're growing at an exponential rate. And I think there's no slowdown in side. I think basically it's and you see this, obviously it's it's poll out in a sense that like, obviously this week was kind of like anonymous, Like a 100,000,000,000 plus in like compute commitments from individual AI companies and labs, and I actually think the trend right now is rather that there's much more demand for compute, including a much longer tail of companies, startups seeking that compute. But I think to your point of like, kind of like how how does market works and structures, right? Like how how I would describe, right, it's like a lot of the, like we basically more than a year ago, right, like, we're in a period where, like, the supply for H100 was extremely constrained. So basically, the the big guys that paid, like, extra got got more of them and and kind of, like, got it got them with priority, right, like, the open AIs of the world. And then it also shows you kind of like the problem. Right? It's like those big tech giants got priority because they they paid more and have like gigantic long orders and clusters. But I think what we've seen shift is basically it's like they have obviously now more supplies. Basically, it's like the production started to catch up a bit, and obviously, also new generation, for example, concretely with NVIDIA rolling out. So I think, basically, the a lot of those small startups on on the point of, like, raising capital and deploying it, bought long contracts. Right? They bought, like, 1, 2, or 3 year contracts, they paid hundreds of millions, and and they they had to basically, it's like if you wanted to train a model and you needed 1000 GPUs, you had to buy a 2 year contract to get it. It wasn't possible to rent them on demand. And that kind of like fast forward, like now, like on platforms like ours, you can rent 1000 H100s, like, on demand. And that wasn't possible even 3 to 6 months ago, really. And I think it showcases you that basically from the supply, which the majority was kind of like, for long long term contracts, right, like, from a capital perspective are more derisked and and almost more like structured financial products, right? It's like core weave buying like tens of billions in in compute, giving them to OpenAI, taking out loans against the GPUs. Like that model I think is evolving that basically more and more of those compute is actually moving to the on demand supply. So basically it's like really I think also what powered all our growth and positioning is that like a year ago or 2, it was impossible to find an H100, and that was very useful what we offer in the sense that that's also why we started it. Every AI setup and developer we talked to, and it was a, had a very hard time to find H100s, but we were able to find them going directly to data centers, talking to some billionaires that bought like tens of thousands of H100s and were looking to like sell them on demand. And but fast forward, like, now there's even more data centers, there's even more clouds that are appearing, and really a market that's insanely fragmented. So I think to the outsider, it's not so obvious that basically there's hundreds of clouds. There's thousands of data centers. Right? And there will be more of them, and, like, there will also be more chips than, like, NVIDIA chips. Right? So really, you need to aggregate a market that is, like, by design, so fragmented. Like, basically, keeps it fragmented because they don't want to have, like, an overly powerful buyer. Right? Like, if OpenAI or Microsoft would be the only buyer, they wouldn't be in a good spot. But then on the other side, their biggest clients also compete with them. Right? It's like Google, Amazon, like, all the others, big buyers, they also produce their own chips. Right? They compete head on with NVIDIA's chips. So, basically, they need to be careful and really gave a lot of allocations to model folks like Lambda Labs, Core Weave, basically, like, in some ways, smaller clouds than than the really big tech clouds because they're more aligned in many ways. So basically, that's why supply is kind of like distributed across America, across the world, in hundreds of data centers, and really, as an end user, you kind of want to discover like, hey, I need 200 of them. Where can I find them right now? Like, it's not so easy and clear because also the obviously, the demand and the supply is shifting in real time, like, with all of these platforms. So that's kind of like the broader backdrop, and I think the other pieces like, that I think the, there was obviously some of those like very big and well known cases of like, startups that raised a lot, then started pre training models, right? And I think the thing is that's continuing to be the case, right? It's like there's like more AGI Labs today than a year ago, right? It's like a very obviously well known case. Out of all my ideas now, like, 4 AGI Labs I can count that were in the round like 1 or 2 years ago. There will also be like a lot of applications that need more and more compute, right? And especially some new use cases like video, like the more heavy reasoning or or coding agents, and in general, like agents coming onto the scene this year, they'll need much more compute. So I don't think there will be a slowdown in compute. I think on the other hand, like, will become, a biggest chunk of of GDP probably, like, almost in this ramp up to AGI or ASI, and probably just, like, power all the different intelligence use cases.
Nathan Labenz: 42:55 So would it be fair to say that you are playing a role that's like an aggregator of people that are also selling direct? Kind of like, I don't know, to pick 1, like a kayak type of aggregator where I could go buy directly from the airlines or I could go to kayak and they give me this sort of menu of all the different airlines as opposed to like an eBay where it's like, hey, I've got 3 H100s here. You know, can I plug them into your marketplace? It's it sounds like it's more like the former. Is there are those good, you know, frameworks from which to think about it?
Vincent Weisser: 43:30 Yeah. I think, like, ultimately, it's not so different from the biggest marketplace in the world like Amazon or something, right, where it's like anyone in the world can, like, basically sell compute on our platform. And also, like, anyone that is selling compute, we're in touch with it. Like, from the hyperscalers to individual data centers to clouds to billionaires that have compute to startups that want that have too much compute, 2 startups. They want to buy compute and put it back onto the platform when they don't need it, right, like to to get cost savings out of it. So I think the best analogy is just like a lot of the platforms, right, like, are kind of like little isolated islands, we try to create the ocean that connects all the islands. And it's like, basically, you have some com compute 1 cloud, you have some in the other, you have some in in 1 big tech, some more in the other, but basically, it's very, very hard to orchestrate all of it. So it's like literally a lot of the people we talk to, like even start off with 2 people, we're building their own APIs to like query the availability of all the GPUs, then they have like different APIs, and like that's basically all the work that we're like simplifying for the end user, is that you can just, like, orchestrate this kind of, like, global compute fabric. And I think the other pieces that, like, a lot of them are almost more like a PE, like, they basically structured financial products. They borrow 1000000000 to buy 1000000000 of compute and lend money against the compute. Right? It's a very high CapEx business, and I think what's very different is we don't buy hundreds of millions billions of compute. In that sense, we're not a hotel, we're more like Airbnb or in that sense, right? We are more marketplace sitting on top even of other marketplaces. So basically, some of the marketplace we talked to that are trying to do something similar, they're they're not going for the full broad spectrum. Right? For example, they don't work with the the hyperscalers or with, like, some of the larger clouds. They just work with, like, individual data centers. So basically, we try to just like aggregate all of them, right, also with the time and onboard all of them onto 1 market.
Nathan Labenz: 45:27 What's the smallest unit of compute that would make practical economic sense for somebody to contribute to the marketplace today?
Vincent Weisser: 45:37 Yeah. So basically, we have every chip that is in demand. It's kind of like a market where it's like the majority of the demand falls right now for like H100. It's like GH200. It's like kind of like a 1 hundreds still to some extent or 30 nineties. But like, there's obviously over time, I think, the reason to believe that, like, for example, for synthetic data generation, could even use your, like, MacBook or something. That's not our primary focus right now, but I think it's like there's compute everywhere, right? And it's not just in the data centers and H100s even though that's our primary focus. Like over time, I think the goal is basically that you can literally leverage all the compute out there. Obviously, some of it is much more useful for the different AI workloads, But I also, it's like now, for example, that we're shifting to the synthetic data paradigm, suddenly some new compute starts to become useful, right? Like now that NVIDIA is chipping their own, almost like home GPUs, right? Those become potentially interesting over the next few years, like especially if like a lot of people get them. So I think like all, like the other, I think it's like all of the compute, but like the computer that's on highest demand, like, will be more successful on the market, right? So it's a, like, very market driven in that sense.
Johannes Hagemann: 46:44 Exactly. For the users on the platform, obviously, all kinds of compute resources would be useful, right, that they yeah. And for their development purposes and stuff like this. For the more distributed training component of it, right, contributing resources there, it's very much more focused on the high capacity GPUs. Right? Like an h 100, an a 100 node, a full node, or even for clusters. Right? Just because of memory requirements and other components needed to actually do a distributed training across those. So their different focus is more on higher end GPUs. There's still use cases for, like, your 49 DCR, the smaller clusters and stuff like this, right, for large scale synthetic data generation and other things. For training, you need quite a bit of memory requirements, and that's why it's more focused on a 100 and h 100 also for a while, like, distributed 20 ones.
Nathan Labenz: 47:28 Part of why I'm I'm interested in sort of the economics or the sort of practicality of contributing to the market is it seems like in the long term, we may be headed for a fight about, like, who can compute, under what controls, under what circumstances, with what oversight, whatever. The sort of trillion dollar data center is 1 idea where it's like, it's all super aggregated, it's all super centralized. Obviously that also becomes from like a geo strategic standpoint, hard to defend, but it is like probably relatively easy for a government to control. You can sort of fence the perimeter and know who comes in and out. On the other end, there's not really anything like this for AI yet, but there is Bitcoin and Ethereum. And these are things that are truly like not something people can shut would be very difficult at this point for even probably, mean, tell me if you think I'm wrong about this, it would seem like even the United States government, even if it really wanted to, would have a really hard time shutting down those networks. So I'm kind of wondering like, where are we on that spectrum today with these sort of global decentralized compute fabrics? Because if it's, you have to be at a significant scale and it's basically a data center, it seems like we're obviously not at the extreme of the trillion dollar data center, but a lot closer to that where it like could be controlled or governed or whatever. Whereas if you really could get down to, I got a Mac mini on my desk that's idle and I'll rent that to you for 5¢ an hour, potentially not even for the money, but because I want to contribute to Yep. A, you know, global network that nobody can shut down, then we sort of are in a very different regime. So where are we today and where do you think we're going?
Johannes Hagemann: 49:13 To maybe touch on the the first point you mentioned, the trillion dollar cluster. I don't think it's going to be 1 single cluster, even the stargate round, where we distributed in a way. Right? All the big techs are looking into how to train across their distributed training clusters as well. They obviously have a bit more favorable environment than we are in. They have cables between their data centers, 40 gigs of interconnections at least to a 100 gigs, which we obviously don't have right now in, like, a global distributed setting. But even there, I would argue, it's also not that easy con to control that you just have, like, 1 single location that you have to prevent somebody access to it. It's it's gonna be distributed across the globe. If you wanna build something like a static cluster because of energy requirements, you can't build a 10 gigawatt cluster right now, but we're gonna be limited to to multiple, like, 500 megawatt clusters or something like this over the next couple years.
Vincent Weisser: 50:01 Yeah. And I think what I would add is, like, with this distributed training run, like, we trained across Europe, Asia. I think both things can be true at the same time. Right? It's like computers extremely just real. Like, it sits in Europe. It sits in mainly America. It sits in Asia already. It sits in other parts of Asia like China. It but also there's a ton of compute in India. There's a ton of compute in, like, other almost, like, more free, like, nations like, be it Singapore, Malaysia, and and a lot of the computers shifting to the places where it is the most free, right, like, where where you basically can generate revenues, where we we are not encumbered by, stringent regulation, right? So, places like Malaysia and Singapore, right, seeing a boom in compute. And I think that's 1 piece. I think the other is like, we'll see these big data centers, but there's a gigantic long tail of compute, Like basically there's a lot of data centers that have hundreds of thousands or tens of thousands of GPUs. Most don't have actually a 100,000 or million, right? Like those you can count on 1 hand, right? So they have like 50 to 200,000 GPUs, it's like more like the XAI, OpenAI, and Microsoft clusters, but there's so much more compute. And I think the other is, to your point, all the lower power compute sitting in people's home, right? I guess a ton of people with 40 ninties, 30 ninties across the world, right, like in every part
Nathan Labenz: 51:12 of the world. So I
Vincent Weisser: 51:13 think it's not only extremely naive to try to regulate compute, but it's also extremely dangerous. David Deutsch said on a recent interview, it's like regulating code or math, like, it was like, basically, it's like this, on the 1 side, extremely dangerous, but also extremely naive in a sense that basically it has the exact opposite effect. Right? It's like that leads the compute to shift to nations that are even less easy to regulate and and observe. It's like it shifts to the the boundaries and to the actors that don't obey by regulation. Right? And to an extent, you could see this even with Bitcoin and Ethereum, and I think the exact same thing will play out. Basically, a lot of it went to the cheapest regions in the world where you could mine. Right? A lot of it was in China, and then China banned it, and then a lot of it moved to The US, and and a lot of the business associated moved to The US, but energy was cheap in other regions, so it also moved to Iceland or something, or like parts of Africa, and I think, like, it's how I see this play out as well, basically, where ultimately, like, Europe may be the best example. Like, we're both from Europe, and I think we both left Europe, and I think 1 of the reasons is because Europe is known to be extremely diesel and, like, co opted by those regulators who ultimately don't have the best intention of the citizens of the future and of the economy of Europe in mind, but they basically overregulate and kind of like, like Europe didn't even have an AI community, right? Now it's dead. Like ultimately, it was not even really breathing and alive, but like it's nonexistent now, right? And I think it's a it's a best example zooming in on a nation that completely funneled their participation in the new like intelligence age by a few co opted, like, regulators. And I know those people, right, like that regulated. They had the best intentions. They didn't plan to destroy the future of the whole continent, but they did. And I think that's a very cautious lesson, and I think fortunately, like, I'm very excited for America because ultimately America doesn't repeat that mistake. I think it it was on track to repeat it under the last administration, and I think fortunately like we are on bright path like for optimistic future. Otherwise, The US or or the Fed said goodbye to the their future and their stake in AGI. And I think they would have been extremely dangerous with push it even harder to Asia than it already is right now with progress like DeepSeek. So I think, basically, it's important to think through the second, third, all the consequences, right, and, like, what ultimately the goal is. I think there there's been a lot of policy people that I know, like, some of the also most powerful ones that really even regret some of the the policy they they, kind of, like, argued for. I know some of the people that indirectly funded some of the biggest ones and are not big fans, and they they publicly stated or will continue to publicly state it. And I think you would be surprised that, like, how many of the people that actually enabled a big policy regret having done so and stopped doing so now? And I think that will continue to be the case, but I think be more public in the future because now the open window shifted.
Nathan Labenz: 54:10 What policies do you think people are regretting?
Vincent Weisser: 54:13 I think the EU AI act will go uniquely into history as being 1 of the worst ideas for not only Europe, but also, having a completely 0 impact. It didn't really improve anything for anyone. Like, literally nothing. But it made it worse for small AI stuffs like, obviously, it's like a general problem Europe has. It's very proud to regulate, and it's very proud to overregulate industries that are very tiny to nonexistence, right? So I think the EU AI is a very concrete example. I think even the California act is an example. Even though the the people that were arguing for it, they said like they should have done it on US level instead of the, like, just like on the level of California. But I think those 2 things stand out for me as the 2 most well known policy initiatives. And I know some of the key policy people at the big AGI Labs had stopped being there and had regretted it. And I know, like, some of the by far biggest funders of policy, if you talk to them, they regret that they made those donations, and I think they will be more of that and will be more public in the future. So I think there's a lot basically where people realize the outcomes of policy that like like, funding superlimate. Like, great. There should be more of that. Ultimately, funding policy quite quickly goes wrong. Right? Like, they quite quickly gets co opted, quite quickly has the unintended consequences, and it can happen that it goes horribly wrong. And I think that's was to make sense the case, I think, with a lot of the government, like, the policy. But I think on the other side, it's like the majority of proposal is compute governance, right? And people like Jay Clark also commented on our work and said like, hey, I think those folks need to update. I think compute governance in the face of decentralized chain working makes 0 sense, especially how it's done right now. So I think, that that's the other piece. Policy proposals were very slow and lacking, but also very naive. And ultimately, I think not actually in the spirit of, like, freedom, not in the spirit of democracy, not in the spirit of, like, basically free markets. And I think that's very problematic. Like, ultimately, I think it's like and David Deutsche actually did a really good interview on this, and I think he basically said it's like, if you're, like, ultimately worried about, like, basically, unsolvable problem, but it's like the the only path forward is, to solve them, and it's like, ultimately, AI can help solve those problems. Right? AI, like we proved with also, like, 1 of our first modern releases, can help make the world safer with regards to biosafety. So I think maybe the the the philosophy more broadly underlying this, like I said, I think well encapsulated by what wrote if you read it on, like, the act of kind of, like, basically accelerating defensive technologies and differential progress like democracy and decentralization. That ultimately is the past, probably the brightest future in in this direction. It doesn't include governments overregulating. It's rather making sure that ultimately the good actors are supported while the bad actors are stifled. Right? And I think, like, like, policy and regulation completely has the opposite effect. Right? It's like the best, most well intentioned actors, they'll follow it. The bad actors, they don't care. And I think that's the reality to to an extent, yeah, with how it's already playing out with, like, governance.
Nathan Labenz: 57:20 I feel you on the counterproductive nature of many regulations for sure. I'm personally worried about that right now with all of our policies that are directed toward China. I'm like yeah, you know, it's I just can't shake the feeling no matter how many arguments I hear to the contrary that it feels like the actions we're taking now are gonna make this situation worse. You also mentioned super alignment. That was 1 1 other, thought I wanted to kind of pick up on. What do you think is reasonable for society to expect of the Frontier Labs? I mean, I was a supporter. I went back and forth on s p 10 47. Initially, I thought, yeah. It seems okay. Then there was a moment where they, like there weren't many drafts, of course. Right? And then there was 1 moment in time where it seemed like this sort of panel or committee or commission, whatever it was officially called, was gonna be kind of overempowered. And I was like, man, I don't know. It doesn't sound that great to have these like 5 people that are appointed by who exactly? The governor, and then they're going to get to make all these decisions. So then I was kind of out on it when that was the version that was being considered. And then they cut that from the final version. And then I sort of would summarize the final version as basically like, if you are doing frontier AI work, you have to have a safety plan, you have to publish that safety plan, and you at least have to be like open to a certain amount of scrutiny. That to me in the end seemed like pretty reasonable and as things go and I'm very mindful of the risk of unintended or even outright opposite consequences from what policymakers intend, that seemed like a pretty light touch, like not super crazy sort of thing. Even if it's not like law, you could frame it as law, policy, regulation, whatever, or you could just frame it as like normatively, in your opinion, what does it mean for a frontier AI developer to be acting responsibly in today's world? Because it does seem like they're taking sufficiently, potentially disruptive steps that like they should be proceeding with caution. They should be trying really hard to act responsible, but, like, what does that mean in your mind, or what should we expect of them or demand of them?
Vincent Weisser: 59:30 Yeah. I think, like, the the a key aspect is this differentially accelerating defense. Right? That's basically that ultimately, you and I can't defend against, like, an AGI if we don't have 1 ourselves, with intent to, like, whatever, like, defend against, like, cyber, like, bio, and and other things. So I think going back to the frontier labs, I think a lot of the the way how it's currently being done, I think it's like very reasonable, right? It's like they do write teaming. They have their own goals to like publish something that is safe, right, and that they basically can't easily get misused. So, I think ultimately, the, it's a bit like, you can create a technology, right? And like, the burn is to an extent on the user, right? It's like, a lot of things you can misuse, right? Like, be it the internet, car, or anything And ultimately, most people don't, right? Like most people use it very lawfully and carefully, right? And I think the same we should assume with AGI is that like the majority of humanity by far, like will use it for the best purposes, right? And I think that's that's like a very tiny minority that will misuse it. And I think that's really where the focus should be, right? It's like basically stopping misuse of generally like technologies, right? But I think the other is like, it it depends, I think, on, on the flavor of, of GI research, where I think, that it was like very reasonable steps being done by like all the labs on like safety and alignment and all of these things. And I think the benefit really with like, oh, open source is that you have actually more oversight, right? You have more transparency of like, just by design, like people are able to jailbreak that and create solutions for the jailbreaks. They are able to basically just very quickly like make those systems much more robust. So basically I think it's like, if I think about a system, right, it's like that I want to hand my healthcare into the education of the kids and like basically a key of parts of like societal functions, I want to make sure that actually we can see the code, we can verify the code, we can run tests against it, right? Like we can better test it. It's actually like, you know, like a black box, like we asserted by the AGI Labs, doesn't instill a lot of confidence in me. So I think there's basically, I think fair demands for people to like get more insights into like those front end models than just like an API endpoint, right? Basically like limited to access the capabilities that actually those models have, right? And I know you, I think also the red team, right? Some of the models in past, I'm curious on also a new perspective, I think there's a lot 1 can like find out, like even using those closed black boxes, but it doesn't mean you can't look into the black box, like you can't understand what's really happening. And I think that's a problem, right? It's like, I think we would be in a way safer situation if all the front AGI models would be completely open source, and like similar to like the internet, right? It's like, and and a lot of the building blocks of the internet are extremely open, like similar for blockchains, like, all of the code, everything is open, they're extremely safe, right? It's like ultimately, they are safer for being fully open and having kind of like this almost like adversarial environment where people try to break them, but if they can't, right, like in most cases, and I think that's really, like, would be the safest path also for this technology, and I think, in the face of this, like, ultimately, I think there is some, like, given they have black boxes, I think closed malls should have actually more stringent regulation than open source malls, because ultimately, you can't, like, basically, I think they should open their code to, a bunch of people that, look deeply into it, that do tests beyond just having an API endpoint. So I think the jury's like really out on like if closed malls basically get to the same rigorous like testing and like transparency that like open source models have by nature. But, yeah, I mean, I'm curious how you think about it also from red teaming them.
Nathan Labenz: 1:03:16 I'm of mixed mind on this. You know, my background has always been pretty libertarian politics, you know, much always skeptical of regulation, generally pro capitalism, free markets, you know, within a pretty wide range of, like, of outcomes. I do wonder how well the sort of open source label really applies to models because I do think, you know, models are not that much less of a black box. They can be somewhat less of a black box when you have the weights because I'm a very small angel investor in Goodfire, for example, and they're doing interpretability on Wama 3 7 DB or 3.1 or 3.3 or whatever it is exactly. So there is some degree to which that black box obviously is starting to get cracked open, but it's still not like we're anywhere close to having the ability to really say, you know, we now know that, like, Llama 3 point whatever is, you know, fully under control or like won't do bad things or, you know, anything along those lines. I find myself always looking for some way to square these you know, to get the best of both worlds or to square the circle where I'm like, I do think more than API access and just generally more access for safety and academic researchers at the big labs seems like it should be a pretty significant priority. And what access I had was like purely API, very limited, very little information. That's gotten a little bit better, but I would like to see it get better still. But then I also do kind of worry that like, and I think the sort of R1 moment is kind of a good example of this, we don't know exactly what the trajectory of these technologies is going to be. And so there is something kind of risky about open sourcing frontier stuff because you can't really take it back. And, you know, if you open sourced Llama 70B and then somebody comes along and does some gain of function research on 70B as they did with R1, right? I mean, in the R1 paper, they found that they were able to take all these, like, midsize open source models to, like, dramatically higher levels of reasoning capability in a way that probably if you'd ask the people that made those models like, Hey, what do you think is the best ever that somebody will be able to do if they fine tune this on these math benchmarks? They'll be like, Oh, well, probably not that good. And now all of a sudden, woah, there's like a huge unlock and it's sort of something that society as a whole is like, that's just now the norm. So I do wish there was some way to kind of enable the access, enable the research, enable the understanding, but also give us some way to, like, take certain things back if we do make mistakes. And so far, I think we're mostly just protected by the fact that, like, the AIs that are out there are just not that powerful. And know, we haven't had anything where we really needed to take it back, but it certainly seems realistic to me that in the not too distant future, you know, you could have a sequence of events where something that's a bit more powerful gets open source and it seemed fine at the time, but then there's, like, another unlock of post training or whatever that sort of takes it to a whole other level. And then it's like, geez, you know, this now becomes a real problem. And yet there's not really anything that can be done about it. I think you guys actually have 1 really good example and maybe I'll just jump to that now because you mentioned the the Diac philosophy. And honestly, of the best examples of an open source AI project that I think fits that description is the recent work that you guys put out. And I was really interested to see too that it was in partnership with Secure Bio. So you wanna tell us about the the work that you've done in in biology recently that kind of illustrates how we can maybe get the best of both worlds, at least in some, you know, domains?
Vincent Weisser: 1:07:00 Yes. Yeah. So basically, for context, we we trained our support of basically training of a metagenomic foundation model for early pandemic detection in wastewater and can also be brought more broadly extended. And actually, I think 2 or 3 things are interesting. On the 1 side, it's how almost like a little computer took. I think it was like $2,030,000 dollars of compute to create a state of the art frontier model to detect pandemics earlier. And so we'll continue to train it in a this real fashion to improve it. Like, there's much more data we can train it on. But I think the other interesting fact was that, basically, there's a lot of, like, defense favoring open source models that we built, and I think a lot of basically very objectively high impact positive models haven't been built yet. Right? It's like the best examples. It's like a lot of different scientific foundation models. It's like if you look at something like AlphaFold, there's very little misuse risk, and it's gigantic upside, right, like for humanity. So even from a like, almost like utilitarian perspective, it's like, it's very heavily skewed towards like positive outcomes. Yeah, like almost like to explain it how it works, right? It's like, basically the data is mainly just a lot of different wastewater data to detect what's in the water. And if you can count on like, or basically detect like the outbreaks of pandemics already, which was actually the way how COVID was detected in the earliest instantiation was through wastewater detection. And I think it's how you can even like monitor and spread, to grow with all the spread of those risks. So I think that's a good example where it's like, if we put those things in place, right, it's like basically we we are in a much safer world, right, in general, right, like irrespective of AGI, like COVID probably didn't happen due to like, intelligent systems, it actually happened due to like, actually, like probably at this stage, like quite quite factually through misuse, or like basically it's like mistakes by nation state actors, right, it's like by US and China, and like, and their involvement in gain of function research, right? And ultimately, it lacked the the antibodies for society and the the defense mechanisms, right, like to avoid it. So I think that's really like how we also go into this like a utopia and a safe AGI future. It's like if we look very concretely at the risk, right, and biosafety being 1 of them, and then just solve them, like, so basically even if there's a next COVID or pandemic, which surely there will be, that's actually not a problem because we have the vaccines, we have the early detection, we have all the other defenses to basically avoid this aspiring into a problem. But I think, yeah, for the for the model, basically was like collaboration with UCL and kind of like the, like, nucleic acid observatory and like the safety and a team. And they basically did all the research work and we helped them really, on the 1 side with compute, but also on more like the training side, right? Making sure that a good model comes out of this. And I think it's a good example where there's like a lot more things like this, where even like where in the millions of compute, you could really transform scientific progress and human progress more broadly. And I think that's really obviously also, I think, to going back to the you took INTLLECT-two, right? It's like it's intelligent like solving intelligence to solve everything else. And I think the most important thing to solve really is science. And I think it's like solving all the diseases, solving natural risks, but also things like climate, and energy, and a bunch of other pieces that I think will just create an enormous amount of human flourishing and progress. So I think it's actually very key to our broader goal, is to help solve science, And really, also with a lot of connections we made in that world, I think a key piece of this is almost like working towards autonomous AI research and autonomous scientists, right, that can safely advance different areas of science. And it's just bottlenecked by almost like the amount of compute you can feed it to do reasoning and like solve those problems and like simulate them. So I think that's actually a world that we're like already starting to move into. It's like a world in which like AI will like massively accelerate science, like scientific progress more broadly.
Nathan Labenz: 1:10:54 Yeah. I've been tracking that general phenomenon in a lot of different domains and it is like everything else. It's really ramping up. What I really found interesting about this particular model, and I'll just read a couple highlights off your blog post. It's called Metagene 1. And then I'd be interested to hear like what other sort of ideas you think have similar shape in terms of their, you know, defense favoring properties. But this 1 I thought was really thoughtfully designed because, of course, we've heard from so many, you know, warriors, myself included, from time to time, like, jeez, you know, do we really want, you know, expert virologists in everybody's pocket? That seems like maybe not a great, you know, technology to make freely available in a way we can't take back. This is like almost the exact, you know, inverse of that where I think at least what jumps out to me as the key architectural decision is that the model only has a 512 token context length. And so that's like obviously a lot shorter than any genome, and it's not gonna be able to be used to generate full genomes. There's been a lot of talk. I'm not sure how much exactly, like, research there's been. Although I did just see 1 paper that came out that I wanna circle back to around, like, creating AIs that are myopic in general, which is, like, you know, to say, we don't want these things necessarily thinking, like, super big long term plans. We don't want them doing super huge things that we don't understand. We want them to do really well at the thing that we put in front of them that is there for them to do. And this is that. Right? 512 tokens, you can only do so much, but that is enough to detect anomalies. And so you've created something there that it does seem overwhelmingly likely that you could put this in everybody's pocket. And, really, the only thing they could do with it would be monitor their local wastewater for anomalies. And that just seems like and I love projects like this that are sort of the unilateral provision of a global public good. That's 1 of my favorite phrases. And this really, I think, fits that bill where you're like, for the people that need this, it's incredible. For the general sort of balance of attack and defense, this seems extremely useful. And yet it's really hard to imagine how anybody could really abuse it, and that is baked into the artifact itself. Now when you imagine a general purpose scientist, like, I'm not so sure, you know, that that trait is always there. Like, what other projects do you have in mind that have that set of properties?
Vincent Weisser: 1:13:24 Like, going going to this point, I think that's really the philosophy, right? It's like figuring out what could be the biggest contribution also that, like, almost has a disproportionate impact to the amount of computer takes, right? It's like there's other things you could do with H100 documents that have less impact, you know, where it's like really about figuring out what's the lowest hanging fruit for human progress for this differential progress. I think it's even like leads, and I a chat with like different people in biosafety bot, like the work we did, they are very excited to like open up like almost like a distributed kind of like pathogen detection monitoring effort, right, like where you could at some point have like millions of places, right, like analyze those samples, which you already do, but like obviously not on a granular like individual home basis, but more like on a city basis, right, or airports and things like this. But I think, like more broadly, think like we basically want to make sure, right, like that we build like the key pieces and the key research needed to scale progress radically, right? And I think basically it's like almost like figuring out how to create a systems with a human in the loop to do science autonomously or AI research autonomously in the same fashion, right? I think also the path to things like super alignment, It's like at the core of those philosophies being like, realistically, the only way to solve the superintelligence and laminar, the only reasonable path to have an autonomous scientific system solve for global pandemics is to have the right guardrails and the right, like, humans on the loop, and just like the benefits of open source and a lot of participant, participants to drive them in the best direction. So I think, we have basically multiple things planned in this direction, including other scientific foundation models that basically just like generally just, like, advance, progress, like, for example, like, virtual cell foundation model is is 1 idea or things like a simple autonomous scientist where where you also have some of those, like, mechanisms in place where humans can, like, interact with, with AI scientists, but I think those are the most promising paths to also just like radical, positive human progress. I think it's like, 1 needs to do this like HRF deployment, right? Like where 1 like builds out in small steps and then like improves, better tests it, and scales it, and then makes it accessible to everyone, right? To have basically the biggest potential impact.
Nathan Labenz: 1:15:39 Ready to switch gears to distributed training? We've covered a lot of ground, but now I think might be a good time to just dive into the technical details of that. So again, just to remind myself and everybody, why does this matter? I mean, multiple reasons. Right? It's like in the limit, it could create a sort of truly decentralized AI infrastructure that nobody can control. It's also just a challenge to build clusters beyond a certain scale. There's also this spare compute notion, major implications for compute governance. Lots of reasons I think to care about distributed training. You guys have an excellent blog post, is 9 months old, but I still found very well worth reading today. And I think that's like honestly a rare accomplishment in AI that you could have something that is still worth reading 9 months later. So I commend you for for that. I'd love to kind of just walk through that and talk through the details. And then, you know, we can go beyond the, you know, the time of the blog post too because you guys have actively contributed to the frontier since then. Let's just start maybe with what makes distributed training hard. Like, why is it you know, why was it ever a question as to whether or not it would work?
Johannes Hagemann: 1:16:52 Yeah. This is mainly about the the bank to bandwidth requirements you have for your usual distributed training. Right? Even distributed training inside a single cluster is quite hard to do. So you have to make sure that for every kind of, like, parallelization strategy you use to, like, parallelize your workloads across all the GPUs you have in 1 data center, that you do it efficiently in a sense that it it works efficiently for the different parallelization strategies. Right? And, yeah, some of them have more memory or more bandwidth requirements than others. Right? So 1 of the ones that is also used in our local technique, for example, is normal data private training where you basically do your forward pass, do your backward pass of your whole model, and then you only send the gradients at the end in a very efficient, like, all reduced manner across all those nodes in the network. Right? And there are other techniques that that have more memory requirements. So 1 of those is, like, tensor model parallelism where you basically split up the whole weights of your model across different GPUs, and then you have to do communication between different GPUs for every single layer, basically, in your in your transformer model. Right? And another technique we also mentioned in the blog post, right, is pipeline parallelism where you have different stages. So you only have to communicate between those stages, and on all those stages, there are different blocks of layers of your transformer. So you only have to sync the last activation hidden state to the next stage in to to your next pipeline stage, which has less memory requirements than your tensor model parallelism, but still a lot more memory a lot more bandwidth requirements than your data parallelism. And those are the 3 different techniques you can basically go for. There's a couple more other things that people figured out over the last couple of years, but that's the main 1. And a lot of people have been going for for a distributed training across, like, non collocated clusters now. This is data per the regime, which in a normal setting is still a way to communication intensive to actually do it, yeah, with interconnect that is slower than a 100 gigs for for large models. So even in this regime, you have to find techniques to have less communication between the nodes. And, yeah, 1 of the techniques, yeah, we've been working on quite heavily and have been building on top of is this DiLoCo technique by by distributed low communication training, which is basically a technique where you train on the those different islands of devices in a data parallel fashion. But instead of, like, syncing every step in your training now, you only have to sync every couple hundreds of steps, basically. So you you sync only every hundred hundred couple of steps, like, sort of pseudo gradients and this mechanism, and then you have an auto optimization loop to merge those results, basically, from all those different nodes. And, yeah, it basically works as efficient as in a centralized cluster with, obviously, some limitations in a sense on how many actual nodes you can scale to and where it actually works well. Like, at the beginning of the training, it doesn't work as well. But in later stages of training, it's pretty much as efficient as your usual data parallel training.
Nathan Labenz: 1:19:51 Okay. Let me take that back entirely from the top, make sure I understand it. You can then correct me and elaborate on the things that I'm either getting wrong or simplifying. But I do think this is really important for people to understand. So I guess in terms of just the fundamental challenge of like a naive setup, right? Data parallelism is the first kind of parallelism. And that just means if I'm going to train a giant model, obviously, if I want to do, for example, 15,000,000,000,000 tokens as some famous open source models have done, I'm not going to do 1 token at a time. The whole game is right. I need to the reason I have these big clusters is I need to parallelize this out. I can't possibly, you know, have 1 model do a forward pass, take the gradient update, you know, and just keep using this 1 copy of the model. You'd never get there. Now we can just say data parallelism is basically you make a bunch of copies of the model, you train all these different copies or maybe I should say you run the forward passes and take the backward pass gradients on all these independent copies of the model which are each processing their own data. And then the aggregation step is the part where a lot of stuff has to fly around. And I think it's underappreciated that the gradient is basically the same footprint as the model itself, right? Because if I've got a 100,000,000,000 parameters sitting on a server somewhere and I take a backward pass in the limit, I'm at least, you know, the naive approach is I'm have 1 adjustment to each 1 of those 100,000,000,000 parameters. So I have a 100,000,000,000 adjustments that I wanna make to the model. And now if I have 10,000, you know, GPUs wide, each having processed its own data and found its gradient in order to learn from that particular data point that it just crunched. Now I've got 10,000 times a 100,000,000,000 numbers that I need to bring to some aggregate form. So then I have the sort of cumulative gradient from this training step. Now I can go apply that to all the copies of the model and then I can take the next step. Right? And then I'm not sure how much we need to add on to that in terms of tensor parallelism or pipeline parallelism to to help people at least develop, like, a decent intuition. I don't may maybe there's more you would add, but it seems like that what I just described feels to me like the really fundamental issue, and then tensor and and pipeline parallelism also is important because your model is too big for a single GPU, then you've got to start to split your model. And so that just adds even more complexity and overhead. But even if you had giant GPUs that could fit all the parameters onto 1, you would still have this data parallelism problem where you're just like, I have to run a bunch of copies and I have to figure out some way to aggregate the changes that each instance of the model wants to make so that I can make, like, the global change and that the overall model can keep getting better. Pause there. Is there anything more that in terms of, like, fundamental intuition that somebody who's not gonna do this themselves but wants to understand generally what's going on needs to know?
Johannes Hagemann: 1:23:09 No. I think that's great. It probably makes sense as part of this conversation to stay in the state of parallel regime because the other 1 is even harder to, do in a distributed fashion. Right? Exact techniques for this too, obviously, but, yeah, to as as part of this conversation makes sense just to stay in the data parallel regime and understand that use case. But, yeah, as you said, all those models have a lot of memory requirements too. Right? You can't train your 100,000,000,000 parameter model on a 40 90 because it takes too much memory. Even the parameters are too big. Right? And especially for the training use case, which people don't always realize. Right? You have even lost more memory requirements and for inference. Right? So in in training, you need your model parameters, you have your gradients, which are the same size as your parameters, and then you also have a huge optimizer state, right, thing to to update your your model parameters, right? And those kind of things take a lot of memory, which is, yeah, 1 of the reasons, for example, in our 10,000,000,000 parameter run we've done with INTELLECT-one. Yeah. The biggest thing at the time we were able to to fit on a node, basically, was around, like, a a 10,000,000,000 parameter model in terms of memory requirements for training. Right? And then doing data parallel training across those.
Nathan Labenz: 1:24:18 Basically, the momentum term helps stabilize training overall. Right? You, like, you could take the gradients without the momentum term, but then you have convergence problems because everything's sort of thrashing around, and the momentum term basically sort of keeps your individual updates from going too far in in unusual directions. But again, you have 1 of those momentum things to keep track of for each parameter. Right? So essentially, have a multiple for however many parameters you have, which are numbers. You have a multiple of that number of numbers that you need to keep track of during the training process and then figure out how to aggregate across. So when you go to something like DiLoCo, which is the distributed training regime or training scheme, I guess, maybe that DeepSeek put out and that you've elaborated on since then. Is it as simple as an insight that, like, actually, you can just run more forward passes and kind of keep track of things locally and you just simply don't need to aggregate every step and it still works? Or is there more to it than that?
Johannes Hagemann: 1:25:23 For sure, it's a very empirical result. Right? Just showing that it basically works by doing this. It's also not the first work into this direction, obviously. Right? There's a lot of the whole, like, learning literature has been doing a lot of things in this direction as well as as a general algorithm is more called local SGT type of approaches. Right? Where you are able to do those local steps and then in a sim can have an outer step, basically. So it's just been showing to improve to work very well from from Google DeepMind on the I think they stated in their paper to, like, a 400,000,000 parameter model, which is, yeah, in retrospect, not that large for large language models these days. Right? We we scaled it up to the 1,000,000,000 parameter size where it was still working, and then just a 10,000,000,000 parameter right, with the INTELLECT-one run. Unfortunately, obviously, it didn't have a baseline run to compare to because that's at that scale, the the costs are a bit too large for, like, a small startup to actually run baseline operations, right, on those kind of models. As a general intuition, you can do those local steps and don't have to sync as often. It works a little bit less at the beginning of the training, which is also 1 of the things that the local paper showed and and we showed later on, right, where at the beginning, it's very clear the gradients have to move from the specific direction. Right? And then the merging of those doesn't really help too much. Right? But in the later, like, very flat regime of a loss, data is working really well.
Nathan Labenz: 1:26:44 Okay. Let me help me understand that a little bit better because I almost had the opposite intuition at first. Like, when you're learning very quickly, it would seem like maybe you need to share those learnings sooner, but it seems like you're saying the opposite. Like, I don't know. I'm confused. Help me help me try help me explain that a little better.
Johannes Hagemann: 1:27:01 At the beginning, there's not much signal because they all go into the same direction. Right? And then later on in training, there's a lot of signal coming from all those different data parallel ranks. And then the the technique just works better in the in the later stages of training. It's also very much empirical in the sense that we've seen DiLoCo converge a little bit slower at the beginning and then pretty much catch up to your your baseline experiments with just doing data parallel training.
Nathan Labenz: 1:27:27 Yeah. Okay. That's interesting. So there's not much major trade off?
Johannes Hagemann: 1:27:32 There's certainly still things to solve in a sense. Right? Like, there are 3 things that we we obviously work on since we've been doing INTELLECT-one. Right? For the INTELLECT truth to run, like, like, speed major components. For the 1 part, the communication requirements are still too large for, like, training a a 100,000,000,000 parameter model. With this technique, we would have to have even less communication and have been working on different techniques there to quantize to the gradients and other reasons corrections. So that's 1 of the things where it has some limitations to it. Other parts are just memory requirements. So if you take larger nodes of h 1 hundreds or multiple nodes, obviously, would fit there. So this requirement is gone, but to actually make it possible so everybody could join at least both, like, some memory requirement from a 100 node and h 100 node. So there's still pretty much a limitation to scale this 1, as well as then on the number of workers. So DiLoCo currently still has some diminishing returns in terms of how many workers can you actually scale to. And when you you extend the gradients after, like, the 500 steps of pseudo gradients, right, the signal gets lost a little bit if you have too many workers. So what we've empirically tested with INTELLECT-one run, we're scaling well to like 16 workers, for example, right? But to truly make a more distributed 21 happen, we have to scale it to hundreds and thousands of workers, right?
Nathan Labenz: 1:28:51 Yeah, that's interesting. So, and the intuition there is simply that because these things are like each kind of overlearning in whatever random directions they happen to be overlearning. And then if you have just too many of those, the whole thing becomes noisy. And because these, like over learnings are sort of happening in in various directions, they're kind of canceling each each other out. The signal's getting lost.
Johannes Hagemann: 1:29:19 Yeah. It's it's not like it doesn't work at all anymore. So we've been scaling to to more workers in the sense in in our testing, but it's just not as compute efficient anymore as usual data parallel training. So that's what you want to aim for. You wanna you can take some trade offs because in there, just with a global setting, the flops are maybe cheaper, right, because we can run spot instances, we can have people contribute to added compute resources and so on. So the flops are cheaper, the bandwidth is more extensive, but we still want to be on par in a sense to how efficient it is to train in a centralized cluster.
Nathan Labenz: 1:29:52 Yeah. Gotcha. So you mentioned like quantizing gradients. I assume you've sort of similar, maybe even the same idea as just saying like you could sort of round probably a lot of gradients down to 0 perhaps and sort of say like anything below a certain size, you know, we just don't have to send across the wire at all maybe. What other tools are in your toolbox there that are Yeah.
Johannes Hagemann: 1:30:18 We obviously have a lot of coming up in terms of like improving all that algorithm, right? Summitting a paper there soon. I think what we've done so far, right, is just what we've realized those pseudo gradients actually sent after those hundreds of steps. So it's not the actual gradient of the model. It's a difference between the beginning of the weights and the end state of the weights after all those inner step updates. And what we've realized is that those 2 gradients are actually pretty easy to well, to quantize. So what we've done so far for the INTELLECT-one run, we haven't been sending those in, like, 32 bit precision, but we've been sending those in 8 bit precision, which gives us another, like, 4 x reduction in computation requirements. And for the INTELLECT-one run, it was enough to basically sync every 100 of a 100 steps, and so we had a total communication reduction of 400 x, right, which was enough to still train very efficiently across the whole globe, right, with the interconnect we had there for a 10,000,000,000 premier model. And if you do more inner steps, basically, find more quantization techniques, different things on top of it, we could probably scale that to even other models.
Nathan Labenz: 1:31:27 Yeah. That's interesting. The other thing that really stood out to me was the DeepPacko scheme where and this I think is interesting for multiple reasons. I mean, I had previously been kind of studying mixture of experts in general and, you know, interested to learn that, like, just like so many other things, like, the the sort of relationship between the experts is, like, generally very opaque. There's not you know, you should not model the experts as, like, actually being domain, you know, subject matter experts in a recognizable area of human pursuit or whatever. On the contrary, you know, just like everything else, it's like, why did that token get routed to that expert or set of experts? You know, we don't know. But this TOPO thing starts to segment data by, like, longer sequences. And therefore, you know, it's not like you'd have to juggle things on a token by token level. That seems to me like both potentially interesting from an efficiency standpoint, but also from a, you know, interpretability standpoint and potentially on some of these, like, gradient sync questions. I wonder if you were to say, like, what if we sent all of our science, you know, information over to this, you know, subcluster and we sent all of our like literature over to this subcluster, would that have a chance at sort of allowing these things to do like more locally before they need to aggregate and potentially would it also create resulting architectures where there actually is like a separation of concerns, more like a traditional software project and less like a giant spaghetti black box that we are accustomed to. How much promise do you think there is in that sort of, like, let's say semantic data segmentation sort of idea?
Johannes Hagemann: 1:33:13 Yeah. Good question. And, yeah, that's I think what a lot of people are getting around, right, in terms of how they think about mixture of experts. Right? That they actually have some semantic routing in there that it actually routes to specific experts for specific domains, which is, yeah, just generally doesn't seem to be the case. Right? Just seems to be a more efficient way to compute to get a lower loss. Right? So there's a usual mix of expert way where you basically which yeah. It's been a lot behind a lot of the models. Right? Like, even the the newer DeepSeek models. Right? Where you route the token level. So for every token, you route to a different experts of the MLP. And the TOPCO way wants you to allow to actually distribute the whole mixture of experts, which is not really possible if you route to a different expert for every token. So you have to do it not in such a granular way, but you you can only do it on the sequence level. So for every sequence, you can route to a different expert, which, yeah, as you said, theoretically would allow for, like, routing to different experts for actual domain experts in a way. Unfortunately, I'm I'm I'm not too bullish on on the whole way of, like, routing it by the domains, basically. Right? I think it's just the the mixture of experts that doesn't work as well as what we've been seeing with the the normal top level mixture of experts. So, yeah, would love to see if somebody wants to replicate in an open way to, Yeah. See more research happening. Right? See if it works as well to actually use a routing on a bit more of a domain level. But my my intuition is is is not to yeah. That I'm not too bullish on it.
Nathan Labenz: 1:34:45 And is that because you just feel like it's too cleverer and clever things don't work, or is there what motivates because, I mean, it would really be, in some ways, a huge win if you could get a semantic segmentation between experts. Right? For multiple reasons. But just on the DeepSeek v 3 r 1 architecture, if I recall correctly, it's 671,000,000,000 parameters of which I think 37 are active at any given point in time. So it's like 6% or something of the, you know, overall parameters in the model are used in any 1 forward pass. If you could do like sufficient level of semantic segmentation across experts. You know, you could have the sort of, you know, virologist removed, you know, package that I think that's another way I kind of try to think about it, like, squaring the circle. If you could localize certain highly sensitive, you know, sets of, information or knowledge in the model and then distribute a version that, like, just doesn't have the virology knowledge as sort of the base, you know, open source case, that might be a great way to be like, look, everybody has everything you really need and you just don't get the virologist and, like, you probably, you know, shouldn't complain too much about that if you're, you know, a normal user. But yeah. So what can you tell me more about why you think that's ultimately not gonna work?
Johannes Hagemann: 1:36:09 I I think it's gotta work in a sense. Right? You can definitely do it. I'm not as sure about that it works as well as just letting the model learn. Right? So adding, like, this inductive bias on, like, hey. We we wanna route those to to those different domains is is probably not as efficient to just let some mix of expert learn on their own where they wanna route it to. And just empirically has been shown the mix of expert is not learning where to actually that to route it by a domain. Right? But rather in a completely different way that is not as interpretable for us. Right? Which is unfortunate for interpretability research, obviously, right? It would have learned it if it would have created a better model. So that's that's my kind of view on it.
Vincent Weisser: 1:36:48 Yeah, think like to add to that, think there's almost like 2 things that I think are interesting on the 1 side that like 1 can build those architectures that are slightly better set up for like a distributed setting. I think on the other side is that we're obviously entering a new scaling paradigm, right? Like of like inference time compute, which scales much better, right? It's like basically you get much more reasoning for much less compute and it's actually perfectly set up for decentralized training, like almost like by coincidence, right? So I think basically, the part where I think was in many ways interesting, I think looking into the present and future, I think it's it's more like the r 1 scaling paradigm is actually perfect for a decentralized setting on the 1 side. I think on the other side, it's like basically has very little communication requirements, right? Because a lot of it is scaling through synthetic data and through inference time compute. And I think on the other side, I think to your interpretability point, the RL chains, right? Like it's good if you can read them, it's good if you can see them, right? Which you can't for like O 1 or O 3, right? Even for interpretability and like safety and whatnot, and even as an end user to understand how it's reasoning. So I think it's actually quite a, in some ways, interpretive paradigm, and I think that's also what, like, open eye folks, try to make you believe, right, is like the, like, it shows you how it's reasoning, right, and you can align and kind of, like, edit out specific reasoning. You can add some things into the reasoning, right, like some safety checks or some general checks. I think on the other side though, to be realistic, right, it's still like a way of reasoning, right? It's like basically suddenly starts reasoning in Chinese, like drops some like random bits that like you also can't really understand. I think it's it's just like almost like the emergence or like where it's like scaling laws don't like scale reasoning in the exact same way as human reasoning, right? Like they they take some like different paths through almost like this multi dimensional space, right, of math, like that that is kind of like foreign to us, but it's still probably there's this patterns and structure in those systems, right, just like maybe not, they don't map like anthropomorphically like 1 on 1 to human reasoning and like language. So, yeah, I think that's kind of like the paradigm we're in and the paradigm we are, like, planning to scale is almost more just like inference time compute paradigm in a decentralized fashion.
Nathan Labenz: 1:39:08 Yeah. Couple interesting points there. In terms of just very practically on the why is the RL, like, the r 1 style training more favorable, it's I don't know if you have, like, better numbers on this, but because I don't think it was in the r 1 paper. But, basically, you're doing a lot more forward passes relative to the number of backward passes instead of having to do a backward pass for every document that gets processed through the model because you wanna initially in pre training, right, you're doing every I know you guys understand this. This is more for the audience. Every single token, you're kind of able to adjust on. Here, it's like, did you get it right? Did you get it wrong? Take the best of n samples and maybe also the worst. And you only have to do the backward pass and look at the gradient for, like, these sort of extremely good or or extremely bad generations. And so that just, like, dramatically reduces the overhead in terms of the bandwidth that's required. Right? Any anything else that needs to be added there? Is any ratio in the r 1 paper, I don't think it said how many generations they're doing and, like, what their sort of best of strategy is, but you may know from other sources what is kind of the norm there.
Johannes Hagemann: 1:40:30 Yeah. Like, unfortunately, for for this time, like, obviously, DeepSeek is always pretty transparent in what they're doing. Right? For the our 1 paper, there were a couple of missing details definitely on the yeah. How they actually trained that thing. Right? How did the the infrastructure look like? And how long did they train for? What's the actual communication requirements were? Right? But in general, like, the the 2 things that are needed for the DeepSeek 1 model in a sense. Right? Like, 1 of the things they've shown that is that supervised fine tuning on just a bunch of different reasoning chains from, yeah, tasks and verifiers that you've just generated with a different kind of model, is helping quite a bit for, like, actually getting a better r 1 at the end with the reinforcement learning approach that is layered on top later. Right? And the suit 2 component components. First of all, the the supervised fine tuning to generate those large scale synthetic datasets is very much favorable for us with the training setting. Right? Well, spirit network. Right? That you can just generate those reasoning change, yeah, for for the laptop models on, like, h 100, h 200, h 200 nodes, and then, yeah, you can even, like, use the smaller, distilled models for our static data generation. And, yeah, just have more supervised fine tuning data to train on, and this improves the performance of the of the model. And then the the second stage, right, where you actually do the GPO reinforcement learning on top of it. Yeah. They they don't have the details there, but generally looking at, yeah, what other other people have been doing in in the reinforcement learning regime on top of language models where you have a bunch of different rollout phases. Right? You generate for 2 256 questions, different rollouts. Right? And then only do an update step after you've done all those rollouts or do do a backward pass on the afterwards and, yeah, accepting or rejecting a bunch of those. Right? We've seen that you have, like, minutes of forward passes, basically, for every backward pass. Or it could be hours if you have a lot of rollout phases. So still very early on there, obviously, but, yeah, definitely looks like the the whole paradigm seems to be much more favorable for just a bit of training sake.
Nathan Labenz: 1:42:43 Yeah. It also probably is worth mentioning that, like, there's not gonna be a single ratio of forward passes to backward pass. It because you're also sort of
Johannes Hagemann: 1:42:53 Yeah.
Nathan Labenz: 1:42:53 Interacting there with, like, curriculum learning type things or sort of what problems are you trying and what is your success rate on them? Because you you need some signal. Right? So if you're doing if you're doing problems that you only get right 1 in 1000 times, then you need like 1000 to get 1 that's right that allows you to get the reinforcement signal to proceed. So that's part of why the sampling strategy is important. And they have some interesting details on that. And I think in both that and the Kimi paper, there's like a sort of algorithm that's like trying to make sure that the difficulty of the problem is well calibrated so that, you know, so that you're doing hard enough problems that you're still meaningfully learning, but you're doing, you know, problems that are in range enough that you don't have to try 1000 times before you get it right once so you can learn from that. So I do think that's important to keep in mind too. So what else? I mean, were a number of other different papers that you run through in the blog post beyond what we've talked about so far. What else do you think is worthy of highlighting or, you know, what from your work since that blog post 9 months ago would you say is what people most need to know about?
Johannes Hagemann: 1:44:05 Yeah. Good question. Yeah. If I think back about the blog post, like, of the other things mentioned in there is, for example, this swamp parallelism idea. It's a paper by somebody we've been collaborating with a lot to, Max Verbini, who's currently at Together AI. And he's done, like, really tremendous work in the whole distributed training space over the last 6 years and has been doing his PhD in it. And, yeah, 1 of the techniques he's been coming up with this is called swamp parallelism, where you also not only have your data parallel across the world, basically, but you can also do, like, pipeline parallelism across the world, at least for, like, small sequence length, it's actually not that communication intensive. Right? Because you only have to sync the last activation state of your pipeline stage. So you only have to sync something which for a transformer model, but also for all the different hybrid variants that are out there right now, is always of the size, like, your sequence length times your batch size times your hidden dimension, which, at least for, like, small sequence length is is not that big or for also small hidden size and batch size. So that's something you can also potentially do over the Internet, but then you also get into other things to to be cautious of. Right? In a in a data, for example, don't have to think about latency, which in 1 single data center is not an issue. But across the world, latency is a huge issue. And in your data per regime, you don't really have to think about it because you have to up due to the update steps so so little that if you have, like, 100 milliseconds of latency, it doesn't add much overhead. But for pipeline parallelism, we have to do that communication quite often with, like, very small sizes of of tensors we have to send there. So that's 1 of the things to be cautious of there. But, yeah, otherwise, it's also a great technique to actually scale that out to to larger model sizes, which is 1 of the limitations we have right now in terms of how much or how large of a model can you actually fit on 1 node.
Nathan Labenz: 1:46:01 So does that swarm type thing start to open up the possibility of like truly end user, like retail compute being contributed to training projects in the future? Because my laptop can't hold like the full DeepSeek v 3 or r 1, but it might be able to hold, like, 1 of the experts or, like, 1 of the layers. How far from that or how realistic or how far from that do you think we are right now?
Johannes Hagemann: 1:46:31 Yeah. Definitely would be a possibility for for smaller models. Right? You can't still train a large model across a bit, have too many pipeline stages, right, as well as then these other, like, bottlenecks you run into with latency and so on in, a home GPU type setting. Right? But, yeah, theoretically, it should allow for it. The 1 thing that makes me a bit bearish on doing this in a whole distributed fashion is how the current research paradigm has been moving. Right? Like, 1 of the things we've been doing is we create ever larger sequence length because we we need those for the r 1 style reasoning components. Right? So whenever we increase our sequence lengths, it doesn't make a difference for your data parallel training. So your data parallel training, you start have to only send the gradients, which aren't growing with the sequence length, but in pipeline parallelism, your activations, they grow by the sequence length. And if we go for 1000000 sequence lengths to train something like an OSP type model, your activations are gonna be way too large to actually send those. So I think there are better techniques. So 1 of the things we've been rather focused on is offloading a lot of the optimizer states, for example, on a single node, which there are techniques out there with 0 offload and a lot of more modern approaches to make that efficient and train like a 100,000,000,000 model on a single node of h 1 hundreds or a 1 hundreds.
Nathan Labenz: 1:47:47 Can can you describe offloading a little more? It's essentially moving the momentum terms off of the high bandwidth memory to other storage?
Johannes Hagemann: 1:48:00 Yeah. To to your storage on your node, basically. Right? To CPU RAM. Right? So that's how how offloading works in a sense. And you certainly wanna offload the the optimizer stage without taking the the most amount of memory. Right? Like, we touched on it earlier in your user, like, mixed business training, you have with your Adam W. Right? You have your parameters. You have your gradients, which are stored in mostly a 16 bit position. Right? But your optimizer has you have to store your your parameters, a copy of it. You have to store in full precision. You have to store your gradients again in full precision, a copy of it. You have some momentum as you just mentioned, and you have the variance also in full precision. So it takes up, way much more memory than your prem design gradients. So if you're able to offload that efficiently while still keeping really good training efficiency, that's probably the way to go. And then you can fit some of the largest models basically, on a single node. Obviously, you still won't be able to to train a DeepSeek r 1 with 650,000,000,000 parameters on it. But at least to go into the 100,000,000,000 parameter regime.
Nathan Labenz: 1:49:05 1 thing I've been thinking about a decent amount you've I'm sure you've seen this blog post from who, you know, was at Google and then left to go start Rekha and is now back at Google. I thought the blog post about, like, buying CPU or buying GPUs, I should say, in the wild was a really, illuminating 1. And basically the the long and the short of it was like, he was saying, I had no idea how spoiled I was at Google because all the infrastructure there just works. And, you know, then you get out into the real world and you're like trying to buy from this cluster or that cluster. And this was like a year ago, so, you know, you didn't exist yet maybe. And there was also just a more sort of frothy time in general in in trying to buy compute, but and also people hadn't like worked out problems nearly as much, but it sounded like a real jungle. Like at at what point, if ever, do you expect us to kind of be in a similar place with GPUs as we are with CPUs? By which I mean, the hyperscalers years ago now, right, basically figured out how to sort of abstract away the differences in underlying hardware and create like consistent containers that, you know, they don't really have to worry too much any you know, there's there's enough layers of abstraction. They don't really have to worry exactly what physical device is this running on. They just kind of know like the virtual environment that it works on and that can be consistent and, you know, that's good enough. And on top of that, have the sort of, fault tolerance of like 1 computer craps out or whatever, somebody spills a drink on it, it doesn't matter. They just kind of route around those problems and that's great. Right? That is why we have all the tremendous uptime that we have, so on and so forth. It strikes me that we're not there with GPUs. Is it like just because software takes time to mature and we're just not there yet and it will get there, or is there something more fundamental that I may not appreciate? Because then I've heard these, in contrast, I've heard these stories of like, 1 GPU goes down and it stops your training run. When I hear that, I'm like, wow, that's really weird. That's a dramatic departure from 10 years ago, large scale web products were supported, right? With Facebook 2013 or 2014, 2015, I think was already starting to get pretty far down that path. So what do you make of that? You know, what what should I expect in the future?
Johannes Hagemann: 1:51:30 Yep. Yes. You've mentioned it. Fort Lauderdale is a huge part there. Right? And it's definitely on the software as well as hardware stack right now that it's, yeah, not perfect yet in the whole GPU regime. Right? If you train across, like, well, like, do a training 1 across, like, a 100,000 GPUs and then a single cluster, for example, like the, I don't know, a Grok x AI cluster or something like this. Right? There will be a node that fails every at least couple hours. Right? Maybe less. So right now, a lot of the 2 current training frameworks are not actually made for for transparency. So if training 1 is just gonna crash too, if 1 node of those thousands of nodes that are gonna crash, And, yeah, you have to continue your training from a from a last checkpoint, for example. And, yeah, all the the big labs, a lot of them have it figured out. The other ones that have don't have it figured out are racing to figure out on it now because, otherwise, they won't be able to to train their models across the largest cluster. And and in our case, it's even more crazy in a sense. Right? We we definitely need for tolerance. We need even, like, more than that. We need onboarding of different GPU nodes during the training. And, yeah, that's what we've been we've done with training framework called Prime. We've also open source, right, where we have this for tolerance. All the DiLoCo range, all the data parallel range we have in our training for tolerance when area whenever somebody drops off, the training is not stopping. You can still continue training. And, yeah, that's why, like, also, all the big labs were interested in, like, solutions there in a sense. Right? How do you actually make this fault tolerance happen? And, yeah, over the long term, think I it's gonna be fixed, obviously, on the software side as well as on the hardware side. We're gonna go more into, like, actually a commodity for compute, which is definitely not the case right now. There's yeah. You've mentioned the record paper where they basically showed there was, like, a 100 x difference in terms of different GPU providers for how reliable their GPUs were, which I don't think it's definitely there's probably also always gonna be a bit of a better provider, but it's definitely got a turn more on a commodity that you can actually just run on this compute for tolerances being handled, and the hardware is gonna improve as well.
Nathan Labenz: 1:53:37 When you do something like the 10,000,000,000 parameter model that you have trained in a distributed way, how much of your time is going toward managing that mess and making sure that it works with heterogeneous hardware and, you know, things coming online and offline? And, like, do you have, like, an SLA sort of agreement with your providers that, you know, they're not just gonna yank things out from you? Or are you giving them the freedom to say, no. Well, you know, these are preemptible nodes. Like, we'll take what we can get as we can get it. How much of you know? So, yeah, a little more there. And how much of your time and how much of the return comes from that kind of improvement versus, like, the sort of DiLoCo, like, more theoretical information management kind of improvements?
Johannes Hagemann: 1:54:22 Yeah. I think for the initial, like, INTELLECT-one run, it was a lot of improvements more on the hardcore engineering type work to actually get that done, right, to get photons into the framework, as well as then covering all the other edge cases you have, having stuff like that. You can have, like, a life recovery where somebody can drop off, but they could also join again or, yeah, just join in the middle of training run, right, which is definitely not supported in any of the other training frameworks right now. So our focus was was definitely more on the engineering side. And to be foolish on Spern, they were still at the beginning a lot of things going wrong. So where we basically had to, like, fix stuff on the fly where a lot of the edge cases we haven't thought about, right, which can actually go wrong in such a globally distributed setting had to be fixed on the fly. And the training stopped for a couple of hours, then we continued with all the nodes and later on. So that's basically how we've been handling it for the INTELLECT-one. And I think at the end of the training 1, we we had a pretty stable solution in the sense, right, that a lot of the components were fixed. I'm sure once we started in, 2, there's still gonna be some of the issues where training ones have to be restarted because of edge cases. So you can never be a 100% sure. But, yeah, that's how we've been handling it.
Nathan Labenz: 1:55:33 That narrative strikes me as a decent metaphor for the AI phenomenon as a whole. It's like, there's a lot of edge cases we haven't anticipated, and we may have some crashes, and hopefully, we'll recover, and hopefully, we'll get to a stable equilibrium at some point. I guess let me try to summarize where where are we on distributed training. There is a mess of hardware management, which it sounds like you've come quite a long way toward taming, but maybe not fully tamed yet. There's also just the fundamentals of information management and the challenges of bandwidth. It seems like the low hanging fruit has been picked where it was like, actually, you could just do this in a less bandwidth intensive way, and it still works. And now we're getting into, like, things where there's probably more tangible trade offs where, yeah, you could do this, but then it's not gonna converge as fast, or we're not necessarily competitive if we make certain trade offs. It seems like we're you know, in terms of scale, you've got the 10,000,000,000. It seems like you can see a path to the 100,000,000,000, but a path to 1000000000000 maybe is not yet clear. Is that a decent I
Vincent Weisser: 1:56:41 think there's still a path to almost like AGI and even like, like, this paradigm that almost like goes beyond purely just maxing out parameter count or like, where I think even the r 1 shows us, right, it's like scaling that paradigm might get us to AGI and it might not require a trillions of active tokens or parameters. And I think that's, for me, quite plausible path where it's like, there's other pieces of the puzzle that are quite relevant to solve. And I think we're largely on track to solve decentralized training. It will always have like some trade offs but that come also with huge benefits, right? It's like because you can train across the whole world, you're not reliant on like drawing absurd amounts of energy in 1 place and having like an absurdly large customer in 1 place which you can see play out in real time, right? Like with all the big labs, which is also why they're all extremely interested, like, a bunch of them reach out to us, right? It's like, because it's the key thing that they are trying to solve, right? Like, this weird training. So, I think in that sense like, it's actually something which even the big labs and everyone basically is tackling. Like, it's not just all us as French like, decentralized AI, do, decentralized training ecosystem, it's open AI, it's Gemini, it's all of them, right? And I think that's something which people haven't fully appreciated yet about distributed training. They obviously, as Johannes also mentioned earlier, have like a slightly different, paradigm, right? Like they, in some cases, have very fast interconnect between their clusters, But it is surprising. It's like, even if you look at at OpenAI, it's like, it's not like they're next to each other. Right? It's like the the plan is like they're actually spread throughout The US, kind of like the the data center build out, like, that they had planned.
Nathan Labenz: 1:58:20 You need decentralized training to win over congress to fund the massive build out because they all need
Johannes Hagemann: 1:58:25 a little a little bit in
Nathan Labenz: 1:58:26 in each of their respective states and districts.
Vincent Weisser: 1:58:29 Yes. And I think on the other side, this also unlocks in just like global communication. Right? It's like having fiber, like, under the oceans, right? Like connecting the continents, like having Starlink, like all 3 things, open source theoretically even like compute being in the middle of the ocean, right? It's like, if you suddenly can communicate. And it's like, we had people doing exactly this, like reach out to us because the cost of energy is so much lower if you have like wave energy in the ocean. It's like suddenly basically like a different, almost like even infrastructure paradigm opens up where you don't need to compute necessarily in like the different hubs, but you can have them in extremely rural areas as well. And I think that future is playing out. And I think the question is probably how fast that infra build out will continue, right? It's like it's at a trillions per dollars, like now. I think it's probably gonna get like to the tens of trillions, but Microsoft alone is is spending 80,000,000,000 a year, right? Like on data center build out. OpenAI obviously now like 100,000,000,000 plus, and I think there's a lot of others that like are underreported, right? It's like the governments have gigantic infrastructure build out that are very underreported, like including US government. So I think there's much more compute already being built and even people know, like about, like, and then it's reported, and there's way more being built, yet it would be hard to even like fully have a overview of, right? It's like, DeepSeek is the best example, right? It's like, there was, like, yesterday, like, a saying where, like, scale founded it, like, supposedly they have, like, 50,000 plus H100s even though they underreport the number because they have export restrictions. Right? So there's all these things in place where it's, like, good luck getting the actual number of what the defense department is right now building up on the GPU side. I know it's like in the 6 figure, like, hundreds plus. Like, it's probably in the 7 figures already. It's like gigantic clusters. Like, that no 1 know like, they are very unknown. And I think that's playing out for China. It's playing out for, like, a lot of different places where, like, basically, there will be specific spots where there will be also an explosion just in in compute, which we'll tap into. Right? A good example was, like, TikTok was 1 of the biggest providers of compute capacity to a lot of these different platforms and ecosystems because they bought so many to use, like I think over 600,000, and at times they didn't need them, right? And they gave them to the market basically. And I think, those things will continue to play out as basically, as like every corporation and every nation state starts ramping up their, like, data center build outs.
Nathan Labenz: 2:00:56 It's gonna be quite something if sea steading becomes a thing because people wanna compute in the open ocean for like wave energy. You know, hey, we maybe added to the cyberpunk future.
Vincent Weisser: 2:01:06 Yeah. I think more what is like the energy there is already so much cheaper, like out of magnitude, that it would kind of, like, help me even from that angle probably to some extent. Yeah. So I I think it would, like, a computer will be to an extent everywhere where it's, like, allowed and has, like, geographic advantages. Right? Like, where it can tap into cheap energy, where it's, like, easy to maintain.
Nathan Labenz: 2:01:27 Yeah. With Starlink, you could even have a decent you know, it sounds like you could have a little floating compute. So you talk about islands of compute. I mean, you're you could be really talking true genuine islands in the, you know, Middle of The Pacific of compute. That would be quite something to contemplate. What does this mean for, like, NVIDIA? I mean, this is not a stock show by any means, but, like, I looked at NVIDIA, which, of course, has blown up. And as of today, $3,600,000,000,000 market cap. And then just for 1 comp, I looked at AMD, which is actually down a third over the last year or so and is a $200,000,000,000 market cap right now. So you've got an 18 x ratio of NVIDIA to AMD. I wonder what you guys think of, like, NVIDIA's prospects for continued dominance of this market. I mean, a sort of naive read of everything that we've talked about would be, you know, the sort of abstraction is coming. And if abstraction comes, you know, then that would seem to be almost synonymous with not the kind of moats or or profit margins that NVIDIA enjoys today.
Vincent Weisser: 2:02:30 I think, like, 2 or 3 things can be true at the same time. Right? I think 1 is that this market is exploding. So I think it was, like, nonexistent basically 5 years ago, and now it's, like, in a in a in a trillions. And we'll go to trends probably hundreds of trillions, right? Like, I think it's hard to fathom how gigantic this market will be in a sense where it's like everyone in the stack is poised to be like 1 of the most valuable corporations in in history, right? It's like be it in media, ASML, TSMC, AMD, and everyone else. But I think on the other side, it's like, I think it's also very positive, right? Like the the kind of the monopoly really that MVR holds right now, right? It's like, they make 100, like over 100% of the profits, right? It's like in the AI industry right now because everyone is burning through money, right? Like, while they stack money, I think their margins and their, like, basically having, like, majority of, like, AI profits flow to them, I think that will probably over time shift slightly away, right? It's like their margins might compress. Well, their revenue is like 10 to 100 X, right? So it's like, they they'll probably still be a bigger business, right? Especially in kind of like the optimistic scenarios. But like, the thing is, I think AMD hasn't been the strongest competition, but they're obviously the clear number 2, right? It's like they think famously on like a lot of different fronts, including on software, and also not really building up the ecosystem that I think India built up. Think my other, like almost like read of it is that at some point, like, NVIDIA has such big free cash flows that they can, like, reinvest them into, the next generation of chips and and ultimately extend the lead. I think, to be honest, the strongest competition, which is also fumbling it in, a different dimension is Google with TPUs, which obviously also like has some very like closer chest internally. Obviously, the other big techs like the Amazons and Apples, they won't start adopting, Google TBUs, right? But they're also doing their own efforts, right? It's like Amazon has their own chips, or Apple has their own chips. So I wouldn't actually underrate almost like the chip efforts from the big tech giants, like, more in terms that they can force adoption and everything else. I think, basically, like, AMD will also go from close to 0% market share to slightly higher, but, like, probably it won't, overtake NVIDIA anytime soon. And I think there will be there's other more specialized chip providers, right? There's specialized ASICs just, like, made for transformers, like the grogs and edges of the world. I think they also are basically at close to 0 market share right now, but they'll probably gain some market share similar how, like, a lot of the other chip providers will. So I think over time, like, assuming fast forward, I think often will be probably as a class of companies be worth more, like the margins will be compressed, but the revenues will be way larger, and I think there's often a basically more competition to your point on like the software stack, which we plan to contribute to, right, like to help commoditize those efforts in a sense where where basically the demand is ultimately, I think, in a perfectly efficient economy, right, is, like, get compressed to a kind of cost of capital with, like, some risk premium. Right? And I think that's hopefully, even with, like, superintelligence in long run where we're headed. Right? So they're actually over time, like, their margins is opportunity of super intelligence to compress them with like cheap alternatives, right, over time. So I think that's like, I think the high level read of it is they they they were able obviously to stack those modes, right, from hardware to software to the interconnect to, like, vendor relations and all of these other things, which is, like, not easy to break at least right now for, like, I would say pretty weak competition from, like, AMD and even weaker from Intel. Right? Like, they're not even in the race at all.
Nathan Labenz: 2:05:58 Yeah. Elon might have something to say about that, I guess, from what I hear recently, but time will tell. Are you familiar with Imad's intelligent Internet project? I did an episode with him not too long ago. Of all the other things I've seen in the world, I would say you guys, like, vibe match most with him. Are you in touch, or are there other things?
Vincent Weisser: 2:06:19 We're quite close with him. I think still in in my charts, like, I can't fully wrap my head around yet what exactly he's planning to do, but it's think I it's also evolving and it hasn't fully launched yet, but I think maybe some differences, right? And then some similarities is we're focusing a lot on peer to peer compute and intelligence as foundational pieces of the puzzle. So I think that that's like a difference. I think the others, as far as I know, ultimately, we we put a big focus on decentralized training, right, and fixing and and making that work. He seems more interested in like slightly different paths or like kind of orthogonal to it. So basically, I think it it shares maybe like a almost like broader vision but I think the way how we get there is like slightly different fast but then also I think we really try to obviously compare to not just him but like a lot of the efforts just like ship in like small increments, like as quickly as possible, like different pieces of the puzzle, and open them up to the open source community. So I think he'll do the same probably soon, like, looking at his work before with stability. So I think he definitely knows and has a track record of building relevant open source models, right? Which, isn't the case for others in maybe entering decentralized training or decentralized AI. So like, I do think he'll be able to create some cool open source AI models and create, interesting community, which he obviously did quite well with stability in many ways. So I think this is probably, like, superpower that he'll continue to play into. And, we're quite collaborative with him and in touch, but I think it's like the same case with a bunch of other, efforts in the space where we're just trying to figure out how they can support each other. It's like the different use cases where people can build agents, they can build their own models on top of us, or a mod's network can like, needs compute, and like it can draw out from our network. And I think that's kind of the future that I'm envisioning is that similar to what we've seen with something like Ethereum where it's really the big power and why it ultimately I think won in many dimensions is that you have those basically, it's like co ownership, right, of the whole network and protocol. It's fully open source. You have thousands of people like collaborating, having strong incentives to make it successful instead of just like 5 or 10 team members and a few shareholders. Like, it's basically a paradigm where ultimately, it's it's kind of almost like the open and decentralized community competing against the closed source labs, I think, and that union basically of players is growing, including some of the foundational open source pieces like PyTorch and and the Llama community that we are very open and in close touch with. So I think all of those things can stack on each other and build on each other. So, like, he can leverage our progress on decentralized training and same for everyone else and and the other way around. So I think that's really, I think, the, basically the strong tailwind that space has. It's like we can build on our own like and those things and I think that's really how the most progress will happen in like a very short period of time and how I think open source is now like officially, right, like, catch up with closed source, like since R 1, like it closed the gap, right? It's like most closed source companies like the XAIs and Ntopics are like now officially behind, like open source, and, and I think that's like an update that people, that that shifts people's perception. Like, think a lot of people didn't thought it was possible, like, even a year ago or or 2 years ago. And now it's kind of like definitively proven that like open AI lead, like, melted towards 0, right? It's like that 1 or 2 year lead is now, like, gone.
Nathan Labenz: 2:09:47 Certainly down to months, not years. I would still say I think I mean, it is tricky because there I have had really good experiences with DeepSeek. It's only been a few days. A lot of people seem to be very enthused about it. I still think if you could only give me 1, I would take a 1 over R1. And certainly there's, you know, we know O3 is coming too. So there is something there. I would also bet that internally at DeepMind and Anthropic, there is something, you know, at least on the r 1 level even if we haven't necessarily seen it. But we have seen Gemini flash thinking and Dario's recent comments about what he's seen internally at anthropic certainly suggest they've got something cooking that's moving that direction. In terms of like the long term vision or sort of structure of your company, in in a way kind of going back to the beginning, you guys have all these compute partners that contribute to the marketplace, but then they also can contribute to these decentralized training runs that you're doing. Are they just contributing in kind and like, what is the governance model or, you know, I'm a little bit fuzzy on how what's in it for them. What is the sort of incentive design that you've done to bring people together and create sustainability on that level?
Vincent Weisser: 2:10:55 I think there's, there's different phases to it, right? And I think up till now, like everyone contributed in kind and ultimately was like, more like friends and sponsors. So I'd like basically like the folks like Hackerface or DiLoCo from SemiNalysis, but also just like a bunch of like, in the, in the internet, right, like this model and, and joined it. I think basically, almost like again drawing the analogy to something like Ethereum. Really the goal of this is to create like a public utility that like anyone that contributes to it can have a piece in and can direct it. And I think we're basically in a process to, like, finalize the details and work this out, but, it will have, like, a nonprofit foundation structure that that governs kind of basically almost like this provisioning, of this public utility. Right? And I think it's in many ways very counter to just like pure like startup corporations. It it just in structure, right? Like it's basically setting up to provision like a utility in a very efficient way, but also in a very open and decentralized way, that is permissionless. To your point, it's like a Vitalik can't stop Ethereum, right? Like even if you wanted to, and I think that's kind of like by design, right? And ultimately, they're not the only 1 contributing to it. There's hundreds of teams contributing to something like this infrastructure, like everyone using it, everyone like making pull requests to it. And I think that's basically what we are setting up for. And I think that then basically structures like that we as a company almost are 1 of the contributors to that like public infrastructure and utility. But I think it's very very different from like a traditional corporation in that sense.
Nathan Labenz: 2:12:35 So do you envision like a a token or a currency of some sort where I contribute compute and then I get some sort of claim on future governance that I could potentially even resell?
Vincent Weisser: 2:12:48 The broader goal, right, is to like, create like an alternative system at RL, like, in that sense, like, like a theorem that is, like, fully permissionless, fully, like, tokenized, and everything. But like, so, yeah, in the, in this direction, even though, like, it's very, like, very, we are very light and kind of also coming too much on like, concrete timeline and everything. Basically, it's a broader question and discussion where it's like almost how economies, right, and, and even currencies reshape in kind of like this like superintelligence age, And I do think the default outcome, right, is like almost a current continuation of like fear dying, like to an extent or being like at least overly inflated with like UBI like measures, right, which can potentially be funded by the windfall from super intelligence, right, like, but still means that ultimately people want to flee into like less inflating harder assets. And I do think like a currency that is backed by compute is way harder than a currency, like fiat currency that's backed by nothing. And also the, then, like a, the cryptocurrency that is backed by like, like, even like, something like Bitcoin. If you really look at it, like, it's not that hard of a asset and currency, right? Like, don't get much utility out of it. So, I think it's kind of like a bit how we're thinking about it. It's like, in the intelligence age almost, you want to own a piece of a super intelligent system that is able to generate value where you have actually had like access like through through your ownership in it to the compute, to the intelligence, and how that exactly can be built, right? Like and how we are approaching it, we'll like share more about soon, but ultimately, I think it it basically needs to happen in stages where it's ultimately really guarantees also that like anyone like can get involved and have a has a piece of it, but ultimately where it's also truly permissionless, right, and, like, in the spirit almost like of the early, like, cyberpunk Internet and not, like, of the late stage, like, platform monopoly, big techs. And I think that that's really the broader spur. It's, like, making sure that, you know, we lost kind of like the Internet to to big tech platform monopolies. I wanna make sure that we don't lose like super intelligence to them again. I just I think it's kind of like the default outcome if we don't kind of like veer off in a in a alternative path.
Nathan Labenz: 2:15:02 I think you just brought us to a perfect ending point with that comment. Maybe 1 little extra bonus would be what are you guys looking for? You know, that could be hiring on your team. I know you got a jobs page with a number of listings. It could be compute contributors or anything else, but what sort of bet signal do you wanna put out into the world?
Vincent Weisser: 2:15:23 Yeah, appreciate it. Like, I think all of the ones you mentioned, like, basically we're actively hiring across, like, a lot of different roles, like a lot of on the AI research side, on, just like general developer side, but even on things like marketing, design, and every other front. So, like, for anyone that is interested to reach out, like, feel free to do so. I think similarly, like with the modeling collaboration we did, which we didn't touch on quite actively collaborating with like dozens or hundreds of like different open source AI researchers, like leading scientific institutions, leading like AI labs at universities. So basically we're quite supportive of basically supporting the highest impact initiatives, with compute, but also with hands on support to actually build and scale their models. So that's another area. Then, yeah, in general, always happy to connect with anyone interested to like go deep on this and connect or collaborate with us. And it there's a lot of surface areas for collaboration, as you also mentioned, for people that have either computer and want to contribute to the models that they are or the agents that they are excited about. Like, that's kind of, like, really what we're, planning to enable.
Nathan Labenz: 2:16:28 This has been excellent. Vincent Weiser and Johannes Hageman, founders of Prime Intellect, thank you both for being part of the cognitive revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.