Data, data, everywhere - enough for AGI?
In this podcast, Nathan and Nick Gannon dive deep into the data requirements for achieving Artificial General Intelligence. They explore the current paradigms, the role of data in approximating intelligence, and the scaling trends for GPT models.
Watch Episode Here
Read Episode Description
In this podcast, Nathan and Nick Gannon dive deep into the data requirements for achieving Artificial General Intelligence. They explore the current paradigms, the role of data in approximating intelligence, and the scaling trends for GPT models. The discussion covers various datasets, from email and Twitter to YouTube and genomic data, as they analyze the feasibility of reaching the target of 100 trillion high-quality tokens. While the bull case suggests an abundance of data, the bear case highlights the limits on high-quality data, prompting a fascinating exploration of what makes data good for AI and the potential for AI to generate its own data.
Sponsors :
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
Plumb is a no-code AI app builder designed for product teams who care about quality and speed. What is taking you weeks to hand-code today can be done confidently in hours. Check out https://bit.ly/PlumbTCR for early access.
TIMESTAMPS:
(00:00) Introduction
(05:04) Scaling Hypothesis of Intelligence
(07:32) Is There Enough High Quality Data?
(10:19) Algorithms Impacting Data Requirements
(17:42) Sponsor : Omneky
(18:04) Estimating High Quality Token Requirements
(24:07) Astronomy and YouTube Data Scale
(29:42) Genomics Data
(38:14) Sponsors : Brave / Plumb / Squad
(41:16) Code Datasets and Synthetic Data
(45:48) The Bear Case: Quality and Usability of Data
(50:54) Investment Trends and Compute Efficiency
(54:19) Training Run
(57:21) Synthetic Data Generation and Self-Play
Full Transcript
Nick Gannon: (0:00) Oftentimes, people's conceptions of AI progress seem to be more so derived from aggregating the sentiments of the crowd than any core ground up framework. This is something often I do as well, but we want to avoid reducing AI as a concept to an index that we're sort of longer, short, bearish, and bullish, overpriced, underpriced. Because doing so makes our models of the AI space crowded in other people's opinions of AI rather than in any facts of the case. It's becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What this manifest is trained on the same dataset for long enough, pretty much every model with enough waits and training time converges to the same point. Improvements in data quality and improvements in algorithmic architectures can be viewed as reducing the scale requirements to reach this human level performance in generality across a large range of tasks.
Nathan Labenz: (1:01) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, we're diving deep into 1 of the most critical questions in AI. If we think that we can build some form of AGI by simply scaling up the current paradigm, is there enough high quality data in the world to get us there? Joining me on this journey is Nick Gannon, master's degree data scientist by day, AI scout by night. He conducted extensive research to gather the numerous numbers that we'll be discussing over the next hour. We begin by extrapolating the trend set from GPT-two to GPT-four to set a budget for a hypothetical GPT-five. Then we consider the total data volume generated across domains like email, social media, YouTube, genomics, and astronomy, attempting to determine just how much of humanity's raw data output would need to be high quality to achieve this ultimate goal. We also work backward from the scale of compute that we might expect to have in the future, asking how much data we'd need to use it all effectively. As you'll hear, this episode is full of interesting numbers and useful comparisons. Our goal is to help you anchor your AI worldview to realistic ranges for the key AI inputs of data and compute, enabling you to better contextualize the growing volume of new research, datasets, and models that you'll have no choice but to process with increasing speed going forward. While following the many order of magnitude calculations that we work through will likely require more focus than our typical episode, I personally believe the extra effort is worth it. In a world where the White House has set 10 to the 26 flops as the threshold for reporting training runs, I think anyone seriously tracking AI progress should actively build intuition around key reference numbers. As always, if you value this work, we appreciate it when you share it with friends. And for this experimental episode in particular, we especially request your feedback via either via our website, by DMing me on your favorite social network, or if you loved it, via a review on Apple Podcasts or Spotify. I don't see anyone else doing quite this kind of work, but is that for good reason, or would you like to see us invest more in these high level guides? In any case, your input will shape our future decisions. Now without further ado, here's my discussion about the scale requirements for AI training data with Nick Gannon. Nick Gannon, welcome to the Cognitive Revolution.
Nick Gannon: (3:41) Thank you.
Nathan Labenz: (3:42) So we are here to talk about data. And I've been really intrigued by some of the analysis that you've brought to bear on this question of what data exists, what's out there, are we going to run out of it, what are the different modalities? So this is a scouting report episode that really just tries to get our arms around the scope, the scale, and the nature of data to the best of our ability. And I really appreciate all the work that you've put in to trying to answer these questions. I'm excited to learn a lot from your analysis.
Nick Gannon: (4:18) Yeah. Absolutely. Yeah. Thanks for letting me share. So diving into here, the general premise is what are the data requirements to brute force your way to systems that are as generally intelligent as humans across essentially all cognitive tasks. Oftentimes, people's conceptions of AI progress seem to be more so derived from aggregating the sentiments of the crowd than any core ground up framework. This is something often I do as well, but we want to avoid reducing AI as a concept to an index that we're sort of longer, short, bearish and bullish, overpriced, underpriced. Because doing so makes our models of the AI space crowded in other people's opinions of AI rather than in any facts of the case. So instead of this AI perspective of crowd sentiment analysis aggregation, it makes more sense to live within an explanatory paradigm to explain the current state of affairs. In AI, the current paradigm that sits at the core of the scales being OpenAI, DeepMind, and Anthropic is the scaling hypothesis of intelligence. And puts it pretty well outlining 2 fairly simple premises for the scaling hypothesis that sort of underlies the AI strategy of these 3 firms. Premise 1 being really just this if brain bigger than brain smarter premise. And the second 1 being there is functional parity between biological and artificial neurons. And this is like a substrate independence of intelligence. There's nothing necessarily unique about fleshy meat wires or anything along these lines. It doesn't need to be carbon basic and be silicon. If premise 1 and premise 2 hold, then human level AI systems are not only tractable, but a very doable engineering problem. It is examining data through this reference frame of scale that gives it the most grounding in how we should go about approaching this, that allows us to take AGI, like, extremely literally even more so than taking it seriously, where we start planning for the world that we are walking into, where we've got systems that are as generally intelligent as humans across essentially all cognitive labor. Jay Becker has a really good quote on the role of data to be played here. And Jay Becker, for context, has an awesome blog. He works at OpenAI and essentially builds distributed clusters and large systems at scale. He's got a quote says, it's becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What this manifests is trained on the same dataset for long enough, pretty much every model with enough waits and training time converges to the same point. And this is a very extreme interpretation of the scaling hypothesis where improvements in data quality and improvements in algorithmic architectures can be viewed as reducing the scale requirements to reach this human level performance in generality across a large, range of tasks.
Nathan Labenz: (7:30) Yeah. Just to zoom out for a second. The core question right now is, can we just keep scaling? 1 of the key questions seems to be, is there enough high quality data? The leaders at the labs seem to be pretty clear on their belief that there will be enough data or that they can figure it out 1 way or another, create it if they have to. But across the board, nobody has shown much concern of the SAMH, GDB, Ilia, Demis, Shane Legg, Dario set. Nobody there that I'm aware of has shown much concern about lack of data being a fundamental barrier. On the contrary, pretty much all of them I can recall quotes for where they're like, Yeah, we'll be fine. We'll be able to get over that. So that's the paradigm. How far does it hold? They seem to think it's going hold. Now we can actually go out and look at how much data there really is. What is the data we actually have?
Nick Gannon: (8:27) Yeah, absolutely. And so going back to GPT-three, GPT-three was trained on roughly 1,000,000,000,000 tokens or 600 gigabytes of text in a lot of numbers. Our ultimate goal is to be in the correct order of magnitude. Honest our confidence interval, we would like to be smaller than an order of magnitude ultimately. And along these lines, the GPT-four training is estimated to be somewhere around 10,000,000,000,000 words or 10,000,000,000,000 tokens. And this 10,000,000,000,000 data point number is something that we're gonna anchor against going forward. And comes from many places. Probably the most influential is semi analysis research, where he estimates GPT-four to be 16 head, MOE a 111,000,000,000 parameters per head, 1,800,000,000,000 parameters total, with an additional vision encoder with cross attention, trained on another 2,000,000,000,000 image tokens. So giving these GPT-three and GPT-four estimates, current scaling trends suggest that GPT-five's dataset is gonna be somewhere in the 100,000,000,000,000 byte, or we can think 100,000,000,000,000 word territory. And if we look at Chinchilla scaling laws, if we're going up 9, 10 x on the increase in dataset size, that would be akin to about 2 orders of magnitude increase in the amount of compute that we're gonna be using, which does fall in line with the compute trends that we're seeing as well. And it is also worth noting that optimality is no longer in fashion, and everybody essentially trains significantly past chinchilla optimality with really high token per parameter counts to minimize inference latency and inference cost. To add a couple more caveats to these GPT-five training set estimates, it's worth diving into some of the algorithms that are likely to be central features. So the pretraining context window size can have a large impact on the data to compute ratio. And it seems likely that GPT-five will be trade with something that looks like ring attention, a mechanism to allow context window sizes to get to this sort of 1 to a 100,000,000 token context window with perfect needle in a haystack recall. It speculated that this is likely what Gemini 1.5 is using and also what Claude Opus is using to get that really long retrieval performance. And to illustrate the computational savings that you get from Ring Attention, because it's not approximate, so you still do have to pay this quadratic compute. If you have a 1 terabyte model with roughly 1000000000000 parameters, If you're a pre train it on a sticks dataset size at 4 k token context window, and then you're gonna pre train it on 1000000 token context window, it would actually be 5.6 times more compute for the million token context window. Intuition for why this would be the case is attention's flop contribution is being offset by the dramatic increase in tokens per batch. So we can think we have this 4 k squared value in the numerator, and then you got this divided by 4 k. And then we're changing that to 1000000 squared in a numerator along with a bunch of other variables, number of layers, things along these lines. But you're dividing by 1000000, so it takes the byte out of that 256 x. And intuitively, if you're batching at 1000000 tokens, you just have way fewer batches as you're going about this pretraining process. And so we can look at this sort of variable as being a downward force on the data requirements that we would need to train something like GPT 5. You've actually spoken to this in your Mamba work where some of these algorithmic developments, it it's pretty astonishing that anybody is publishing them at all. This feels like a secret that if somebody went to DeepMind and asked for $10,000,000, they could likely be given $10,000,000 just for DeepMind to have the option to have a couple months before everybody else figures out an algorithmic secret. So it's definitely shocking that any of this is publicly disclosed and bless the hearts of all of the academics of the world for doing that. And further caveats to these GPT-five data requirement estimates, the scaling laws for MOE are different than chinchilla optimality. MOEs were dealing with highly sparse decoder only transformer models, and they're sort of the elected successor of dense counterparts because they're vastly more compute efficient across just a really wide range of compute budgets. And we look at the MOE of GPT-four on a forward pass. It's likely of the 1,800,000,000,000 parameters, maybe only 270,000,000,000 of those parameters are actually being activated, allowing us reap a lot of these benefits of scale without paying this really high latency cost. But on the other side of that, there is no free lunch. And so while the MOEs still follow scaling laws, meaning cross entropy log loss and log complexity decreases on a range of test sets linearly as a function of log data and log parameters and log compute. The slope is a little less slopey.
Nathan Labenz: (14:09) Okay. A quick recap of that because there's a lot there. Starting off with the amount of training data that is going to be potentially thrown into a GPT-five. 1,000,000,000,000 was the number for GPT-three. That is 10 to the twelfth. So 10 to the twelfth tokens for GPT-three, 10 to the 13 or 10 trillion tokens for GPT-four, and you're estimating that would go up another 10 x with GPT-five to a 100,000,000,000,000 tokens of training data or 10 to the fourteenth tokens. And to get an intuition for why it would grow by a factor of 10 x, it seems like that's based off the budgetary observation that the compute budget is going up a 100 x per generation. So let's imagine that they 100x the budget. We have estimates of flops and we also have estimates of dollars. GPT-three trained from scratch for under a half million dollars was a Mosaic claim from a little while ago. Estimates of flops on GPT are somewhere in the 10 to 24, 10 to 25 range. We don't know exactly. And that is mid tens of millions of dollars compute budget. So if we're then imagining 100x ing that, we're getting into 1000000000 to a couple billion run size. So that basically tracks. Certainly, NVIDIA's revenue seems to be on the order of magnitude suggesting that people are doing that. Meta's announced purchases would allow them to go in that direction pretty soon if they wanted to. So okay. Billion dollar training run, a 100 x compute, 10 x the data size. Why does that 100 x compute translate to 10 x the data size? The chinchilla training laws say that for a given compute budget, you're going to expand both the parameter count and the token count. And when you expand both, the compute required is the multiplication of that. So if you make the parameters 10 x bigger and the data 10 x bigger, then your compute budget gets a 100 x bigger if you expand both of them in parallel. It's also really good note that the shift in overtraining seemed to be really associated with LLM because they trained LLM 2 7 b up to 2,000,000,000,000 tokens, which was like, oh, wow. You're taking the the parameters way down, but still doing a ton of tokens because the models can learn past the point of optimality and especially if you wanna make something that people can run on their laptops or whatever, which obviously was a big part of that whole program, then you take this 100 x compute budget and maybe you say instead of going 10 x and 10 x on either side, maybe I'll do 20 and 5. That does lead you to something pretty dense. And then finally, with the ring attention, the much bigger context window requires much more compute to compute all the attention. But because you are running the data through in such bigger batches, you are not making as many updates to the weights. Okay. Cool.
Nathan Labenz: (17:37) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: (17:55) So, basically, with all that said, we have to see our way to a 100,000,000,000,000 high quality tokens?
Nick Gannon: (17:55) Yeah. Absolutely. So diving into how much data we currently have at our disposal. In the year 2020, estimates were that we created about 64 zettabytes of data, which if we converted everything to text, to be somewhere between 1,600 words, which is an almost meaninglessly large number. And really, almost all of this is just copies and copies of copies. You can think of a Netflix video or a YouTube video as being prestored segments or chunks of video that are streamed from some CDN node that happens to be closest to you.
Nathan Labenz: (18:35) So the 20 20 headline super high level number is zettabytes 10 to the 22. That is all data. That's like global storage. So our 100,000,000,000,000 words by comparison is 10 to the 14. So we're talking 1 in 10 to the 8, aka 1 in 100,000,000 parts of this raw data would have to be high quality text for us to see our way to the theoretical GPT-five trading data scale.
Nick Gannon: (19:15) This data just includes everything. It clocks in at about 13,000 times more words than all of the humans have ever said collectively. Through the history of you've made about a 100,000,000,000 humans that have lived thus far, it's the continual monitoring logs of edge computers such as electric toothbrushes. It's the 3 33000000000 emails that are sent each day, the vast majority of which are the same emails being sent to all of us.
Nathan Labenz: (19:44) So 1 in 100,000,000 parts of this raw data would have to be high quality text. I would say that feels safe. As much as there's a lot of log data and a lot of garbage out there, it does feel like if only 1 in 100,000,000 parts of total data has to be good, Surely that much is good to put them in scientific notation as well. So there's 10 to the 11 emails sent per day. Assume 300 tokens per email, you're already up to 10 to the 14. Assume 300 days of the year, you're up to between 10 to the 16 and 10 to the 17. We need to get to 10 to the 14 of high quality. So if 1 in 1,000 emails is high enough quality, then email traffic would contain enough data to do the job. Obviously, you've got major access questions on something like email. But, yeah, again, it seems like you can start to wrap your head around why people would not be too worried because there's just a ton flying around.
Nick Gannon: (20:57) So we have orders and orders of magnitude more data than we actually need. And the current trends seem to suggest that the amount of data being generated by the world doubles roughly every 3 years, meaning about half of the data that has been generated in the last 3 years. And it seems that this is going to continue if not accelerate. Let's get a little bit more of a grasp on what this looks like with a little bit more granularity. There's a great research paper called big data astronomical or or genomical that dives into a couple domains, including astronomy, genomics, YouTube, Twitter. And it also touches on other things like particle physics data accumulation. Apparently, Large Hadron Collider creates quite a bit of data. So Twitter can serve as a great microcosm of sort of mediocre quality text data on the Internet. And estimates suggest that about 33 terabytes of text are being generated per year. And this about 1000000000 tweets a day, most of which is probably executed by bots spamming people. And this is enough to train a model about double to triple the size of GPT-four.
Nathan Labenz: (22:11) Okay. The Twitter 1 was 1000000000 tweets a day. Say a 100 tokens per tweet, we're at 10 to the 11, 300 days, 3 times 10 to the 13 would be the the total volume of Twitter for a year. Yeah. So that's 3 times GPT-four and would be about a third of the way toward this hypothetical GPT-five. So wow, it's interesting. The relative scale of email versus Twitter is shocking. 1 1 thousandth of email over a year would be enough for GPT-five, but you need 3 times Twitter to get to GPT-five. That also goes to show, I mean, OpenAI has a data acquisition team that they've been fairly public about, where they're just looking for all kinds of data and trying to partner with whether it's governments that have their own data sets of their particular language or really, they're open to a lot of things. It's not necessarily easy for them to get to the scale that they want to get to. All this stuff is locked up in other people's systems. And so the question is, who knows where it is? How to tap into it? How much do they need to be compensated to potentially share it? But 1 in 100,000,000 parts definitely suggest that we'll get there to get to GPT-five. Okay, cool. In terms of our target, we've been conceiving of this largely in text. And it's also interesting to think what other modalities might be bolted on there.
Nick Gannon: (23:36) A lot of this over indexes on text, but that's largely just because it's easier to think about all of these estimates in a single modality, even though this is very much a story of many modalities.
Nathan Labenz: (23:47) Okay. Cool. So let's get back to the story.
Nick Gannon: (23:50) Yeah. So we have astronomy and YouTube data, which are roughly on the same scale of 1 to 2 exabytes. And so exabytes, if we run the terabyte scale for dealing with Twitter, we're 33 terabytes. We skipped head of scale, and we went straight to exascale. So we skipped essentially 4 orders of magnitude in there. And for YouTube, this essentially means we've got 500000000 hours of uploaded content every year. And remarkably, we have about that equivalent in images of space. Hubble and James Webb are certainly pulling their weight here. So if you did just wanna go straight YouTube, it does seem like that's a Honeywell that will just keep giving and giving for a good long while. Then genomics data, you have about 40 exabytes of genomics data being stored every year, and almost none of it is actually being utilized.
Nathan Labenz: (24:52) I'll trust folks are familiar up through giga, which is 1000000000. So a gigabyte is 1000000000 bytes. A terabyte, tera for 12, a tera is 10 to the 12, a peta is 10 to the 15, and an EXA is 10 to the 18. So the scale of both YouTube and, as it turns out, astronomy at 10 to the 18, that is 4 orders of magnitude bigger than our 10 to the 14 target. 10 to the 14 target corresponded to 1000000000 dollars in compute. So if you actually said, would it take to scale out to all this raw astronomy data? Talk about your universal function approximators. Here's all the stars. Figure out what's going on. Now we can start to be maybe a little fast and loose on which scaling laws we're going to use. If we're going up an order of magnitude of data, we're also presumably going up an order of magnitude in params. Therefore, we're going up 2 orders of magnitude in compute. You'd be talking about taking the budget up 10 to the 8, which would be 10 to the $8,000,000,000. So you're talking a 100 quadrillion dollars, which is 1000 years of global GDP at present. So that's probably a bit out of reach. We're going to need some filtering techniques to do that. But it is also really interesting just from a YouTube standpoint. Those are bytes, and you can definitely get a pretty good discount from pixels to tokens. Obviously, it depends on how well the model works and the nature of your images that you're looking at. But I've been doing a bunch of stuff recently with GPT-4V and the new Claude multimodal, which, by way, the new Claude Haiku is insane. It's so much cheaper and faster. And for my use cases, it largely seems to work roughly as well. If I was looking for a tumor in an X-ray or whatever, I don't think I would take it to Claude Haiku. This, is this an appropriate image to be used in a certain context or whatever? It's like totally handling that stuff fine. But I guess I know GPT-4V best. Their a low res image is billed at 85 tokens, and a low res image is maybe 2 56 x 2 56. So you're looking at, call it, 50,000 pixels, each of which is basically a byte of information. That means you are going down 1000 x pixel to token compression. So now let's do that again on the YouTube side. YouTube at 10 to the 18 bytes. If we were to try to compress that to tokens at that thousand to 1 ratio, we'd be at 10 to the 15 tokens, which would be 10 x the tokens of our 10 to the 14 GPT-five target. And that's the annual upload to YouTube. So that would be only if you wanted to do all those YouTube tokens following a Chinchilla scaling law, you would be 100 x ing past our hypothetical GPT-five budget. So then we'd be talking hundreds of billions of dollars of compute, which it's funny, that's like on the balance sheet for the world's biggest tech companies. That's like the size of cash and cash equivalents. Google, Microsoft, Apple, Meta balance sheet. It's crazy to think that it is in scope for those guys. Obviously, lot of engineering is going go into they're going need to buy a lot of chips, but they could just raw crunch all of YouTube if it really came to that. Okay. Cool. So we got 10 to the 14 target. We got 10 to the 18 from astronomy. We got 10 to the 18 from YouTube. Apply a token deflator when we get 10 to the 15 tokens from YouTube. All of YouTube on a chinchilla scaling law basis would be estimated at 100,000,000,000 to train on that scale. But again, another way to think about that is 10% of the tokens of YouTube would have to be good in some sense to get that same level. Okay. Useful data points to have in mind. I will I'll be referring back to this often. Let's just maybe start it over at the top of the genomics section.
Nick Gannon: (29:25) Yeah. So moving on from astronomy and YouTube data, we move up in order of magnitude for genomics data. And the rate of growth in the genomics data space certainly outpaces a lot of these other spaces where we're seeing about a doubling in the amount of genomics data generated every 2 years. And so right now, we're generating about 40 exabytes of genomics data a year, which definitely gets us to this point where we're 1000000 x the amount of data required to train GPT-four.
Nathan Labenz: (30:01) So that's interesting For genomics specifically, coming in at 40 exabytes, 4 times 10 to the 19, it is kind of crazy that that's that much bigger than YouTube. Size of human genome is 3.4 gigabases. We each have 3,000,000,000 bases. So in the raw data, we each have 3 gigabytes. So to get to exabytes from gigabytes, you basically gotta be doing 1000000000 sequences a year. I think we're sequencing 1000000000 humans a year, but you think about all of the other things that are constantly being sequenced and it seems not crazy. So just as another anchor on this, I looked up the size of the dataset that is behind the recently released EVO model. They're open sourcing a 300,000,000,000 token training dataset, which consists of 2,700,000 publicly available prokaryotic and phage genomes. This is kind of crazy because they don't get here to the organisms with nucleus and their cells. This is all prokaryotic and phage, and that gets you to 300,000,000,000 tokens. So genomics projects hit 10 exabyte scale, 10 to the 19. That's 1000000000 human sequences. So it's like high end kind of what genomics data look like. That's all healthcare data, including like imaging, whatever is even like a lot bigger than that. That gets up to 10 to the 21 just raw bytes of of data. Presumably, a lot of that is imagery. Low end from the EVO paper, 300,000,000,000 tokens, all prokaryotic and phage is used to train that 7,000,000,000 parameter model, which is ostensibly developing some sort of life or cell model, which is something I really am eager to understand better. And if you're listening at this point and you're somebody who knows a lot about that, then definitely ping me because I wanna do 1 of these for that as well. I've been actively pinging people and DMing to try to find the right scouts for the intersection of AI and biology. But again, 300,000,000,000 tokens curated just from prokaryotic organisms. 2,000,000,000,000 was the amount of image data added to GPT-four to get to that vision understanding. So it seems like, again, a pretty clear path to the scale of data that would be needed to certainly add a DNA modality. And this is where this stuff starts to get really trippy is when you start to think about adding modalities on the native on par basis that humans just have no intuition for. For all the tools and understanding we've developed with DNA, we cannot natively speak DNA. Nobody can do that. And there's now both a proof point to believe that models are starting to do that, even at 300,000,000,000 tokens, and then a path to easily scaling that 10x, which is the same size of image dataset that got us to GPT-4V. So you think, It seems pretty realistic that you could just add some sort of native sense for DNA into a language model. I imagine the curriculum learning aspect of that would be, like, quite important. A big part of how OpenAI has been so successful with their multimodal stuff in general has been recaptioning. Like, there there's this process of refinement where they are creating better and better captions for the images and much more tightly aligning the vision and language spaces that way. And I imagine there would be definite need for tricks there as well on the DNA side. To really make use of that, you would presumably need to annotate it in all sorts of ways. You'd want proteomic data or regulatory RNA data as well. But if you started to interleave all those together, especially with some sense of health outcomes, I started to think about putting that stuff interleaved with scan results. It sure seems like you could get to an integrated system that speaks biology as well as language and image. And I've not even been talking at the level of theoretical GPT-five. This is the level of vision that already exists in a GPT-four was the 2,000,000,000,000.
Nick Gannon: (34:33) So moving on to areas outside of genomics and astronomical. This plan is really dripping in data to a large degree where there's 600 GPT-4s worth of pre training data in the world data center for climate. The way that they actually go about accruing weather data is definitely very cool. They're scraping just 4 quarters of the planet. And DeepMind certainly has a cool paper where predicting hundreds of weathers of variables 10 days in advance to 0.25 degree accuracy in under a minute. It's just a lot of infrastructure from the entire world. Satellite imaging data in sensors on aircrafts, in ships, in buoys, in hydrometeorological radars, and weather stations, and radio nodes on weather balloons. And just the whole 9 yards. They're just taking it from everywhere. That's wind, temperature, humidity, pressure, just a lot of different ways of attacking this.
Nathan Labenz: (35:33) Does it say in that paper how big that dataset is? Claude 3 opus. Tell me. What is the size of the dataset in this paper? Okay. Apparently, it does not say. Claude can't find it, but 37 years of weather data.
Nick Gannon: (35:55) Moving on, there's another several 100 GPT-four's worth of training data in the US Census Bureau. Apple Vision Pro has just a remarkable mechanism to accrue data. There's blockchain data. Really, the whole financial sector has remarkable amount of data. In the financial sector, 1 of Jeffrey Hidden lost proteges, Peter Brown, who is actually Hidden's first advisee, has allegedly been accruing a really large number of h 1 hundreds, and he currently runs Jim Simons' hedge fund Renaissance Technologies. Renaissance Technologies having had their premier fund return about 37% a year every year for decades. And there's definitely an interesting element where we are starting to have people to a remarkable degree using AI to essentially win the financial markets, and they appear to be increasingly using larger amounts of compute as they're doing it. And then moving on to the size of the Internet, The amount of pages indexed by Google at any 1 point in time oscillates roughly around 50,000,000,000 pages. In estimates for the unindexed size of the Internet range anywhere from 25 x the index size to about 2,000 x the index size. And when examining what is the unindexed Internet look like, try to find an old tweet. And you'll find it's just not indexed on the Internet, and it's not gonna pop up. And this is the sort of stuff that's just lost in the ether where it is accessible in theory, but it's just not searchable. And it's also worth noting the massive error bars here. So we're 25 x to 2,000 x larger than the index Internet. And it's sort of asking, like, how big is the unobservable universe? Well, it's at least a little bit bigger than the observable universe, and it could be a lot a bit bigger, but it's really tough to calculate.
Nathan Labenz: (37:58) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: (38:09) The other thing that we didn't cover is code, and that's probably 1 that we should at least look at for a second. Just pulling up the big code dataset from Hugging Face is 6 terabytes raw, 3 terabytes, they say, near fully deduped. So 3 times 10 to the 12, this code dataset, is 10 times bigger than the genomic dataset used in EVO, hanging out on Hugging Face deduped. And that would be 30% of GPT-four, 3% of GPT-five. That's an area where you can really generate an unbelievable amount of stuff.
Nick Gannon: (38:41) And then even in the synthetic data space, we're seeing a remarkable growth rate in the synthetically generated data from GPT and Claude in Gemini, where quick market sizing, we could estimate GPT to have maybe 25,000,000 daily active users, maybe 15 queries per day per user, which is potentially an overestimate, but I'd probably query GPT 40 to 60 times a day on the average weekday. And if we estimate about 1000 tokens per response, we again get to this sort of teen to 45,000,000,000,000 generated words by Chad GPT, which again is enough to train a couple GPT-4s. And so you were speaking about this earlier. The degree to which nobody seems to be particularly concerned about the amount of data that we have. It comes up, and then there's a sly smile that goes across the face of the AI researcher, and they say, don't worry about it. We've got it figured out. And it might be synthetic data. It could be these huge body wells of data elsewhere. But it definitely does seem like at first glance, there is just a lot at our disposal.
Nathan Labenz: (39:53) Okay. Great. Well, let's do a quick recap of the bull case. I find the bull case pretty compelling. The bull case is GPT-five would need 10x more data than GPT-four, which would correspond to 100x bigger training budget, which is like 1000000000 dollar or a couple billion dollar training run. And 10 to the 14 or 100,000,000,000,000 tokens is the target. Basically, that's 1 in 100,000,000 parts of all the data that we're creating. That would be 1 in about 1,000 of all the email volume. It's roughly on the same scale as all of Twitter for a year. And other modalities seem to also be, like, pretty accessible. YouTube tokenized even with 1000 to 1 pixel to token compression ratio is 10 times that target with just 1 year's worth of data uploaded. So 10 of YouTube would have to be good for that to hit some sort of parody. Astronomical data, it's just a kind of, you know, ridiculous amount. Genomic data, another kind of ridiculous amount. And we already see that not huge amounts are starting to work. 300,000,000,000 tokens uncompressed in the EVO paper as compared to 10 to the 12, which was the scale of image tokens in GPT-four. So the benchmark of 2,000,000,000,000 tokens to add a modality that seems like all the usual suspect modalities would have plenty of data available. And then synthetic data was the last thing that we mentioned where basically ChatGPT is like generating more data than it was trained on. It's already generating more than it was trained on, and it's approaching generating as much as GPT-five would need to be trained on. Obviously, quality there becomes the question, right? The bull cases, just the scale is so big that surely these small fractions of these holes are good enough to work. That's honestly most compellingly obvious on these other modalities where we just can't interpret them natively very well. Any sort of being able to see DNA, so to speak, would be potentially game changing. But the bear case presumably has to start with quality. Okay. Sure. There's all that stuff. But how confident are you really that the fraction that you need is actually gonna be of the quality that you need to get there. So let's tackle that. Time for the the bear case.
Nick Gannon: (42:34) Yeah. Absolutely. So a research paper from Epic AI called Will We Run Out of Data does outline a little bit more of, like, the taking us back to reality, like, how much of this data is usable, at least insofar as it's high quality text data. And Epic AI suggests that we're actually going to run out of high quality text data between 2024 and 2026, and they claim the amount of high quality data to be roughly on the order of 10,000,000,000,000 words. So the amount of high quality text data is roughly akin to the amount of data used to train GPT-four. They define this as books, news articles, scientific papers, Wikipedia, filtered web content. It just to sort of sniff test this number, the amount of words of text in the Library of Congress is about 10,000,000,000,000 words or, you know, 10 terabytes of data. So if getting into the Library of Congress's storage is some sort of litmus mechanism for, are you a a high quality text token? It it does seem fairly reasonable. On the 1 hand, we've got plenty of data. And on the other hand, most of it seems to not necessarily be of high quality. And certainly here, we are over indexing on the idea of text tokens where building human level systems could very much, even if not more so, be a function of vision or other modalities. But it also makes the estimates a little bit more consumable to reduce it down to a single modality. Yes. If we were to brute force our way to the largest training run we could feasibly do, and this is a little bit of the Carl Schulman perspective, where the max scale of a training run that we could get is something on the order of $1,000,000,000,000 or 1% of gross world products. And investments like this are they're preceded, but they're very uncommon. The primary anchor point would be The US during the Cold War Apollo program was spending about 2.2% of GDP on the Apollo program, meaning it's not completely out of the discussion to be spending 1% of global world products on just a very large training run. But if we moved up in order of magnitude, it would fall out of feasibility. And furthermore, Apple's revenue is closing in on 400,000,000,000. So this could be a degree to which we say, like, the numbers are sort of there for this to be just like the final attempts of brute forcing it. If there was no major algorithmic development, we can just throw, like, 30 aircraft carriers of money at it or 10 international space stations of money and see if we can't build this thing that's as generally intelligent as the human brain. More than likely, if you were to gonna gonna go about doing something like this, it would probably be the defense department's dollar. Their current budget is about 800,000,000,000. If that was siphoned into chip fab production over the course of a half a decade to a whole decade, It does seem very feasible in certainly plenty of headlines recently about Sam Altman allegedly speaking to investors in The Middle East about 7,000,000,000,000. I didn't see any quote that was like, Sam said, I would like $7,000,000,000,000. It's like it was a lot more rumor mill. But there is evidence of Sam Altman, even as far back as 2015, having discussions with the secretary of defense about the role of AI and the role of compute in a national security context. And it does seem if you were gonna do 1000000000000 dollar training run and you were gonna have a cluster that was gonna be able to run this large free training, it would likely have to be in The United States. And really, it does seem like the only 2 parties that could feasibly execute a training run at this scale would be Xi Jinping and Joe Biden. So if Sam Altman were to get this $7,000,000,000,000 and was able to do 1000000000000 dollar training run, and maybe it's Google, maybe it's another player, looking into the investment side, like, when would it happen and how much data would be needed for a training run of this scale? So GPT-three was roughly on the order of about $5,000,000 in most estimates for GPT-four's pre training or somewhere between 50 and 100,000,000. Like, all the bells and whistles, I'm sure it's far more than that. There's plenty of r and d, plenty of side side experiments going on, but this is just like straight GPU costs for the singular pre training run. In following these investment trends out where you're moving up an order of magnitude every 2 years or so, we see this sort of trillion dollar training run coming somewhere in the early 20 thirties if we're following this investment trend starting in 2012 and going through to 2024 thus far. And so the intuitions for how much data we would need would be anchored in sort of 3 different but core trends. 1 is this investment trend where the most expensive training run, the cost of that is doubling about every 2 years. And then we also have Moore's law, the compute trend that we're seeing this doubling of transistor density every 2 and a half years. And in some sense, Moore's law is dying, but in the less parochial sense, it's very much still alive, where we're interpreting the spirit of Moore's law to be something closer to the sort of Kurzwelyn law of accelerating compute, where we don't really care about transistor density. We care about the amount of operations a second that we can do. And then third, there's algorithmic developments where since 2012, we're seeing a doubling of effective compute about every 9 months. And what we mean by compute efficiency or effective compute is essentially the amount of floating point operations or the amount of operations per second required to get a 90% on test set x, y, and z. That is coming down, cutting in half every 9 months. And this is just a testament to better strategies and all of the algorithmic developments. We have more dollars going to compute. We have more compute per dollar, and we have more effective compute per compute. So in this sense, we can expect our trillion dollar training run to be somewhere between 2029 and 2033 requiring about 85 data points, which is a lot of data. But if we're gonna take the non tokenized version of YouTube, it's not even a single year of YouTube. And just to give us a couple of anchoring points here, most estimates put the processing of the human brain at about 11,000,000 bits per second, implying that the human brain processes about 2 to 3 quadrillion bytes over a 70 year time frame. And in this sense, the data processed by the human brain is about 2 to 300 x the scale of GPT-four's training data. And what we would need for this, trillion dollar training run would be about 8,500 x the scale of GPT-four's training data. And so from the scale perspective, this human level performance on cognitive tasks, it does seem quite reasonable if it's truly a story of data that it would be as we're approaching the amount of data that the human brain processes that that's when we would start to get performance levels akin to that of the human brain on cognitive labor tasks. And yeah. So we've got 85 data points that we need to come up with for our trillion dollar training run, and that would be about point 00013% of the data that was generated or replicated in the year 2020. And then, bearishly, we've got this epic estimate that suggests that we need 8,500 x more quality tokens of text data to build this full corpus if we were gonna do it exclusively in text?
Nathan Labenz: (51:01) You're coming at it from a top down and a bottom up. The top down idea is largest conceivable training run would be 1000000000000 dollar training run. That would be 1000 times the budget of the 1,000,000,000 that we've estimated for GPT-five. You can split the compute budget across increases in your parameters and increases in your data size. If we have 1000x compute budget, maybe we could do 10x parameters and 100x data size. So from the 10 to the 14 tokens that we had for GPT-five, you get to 10 to the 16 target tokens. Now, what do we actually have? Well, we know we have 10 to the 13 because that's what GPT-four was trading on and that's what the Library of Congress contains. We've pretty clearly established we can probably get to 10 to the 14. 10 to the 16, though, does seem like, that could be kind of a lot. But then there's also the estimate, 10 to the seventeenth, with my back of the envelope, I got to 10 to the sixteenth. We're within an order of magnitude. But it that definitely matters when it's that last order of magnitude. How do we get to that 1 more order of magnitude out of there?
Nick Gannon: (52:19) Yeah. So, essentially, we're saying, okay, we're spending a 1000000000000 dollars. We're getting more compute per dollar, and then we're also getting more compute per compute. So the effective compute is increasing.
Nathan Labenz: (52:32) Okay. So that last order of magnitude basically comes from compute also getting cheaper. Certainly a prospect. Now, it's also important to keep in mind that is how did we pick the starting point here? We basically picked the biggest training run that seems like economically somewhat plausible and then said, assuming that compute continues to get cheaper and we're maximizing the value of that budget, then to really do that, you would need something like 10 to the 17 tokens, which is 10,000 times GPT-four and 1000 times our target for GPT-five. So yeah, if it does take 1000000000000 dollar worth of compute, it does seem like we may struggle to get to that quality of data. Then you'd have to either be synthesizing a lot of it, which may work. Claude 3 certainly has changed my thinking about how viable this refinement process really can be. I still am uncomfortable with the notion of just having AI self critique to the singularity, but the quality of Claude 3 definitely suggests that they've got that top spinning pretty tightly. It's not totally crazy to think that you could generate data on the scale that we're talking about here. Definitely in code, there is the opportunity for this kind of self play, create, modify, actually run it and validate on the fly in ways that you could really generate an unbelievable amount of code.
Nick Gannon: (54:07) Yeah. Totally. I do champion code as the best sorts of synthetic datasets, like AlphaGo and AlphaZero and AlphaGeo. The whole Alpha series, the mechanisms that they've created to determine how to do Monte Carlo tree search at inference time and how to do synthetic data generation and then self play and how those work together to create the only datasets that we have that are actually capable of generating superhuman performance in systems. And if the goal was to create systems with superhuman performance, we have to throw most of the human generated data in the trash. We start over with just alpha geometry like systems, but we scale horizontally where you've got an alpha geometry in an alpha fold for every game that could be played. So really anything that you could put an Elo score on, like chess, could be constructed into an algorithmic architecture that's value network, Monte Carlo, tree search, focusing on self play and then updating and then self play and then updating. And then if you would like to do reasoning, the closest thing that looks like reasoning right now is Monte Carlo tree search at inference time, which is known Brown's work. If we let the model think for a while, can it essentially play as a neural network that has 10,000 x more parameters? That's really a way to hack your scale by just thinking more about the problem. Those are a lot of the elements that are gonna be zeroed in on.
Nathan Labenz: (55:40) Yeah. That self play stuff, man. My kind of final thought on that was just, can GPT-five create a coding problem that even GPT-five can't solve? The answer is probably yes, but also in a way that kind of leads you to superhuman coding in a pretty likely way. Right? It's definitely easier to specify the problem than it is to come up with the solution. So could GPT-five create demanding coding problems that itself then could then try to solve and then train on the ones where it's working and whatever? That formula seems like it is almost sure to work.
Nick Gannon: (56:15) Yeah. Recursive self improvement just seems like a tipping point where at some point RSI starts working. And it was clear it wasn't even close to working with GPT-three and with GPT-four. You see the glimpses, but ultimately, I don't think it would work.
Nathan Labenz: (56:30) I've seen a number of papers where there has been this change between 3.5, a lot of times the self critique and attempt at self improvement loop makes it worse. And then with GPT-four, it'll get better, but it'll plateau after 3 to 5 rounds. That's been the thing in program improvement and also like chemical reaction improvement. It levels off on 3 to 5 cycles. But notably, that's not involving any retraining. So it seems that's when the loop really could start to be closed in a true takeoff sort of way. All right. Well, it's a pretty nice journey that we're taking people on here because from the bull case of there's so much data, only this small amount should work. We've got all these different modalities. These things should be able to do all these different things in an integrated way. But on the other hand, wait, we don't have an obvious path to 1000 X that if that's what it were to take. And so then the question becomes, can we actually create a curriculum that gets the right grokking going on early enough in the cycle that you get the superhuman capabilities that you either hope for or fear, this has been cool. I'll walk around now doing these sorts of numbers more. And I think more people would do really well to be able to do that kind of quick sanity checking on order of magnitude basis. So hopefully, we'll help people do that. For now, Nick Gannon, thank you for being part of the Cognitive Revolution.
Nick Gannon: (58:04) Yeah. Thank you.
Nathan Labenz: (58:05) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcrturpentine dot co, or you can DM me on the social media platform of your choice.