Nathan interviews Mike Knoop, co-founder of Zapier and co-creator of the ARC Prize, about the $1 million competition for more efficient AI architectures.

Watch Episode Here

Read Episode Description

Nathan interviews Mike Knoop, co-founder of Zapier and co-creator of the ARC Prize, about the $1 million competition for more efficient AI architectures. They discuss the ARC AGI benchmark, its implications for general intelligence, and the potential impact on AI safety. Nathan reflects on the challenges of intuitive problem-solving in AI and considers hybrid approaches to AGI development.

Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.c...

RECOMMENDED PODCAST:
Patrick McKenzie (@patio11) talks to experts who understand the complicated but not unknowable systems we rely on. You might be surprised at how quickly Patrick and his guests can put you in the top 1% of understanding for stock trading, tech hiring, and more.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...

SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

CHAPTERS:
(00:00:00) About the Show
(00:06:06) The ARC Benchmark
(00:09:34) Other Benchmarks
(00:10:58) Definition of AGI
(00:14:38) The rules of the contest
(00:18:16) ARC test set (Part 1)
(00:18:23) Sponsors: Oracle | Brave
(00:20:31) ARC test set (Part 2)
(00:22:50) Stair-stepping benchmarks
(00:26:17) ARC Prize
(00:28:34) The rules of the ARC Prize
(00:31:12) Compute costs (Part 1)
(00:34:47) Sponsors: Omneky | Squad
(00:36:34) Compute costs (Part 2)
(00:36:40) Compute Limit
(00:41:00) Public Leaderboard
(00:42:58) The current AI ecosystem
(00:47:23) The four steps of solving a puzzle
(00:51:20) Intuition
(00:54:32) Human Intelligence
(00:56:06) Current Frontier Language Models
(00:57:44) Program Synthesis
(01:04:10) Is the model learning or memorizing?
(01:09:51) Improving the ARC dataset
(01:11:34) Step 3: Guessing the Rule
(01:12:51) Dealing with Ambiguity
(01:15:02) Exploring Solutions
(01:17:02) Non-backpropagation evolutionary architecture search
(01:19:49) Expectations for an AGI world
(01:24:11) Reliability and out of domain generalization
(01:28:35) What a person would do
(01:29:51) What is the right generalization
(01:35:32) The ARC AGI Challenge
(01:37:01) Postscript
(01:38:07) DSpi
(01:39:55) Statespace models
(01:43:28) Hybrid models
(01:48:32) FunSearch
(01:50:41) Kolmogorov-Arnold-Networks
(01:54:18) Grokking
(01:55:42) Outro

Full Transcript

Nathan Labenz: (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Erik Torenberg. Hello and welcome back to the Cognitive Revolution. Today, my guest is Mike Knoop, cofounder of Zapier and cocreator of the ARC Prize, the recently announced $1,000,000 public competition that's meant to motivate research into more sample efficient and generalizable AI architectures, and which includes a $500,000 grand prize for systems that can solve the ARC AGI benchmark at a human level under strict compute and time constraints. If you've been living under an AI rock, ARC stands for abstraction and reasoning corpus. It's a benchmark created by Francois Chollet in 2019 as a test of general intelligence. The test presents input and output pairs of 2 dimensional grids, which demonstrate a specific transformation, plus a final input grid to be solved. The solver must first infer the rule being used to make the transformations, which is different in every puzzle, and then apply the rule to transform the final input into the correct output. Importantly, the test set is kept private to prevent systems from simply memorizing solutions, but you can see samples and solve a few for yourself at arcprize.org. There's been a ton of discussions surrounding the ARC benchmark since the prize was launched, and I can specifically recommend recent machine learning street talk episodes as a great source of information on the techniques that currently top the leaderboard. Having listened to those and read lots more besides, I have to say that I'm still not quite sure what to make of the whole debate. On the 1 hand, I have to respectfully disagree with Mike, Francois, and anyone else who says that progress toward AGI has stalled. As someone who has used large language models intensively for the last 3 years for all sorts of practical projects, I feel like the progress in reasoning and problem solving, while certainly incomplete, is ultimately unmistakable. At the same time, the degree to which even the very best multimodal models like GPT-four o and Claude 3.5 SONNET still struggle with ARC puzzles does seem important, and I would agree that any AGI worthy of the title would need to be able to do a better job on archetype problems. When I solve these puzzles for myself, a subconscious, deeper than language sort of intuition seems to be doing most of the work. I stare at them for a bit. Suddenly, have a sort of eureka moment or I know what the rule is, and then things become relatively easy for me from there. Current AI systems are definitely not nearly as good as humans when it comes to such intuitive insights. And this really does matter, not only for ARC puzzles, but for the possibility of, for example, an AI scientist, which would need to come up with novel hypotheses that are sufficiently insight driven as to be worthy of testing in the real world. To date, we've seen precious few sparks of that kind of insight coming from language models. And while that might emerge at higher scale, I certainly can't guarantee that it will. And in any case, a new technique that solves ARC within the rules of the contest would definitely constitute a notable step on the path to AGI. Interestingly, at 1 point, suggested that such stringent efficiency requirements imposed by nature might have given rise to intelligence in the first place, as organisms that were able to make good decisions based on very limited local evidence would naturally have the best chance of survival. That framing does make me wonder though if a breakthrough architecture that solves ARC might prove unwieldy from an AI safety perspective. Before language models stole the spotlight, AI safety theorists anticipated small but highly capable systems and worried that while they might solve problems effectively, they wouldn't understand human values well enough to know when to stop. This is the origin of the paperclip maximizer thought experiment. If we imagine now a new system that can solve ARC puzzles with just 1¢ worth of compute, I would have to guess that it would not have room for the sort of understanding of values and ethics that we see from the likes of Claude today. And so I think 1 can reasonably worry about what might happen if such an architecture ever gets to the point where it can pursue open ended goals. After wrapping up the conversation with Mike, I stayed on by myself for a brief postscript in which I offer a run through of a number of recent research results that I would draw inspiration from if I were to try to tackle ARC myself. On reflection, I tentatively hope that a hybrid system combining a language model like component that does understand human values with more algorithmic search and reasoning modules turns out to be the winning approach. Such an ensemble of different subsystems would be consistent with how humans are structured, and my feeling right now is that nesting powerful problem solvers within more holistic systems might be a promising way to improve practical utility and reliability while also keeping general purpose systems under control. That would not conform to the contest rules and wouldn't be eligible for a prize, and some might still object that it's still just brute forcing the solution. But if it's fast and cheap enough to compete economically with humans, I don't think that distinction will matter much in practice. AI systems are often quite alien, and for the purpose of transforming daily life, a functional general intelligence does not need to satisfy our intuitions or preferences about how intelligence ought to work. Of course, I do not claim to have all the answers, and I expect to continue to be surprised by AI developments. In the meantime, I can confidently say that it is an awesome accomplishment to have created a benchmark that's remained unsolved for more than 5 years, and all of us should really applaud successful entrepreneurs like Mike for putting their money where their mouth is to try to encourage high impact research. As always, we appreciate it when listeners share the show. We welcome your feedback, and I look forward to reading your resume if you're looking for a new role as an AI adviser or engineer. You can find the link to submit on our website, cognitiverevolution.ai. Now without further ado, here's my thought provoking discussion on the ARC AGI benchmark, the ARC Prize, and the future of artificial general intelligence with Mike Knoop. Mike Knoop, cofounder of Zapier and cocreator of the ARC AGI Prize. Welcome to the Cognitive Revolution.

Mike Knoop: (6:12) Thank you for having me, Nathan. Excited to dig in today.

Nathan Labenz: (6:14) Yeah. Me too. You guys have done something that is not easy to do in today's world, and that is capture the attention of AI discourse and get it focused on a a topic with so many things going on that is, in and of itself, quite a feat. So what I'm hoping to do today is just kind of dig into little bit of the the background of the the benchmark, how you got interested in it, get into the rules of the contest, and then hopefully kinda spend the majority of our time brainstorming some possible

Mike Knoop: (6:45) How are we gonna solve this? That's why we're all here, isn't it? Right? We wanna beat the benchmark.

Nathan Labenz: (6:49) Yeah. And you've graciously put up 1000000 bucks to people who make either do it or, you know, come close in various ways. So we can get into that as well. I guess, first of all, you know, it's another not easy thing to do is create a benchmark that stands the test of really any significant time in AI given how quickly things are moving. When did you become aware of and interested in, I imagine, with increasing obsession, the ARC challenge?

Mike Knoop: (7:17) Yeah. The sort of cliff notes here are so I cofounded Zapier. I was running all of our product engineering up until about midyear through 2022, and I gave it all up to go back and just be an AI researcher at Zapier that year. The chain of thought paper that came out that January was 1 that really shook me off track of what I was doing. And I got really deeply curious about like, are we on track for AGI or not? Felt like it was really important to know for Zapier's business and our customers as well as just like as a human. I wanted to know. I think I first got exposed to Francois' research all the way back during COVID. It was He did like a pod with Alex Friedman. And I think that's where I first heard him talk about his paper on Measure of Intelligence and the ARC benchmark. I thought it was curiosity. It kind of resonated with some long term AI ideas that I'd been thinking about since college, but there are other sort of things I was working on with Zapier at the time. And as I got more into AI research as an individual contributor and an engineer and starting to build the stuff at Zapier, I got really into AI benchmarks. Turns out benchmarks are really, really important for defining and guiding the quality of your systems that you're building. And globally, 1 interesting thing about benchmarks that I found when I was digging in there was like all these AI benchmarks we have are basically saturating up to human level performance. And in fact, it's like been happening faster and faster as time has gone on, specifically as like these live training trials have scaled up or sort of meeting human performance benchmarks faster. And at some point, went back and looked at ARC, the abstraction reasoning corpus that my co creator of ARC Prize, Francois, created. And I saw the opposite. It was actually slowing down on progress. It had been decelerating actually over the last 4 years since it got introduced in 2019. And that's what led me to get, Oh, okay. That's how I expected. Dug in way more. And what I found was or believed to found is I think that's not an accident. That is a very special important benchmark relative to all the other AI benchmarks. I think it's At this point, we're 4 weeks into the competition now with a bigger prize pool and no 1 has come forward yet and said, Hey, I think I've got a better AGI benchmark that exists, or even an AGI benchmark that exists. I think ARC is perhaps still the only AGI benchmark that exists in the entire world that was singularly designed to measure AGI and separate it from AI. I think the fact that that's true and the fact that we haven't beaten it yet, in fact, progress has been slowing down, is what led me to get involved and pitch Transform and try to blow up the prize pool and try and grow awareness on the benchmark.

Nathan Labenz: (9:34) Are there any other things that are like this that are maybe not so AGI flavored? And by the way, I have a very similar story, albeit with a, you know, less well known company. But I had been, you know, kind of background interest in AI for a long time, and it was 2022 also where I was like, you know what? I just kinda wanna focus on AI all the time now, and I was very fortunate that I had a a teammate who could take over running the business. So kind of parallel timing on that. 1 other thing that I've seen that I wouldn't say is, like, so relevant perhaps, but it is an interesting data point is also aesthetic evaluation of images, which was something at Waymark that we really valued more than most. We found that there was, like, a particular dataset that had been created, and there were you know, you go and kinda look on the leaderboards on, like, what is it, open papers or whatever, and it was not going very far. And we were really struggling to figure out how do we do aesthetic evaluation of images. Interestingly for us, that has largely been solved with the latest large language models that are now, you know, obviously multimodal, and we can just ask, like, does this look like a good image to use for this particular business? And we get pretty good answers. I guess anything else that you you see out there that's even kind of in the same ballpark as this where it's Well

Mike Knoop: (10:51) progress or like benchmarks that exist that have not been beaten yet. ARC is not obviously the only, yeah, benchmark that exists that has low score. Suitebench, for example, is another 1 that I'm aware of that has a very, very low state of the art score. I think what's unique about ARC is that it was built off of a definition first. Right? So as I got into this, are we on track for AGI curiosity or not question, First thing I found was that there's kind of 2 big schools of thought that I see existing today in public discourse around AI on how to define what AGI is. The first school of thought is actually, I'll know when I see it, it's undefinable. We shouldn't really even bother trying. We're already just on tracks and don't have to worry about it. And then the other school of thought is got kinda popularized by OpenAI. It's this definition that AGI is a system that can outperform humans at the majority of economically useful work. And this is actually lit literally written into the terms of the agreement between OpenAI and Microsoft, by the way, because, like, they're folk shitting. Fact. Yeah. If they achieve that, like, Microsoft doesn't get any more, like, IPO ownership of future systems that they develop. So it's actually kinda interesting. I think Khoso might get first credit for defining that 1, but certainly due to the success of OpenAI, I think this has become quite a accepted definition of what AGI is. And I don't think it's a bad goal to shoot for. I actually think it's quite a good goal. I think narrow AI that we already have today can accomplish that goal. And I think that's why it's actually not a very good definition because it doesn't drive us towards what we really need and want, which is the generality, the AGI aspect of this. So to jump to the conclusion here, I think the right definition of AGI is 1 that Chollet defined back in 2019 in this paper, which is AGI is a system that can efficiently acquire skill and apply it. And the skill acquisition efficiency is the hallmark of what general intelligence is. Maybe here's a quick qualitative thought experiment to help make it concrete. So we've had narrow AI systems for many years now, 5 plus years, that can outperform humans at games like poker, Go, chess, even self driving cars now in some cases, other more complex games with language like diplomacy. And you might look at that history and say, Okay, well that's generalizing. We're beating more and more of these games. But the reality is that the way that we're getting these systems to outperform humans at those individual games is the researchers and engineers have to start over from 0. They have to completely invent, Okay, new search algorithms, go acquire new sources of training data, go reach new levels of scale in order to solve it, bring new ideas into the fold. Yet, I could sit down with you, Nate. I could teach you a new card game, get you up to human level proficiency pretty quick. I could teach you a new board game and get you up to human level proficiency pretty quickly in just probably a few hours, merely just by exposing you to new experience, right, new training. And so I think that is the hallmark. This shows that just narrow skill is not a hallmark of intelligence. That can simply be achieved through memorization. Instead, it's your ability as a human to rapidly and efficiently acquire new skill and apply it to things that you've never done before. That's what the hallmark of generality is and that's the facet of AGI that has been stalling out on progress over the last 3 or 4 years. We've gotten very, very good at building narrow AI systems and targeting them towards specific tasks and outperforming humans on those. No 1 in the world knows how to build a system that can, with the same level of human efficiency, go acquire new skill in a domain or a task that it's never been exposed to before and solve that. And that insight is what ARC is built on top of. So that conceptual framework and grounding is how ARC got created. It's not just like a random benchmark that happened to be good or happened to be hard, but actually had a conceptual underpinning. And the fact that we have empirical evidence now 4 years in, that it's unbeaten, products have been slowing down, there's a really strong conceptual framework underlying it, is what leads me to believe that like, we really truly have still a lot of progress towards the generality aspect of these AI systems today.

Nathan Labenz: (14:38) There's some fascinating philosophical questions there that I wanna maybe come back to in a minute. Before we do that and maybe to sort of motivate it as well, let's just talk about, like, the rules of the contest, because I think the concrete rules maybe inform, like, certain aspects of how different people are thinking about, you know, what would constitute an an AGI. So I've just noted these down. I'll say some to you. You can comment on them or expand on them or correct me if I get

Mike Knoop: (15:03) anything wrong. Maybe we should, like, talk about what if people haven't seen what these the tasks actually look like, that actually might be, like Yeah. Sure.

Nathan Labenz: (15:09) I honestly assume that most people, if you're listening to the Cognitive Revolution, have been released puzzles.

Mike Knoop: (15:14) Yeah. Sure. Go for You can play with, like, 6 examples that are on the homepage at arcprize.org and get a feel for them. The the idea is that they look like IQ puzzles. And what how I like to think about the puzzles is they're sort of like a minimal reproduction of what general intelligence is. The key insight of how all the ARC tasks were designed is every single task is novel. It should be something that you have never seen before. It should be something that an AI system that you've built to try and beat the task has never seen before. That's 1 of the reasons why ARC as a dataset has a public dataset that folks use for training and designing their systems. Then there's a private dataset, which is what the competition is run against. Very few humans have ever seen the private dataset and that increases the level of confidence that if someone shows up and says, Hey, I've got a system that can beat ARC at the 85% mark, which is where we set the grand prize at, that increases our confidence that that's a good solution, that that's actually something that is generalized and solve these tasks that never got exposed to. It never got to train on the answer. It never got to train on code where other humans are writing code that tried to solve similar looking puzzles. If ARC continues to endure, I think it'll be because of that fact that every single task is novel in the task set. And you don't need world knowledge, you don't need language necessarily to solve these puzzles. They're instead grounded on about 7 or 8 core knowledge priors, things like objectness, goal directness, symmetry, rotation, masking, things that kids develop very, very early on in childhood development. They acquire these core skills. And that's all you need in order to beat the benchmark or beat an individual task. I think that's 1 of the reasons why ARC tends to be very straightforward for humans, but evidence shows, like, no AI system can Yeah.

Nathan Labenz: (16:52) It's really fascinating. I guess I had 2 questions there. 1 is the way that they're posed from what I saw when I looked at downloading the dataset is just as 2 d arrays of numbers with a 0 through 9 in each place. Right? Yep. Then the images that are presented on the website, those are, like, not part of the core dataset but are just sort of a visual presentation. Is that true?

Mike Knoop: (17:15) Yeah. That's right. Many people have, like, built viewers to view the different puzzles. The data format is just like a JSON file. It defines all your task demonstrations. So you get an input example and an output example, and you get usually several of those per puzzle. And then you get a test input JSON array of, like, an a matrix of 0 to 9. Your job is to write a program that can generate what is the right output given that test that matches the pattern that's shown in the input demonstrations.

Nathan Labenz: (17:40) That's even assuming a little bit of the approach. Right? Because your test is really just to provide the output. Right?

Mike Knoop: (17:46) Yeah. Indirectly, I get the output. The only interesting thing is the numbers are arbitrary, or I I should say the colors are arbitrary. So this is maybe 1 thing I've seen some people get stumped on is like, oh, maybe the colors, like, mean something in some of the puzzles. They do not. There's no special nature assigned to any of the colors that are on it. It's purely like you could swap out a different palette, and all the puzzles would be just as solvable as they are with any other color palette. I I think 0 representing black might be a special case in some puzzles, but generally, like, still humans should be able to deal with that that fact and and still solve. So the color choice of the palette is completely arbitrary. And instead, the numbers are more symbols to represent different things in each square.

Nathan Labenz: (18:23) Hey. We'll continue our interview in a moment

Mike Knoop: (18:25) after a word from our sponsors.

Nathan Labenz: (18:28) Was the test dataset originally created 5 years ago and kept private this whole time, or was this sort of an expansion of the dataset now to create this private test set?

Mike Knoop: (18:40) The direct answer to your question is that the 20 24 contest is using the exact same private dataset that Francois created back in 2019, and that was the same dataset that got used in 2020, 2022, and 2023.

Nathan Labenz: (18:52) Cool. That is definitely a lot of foresight to keep that private. I mean, I think it'll become a big trend in general. We're starting to see that with, like, scale's recent seal benchmark Yeah. Where they're not publishing the, test questions. So that's definitely a headwind to well.

Mike Knoop: (19:06) There's, like, 2 really interesting things here. Maybe it's not to talk about. 1 would be kinda, like, the landscape of benchmarks that exist. There's I'm I'm seeing, like, a handful of different categories of benchmarks that exist that could be interesting to describe. Then other thing I'll say on ARC real quick is that it is not a perfect benchmark. Like we know that there are deficiencies in how it got created. I don't think anyone, Francois certainly envisioned that it would be enduring as quite as long as it did. And he did as best as he could with the resources he had available to him 5 years ago. For example, during the 20 20 competition, after that 1 wrapped up, some folks did some analysis to look at what percentage of all of the puzzles, if you were to kind of like assemble all the solutions, could be solvable by brute force. The state of the art in 2020 was only about 20%, think, the end of the first result. But if you ensembled all the solutions together just merely using brute force, in fact, AI models were not even really a technique that got used in the first competition, you could solve upwards of 40%, 50% of the puzzles purely through brute force search. So we know that the benchmark is not perfect. The hope is that there's enough remaining novelty in the rest of the set that actually is illustrative and points the right way towards AGI. And so far that's very true over the last 3 or 4 years. It's kind of interesting to see that in some respects, like state on the private dataset today on Kaggle is still only 39%. So it's interesting to see that these more AI model based approaches are kind of in some respects just catching up to what brute force search could possibly do. But we know in order to get to 85%, we're going to need something much further beyond brute force just because of this combinatorial explosion problem that starts happening with the minimum description length that's necessary to solve a lot of the different puzzles. So we know the benchmark's not perfect despite the fact that it's been there for 4 years. And 1 of the things we want to do is continue to make improvements in the benchmark. Now, won't do that during the contest period, but this is something we're hoping to do during the downtime between the annual contests is to try and improve ARC to make it the best, strongest, like, measure of AGI that we can. And I think, like, 1 maybe interesting thing I've heard friends wanna talk about is, like, we really wanna stairstep our benchmarks alongside our, our system strength. So as we start to discover more and more general AI systems, we're going want to increase the sort of novelty and the generality that exists in your benchmarks. You kind of need to like stairstep them up alongside each other. We're actually using the generality of the systems we've made to design better and stronger benchmarks. Right now, all of the ARC tests are generated by hand. Francois designed all of the initial test set. Now there's some crowdsourcing that's gone into more of the latest data sets. But every single 1 of them is generated still by a human and it's hand verified still by a human to make sure that they're solvable. But I would expect as we make more progress towards AGI, in order to make stronger benchmarks, we're going actually have to use some of our weak AGI like systems in order to design better benchmarks. So I expect that's kind of going to be the future of how the ARC challenge evolves as we go forward. For the time being, with the regime we're in, we're going to have to just keep having humans design the puzzles and adding more novelty, looking at the puzzles that have the most degree of novelty and trying to make sure that more and more of the dataset is representative of the types of puzzles that can't be just solved via brute force search. And then as time goes on, as we get weaker AGI systems starting to emerge, using those systems directly to make the challenge itself and make it stronger and the best, you know, good measure of AGI that we can.

Nathan Labenz: (22:25) Crazy times ahead. No doubt. When we get the proto AGIs developing the AGI benchmarks, that's when we'll know we're really in some sort of takeoff situation.

Mike Knoop: (22:33) There's, this interesting world of, like, benchmarks is kind of emerging. Right? Maybe I'd call, like, 3 big categories. You've got private benchmarks. These are things that, like, Zapier, for example, has in house. Right? We're building private datasets of our own usage data in order to make sure that our products are using AI are actually really good for our customers. These are narrow. It's not a general form benchmark, so we're not gonna publish it, but it's very useful for us for benchmarking the quality of our systems. Then you've got this category that's on LMSYS, right? The sort of ELO scores where they're pitting multiple AI systems against each other and having a human evaluator say which one's better. And I think this maybe gets to your idea that you were talking about a few minutes ago of using human aesthetic criteria to decide the quality of the system, where you're using the generality of the human to make a decision about which system is sort of better or not. And then I think the last category is I I think we're going to see more stuff like ARC. The SEALs benchmarked from scales. They're doing it in the spirit of something very similar, where they're only allowing the biggest models 1 shot at the benchmark. In order to try and eliminate the risk of overfitting, they're trying to do things where the tasks are sort of sight unseen before they get exposed to it, to try and be a proxy for the novelty of the task and be a better measure of generality. And so I think we're going to see more and more benchmarks taking that form where the tasks themselves are novel and they're kept private, or at least attempted to keep private in order to increase confidence in the reported benchmark and reduce sort of overfitting and the contamination that can otherwise happen.

Nathan Labenz: (23:56) Yeah. That seems very smart. I mean, I don't know how else with so many people training so many models and so many datasets flying around and people, you know, training on GPT-four outputs and all that kind of stuff, it's like there does need to be some sanity brought to this. And I think the private test set definitely was remarkably forward thinking approach.

Mike Knoop: (24:14) And so we launched ARC Prize 4 weeks ago, and 1 of the first people who really put a lot of effort into on the public dataset is a new leaderboard that we launched. The private competition is where the big grand prizes, and then we have a public leaderboard that folks can use that allows usage of, like, Internet access and frontier models. And this guy, Ryan Greenblatt, worked on using GPT-four o to sample thousands and thousands of Python programs out of 4 point being fed in the images and then ask representation of the puzzle, and then searching over programs that are generated in order to find ones that match the sort of the pattern of the test demonstrations and applying it. And this has been like a really fascinating approach that is, I guess, working. His his solution is on the top of the public computer where he's getting like 42%. And there is a risk that his solution you could maybe claim or make the argument that like, oh, because GPT-four o is training on the, you know, ARC dataset that's directly in GitHub that, you know, it somehow has some sort of advantage because in the public dataset, you know, it's getting contaminated. I I actually don't buy that story. I think the more likely situation of what's happening here and the reason why things like 4 point language models are able to get such high performance on with this, like, program sampling approach is for many years now, people have been trying to beat ARC public puzzles by writing code to do it. They've been creating DSLs. They've been writing code to solve each individual puzzle. And then they're putting that code on the internet. And so that code is what's getting into the training data for these large language models. And I think when you're using an LM to sample solutions to potential ARC puzzles, it's getting access to programs that other humans have written in the past. That's my best belief about maybe where there might be some degree of overfitting or contamination leaking in. But there is still something important about the approaches that folks like Ryan are coming up with because they do have very good agreement from a score perspective with the new semi private dataset that we made. And the fact that he's only sampling maybe 2,000 reasoning traces per puzzle, Whereas if you were to go back to 2020 and look at, like, how many reasoning traces they had to generate for the brute force, it was, like, you know, tens of thousands, hundreds of thousands. So there is something cool about how using these language models as kind of a perception engine to direct and guide a program search seems to be working. There there is something that is special and really novel and interesting there.

Nathan Labenz: (26:31) Yeah. Absolutely. That's kind of where my head goes to immediately when as as somebody who, you know, has, again, had this sort of long standing general curiosity about AI, but has really gotten into it full force the last few years. Like, naturally, the language model wave of technology is kind of where my head immediately goes to. Let's do the rules for a second, and then we can kind of contrast, like, what those rules would require against what somebody like Ryan has managed to make happen and then create maybe some other possible techniques or approaches as well. So there are a 100 private puzzles. These are the ones that nobody has seen outside of presumably the organized humans.

Mike Knoop: (27:08) We've had all of an important thing to note about the 100 private puzzles is they have all been hand verified. We've had a couple humans who are adults who have taken every single 1 to verify that they're, like, error free. They're solvable by humans. So I think that's just 1 important note on the private data side. Even though very few people have seen them, they're all solvable by humans.

Nathan Labenz: (27:24) Cool. So to win the $500,000 grand prize, you have to submit to Kaggle where the contest is being hosted a solution that can get 85% right by November 10. You have to do this in a way where you're willing to open source the solution that you create. You can use pretrained models and licensed software if you have an appropriate license. Right? Yep. But no Internet, which means no APIs to the today's leading frontier models. Then I think maybe 1 of the more interesting things that I was kind of trying to analyze is the actual amount of compute time that you have. So you get a p 100 for 12 hours of runtime.

Mike Knoop: (28:15) That's right.

Nathan Labenz: (28:15) If I looked online, I I didn't know the off the top of my head, the specs of a p 100, but it's going for about $300 on Amazon today. And the wattage is 250 watt. So just kind of thinking, okay. That's 720 p 100 minutes. That's, like, 3 kilowatt hours total. So kinda amortizing cost and trying to back into it, it basically comes out to something like $1 worth of compute. Would you say that's a reasonable estimate?

Mike Knoop: (28:49) Yeah. If you go from, like, first principles and you were, like, owning your hardware, I think if you go to spot pricing or, like, on demand pricing, when I was looking, it's, like, 20 or $30 probably for the 12 hours that you could get. Either way, this is, like, donated compute from Kaggle. Right? There's we can talk more about why the limit exists. I think there actually is some important things behind that. Yeah. I think that's, like, about how much, like, if you were just gonna go try to put that on the open market, what what Yeah.

Nathan Labenz: (29:10) I was figuring if I buy 1 and, you know, plan to use it over 2 years and then just factored in the cost of electricity, I got you about a dollar. So another way to look at that is you have 7.2 minutes per problem for each of the 100 problems. That would be also, like, roughly 1¢ of compute with my analysis, you know, maybe higher price point if you're buying in the open cloud market, whatever.

Mike Knoop: (29:34) Yep.

Nathan Labenz: (29:35) I think you can I'm not sure if you would have better clarity than I on what models would fit and be able to run there and, like, what kind of tokens per second you would get. 1 way I I tried to triangulate that was just looking at prices from, like, Fireworks AI on what their inference costs for llama 3 8 b are. Those are roughly 20¢ per million tokens. So if I said, okay. Well, what's 1¢ by me? That would be roughly 50,000 tokens out of their commercial infrastructure. I would think a llama like, an 8 b model will fit, right, into the because it's if I understand correctly, it's 16 gigabytes of RAM that the PE I think that's right. I have

Mike Knoop: (30:18) go look it up. I just googled it. That's right.

Nathan Labenz: (30:20) Sure. Unless I got something wrong in my process.

Mike Knoop: (30:22) I know it's on we listed all the technical details on ARC Prize website, but I'm pretty sure that's right.

Nathan Labenz: (30:27) So you should be able to get an 8 b model in there, but, like, not too much bigger than that. And do you have any idea what sort of token rates you'd be able to generate?

Mike Knoop: (30:38) Off a llama a b?

Nathan Labenz: (30:39) Yeah. If you had, you know, or something in that general spirit on the

Mike Knoop: (30:43) Hey. I don't know that I've seen you in publish. So this is actually 1 of the interesting things we're trying to do with ARC Prize is over the last 4 years, it has not been a requirement to share your code or approach. So the information that we have from past contestants is pretty limited. Just to kinda narrow in on the rules, specifically, 1 of the things we're gonna have requires in order to claim any of the prize money. So whether it's the progress prize that we set aside for this year or the grand prize if someone achieves the 85%. In order to claim that, you have to commit to publishing your solution into public domain, open sourcing your code effectively over under, like, a public domain license. And our sort of anticipation is that it's gonna take multiple years to beat ARC. That this is not gonna get solved this year. It's gonna take multiple years. And then we're hoping to use the annual contest window as a stair step function to sort of rebaseline the community at the end of every year with a bunch of new open source code and approaches that folks can look at to see how are other people doing it. What are the approaches they're getting at? So the 1 that I do know about I I don't know about the lama 3 b on a p 100. Somebody have to go try it. I do know the state of the art with Jack Cole, who's getting 39% on the private leaderboard, is using, I think, a Salesforce, like, CodeGen T5 model or something like that. And they're doing something really interesting where they're doing test time fine tuning. And that T5 model is a 2 20000000 parameter model. It's a very small model. I think they've experimented with some bigger ones, but they're getting really good state of the art performance off of a relatively very small model. And I think they had to do a very small model like that in order to work with this test time fine tuning where they do some pre training as well, and that maybe got a couple percentage points. But the the main insight that they're doing is at runtime, they're, like, generating new new puzzles in response to the puzzles that they're showing at test time and then doing fine tuning on their model and then directly prompting it, which I think is quite cool. It leads me to believe that we're probably likely to solve ARC with a 7B model and 10,000 lines of code. It would be my best guess today if you kind of force me to make a bet on like, how much compute, how big of a system do we need in order to sort of get to the 85% mark.

Nathan Labenz: (32:44) Hey. We'll continue our interview in a moment after a word from our sponsors. So that was 1 of my other questions. You are able to like, in that 12 hours of runtime, basically, anything goes. You can even fine tune on the private dataset. I suppose that's what they're doing.

Mike Knoop: (32:59) You know anything you want. Right? The no Internet is the main, like, limitation, right, which a lot doesn't allow you to use Frontier pretrained models or outsource your compute to an API provider. Maybe there's like 2 conceptual things to talk about here on the compute no internet. So because these have both been like pretty loud points of feedback, critique, comments that we've seen over the last month or so since we launched ARC Prize. And let me address the compute 1 first since we've been spending more time on it. The compute is an important benchmark for efficiency. If we come back to Francois' definition of AGI, a system that can efficiently acquire new skill. If efficiency was not part of the definition, this would mean that general intelligence was merely a brute force search. And yet, we know that's not the case. Francois formalizes this in his paper, which you can read from this 2019, or maybe using a quicker thought experiment, if you go take the ARC puzzles yourself and just look at the ones that are on the homepage and try to introspect, how did I figure out the answer to that? You are not sitting there and thinking through thousands of possible transformation steps or programs in your head that you're trying to apply them. Instead, you use your perception ability, your experience in order to shrink down the set of all possible permutations or transformations or what it might likely be. And you're only usually doing deterministic kind of evaluation in your head for maybe 3, 5 potential like solutions there. So the idea of the compute limit is really to force researchers to reckon with the fact that efficiency is part of the definition. We're gonna keep increasing compute over time. The honest answer is we don't know how much compute is needed yet in order to solve ARC. We already, I think, 2 to 3 X the amount of compute that you get for this contest period over what folks got last year. I think in the last year you got an even weaker GPU only on 3 to 4 hours of runtime. And so there's somewhat of a practical constraint of the fact that Kaggle is donating compute for the contest. We're gonna work with what they are able to provide That will keep going up over time, this sort of compute flop per dollar scale laws keep holding. But you should expect that that compute limit is going to keep going up as time goes on here. This is perhaps an interesting contrast, by the way, to the public leaderboard. The public leaderboard trades a compute limit for a dollar limit where I know we haven't talked about this, we can't. But the public leaderboard is a separate leaderboard. It's not part of ARC Prize. It's part of the can't win prize money on it because it measures the public dataset, which has the answers that are out there. So could be liable to contamination or overfitting. However, we built this separate leaderboard because we had our own deep curiosity of how could frontier models do? You're allowed to use internet on the public leaderboard. And on the public leaderboard, we swapped the runtime limits and the compute limits for a dollar limit. You can use up to $10,000 for online API based commercial API calls on the public leaderboard. And so the private leaderboard has the Kaggle sort of pretty hard constrained compute limits and the public leaderboard allows very uncapped amount of compute, more of a dollar limit as opposed to a hard bit of hardware that you're required to use. Somewhere we'll figure out what the truth is. And maybe I'll make the meta comment. Pretty interesting today that actually the state of the art on both the public and private leaderboard are pretty in pretty good agreement with each other. They're only, like, a couple percentage points different.

Nathan Labenz: (36:17) Yeah. That's definitely I noted that as well. And so that difference I guess I don't actually know exactly what Ryan ended up spending with his technique, but the

Mike Knoop: (36:26) cap of thousand dollars. Both are maxing out runtime and cost, which is maybe another interesting meta commentary of, like, why we're gonna commit to contributing keep upping the compute as time goes on. Because, like, right now, like, all the solutions are sort of maxing out their run times.

Nathan Labenz: (36:42) Yeah. So there the difference between a little fuzzy because of different things are obviously priced different ways, but we're looking at something like my estimate was 1¢ per puzzle with the private official contest specs versus a $100 per puzzle on the public leaderboard. So there, you're basically talking a 10,000 x difference.

Mike Knoop: (37:06) Yeah. Again, it's, yeah, based on, you know, on demand prices versus acquire your own. Yeah. It's a significant amount of additional compute capacity you get on the public leaderboard.

Nathan Labenz: (37:14) Cool. Okay. Interesting. Fascinating that they're right together.

Mike Knoop: (37:17) I kind of expect the reason 1 of the other reasons we published public leaderboard, that's not just, like, curiosity of how do frontier models do. It's quite easier to get started with, you know, API based models. And so folks that are maybe curious about the competition but just wanna, like, get started with playing around in a notebook, you know, try 1 of the existing off the shelf solutions, not have to go spin up and figure out how do private fine tuning or whatever. Robbing the public leaderboard can be kind of a source for folks to get started just a lot faster and be more of an accessible entry point to the contest. Similar to the private leaderboard, all of the public leaderboard high scores have code attached. So you can go to an ARC Prize at org/leaderboard and see Ryan's code. You can actually go look at it and copy it right in your own notebook if you were so willing or you wanna evolve or tweak on top of it. Quite expensive to run. I would recommend sampling at your sort of test set if you wanna attempt some new ideas there. OpenAI can be bright like an entry point. My expectation is that public leaderboard scores will roll down to the private leaderboard through kind of waterline effects where once someone shows that something is possible with existing levels of compute and algorithmic efficiency, that will encourage a very strongly motivated search to figure out, okay. How do we accomplish that same thing then within the runtime performance limits of the competition?

Nathan Labenz: (38:25) Yeah. The the 4 minute mile effect for these things is very strong.

Mike Knoop: (38:29) You see this all over the place. Right? 4 minute mile video game speedrunning is another area where you see this. In fact, the 20 20 Kaggle competition for ARC had a very similar waterline effect where for the first, like, month or 2, kind of very meager progress, 1 person sets a new waterline 3 or 4 percentage points up and then everyone hits it almost within just a few days. Existence groups are really, really powerful, I think, and motivating.

Nathan Labenz: (38:51) Yeah. I think GPT-four kind of did a version of that just for the, you know, the language models and chatbots and all the product integrations, everything. It definitely sort of reframed it for a lot of people what could happen.

Mike Knoop: (39:04) Yeah. Especially the fact there was commercial value attached to it too. Like, not only is it technologically possible, but the OpenAI showed that there was a market there Yeah. For it too.

Nathan Labenz: (39:12) And will pay for

Mike Knoop: (39:12) 1 this. Of the things that has led to us actually launching ARC Prize in in a meta way, the dynamic of the current ecosystem around large language models, because there is some small amount of economic utility with them, you know, it has sort of caused a reaction amongst AI research and all the companies and labs doing research to over rotate and fixate on LLPs as the only way that things are gonna work. Right? In 2023, there was $20,000,000,000 of investment that went into language model startups. And by my rough estimate and count on my own, maybe like a couple 100,000,000 into AGI starts, like working on new ideas, new architectures, new learning algorithms, things like that. Literally 20 to 1 ratio difference here in investment. The commercial market has also led to a lot of closed up frontier sharing. OpenAI didn't share any frontier details on how GPT-four worked in their paper. Gemini followed suit and didn't share any of their technical details on how the longer context, the million context length window stuff works for theirs. Unfortunately, think we're just gonna see more and more of that because there is a known market value now for what these frontier innovations mean. And yet this is in complete contrast to why you and I are even talking to each other, right? If you look at the history of the transformer or GPT-two, that got started all the way back in 2014 probably, maybe even earlier than that, arguably, but Ilio did the sequence to sequence paper at Google, published that openly. That got picked up and built on by Bannada who made the attention mechanism at, I think, Jacobs University. That got picked up and brought back into Google with Vaswani and Jazir at Google who did the attention is all you need paper, right? 2017. Got picked up by Al Bradford, and Ilya now at OpenAI who've realized, oh, that's the key unlock that allowed GPT-two to get built 3, 4, and so on. So the reason we find ourselves in the environment we do today with all the cool AI progress we've had is because of open progress, open sharing, and open science. And that's just not the world that exists in 2024 right now. Of the goals of ARC Prize is to play a small role in trying to counterbalance some of those things and bring awareness to the issue and also motivate people to work on new ideas again and find a way to get those ideas put into the public domain to reaccelerate some of the open progress.

Nathan Labenz: (41:22) As a technical question on the rules, if somebody submits a solution, are they committing at that point to open source, or do they get to decide later if they wanna open source to claim a prize or whatever downstream?

Mike Knoop: (41:36) It's eligibility checks happen at the end of the contest. So there's a handful of eligibility things. Like, there's a certain amount of countries you can't be in because Kaggle can't pay out money to certain contestants in certain countries. But, yeah, basically, there's an eligibility check for the top 5 scores for this 2024 period. We'll go through pick off the first 5 that are eligible, and those will be the ones and part of that eligibility check is the commitment to publish and put the code into open domain. They'd share with us, and we'll we'll probably act as a clearinghouse to publish it once the prize money is released.

Nathan Labenz: (42:06) Gotcha. I was just wondering if because of your comments on Open Science, I wonder if there's any risk of somebody scoring high and then being like, you know what? I think I'm gonna go use this to fundraise for a start up or, you know, go try to auction myself off to a a big tech company or something.

Mike Knoop: (42:23) This comes back to the waterline effect. Right? Once you know something is possible, it's very motivating to others. Like, we we're gonna keep running ARC Prize until somebody creates a solution and puts it in a public domain. So if someone just shows up and says, hey. I I figured it out. I got the answer, but I'm, like, not gonna share. And, like, we're gonna keep running the prize. Like, our our goal is to motivate people to, like, figure how to do this, figure how to distribute it openly, and motivate. And so, you know, I I think there is a chance that actually happens. In fact, they've seen a few fun comments on Twitter where founders were like, hey. If you've got a solution ARC Prize, don't share the code. Come talk to me, and I'll give you, like, a $10,000,000 starting salary, which, like, I actually appreciate it. I I think it's kinda cool. I think it shows that people are starting to realize the value of new ideas. You're starting to see, I think, an emergent idea that's like countering the sort of dominant narrative of scale is all you need. And I think it's because there's a lot of evidence behind it. And I think that sentiment is what's gonna drive more and more researchers to go explore new ideas again, which is what we want at the end. Like, you know, our goal is to help accelerate open progress towards AGI with ARC Prize. And however we get there, I think there there's lot of paths that can go, but I think fundamentally, the only way it's gonna happen at this point in time is we have to convince and motivate and inspire more would be AI researchers or existing AI researchers to work on new ideas again. So

Nathan Labenz: (43:38) depending on your time, what I thought I might do next is just take a minute on the kind of introspection point and talk through how I feel like I'm doing it, and then start to go into some of these possible strategies or new ideas.

Mike Knoop: (43:50) Let's do it. Let's keep going. I'm I'm kinda I'm having fun.

Nathan Labenz: (43:53) Okay. Cool. Me too. So when I try to think about what I'm doing when I solve 1 of these puzzles, I basically broke it down into 4 steps, and I wonder if you would say the same for yourself or different or, you know, compare and contrast. But, you know, for me, it kind of starts with and I actually have an interesting alternative definition. Right? It's interesting to me anyway. An alternative definition of intelligence that

Mike Knoop: (44:18) What's your

Nathan Labenz: (44:18) I think actually is it it fits very nicely with Zapier in a way because I sort of think about it in the context of workflows or, like, broader structure, broader scaffolding. Obviously, Zapier being almost synonymous with a no code way to create these sorts of things. The definition that I come down to is intelligence is the ability to succeed on a task where we don't have an explicit algorithm to do it. So I think of, like, for example, recognizing digits. Right? I went and asked Claude 3.5 Sonnet to write me a program in explicit code to recognize digits, it got to, like, 14%, you know, out of which is just slightly over random.

Mike Knoop: (44:59) And

Nathan Labenz: (45:01) to my knowledge, like, there's not really still, to this day, any good algorithm for solving MNEST that doesn't involve learning. And yet, we can identify the digits, obviously, super easily and, you know, so can a trained model. So that is interest that's, like, a more minimalist definition in the sense that

Mike Knoop: (45:17) I think doesn't like, there's something real there or, like, true there. Right? Like, this idea that, like, intelligence is what you use when you don't know the answer. Or or intelligence is, like, what you use when you don't know what to do. Right? That's this general definition. I think the sort of Francois definition of efficiently acquiring skill has some, like, formal definition and math behind it that is goes through in sort of the 2019 on measure of intelligence paper. But I think, spiritually and conceptually, I I think there's agreement between those 2 definitions that Yeah.

Nathan Labenz: (45:48) There's definitely something in common. I think his is more demanding in the sense that I am willing to count something as intelligence even if it is heavily trained or has, like, seen plenty of these tasks before. I'm not putting so much emphasis on efficiency. I'm just emphasizing the ability to succeed in the absence of an algorithm, you know, that could tell you how to do it.

Mike Knoop: (46:11) If the thing that you're that definition leaves open is brute force. Right? If that were true and and only true, that would mean that general intelligence could be an algorithm which merely does a brute force search over all possible algorithms that exist. And we know that's not actually quite how human level general intelligence works. And in fact, efficiency is quite an important aspect of early intelligence emergence too, right? In fact, efficiency may be the gradient by which intelligence emerged. Early organisms had to navigate their environment and they wanted to navigate to food. They wanted to navigate to avoid danger, right predators. They wanted to reproduce. And there is risk to taking steps, taking action in their environment. If these early proto organisms are just sort of randomly moving around or exploiting a brute force search to navigate their environment, it's not going to be very good. It's gonna run into a lot of dangers and things that it doesn't want. It's gonna waste a lot of energy at minimum trying to navigate that environment. And so I suspect that efficiency might be 1 of the gradients evolution used in order to get intelligence to merge in the first place so that those early organisms are making smarter local decisions about how to navigate towards food, navigate away from predators, and so on. So my hunch is that efficiency is actually a really, really fundamental aspect of what general intelligence is and how it emerged in the first place.

Nathan Labenz: (47:35) Yeah. That's interesting. It definitely makes sense. I think the, yeah. Mean, these definitions get so nuanced, and the intuitions are definitely worth unpacking, I think. The rules do not have any limit on how much resource you can put into training before you submit. Right? You can submit your model with arbitrary training. It just has to actually run within the 12 hours.

Mike Knoop: (47:58) That's right. You can do as much unlimited pretraining as you want. That goes for both the public and private data board.

Nathan Labenz: (48:03) Yeah. So, I mean, that's interesting because it's also like the, you know I was gonna ask, is there anything in nature that suggests that this is possible? And I think humans sort of do in the sense that we can solve 1 in 7.2, for example. You would then look back and say, well, we've had all of evolutionary history to get to where we can do that. But then you basically allow for the same thing. Like, you can train as intensively as you want. It's inference time efficiency that we're concerned with here.

Mike Knoop: (48:33) I I do think that's something a lot of researchers maybe, myself included, by the way, miss when I start first thinking about, like, efficiency and where the energy in the system lies. You you know, it's not correct to just think about starting at birth for a human. Right? Certainly, there's a lot of training data that you sort of get exposed to as a child and, you know, develop it. But because of the fact that humans seem so predisposed to acquiring these core knowledge skills extremely early on, you know, a year old, for example, with motor skills, seems to suggest there is something in the brain architecture that allows us to be very good at predisposed to, like, acquiring that skill in some way. And that is definitely an outsourced search. Right? That, like, architectural discovery is something that we've had tens of thousands of generations and a lot of energy has gone. You almost have to count up the energy that's gone into all the lifetimes of humans and that sort of ancestral tree in order to make a more accurate assessment of the efficiency there. I get pretty inspired by this. I do think existence first, again, are really important. We have 1 really, really strong existence proof of general intelligence that exists in all of our heads. We don't know what the architecture exactly is though, yet it exists. And so this kind of, in my mind, motivates a search. It's possible to create. It's possible to build. It's possible to figure out what it is. We've not done maybe the best job yet to date of searching the space of what those architectures possibly look like. We're maybe a decade into deep learning now. And what? There's, like, 7, 8, 9, maybe mainline architectures from CNNs, RNNs, transformers, XLSTMs, R, WKBs, all the state space stuff. This suggests that the search space of, like, possible architecture is actually very rich and suggested that there was lots more to find. If I came back 20 years from now and said, hey. By the way, we ran out of new architectures in 2024. I think you'd be probably very surprised that that was sort of the case. And each of these architectures have slightly different properties, at least around the characteristics of efficiency and inference speed and things like that. So far, we have not yet figured out the architecture that allows the generality to emerge quite as strong yet. So there's still some Maybe the space of architectures in terms of how we define architecture is still kind of limited or naive in some way. But there was 1 unsupervised search that led to general intelligence, and that is evolution. I think that's at least know knowing about and realizing from, like, a possibility standpoint.

Nathan Labenz: (50:56) Yeah. It's certainly just also that there might be some creative hacks that could be quite fruitful avenues to explore, and this is sort of, you know, inspired by my own introspection. So tell me if you do this in the same way. My 4 steps are, first and I'll I'll kind of also score these on, like, how much intelligence or sort of inexplicable intelligence I feel in each 1. First, there's the visual spatial kind of, I wouldn't even say it's, like, reasoning so much as more just, like, detection of salient features. There's somewhere it's like, oh, like, that cross, you know, seems to be an important thing. It's, like, jumping out at me off the screen somehow.

Mike Knoop: (51:34) Yep.

Nathan Labenz: (51:34) That part feels to me, like, a little bit hard to turn into an algorithm. Like, I could define an algorithm for, you know, identify all the lines or identify all the contiguous things or identify anything that has, like, a hole in the middle or whatever. Right? I could program all that out. But there is something when I'm actually doing the puzzles where it's like, I see a cross there. I see a cross there. That seems like I'm onto something. Right? So it's

Mike Knoop: (51:55) of That's that objectness corpore part that, you know, you gained very early on as

Nathan Labenz: (51:59) a kid. Ident yeah. And it's funny. I was looking also in in preparing for this at kids' toys and just noting how many of them are sort of, like, there's a star thing, and you have to put it into the star hole. Right? And then there's, like, the rectangle, and you put it into the rectangle hole. So that is something that people at a young age are obviously, like, not entirely born with,

Mike Knoop: (52:17) even though they I would call it what you just described is, like, perception. And perception is something deep learning is really good at. This is we have lots of examples and lots of knowledge about how to build really effective perception networks. So I would say that first step you just described is not the hard part or at least the part that has lots of ideas attached to it today currently.

Nathan Labenz: (52:34) Yeah. Seems like there's a lot of, like, CNN type approaches, or I was even gonna suggest maybe a state space angle that could work on that. But, interestingly, it isn't so easy for the frontier language models. Like, Ryan's report on GPT-four o is

Mike Knoop: (52:50) that it kinda sucks at that. Mhmm.

Nathan Labenz: (52:51) I've been messing around with club 3 5 sonnet, and it's also it's like well, it feels better a little bit than GPT-four o, but it doesn't. Yeah. Certainly not great. You know? It it has this sort of, like, blurred vision somehow that that is really interesting to experiment with. But, yeah, I would agree that that's probably not the hard part. The next part for me feels like where the magic happens and speaks directly to your about efficiency as opposed to search and is the part where I'm like, this feels like the hard part. That for me, I I call it guessing the rule. It's like, given this sort of meditation for a second on, like, what seems important here, I really don't know where this comes from. But most of the time, I can pretty quickly just, like, intuit the rule. And introspect as I have tried to, like, I really don't have any explanation for what is happening between a somewhat conscious noticing of these, like, apparently salient features and the translation from that to, like it kinda starts with an intuition, and then I can verbalize it. But I feel like there's this intuition that kind of just bubbles up out of nowhere that's like, oh, you gotta fill in all the holes with yellow or whatever. Right? And I have no idea where that's coming from. That seems to me to be the most core function here, not only because of these puzzles, but also when I think about and AI for science for me has been a growing obsession of late. I think, like, what would be AGI for science or what is really gonna move the needle in science? I think there's a couple different candidates for that. But the big 1 would be if you could have a model that could generate hypotheses that are high enough likelihood to actually be true that it's worth the investment to actually do the wet work and and validate those hypotheses. We're starting to see that actually a little bit. Yep. Like

Mike Knoop: (54:39) Yeah. Shay, that's that's it strikes me as how we're using AI systems right now for, like, protein synthesis stuff.

Nathan Labenz: (54:45) Yeah. Exactly. That's a leading area in that regard, and a lot more, I think, to come in that domain as well.

Mike Knoop: (54:50) Guessing the rules. That feels right. Like, it's you know, the current best approaches today, or at least on the language model side, are brute forcing. Us. Right? They're trying to generate lots and lots of rules, thousands of them, and then they're using an outer loop of deterministic code to check all the rules that got generated against the examples in the set and apply the ones that work to the test set and use that. In contrast, I think how humans seems to do it is we're using some of that first thing you talked about, your perception ability. We're using some sort of perception ability to shortcut the rules that we think about. And those perceptions that were grounded on all those core knowledge priors that you've, you know, been gaining experience on your entire life all the way from really early age. And so there's something there that hasn't gotten figured out yet of, like, how do you use deep learning or perception network to really effectively steer and guide a, like, program synthesis style engine where you're generating these potential programs. And then you can still use, you know, terms of code to check them. But something like that feels like the most direct near term way to solve ARC. I will maybe point out though just 1 interesting observation that no 1 sat down and designed a program synthesis engine into the human brain. That was a emergent characteristic of something else, of the architecture of the scale of the brain allowed an emergent program synthesis engine to sort of exist. So there's kind of like 2 competing schools of thought of how do you sort of figure out this program synthesis guided deep learning system? And 1 would say, Well, go search for new architectures, go search at new levels of scale and try to figure out, can you design a system where the program synthesis engine emerges organically from that? And the other school of thought is like, it's faster and more direct to try to have a good insight about the problem here and don't try to evolve or discover a system that has that emergent property. Instead of just build it directly because we may feel like we have a good enough intuition of the things that are necessary in that deep learning guided perception network to just jump straight to the answer. They have trade offs on the 2, right? The sort of evolve and discover method is very inefficient. It's going to take a lot of energy and scale in order to go try a bunch of different architectures and different learning algorithms to figure that out. But it has potentially more likelihood of global convergence on actually discovering 1. And by the way, we know 1 exists because it's in all of our heads, right? So it's like maybe it does a better job of searching the global search space there, whereas the attack the problem directly, go build and design a program for this extension that's guided by deep learning, trades off. It's gonna be much more efficient, but maybe it's less likely that there's a strong existence proof there. And that might be where that approach could run into limitations. If it doesn't bear fruit, be like, oh, there's something missing still that we haven't figured out quite yet.

Nathan Labenz: (57:37) Yeah. I wonder I mean, the people at the frontier model development companies these days presumably would say you know, had a couple conversations on this topic. I imagine that they would say this will all kind of come out with scale. Right? And the argument there would be basically the way these are formulated is essentially a classic few shot prompt. The big unlock of GPT-three, and and the title of the paper was large language models are few shot learners. And it was like, my god. You know, we can sort of just give a couple of examples of any arbitrary task within some pretty wide bounds.

Mike Knoop: (58:13) And the thing can not

Nathan Labenz: (58:14) only infer the task, but then go on and and do the task. So is there something about this that you think doesn't

Mike Knoop: (58:20) Well, I think learning there is is probably not the right word to use. You know, there's this effect that you see happen with language models as the parameter counts go up where you need to put fewer and fewer examples into the prompt in order to steer the output tokens better with higher degree of accuracy and consistency. Smaller models, 7Bs, 30Bs, you might need to do, for example, 5 or 10 really good crafted examples in your prompt in order to get it to do a good job on whatever your test example is. Whereas GPT-4.0, it's Claude Sine, you might only need 1 example. Heck, in many cases you don't need any, right? But if you want to steer it a little bit, you might just need 1. I think the intuition I have on why this is the case is that you're sort of trading off where in the system you're putting your energy. With a bigger, larger model, you're sort of putting more of your examples of the programs or the ideas of the training, you're putting more of them directly into the pre training phase. And so you don't need as many in the prompt in order to steer the model to the ravine of activation weights that are close semantically to it in sort of the manifold space of all the programs it has. Whereas with the smaller models, they might only have 1 example of the type of task you're prompting it on in its training data. And so you actually have to overload with more examples in the prompt in order to get the weight activations to kind of discover where it's at in the pre training and have it work. So I think that's kind of the trade off. In both cases though, at least with language models as they're currently designed today, we're still in a regime of direct inference, direct prompting off of memorized training data. And I don't mean memorization and like, oh, they're spaying on tokens, but they're sort of able to make connections in slightly more abstract ways. But at its fundamental, it's still limited by the programs it's being shown in its train data. It's not able to generalize to novel programs or tasks that had no bearing anywhere at all in its training data. And I think that's the big thing that's missing. That's just like the key idea about the current architectures of how we have language models designed that I think will limit their impact on the ARC benchmark. I think we'll run out. I suspect with the current architectures, they maybe get 50 or 60%. I don't think we'll get us all way to the grand prize, though.

Nathan Labenz: (1:00:30) That's interesting. I think I would take the other side of that on just pure scaling, and I'll just throw out a couple of kind of findings that inspire me to believe that. 1 is I think about this paper all the time still. It was a 2022, and the way I might kind of bookmark in my head for it is gradient descent in the weights. I've got a few of these since then. But this 1 stood out to me also because it was a US China collaboration, which I always think is of extra note in the great context too. So what they did was they designed a implementation of gradient descent that they encoded into matrix form. And then they went out and looked for that same general pattern of weights in models that have been pretrained in the normal way, and they found it. And they were like, oh, look at this. We are able to demonstrate now that the models are implementing a sort of gradient descent at runtime based on the few shot examples. That to me does feel like learning. I mean, that doesn't feel like memorization intuitive, but that does feel like a more sort of meaningful

Mike Knoop: (1:01:39) Let me let me maybe clarify. I don't mean to say that, like, language learners don't do any generalization. They clearly do. That's why they're so special and cool. That's why they have economic utility is because they are more general than any system we had ever come up with prior to 2017 with the discovery of the transformer. We have not made more progress towards generalization though since that. All of the current language models still use these same underlying generalization architecture that we've had all the way back towards the very beginning here. It's worth There was a paper I saw a couple I think it was a month ago. I think it was called On Paradox of Learning to Reason. And what the researchers did was they used a BERT transformer and they created a synthetic deductive reasoning dataset. So things like if x, then y. If y, then z. Given x is true, is z true? And so they came up with thousands of these little synthetic toy deductive reasoning rules. Some of them required just 1 leap of deduction of reasoning. Some of them required maybe 5 or 6 steps where you had to plug things into each other. And they sampled from their data set. Made, you know, let's call them they made 1000 of these. They took 500 of them and they sampled it from a forward deducting where they kind of looked at from the Given the rules and looking at the output. And they sampled the other 500 from the backwards, I think. Yeah. Given the output, what's the rule? And what they found was when they trained this BERT transformer using just simple back propagation, off the shelf stuff, on the 500 from, let's say, the first dataset and tested on the 500 from the first dataset, it got a 100% accuracy. Same thing on the second dataset. Trained on 500, tested on that 500, a 100% accuracy. But when they trained it on 500 from 1 distribution and they tested on the second distribution, the accuracy fell off very predictably all the way down basically as the number of reasoning rules and hops you had to do went up, the accuracy went down to below 50%. So that's itself already kind of interesting, but then I think it was an interesting thing that researchers did. Then they tried to say, Well, could we hand code the weights of a BERT transformer model in order to get 100% accuracy no matter how you sample from the source synthetic data distribution. And they were able to achieve this. I forget how the code's in the paper if you want to go look at it. They abused their human knowledge of the BERT transformer architecture where it does broadcasting and they used that. They stacked a bunch of network layers together and used that fact that was broadcasting to mimic the deductive reasoning broadcast that you have to do. But they were able to build a system that used the same off the shelf BERT transformer and got 100% accuracy on this deductive reasoning. And so what's the takeaway here? I think 1 interesting thing it suggests is that perhaps the transformer architecture is capable of really high degree exacting accuracy generalization, but our training algorithm is wrong or insufficient. The back propagation simple training algorithm could never have discovered the set of weights that the researchers hand coded. And the researchers are using the general intelligence in their brain. That's where the general intelligence existed in that latter example. Right? They're, like, using their human brain to figure out how to abuse the architecture. That's where sort of the novelty came in. But I think it says something kind of interesting about, like, you know, maybe current architectures are sort of sufficient, and it's more of a learning algorithm that we need to fix.

Nathan Labenz: (1:04:46) Yeah. I do think that is quite interesting. Yeah. Mean, there's so many different meanings of generalization too. Right? Because you you've got so many different kinds of data and all the you know, if you just trained a transformer at some scale just on these tasks, presumably, would work. I think that's been kind of the the stated expectation. That's, like, a little bit outside the rules. I I think the idea was, like I think Francois said, like, if somebody just generates a ton of puzzles, trains on them, then they'll probably be able to solve it.

Mike Knoop: (1:05:11) And there is 1 known deficiency of the benchmark. We wanna make the benchmark better. We talked about this. Right? We we know the first version is insufficient in many ways. And this is 1 of the ways is that, given unlimited pre training, you could sit there and try to generate every possible ARC puzzle that could ever exist and train a model on it. And it prompted a test time and rely on the memorization fact that it's just been exposed to every possible ARC puzzle that's ever exist or could exist and do it. You could also do brute force search via a data center. You could, like, plug in and search every possible program. There are ways to beat it today that sort of hopefully, the existing rules around efficiency with, like, the compute learn time limits and no Internet help increase the confidence that, like, a score or, like, a a result is a true result. It may have been another current benchmark's not perfect, and and we are gonna try and work on improving it with v 2 and so on in the future.

Nathan Labenz: (1:06:01) So what just as a digression before we get into ideas to solve the current 1, what does that look like? Is that, like, changing the format, or is it just being more creative and off the beaten path in the test set?

Mike Knoop: (1:06:13) The the near term stuff, we want all of the and this won't change for this contest period. Like, that's part of the part of the contest. The thing that I think we want is to have ARC and ARC Prize be a very strong guiding light towards what AGI actually is and a measure and a benchmark for it. Any solution that could simply beat ARC through brute force is not a very good benchmark. So that's the thing we're trying to solve. And 1 approach we're thinking and considering is to look at basically which of the tasks are least susceptible to brute force search based empirically, just based on the last 4 years and this contest period as well. And try to understand what about the puzzles that are least susceptible to brute force search, but still very straightforward and easy for humans to solve. What are the features and facets of those? And use that as inspiration to go generate a much larger dataset. 1 of the other challenges or deficiencies of the ARC dataset is it's kind of small. There's only a 100 private examples in the test set. That does lead to some statistical issues on the margins. It'd be much better if it was at least double the size. But yeah, we want to grow, increase the size of the dataset and we're gonna try and increase and add new examples and new tasks into the dataset by looking at like, okay, well, what are the features of the ones that have the highest degree of novelty in them? Right? And are the leastest up to the web workforce search?

Nathan Labenz: (1:07:36) Cool. Interesting. So just to round up my heuristics, I think we got the or my breakdown. We had the perception portion, which I think were agreed that there are approaches out there that can do this. There's the guessing the rule, which for me is where most of the magic happens and also is, like, the thing that could really unlock things in AI for science and so many other interesting areas. Yep. Step 3 is, like, basically writing the program. This 1 I'm sort of back and forth on because I'm like, I don't have an algorithm that tells me how to translate a natural language specification into code. But, nevertheless, it still feels more algorithmic somehow.

Mike Knoop: (1:08:18) So Check the rule. Right? Like, in your head, you have a thesis of what the rule is, and you try it. You, like, test against your demonstrations, and you're like, okay. That worked. Now I'm gonna use the interface to actually design the output based on that role.

Nathan Labenz: (1:08:31) And then the final thing is just kind of trying to figure out what I did wrong when it didn't work. Like, was it that my rule wasn't quite specified in the right way or that I maybe made a mistake in implementation? From reading Ryan's post from his use of GPT-four o, it sounds like the programming is pretty good, but not awesome, and certainly, like, has mistakes. And then he has this kind of revision step, which sounds like it helps quite a bit, but probably still has a lot of the same core challenges where it's like Yep. The perception isn't great and coming up with the guesses, you gotta come up with a lot of guesses, so on and so forth.

Mike Knoop: (1:09:06) I think this is all pretty good happy path. You know, I think there's maybe a few other things to think about that humans do while solving the puzzles that you didn't mention that may or may not be important, but they certainly feel true for AGI. So and 1 interesting thing is that accidentally the public dataset has had some mistakes in it where there was like a pixel that was in the wrong spot or it got inverted. Maybe 1 of the task demonstrations of 3 or 4 got switched or something. And humans are able to work around this actually. They are able to spot that there is actually an inconsistency in the rules and still are able to solve the problems. Usually, the way that they do this is through ambiguity resolution where it's possible that a mistake or an error there's actually a very small percentage of the puzzles that actually have intentional ambiguity in them as well where you actually need the fact that you get 2 shots at it in order to solve the puzzle. Humans are able to deal with this. They're able to not just figure out, Oh, here's the rule and go directly to the solution. They're able to hold multiple possible rules in their head at the same time and have some degree of confidence against them. And, Hey, it's not I realize the fact I have insufficient information in order to know what the answer is and I actually have to go make contact with reality test. And then I'm gonna use that feedback from reality in order to update. That's a new bit of information that I get to use in deciding what to do next. So I'd like that that kind of ability to deal with ambiguity, I think, is is sort of important part. Probably gets closest into the guessing the rule part, but certainly is part of all of the, like, guess the rule, check the rule, and the iteration. Yeah. Okay.

Nathan Labenz: (1:10:33) Cool. So that part about ambiguity is really interesting.

Mike Knoop: (1:10:37) We've debated, like, should we just have more, like, intentional mistakes than it just to, like, be really novel? And, you know, I think we've always settled came down the side of, no. We should remove mistakes. That's we wanna have a very good pure benchmark. But maybe there's room for, like, Yeah. I mean, every time benchmark

Nathan Labenz: (1:10:53) robustness is a real frontier as well. So

Mike Knoop: (1:10:55) Yeah. I I was looking at was like, maybe this could inspire somebody who's thinking about if you wanna build, like, an an Oracle like benchmark, this would be an interesting thing to explore. It's, like, design lots of novel tasks where there's intentional mistakes in it, and the job is to deal with ambiguity and identify the mistake and still solve the task despite it. It'll be pretty interesting.

Nathan Labenz: (1:11:12) So speaking of things that we've explored, I imagine you must have tried some stuff like this. Like, you probably did a, you know, a version of, like, the Ryan approach with the Frontier model. I mean, correct me if I'm wrong, but I I would guess that you put some real effort into at least a couple different approaches before you decided that you weren't close and it was worth spinning up a a contest like this and funding it and all that sort of stuff. Yeah. What did you try and, you know, what if anything came close or felt exciting?

Mike Knoop: (1:11:39) I've done my personal approach, I kind of did 2 things. I tried all of the, like, open off the shelf solutions that I could find from past year's contest. There's a few of them that are on the public leaderboard now that you can actually, like, go run on the private test set. And I just tried all of those. I tried plugging in the frontier models against them. Yeah, I didn't really do a lot of iteration on top of them. I was just trying to see, let me show the benchmark. Let's see what state of the art today. It was around 20% actually without much improvement or iteration at that point with code from past years. And then it kind of outsourced my perception to the contest because this is not the Virtually ARC Prize. It's been like the contest. Francois and this other firm called Lab42 have run a smaller version of the contest in past years. There was 300 teams, in fact, that competed in the previous version of it in 2023. So I kind of outsourced like, okay, if the sort of known solutions aren't working, I have pretty good insight about what those things are doing. It matches up with my intuition of why language models are sort of insufficient for my Zapier like research experience. And the fact that no 1 else in a private sense has been able to come up with something that's all that much better. It then motivated me to say, All I'm going to go down to first principles here and just start thinking through how do you solve this thing. My personal bet, and I don't have a solution here, I'm actually still working on it, it's 1 of my side projects, is that there might be a way to get a emergent program synthesis engine from a non backpropagation based search approach. Certainly a more indirect approach to beating ARC, but I think it's 1 that potentially scales better up to AGI if we can get it. And ARC is, in that world, 1 of the feedback mechanisms that system can use in order to do its evolutionary search. And ARC is a tool then as opposed to the goal. So that's kind of how I was thinking about it. I don't have a result there yet. Still very early on that, but that's kind of my personal viewpoint, and we're we're kinda working on the side outside of the Zapier and ARC Prize.

Nathan Labenz: (1:13:24) So when you say non backprobagation evolutionary, that just means, like, changing the program, scoring the program, keeping the ones that score higher and

Mike Knoop: (1:13:34) Trying to do a new architecture. So I like, think about a how do you do architecture search? And I should also say, like, neural architecture search is not a new idea. It's been it's been around for the field for years and years and years. In fact, quite possibly, it's the single largest compute suck prior to the transformer being invented was spent on neural architecture search. And really, it's never amounted to anything. Nothing ever has been really discovered from it. And when I tried to understand why is this the case? Because architecture search has this feature that it's a symbolic search. It's not using back propagation. So it's more possible to discover a global solution and converge to a global solution. And what I found was in academia, because most research labs didn't have access to a ton of large scale compute, they would make shortcuts in order to get a result, even a small result out of their NAS paper. And it almost all amounted to just hyperparameter tuning, where they would change the number of layers in the CNN or tweak the learning rate. If you can do a good job of defining all your hyperparameters, I think NASA actually does quite an effective job. What no 1 has quite figured out yet is how do you use architecture search and do a much more true, universal relaxed search? And what could the form of that look like? In my head, it kind looks like a dense, fully connected network where you're pruning weights. Some of the recent results this year on ternary weight systems or binary weight systems, I actually think are pretty cool because they get you closer to an architecture search than a weight optimization function with gradient descent. When you're in a regime where you're just deciding, Does any 2 nodes in my network connect to each other or not? And you define that with a 0 or 1. You are sort of defining literally an architecture there. And so that's kind of the thing I get excited about is trying to answer or figure out, Do we have sufficient compute? Or this would be the bad, is that over the last 4 or 5 years, the advent of language models, the scale that has come with them has caused enough compute, has forced us down the flops per dollar scaling curve sufficiently that we now have enough compute, it's available at cheap enough dollar rate that you can actually do a true like, global relaxed architecture search to discover the primitives of AGI when that wouldn't have been something you could have done, like, 3 or 4 years ago. Certainly a longer bet than, like, the direct solutions to beating ARC, but ones that I think are just personally, like, interesting, I'm curious about.

Nathan Labenz: (1:15:50) Cool. Yeah. That is really interesting. Interesting. And, And, again, if that starts to work, like, the takeoff scenario potential there is really striking to me because you now have, like, automated architecture finding like the world's your oyster. Right? What is your sort of before we get into additional strategies, what is your expectation for an AGI world if, for example, something like that does work? And now it's like, we can sort of brute force or, you know, evolve our way into like, I mean, people have obviously very differing intuitions about this where some people are like, that's a singularity that we can't see past.

Mike Knoop: (1:16:26) It could be good

Nathan Labenz: (1:16:27) or bad. Other people are like, well, the complexity of the world is so high still that

Mike Knoop: (1:16:31) I feel like the only thing that really works in AI is to, like, be empirical. That's, like, 1 of the key insights of the bitter lesson, right, is that the only thing that have sort of worked through, like, you know, the these, like, search and learning techniques. And this is much to the chagrin of researchers who really hoped that there'd be, a key insight that they could apply to the system. You know, turns out, like, it's better just to, look at the results. Like, what systems are actually working and keep doing more of what's working? And so I kinda feel like the near term of AGI, let's say we discover a solution ARC, the first thing it's gonna be used for is gonna be very boring. And it's gonna be used to go solve all of the exacting accuracy issues with the application layer. This is something that I know from Zapier from deploying AI for the last year and a half, that the biggest problem with the application layer in AI right now for users is user trust. And I don't mean that in, like, a data privacy standpoint or, a training data standpoint. Surely, those, like, concerns do exist, but, that's not what I mean in this case. What I mean is that people are trying to use AI in fully automated deployment scenarios using Zapier, and the consistency and accuracy and reliability is not high enough for them to deploy it with no oversight. And so instead, what they're sort of forced to do is either change the environment such that there is a human that will always oversee the system, which means now the system is still very limited by a human, or they just don't do it because the risk is too high. These are really boring risky scenarios too. Think about This is actually an example that we actually had where we had an AI system on Zapier that was automatically deciding what channel to send a message in Slack to. And we had some channels on our Slack that were shared with partners and it shows 1 of the partner channels and sent some user content to it, something that probably shouldn't have shared in outside the walls of Zapier. And as soon as we saw it, we're like, Oh, wow. That's not good. Existing of a very embarrassing thing that we should fix. And what we ended up having to do is add deterministic overrides and controls on top of the AI system to say, hey, it's fine if it generates like the body of the message, but we never wanted to just, like, guess what channel to send it to. Or we wanna clamp the behavior of, like, which channels it's allowed to send to and add these deterministic overheads on top of it. So there's all this, like, weird investment and effort that goes into because the underlying models themselves are just not fundamentally reliable in an exacting way. And if you solve ARC, what you will have discovered or what you have created as a computer program that can generalize from a relatively arbitrary set of core knowledge priors and with exacting accuracy solve tasks that the system had never been trained on or exposed to and it's trained in ever. And so that like exacting accuracy on perfect generalization is gonna be the toolkit, probably developer toolkit to start with, but a toolkit that people can use and bring into the application layer in order to get these systems to higher degrees of accuracy and reliability so the user trust is higher so that they can deploy them in more scenarios. There's kind of like concentric rings of use cases is how to think about this. And right now, the concentric ring of use case that are the sort of innermost ring of use cases that people are actually using them for are just ones where the trust doesn't matter because the risk is so low. Think personal productivity use cases. Having an AI system summarize my inbox and email me is not very high risk if it goes wrong. Yet having it do something with customers or in production, those have much more catastrophic risks. The willingness to experiment, deploy those is pretty much nonexistent today. And I think that's directly what we're going see. So I think the near term future for AGI development is actually really, really boring. Surprisingly boring maybe. And I think it's going to give us a lot of opportunities to update based on what the systems can do once the technology exists. Then I think you just got to make decisions on policy from there. I'm not a person in the camp of Actually, I think it's a very poor way to make policy decisions about trying to think about hypothetical futures. And instead, you just got to look at history and say, What has happened? What can happen based on what we have What we know today? And kind of work your way forward from there. So that's kind of my overall viewpoint, at least today, on how the advent of AGI is going to look like. I think it's going to be a lot more boring and more practical than most people think and expect.

Nathan Labenz: (1:20:30) So it sounds like you see a sort of fundamental I don't know if it's sameness or what exactly, but between reliability and, like, out of domain generalization that I think might be kind of counterintuitive or, like, not something everybody would would agree on. Right? Like, typically, when you think of making things work, you think of, like, you know, dialing in the performance, more examples, fine tuning on train of thought is, like, what my go to trick for any task that I really wanna dial in performance on. And I feel like I'm normally narrowing the scope of what the model can do when I really wanna make it work for a given task. Mhmm. And I'm, like, happy to accept that trade off, especially if I'm putting together a Zap and it's like, okay. I want this thing to be more reliable within this broader context. I don't have an intuition for why this sort of ability to solve a never seen before problem would be the thing that would unlock that reliability.

Mike Knoop: (1:21:31) It's because the core the core knowledge priors are arbitrary. That's why. So I think a full solution to ARC likely takes Probably the nearest form solution to ARC takes the form of something that looks like a deep learning guided DSL generator plus a deep learning guided program synthesis engine. I think you put those 2 things together and you've got a really, really strong general form solution to ARC. And it's 1 that does not depend necessarily on the core knowledge priors that all the original tasks were built on. So think about it with these AI bots that we launched earlier this year. The way it works is users come to us and they have a text prompt. They give the bot access to auth keys for all their apps or many apps as they want. And then they give the bot rules to follow in terms of telling what they want the bot to do. Say like, Hey, want you to every hour look at my inbox. And if I got an email from someone who's really important, I want you to summarize it and send it to me via Slack or something like that. So the sort of failure points of it right now come from the user's desire for how to get that generality through expressing it through natural language. Right? And as the systems break down the function calling step by step on it, I think that's the key problem you're running into, is the only time it's accurate enough to go all the way through the full chain is when there's enough examples in the training data that it's seen in order to have high enough accuracy at each step in the step process there in order for the total sum of the accuracy to be high enough where the user's willing to accept the trade offs around risk and trust in order to deploy it. And so let's say you've got this Let's see you play it in the future. You've got this hypothetical toolkit where you've got a program that can generalize from a core knowledge set of arbitrary priors to solve exacting problems it's never seen before, like the first natural place you would do is like give that toolkit to programmers or line of business users to say, Hey, now in order to write a system that does something useful for you, you don't have to think about building the deterministic code, right? You don't have to think about the wrapper code. You don't have to think about the algorithm. You don't have to think about a series of steps. You don't have to think about a DSL. All of those things are kind of some of the hard parts about learning to program, right? It's like learning the syntax, learning the language, learning how to express it. None of that matters. Instead, you could program entirely through natural language just by giving the system a couple examples of what you want it to do. And it would be able to then generalize from those examples to deal with any of the sort of novelty or runtime situations that cause the existing systems we have today to sort of fail or break. Right? So example of the 1 I mentioned before, let's say with the sort of summarize your inbox, send it to you in Slack. Let's say your former boss emails you and it's like, Oh, okay. So how's that system gonna interpret that? If you didn't write a hard coded rule in the classic AI or classic Zapier land of old, they might do the wrong thing, might fall over, or at worst, do something that's catastrophically bad. And so that's kind of what you want out of these systems. Think have much more ability to learn and generalize off of a set of core knowledge priors are trained on is you want the sort of end user experience, I think, just to be like, I'm gonna give you a very small set of examples and you're gonna do it with exacting generalization even when the sort of input of the system is varying over time. And that's what we really don't have today. That's where the consistency really, really starts falling off is when you get these, like, small variations in the input, change the output pretty significantly.

Nathan Labenz: (1:24:47) There's a lot to untangle there.

Mike Knoop: (1:24:48) I probably do need to wrap up at this point.

Nathan Labenz: (1:24:50) Okay. Well, maybe I'll just stand by myself for a few more minutes and and talk through a couple things that in case people want a couple pointers. But that is I wish I had a better way to try to resolve some of these differences in intuition. Because a lot of times, I think, like, in an alternative scenario, if I had a person do this, how would they do? And, like, what would limit them, you know, in in that scenario where, for example, my boss emails me or my old boss or whatever. Right? And it's like, I think in a lot of those cases, it's not that the person like, the person, in my experience, definitely still could come back with some failures or things that were not what I would have wanted them to do.

Mike Knoop: (1:25:29) I mean, by the way, first of what you probably want in that scenario is for the system to ask you. Yeah. Right? You'd probably want like, Hey, you didn't give me enough details here. I need more insight in order to answer this and I'm gonna come ask you. This is something we've tried to build into our AI bots is the ability for them to self recognize when the input data has varied so divergently from existing past successes that it should get, like, raised up to the human to, like, intervene. But, like, the decision making system to do even the decision of intervening is not great. So you still have the same, like, flaws and failure modes, and that leads to the same amount of, like, user mistrust at the end of it.

Nathan Labenz: (1:26:05) Yeah. That I think you're absolutely right on those questions. I guess my where I'm feeling this sort of inner confusion is around, like, what that person needs is not to be smarter. It is more context. Right? And so then you could say, okay. Yeah. Well, the sort of generalization that we want is, like, for it to ask. That's seemingly a pretty different sort of generalization from ARC, but maybe could be the same. Like, you would have sort of a a third you know, you'd give, like, 2 examples of what to do, and then the third 1 would be, like, if confused, ask, and the hope would be that it would Right.

Mike Knoop: (1:26:40) Ask. Here's another way to think about it. So when we first started experimenting with, like, chain of thought stuff a couple years ago with GPT-three, 3 and a half, I I think the common sort of accepted wisdom of chain of thought right now is like, hey. Each step might get 80% accuracy. And so if you chain 3, 4, 5 steps together, you have to multiply all your accuracy together and, oh, the overall accuracy of the system is just so garbage that, like, no 1 can use it for anything. And that was our first experience too. Was my first experience when I first started playing with, like, reasoning with language models and chain of thought stuff. It's like, oh, okay. Well, this stuff's slow, super inaccurate. No one's ever gonna use it. Then we started this paradigm of these AI bots where instead of trying to do a meta LLM prompt where we tried to solve every user's use case through a single prompt that we prepended to everything, Instead, what we found we had to do in order to get accuracy high enough, at least at an individual user level, was we had to give the end user full control over the prompt. We had to give them full control over steering their prompt for their LLM call and allow them to guide and change and update and tweak the prompt based on what they were observing actually doing. So, hey, maybe you write your prompt initially. It's like, Yeah, let's go back to the email. Every hour I wake up, look at my inbox, summarize emails from my boss and send me in Slack. Maybe what you find is that just writing that out for the first time, it's actually not doing it reliably. Maybe it's summarizing lots of emails that aren't your boss. Then you realize like, Oh shoot, I never actually told it who my boss was. So I gotta go back to the prompt and update it to add a rule and say something like, My boss is this email address. And then you run it for a couple of days and you realize your boss sent you something and it didn't do it. You have to go look and debug it. And you're like, Okay, why didn't that work? And it's like, Oh wait, my boss sent it from his personal email address instead of his work email address. So then you go back to your prompt and you have to say Or the name is this. You get into this feedback loop and developer loop with the system. That giving the end user direct control over the prompt and allowing them to make contact with reality, update and edit their prompt, and steer the local prompt has been the only way that we've been able to get our AI bots to get up to a high enough accuracy and reliability that people are willing to do anything, trust them with any use cases. And people are paying for these things now. There are people that have steered their prompts sufficiently where the chain of thought is high enough that it's valuable enough still for them, even when the general form of chain of thought is still is kind of in the garbage. Maybe that's what I'm getting at when I'm trying to talk about this form of generality of the systems we want where it's like a our program is just programming a super meta prompt and it does lots of things. That doesn't work. That still doesn't have high enough reliability, especially when you're chaining them together. But if you give the user full control over the prompt and allow them to tweak and edit and steer it locally for their specific use case, specific task, you can get the reliability higher enough there. And so I think we really want though is we want the more general form solution, right? Because we don't want users to have to sit there and constantly make all these stupid same realizations and tweaks that every other user's had to deal with. That's the only thing that's kind of works. The sort of, I think, more magical AGI feature is when the system is smart enough, intelligent enough to handle a lot of the We'd probably train that more core LM system on some core knowledge priors about what we see all of our users doing and it being able to then exactly generalize in more domains and more tasks instead of forcing end users to take on the effort and work themselves to solve that.

Nathan Labenz: (1:30:00) Well, I hope to have another episode with Wade, your cofounder, in the the next couple months. We can get into that in a lot more detail then. I guess I always kind of assume that that system, that, like, AGI future will, like, read my whole inbox as opposed to being, like, a generalization on core knowledge. Like, you could imagine and maybe it's convergence,

Mike Knoop: (1:30:21) but you could imagine sort of I just thought of the weak form. I'm just not, like, the first person who solves ARC, that's what you'll be able to use a solution for. Probably worth saying, I do not think that ARC itself, a solution ARC itself, is like the real world AGI that we envision and sort of care about. Right? It's not gonna magically sort of make robots perfect. We're gonna need to scale up the solution, the insights from solving ARC, I think. At minimum though, a solution is this more general form of being able to do exacting generalization from arbitrary core knowledge priors. That is going to be a really important useful developer toolkit and no code toolkit for things like Zapier where we're trying to allow non developers to use the software as well. But I think it's significantly more important, but I would liken it more of like the discovery of like a transformer style system where it's gonna be a core genesis of a new tech branch that folks are gonna have to build on top of and figure out what are its limits, what are its abilities, how do we deploy this stuff into systems, how do we do it efficiently, how fast can it be. We're gonna have all the same set of questions, I think, once we have that solutions like we did today with or have had in the past with, like, transformer systems. Cool.

Nathan Labenz: (1:31:23) Well, you wanna break now and lead me to monologue my way through some pointers to papers? Yep.

Mike Knoop: (1:31:28) That's fun. I'm exhausted, so I'm gonna go get some lunch.

Nathan Labenz: (1:31:31) Alrighty. Cool. Well, Mike Knoop, founder of Zapier and creator of the ARC AGI Prize, thank you for being part of the Cognitive Revolution.

Mike Knoop: (1:31:39) It was fun. Thanks, David.

Nathan Labenz: (1:31:41) Thank you. Keep up the good work. Looking forward to seeing what comes out of it.

Mike Knoop: (1:31:47) Either way, it's gonna be interesting.

Nathan Labenz: (1:31:48) Yeah. No doubt. No doubt.

Mike Knoop: (1:31:50) If somebody beats ARC in 6 months, that's gonna be super fascinating. If someone doesn't, I think that's also gonna be super fascinating. I hope that somebody beats it personally, but, you know, we'll see.

Nathan Labenz: (1:31:59) Yeah. If if the budget I feel like it people can get there if the budget is high. But with the budget that you have, I think it'll be a real challenge and definitely would be a major breakthrough if somebody were to do it on that limited compute.

Mike Knoop: (1:32:13) Somebody should put it on public keyboard then. Hundred hundred x, thousand x more compute? Show it.

Nathan Labenz: (1:32:19) I imagine for a lot of the people that would like to make their name in doing it, the cost is probably a barrier. I don't know what, if anything, we could do about that. But the

Mike Knoop: (1:32:28) Stampling is the biggest way. Yeah. You just gotta try with 10 pen puzzles instead of doing all 500. Like, the the $10,000 limit was is for all 500, which is 400 private plus the 100 verification set. So if you're only tested on 10 puzzles or even, like, 3 puzzles that you randomly sample out of the set just to test and experiment with, like, you're in the order of only, like, $10.20 bucks for, you know, a run. Still, like, not 0, but, like again, I think most of the, like, interesting things here is more on the algorithmic side than the perception side. So you can use cheaper models, open source models to prove the ideas out and then see if they scale up. You can build confidence before you have to go do big deployment runs.

Nathan Labenz: (1:33:06) Cool. Okay. Those are good tips. I'm gonna hang on here and Okay. Just run down the rest of this doc over 15 minutes or something. Bye for now. Yeah. Alright. That was cool. So a little postscript on this episode just to run down some of the things that we were planning to talk to and ended up getting so kind of digressing into different philosophical ideas that we never made it to these concrete proposals. I think we can sort of intuit what his feelings would be on these proposals just based on all the things that we did cover. And I apologize, by the way, if the sound on this isn't perfect because there is a tree being cut down in my neighborhood and being ground up into a mulch at the same time. But okay. Here are some things that I think we should at least have in mind as we think about how we might wanna solve ARC. And if a team I don't really have time to do the programming on this right now, but I do have some of these ideas where I'm like, oh, that would be pretty interesting to go try. I'd be happy to kind of work with a team or, you know, just I'm not even gonna tell my ideas right now, you can just go run with them. But if you have something you wanna discuss, definitely feel free to ping me, and I'm intrigued enough by this that I would be interested in seeing if I can contribute to a project for sure. So I'll, you know, put that on along with the other things that you're invited to DM me about. So possible solutions. Okay. First of all, DSPY. This just came up a little bit in my episode with Sanders Schulhof on the prompt report. And first of all, we don't know if it's pronounced DSpy or possibly Dispy, but this is sort of a framework that is for optimizing natural language programs, I think is how they phrase it. And notably, it beat Sander on a head to head challenge where DSPY was trying to optimize a prompt and Sander was trying to optimize a prompt. DSPY ended up winning. And it sounds like it was by a not insignificant margin based on the conversation that I had with him. So DSPY is getting rave reviews across the community for optimizing natural language programs. I think, like, maybe the first thing I would do is go try using DSPY on Ryan, Greenblatt from Redwood's solution and just seeing, like, how much juice does DSPY potentially, manage to get out of whatever, you know, black magic it does. It's sort of an interesting thing because, like, I've been trying to get the creator of DSPY on for an episode and and haven't, got a response yet. But it feels like kinda it's like, why does this work is an interesting question. Why is it better you know, why should language models be better at optimizing natural language programs than people are? This is a strange result and does sort of suggest some form of generalization, which I would not expect is, like, purely memorization out of the dataset. But, yeah, it gets rave reviews. It seems to work really well on a lot of things. It beats Sanders Schulhof on a head to head man versus machine prompting challenge. And so, like, apply DSP to probably any natural language program that you cook up for ARC, and you'll have a good chance, I would say, of getting some pretty good results from that. Another direction based on previous episode is potentially using state space models for images with multi scan. And here, I think that you could do some pretty interesting things with tokenization as well. In the Mambapalooza episode, we ran down a bunch of different techniques for adapting state space models to and the Mamba architecture in particular to the image domain. And typically because images, of course, are not sequential. Right? It's a 2 d object where and the the ARC challenges are also a 2 d object. So you have to present them to the the state space language model in sequential form. How do you do that? There have been a bunch of different ways that people have experimented with, but a lot of times they involve turning the image into a patched sequence multiple different ways, like from, you know, 1 corner to the other corner and reverse, or there's, like, 4 way scans. There's 6 way scans. There's, like, all these different kinds of scans. Each 1 then becomes a sequence that gets fed in. You can have, like, multiple states internally that handle these different representations, and then they all kind of merge their prediction back into the next token for any given thing. And I feel like that is interesting in that I don't know. It just feels like a different angle on this problem. Also, the state space models are significantly more efficient. If you're trying to do buildup of states or some sort of long prompt, you can potentially get away a lot more efficiently with certain state space hacks. But something about that multi way scan feels pretty interesting to me. I also would imagine doing potentially multiple different forms of tokenization where you might not just wanna look at the individual squares within the grid, but you might wanna look at multiresolution scans as well. You could look at, like, 1 square in the surrounding squares as, like, a single patch. A lot of times in the convolutional context, you'll have 1 thing that's, like, the whole thing that essentially zooms out to the the maximum possible extent. But, yeah, something with multi scans in state space models feels like it could be pretty interesting approach. Another thing I think is just, like, really heavily fine tuning small models for each of the different tasks. You know, going back to the breakdown of the 4 steps as perception and kind of identification of salient features, then guessing the rule, then writing the program, and then figuring out why you're wrong if you're wrong and iterating from there. It seems to me like different specialized models might be the way to go. And notably, like, in the existence proof of human AGI that we talked about, it is a bunch of different modules. Right? We have a visual cortex that has something like a convolutional network going on. We have a episode with the creator of HippoRAG in process. There's the hippocampal theory of human memory. Hippocampal indexing theory of human memory says that these associations are stored in the hippocampus. And so when something comes to attention and it shares a feature with something else, like, that's mediated through this sort of abstracted connection by features through the hippocampus. So whatever. I mean, I don't know that we'll have anything like that in the ARC solution. But just breaking the problem down and having different specialized models for each approach seems pretty promising. I would maybe think about something like a Microsoft PHY for rule guessing. I think their latest and biggest ones might not fit, so you might have to keep it to a small PHY to actually work within the resource constraints. But fine tuning something on guessing the rule seems like a distinct challenge as opposed to writing the actual code. Right? So to have those be the same model, not sure that that's appropriate. And certainly, we see this kind of specialization within the human brain. Couple other projects I think about a lot that I would maybe suggest looking at out of Google DeepMind. 1 is called alpha geometry. Here, there's a mix, and this, think, is very much in the spirit of what Mike and Francois are expecting to be the solve. This is a mix between a language model and a symbolic deduction engine where the goal is to solve international math Olympiad geometry problems, and they do it at basically a gold medal level. So this is, like, you know, not superhuman, but, like, very top end of human performance. And they do this with a sort of 2 part cycle where the symbolic deduction engine, like, generates a ton of possible moves, and their goal is to prove something. So they, like, do you know, make all these sort of moves in symbolic geometry space, see if they can get to the endpoint. And then if they can't, they ask the language model to kind of give another idea. And so the language model in that context is serving as a sort of intuition driver, but then that intuition is, like, hammered on by the symbolic, deduction engine. And so they make an analogy to thinking fast and slow, system 1, system 2. There's been a lot of sort of system 1, system 2 discourse around the ARC challenge. And I think that shape of a solution definitely could be quite interesting. It does also start to feel like cheating in the sense that they trained the system there on a lot of geometry moves. They synthesized, like, a huge number of sample data points. They generated a 100,000,000 synthetic examples of moves that the symbolic thing could make. So that's, you know, like, kind of out of the spirit of the ARC Prize. This if you had to do that much, you'd be within the rules perhaps, but you'd be a little bit outside the spirit. And as much as they've sort of said, like, somebody probably could generate all the puzzles and then train on all the puzzles and then, you know, effectively brute force their way to a learned solution, they wouldn't really consider that to be what they're looking for, but, you know, it might work. So, anyway, 2 part solution there where you've got 1 language model that surfaces intuition for something that's much more symbolic and deterministic, thinking fast and slow, system 1, system 2, whatever. You could also integrate these things. There's another really interesting paper recently called transformers meet neural algorithmic reasoners. And this 1 is kind of bringing those 2 parts of the hybrid together with a cross attention mechanism. I believe this 1 is also out of deep mind. First of all, a neural algorithmic reasoner is a specialized graph model that can learn to do certain classical algorithms. So it can learn to, like, sort lists, for example. And it can do this in a way that generalizes well beyond its training data, at least in terms of, like, the size of the input. So it could, like, take you know, you may train it on a certain list size and then test it on another list size, and you find like, oh, it can actually generalize and and perform this algorithm. In in a sense, you might say it's grokked, you know, what the algorithm is, at least to a certain degree so that it can do this outside of the distribution that's been trained on. Now what they're doing in this transformers meet neural algorithmic reasoners is, first of all, you do have to have the problem for that architecture. The problem has to be specified both in natural language and in, like, a symbolic graph form. Otherwise, it won't work. So that's a constraint you'd

Mike Knoop: (1:44:01) have to figure out. How do

Nathan Labenz: (1:44:02) you map the ARC problems onto a form that would actually work for a neural algorithmic reasoner? But then what's interesting is you give the text part to the transformer part of this thing, and you give the graph form to the graph network. And then there's this cross attention downstream from which the transformer is able to get more information from how the neural algorithmic reasoner is processing the data. And this starts to feel more like the human brain where you have these, like, different modules. They do quite different things, but then they communicate and their communication is, like, not super clean, but, you know, at least can be used to inform what the other module is gonna do. That seems to me like, again, very much in line with what Mike was sort of saying, he expects and and maybe hope somebody might come up with to solve ARC. Another 1 is called fun search. Think I've talked about this on a couple different episodes, but this is an evolutionary method powered by language models where they take some really kind of far out Byzantine math problems that had been unsolved or where the state of the art had not advanced in a long time. And they essentially generate programs to try to solve them. They run the programs. They get a score for the program and how well it did. Then they create a database of all of the programs so that they're able to look at what are the top scoring programs that we have and then just cycle through that looking at some of the top programs. They use, an interesting technique to try to make sure that they're not just, like, generating the same thing over and over again. They kinda have to try to push the language model to, like, do something different. So that's you can go read about their technique in more detail to see how they're doing that. But they do have a problem of, like, getting stuck close to local minimum. They need to kind of push away from that. So given a a couple of the best examples at any given time, generate a new example, and gradually evolve your way toward a high score that literally solved or achieved a new state of the art on, like, notorious math problems. So that's fun search. The 3 papers, again, just to give them as names, or I guess I'll do the whole thing. DSPY for optimization of language programs, state space models, Mamba architecture with multi way scans, fine tuning different modules, small models specialized in the different parts of the task. Alpha geometry is the DeepMind paper that has the kind of TikTok relationship between a symbolic engine and a language model. The 1 that brings them closer together with cross attention is called transformers meet neural algorithmic reasoners, and fun search is the evolutionary 1 where they keep just generating programs and then using the best programs generated to date to try to inspire new and and better programs from there. I guess maybe the final 1 I'll mention is the concept of a CAN. This is a recent new architecture out of the Techmark Group. I've been hoping to get a a full episode together on this, but haven't done it This stands for Kolmogorov Arnold Networks. And, basically, this is definitely worth looking at just for general inspiration. They're replacing the multilayer perceptron with a kind of an inversion or sort of, you know, almost a mirror fun house image of the multilayer perceptron. In the multilayer perceptron, you have the weights are the edges, but the the learned weights are the edges between nodes. And what you're learning is how strong the signal should be from 1 node to another node. But then you have the activation function at all of these nodes, and, typically, that activation function is the same. Whether it's or whatever, it's the same function being applied to the values that are aggregated at each node. Difference with the CAN architecture is that they are actually making the edges learnable functions. And so you have a, in a sense, like, a higher degree of freedom on the edge. It's not just a single number, but it's a function of the input. And then they on the actual nodes themselves, instead of running an activation function there, they're just summing the values that come from each of the edges. So instead of learning a fixed weight and having it a fixed activation function, you are learning a function, and you're just summing at the nodes. If that's not clear, definitely check out the diagram in the paper of Karl Maagrov Arnold networks cans out of Tech Marks. That's is the lead author there. This is, early research. You know, it's an alternative to multilayer perceptron, but, you know, it feels like it's, like, conceptually profound, definitely kind of an eye opener that, like, wow. You can do something this way. It doesn't seem like it's general enough yet or it hasn't been scaled up enough yet to be able to solve a problem like ARC certainly straight away. Like, it would have to be the sort of thing that would inspire you to come up with a new twist on it. But I do think there is something that's quite interesting about learning functions as opposed to just learning numbers because learning these functions, the big upside of it is that it becomes much more composable and also more interpretable, which is a notable highlight. But they target this architecture toward AI for science where you have these, like, natural laws that are, in many cases, like, relatively simple formulas. And they hope that they can sort of discover natural laws by, like, learning on this data what is the sort of composable function, you know, that this data reflects. And so I don't know. I do think there's something quite interesting there about, like, could these functions sort of be transformations of some sort on the ARC data? Maybe instead of, like, a single variable function like they have, perhaps it could be, like, a matrix transformation that could be learned. But being able to dynamically compose transformations is like a big part of what is ultimately involved in solving a lot of these ARC puzzles because that's essentially what the program that you're writing is doing. Right? It's dynamically composing multiple different transformations often in into a single little algorithm. So if you could create a network that could learn a bunch of those different transformations and then dynamically compose them at runtime, it seems like you might have something quite interesting there. And then finally, would say, especially because he noted that the current private leaderboard is doing test time fine tuning, don't underestimate grokking for this. There's something quite interesting there. You might be able to do, like, a really small and I don't you know, would this satisfy the rules? Would it first of all, would it work? But would it satisfy the rules? Would it be in the spirit of, like, a search for AGI? I mean, that's you could certainly debate that. But just taking something really simple network and just trying to get it to grok a certain transformation, we saw that grokking over the course of, like, 1000000 training steps can grok modular addition. So, you know, could it grok 1 of these patterns in, like, a reasonable time frame? Maybe. You know? It's unclear, but it seems like it's at least the sort of thing that would be worth exploring. If nothing else, all those papers are certainly inspiration for me, things that I keep coming back to and very much, like, worthy of study in their own right. But, hopefully, if you're interested in taking a swing at the ARC challenge, some of those will be useful pointers, useful, you know, sources of inspiration for you. And if you do find any inspiration in that or you just wanna chat with me or bounce some ideas off 1 another, please ping me. And if there is anything I can do to be helpful, I certainly will. With that, thanks for listening to this little PS, and thank you all for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

The ARC Prize: Efficiency, Intuition, and AGI, with Mike Knoop, co-founder of Zapier

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Read next

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

The ARC Prize: Efficiency, Intuition, and AGI, with Mike Knoop, co-founder of Zapier

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Read next

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath