Imbue CTO Josh Albrecht on Creating AI Agents for Reasoning, Reliability, and Robustness

Watch Episode Here

Listen to Episode Here

Show Notes

In this episode, Nathan chats with Josh Albrecht, CTO of Imbue. They discuss how to create agents for reasoning, reliability, and robustness. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

RECOMMENDED PODCAST:
Every week investor and writer of the popular newsletter The Diff, Byrne Hobart, and co-host Erik Torenberg discuss today’s major inflection points in technology, business, and markets – and help listeners build a diversified portfolio of trends and ideas for the future.

Subscribe to “The Riff” with Byrne Hobart and Erik Torenberg: https://www.youtube.com/@TheRiffPodcast

SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

With the onset of AI, it’s time to upgrade to the next generation of the cloud: Oracle Cloud Infrastructure. OCI is a single platform for your infrastructure, database, application development, and AI needs. Train ML models on the cloud’s highest performing NVIDIA GPU clusters.
Do more and spend less like Uber, 8x8, and Databricks Mosaic, take a FREE test drive of OCI at https://oracle.com/cognitive

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

X/SOCIAL
@labenz (Nathan)
@joshalbrecht (Josh)
@eriktorenberg (Erik)
@CogRev_Podcast

TIMESTAMPS:
(00:00:00) - Episode Preview
(00:00:00) – Episode Preview
(00:07:14) – What does it mean to be a research company?
(00:10:25) – How is the reasoning landscape these days and how might it evolve?
(00:11:03) – Data quality is highly important
(00:21:15) – What’s the difference between good features and a good world model?
(00:27:31) – The impact of new modalities on reasoning
(00:29:15) – How much can reasoning and knowledge be separated?
(00:45:13) – Imbue demo and are they building their own LLMs or using others?
(00:49:37) – Does Imbue have a deal with Nvidia?
(00:57:48) – Carbs framework
(01:12:57) – Imbue’s involvement with policy and and AI safety
(01:16:23) – Takeaways from AI Safety Summit and Biden’s Order

Music licenses:
UOFBEQSHYSFUVKHK
ED1H4GFE4CZDFJM7
TPVHNLERNDL4KRIA

Full Transcript

Transcript

Nathan Labenz: (0:00) And so over the next year or two, I think we are going to start to see systems that are a lot better at doing longer term things. The longer term actions need to be right. If you're right 80% of the time and you do 10 things in a row, you're actually fairly likely to fail. Right? And so you have to get that up to 90, 99, 99.9 if you really want to be taking much longer sequences of actions. But I do think that we're going to start to see a lot of that performance happening over the next year or two. And that's going to be a pretty interesting, weird world where these things are actually working kind of like we would expect as people. We're not using reinforcement learning to learn this. Right? That's not how you get a PhD. You don't try getting a PhD 10,000 times, and then you finally get it and say, oh, I guess I should do more of that to get my PhD. That's not at all how we do almost everything. Right? We're mostly planning. We're mostly thinking and anticipating and using this kind of logical reasoning stuff. And so that's why we've shifted our focus towards those types of tasks, towards coding tasks, reasoning tasks, tasks in your browser, desktop, where the planning piece is there. There's a lot of complexity in the real world. Right? You think about Stripe or something, it's like, how can there be so many people working at Stripe? All you're doing is paying for a thing online. How hard can that be? Turns out really hard. Turns out there's a lot of details to that kind of stuff. It turns out everything is like that. And so if we have a system that can more automatically break these things down and actually start solving these problems and putting them back together properly again, I think it's going to look broad strokes kind of similar, but in a sense, it'll be quite different because this can happen dynamically. This can change over time. You might be able to come back to the system and say, we're using this language model here, but it's doing something stupid. We're just doing addition. Let's just call Wolfram Alpha. Let's just use a calculator. Okay, great. Now it's a lot faster. And so once this is more dynamic, it's going to be more evolving. It's going to be able to optimize and continually improve in a way that's much harder for a self-driving car system that's been made by whole huge teams of people.

Josh Albrecht: (1:48) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host Erik Torenberg. Hello and welcome back to the Cognitive Revolution. Today, my guest is Josh Albrecht, founder and CTO of Imbue, a research company dedicated to building practical AI agents that can accomplish larger goals and safely work for us in the real world, which despite being pre-product, recently raised $200 million from investors, including NVIDIA, a significant part of which will go toward a cluster of 10,000 H100 GPUs. Imbue is a fascinating company that I honestly struggled to make sense of at first. I did my usual prep. I read through their research papers and their public writing, and I also listened to a couple of recent interviews. But still, I came away without a coherent sense of the company as a whole. In this conversation, we cover the company's diverse outputs, which range from virtual world simulators for reinforcement learning to a cost-aware hyperparameter optimizer to theoretical research papers. And there are a lot of great nuggets in here, including a few moments where Josh challenges some of my assumptions. But it was only on listening back to this conversation and really trying to zoom out from the details that I feel like I began to understand the Imbue thesis from an investor perspective. And to be clear, this may not be quite how Imbue sees themselves. But after taking it all in and chewing on it for a while, I understand Imbue as one of a small but growing class of company, which could prove extremely important depending on how a few key questions in AI end up being answered. One possible path for AI development, and arguably still the most likely, is that raw compute scale will be required to push the state of the art forward and that today's leading model developers, OpenAI, DeepMind, Anthropic, and Meta, will continue to lead the market. But another possibility is that scale won't be all we need, that gains from ever more pretraining become uneconomical, and that the way to make agents robust, reliable, and restrained enough to be trusted with meaningful work will be unlocked by a combination of painstaking fine-tuning and carefully engineered complementary systems, including better retrieval, supervision, and novel user interfaces. It's in this latter scenario that Imbue seems to me most likely to pay off for investors in a major way because it might take a series of eureka moments to make things work. And such progress does often come from small but highly aligned teams like Imbue's, led by dynamic visionary founders like Josh and his partner, Kanjun, who create highly intentional and opinionated culture, follow their own unique research intuitions wherever it may take them, and intensively dogfood their own products, even in the early days when they don't really yet work. In the end, I came away from this conversation not only super impressed with Josh, but convinced that Imbue has all the right ingredients to achieve something truly unique and special. And so they will definitely be a company that I'll continue to watch. As always, if you're enjoying the show, we appreciate your reviews on Apple Podcasts or Spotify, and we love to see folks sharing the show with their friends online. Now I hope you enjoy this conversation with Josh Albrecht of Imbue. Let's get into it. Josh Albrecht, founder and CTO of Imbue. Welcome to the Cognitive Revolution.

Nathan Labenz: (5:28) Excellent.

Josh Albrecht: (5:29) So super excited to talk to you. Your company, some may know it by its former name of Generally Intelligent, now Imbue, fresh off a major $200 million fundraise, is focused on developing AI agents that are capable of reasoning and can help us accomplish larger tasks in the world. So that's no small undertaking. For starters, I'd love to just hear a little bit about Imbue as you've been developing it and as it exists today. And then I want to get into, how do we make this reasoning stuff work?

Nathan Labenz: (6:06) Yeah. I mean, we started Imbue a few years ago now, basically seeing all the self-supervised learning stuff actually working and realizing, we've been working with traditional machine learning for a long time. I did my master's in research way back before deep learning was even a thing, back when support vector machines were a really cool thing. I've always been watching the field thinking, okay, is now the time to get back into more of the research side and work on the larger, more interesting problems in AI? And eventually, I think in 2019, it got to the point where it's like, okay, I can really see how we can move away from these supervised systems that require just huge teams of human labelers. It's really just people putting the right answer in there and moving to systems that are really able to learn very interesting patterns completely on their own. And I can see this happening for vision and audio and text and all these different modalities, and I can really see how we can start to put these things together to make much bigger, much more capable systems. And so that's when we decided to take a step back from what we were currently doing, figure out how to make a research company that can actually tackle these much larger problems.

Josh Albrecht: (7:08) That term research company, very interesting unto itself. What does it mean to be a research company to you guys? And notably, as far as I can tell, you have published research, but no products yet. Right? So how are you thinking of your evolution from research to, presumably a product offering company at some point?

Nathan Labenz: (7:29) Yeah. Yeah. So that's why we've been pretty intentional since the very beginning about calling it a research company. So not a research lab. We don't only do research. We're not a purely academic, not a purely nonprofit, not only about science. And we're not only a company. We're not just a startup. We're not just trying to make a product. It is kind of a mix of the two things. And part of the reason for that is that the stuff that we're trying to do, we're trying to make these agents that can actually reason and think and be intelligent, make computers that can really do what we want them to do. That is an open research problem. So there is a lot of stuff to figure out. But as we've gone further and further, we've gotten closer and closer to being able to make things that are just actually really useful today. We can make products today, us and other startups and other people building on top of existing APIs, et cetera. We are starting to see a lot of stuff that's really useful. And for us, with this latest raise, we are actually developing things that will become our actual product. And then, hopefully not too distant future, probably sometime next year, we'll have more to say about that, and we are working on that. But it is still a research company because there is still a connection between the kind of questions that we need to ask and the product that we're trying to build. There's a lot of open questions about how do you want to interact with these AI agent systems. If you have something that goes off and takes actions on your behalf for days doing this really complicated thing, spending hundreds or thousands of dollars on your behalf, you really want to know what it's doing. And you don't want to see it in a big text blog. You don't want to have just a chat interface with this thing. Right? There might be other types of modalities, other ways you want to interact. So there's UI questions. There's also tons and tons of questions on how do you make it not annoying to interact with? You don't want to have to give it all these instructions to do this thing. How do you get it to generalize, but in a way that's making sure it doesn't drift too far away from what you want it to do. So there's tons and tons of open questions, and really, for us, we're thinking about the product. And right now, we're really focused on building tools for ourselves, so that we can experience the pain of using these things and make them better and better and make sure this stuff is actually working until we get to a place where it feels really good as a user of these kind of systems. Using these APIs and language models to build bigger systems is sort of an exercise in frustration. Right? You're trying to make new prompts. You're trying to put these things together. They don't quite work. They go off the rails. They sometimes work. You change it. It seems like it works better, but it's expensive to run the evaluation. It's just kind of annoying to interact with. So we want to make tools that make it much easier, much more pleasant for not just developers, but eventually other less technical users to build their own things with AI as well.

Josh Albrecht: (9:50) Maybe you could help me survey the reasoning landscape a little bit better. I mean, it seems like today, at least of everything I've tried, GPT-4 is pretty clearly the best reasoning language model that's out there. Maybe you could speculate on how they've managed to get as far as they have, what techniques you guys are finding to be most successful and promising, and how good is this reasoning going to get over the next, say, year, which beyond that, my crystal ball is totally dark anyway.

Nathan Labenz: (10:25) Yeah. Exactly. Yeah. That's a good question. Certainly don't know everything that OpenAI is doing. I think one thing that we do know is that they are definitely using a lot of human data. So in machine learning, I think a lot of people, we see these language models. They look so awesome. They look so impressive. Oh, maybe this totally new thing is happening. It's still garbage in, garbage out. The data quality still matters a huge amount. And so there's a bunch of things you can do on the data quality. I'm sure they have a bunch of teams that are fixing up the Common Crawl and the public web stuff that they have, probably some books and other higher quality data and code and all this other stuff. Right? And so I'm sure they're doing a good job with that. They also have, I'm pretty sure, a very large team of human labelers that work on making back-and-forth dialogues specifically for some of these use cases like code. And so I'm pretty sure they have a very large team of people where it's like a person on one end asking something about code, a person on the other end answering it and writing out a nice thing, and they're generating, I think, a lot of data like this. And so that kind of data, I think, makes it look like these systems are performing really well because you're asking a question that's kind of in-distribution. It's seeing someone ask, how do I invert a matrix using NumPy or whatever. Right? That I think is probably getting them pretty far. They also if you'll notice, when you ask in GPT-4, it kind of starts out when you ask a question with a sort of, let's think step by step or rolling things out. And these kinds of techniques are baked into the training data a little bit now for them as well to help guide the system to a better answer. Part of that just happens naturally as a side effect of RLHF. Part of that is probably intentional on their side, of choosing how to launch it in this trajectory that makes it more likely to be right. I'm sure they have a bunch of other small hacks and tricks like that as well, maybe deciding which piece of the model. I think there were some questions before about, is this different experts. Maybe you route to a particular expert for certain queries. There's probably other things that go into it, but I think probably the biggest ones are the data quality, and using these techniques to help make the language model make a more reasonable response. So those are really good, and they're a great place to start. I think in the academic literature, people have found even other things that are interesting besides just chain of thought. You can also have graph of thought or tree of thought or these other types of techniques. And you can sample, you can do kind of consistency approaches where you do a whole bunch of things and you check, okay, how frequently does it get the same answer? Maybe that's more likely to be right. So there's lots and lots of ideas that people have had, and I think those are the kinds of things that we're pretty interested to explore as well as other ones about how can you spend more compute at inference time? How can you do a lot more work to get to a slightly better answer? So for us thinking about agents, we're usually thinking about things that operate on a longer time scale. As a chat application, you really don't, as a user, want to be sitting there waiting for it to get a 20% better answer by taking a really long time. But if the system is working for you overnight, you don't care how long it takes. Right? And so those kinds of techniques, I think, can allow us to push the frontier of these reasoning systems to end up getting things that are much more likely to be accurate and much more calibrated. And so over the next year or two, I think we are going to start to see systems that are a lot better at doing longer term things. The longer term actions need to be right. If you're right 80% of the time and you do 10 things in a row, you're actually fairly likely to fail. Right? And so you have to get that up to 90, 99, 99.9 if you really want to be taking much longer sequences of actions. But I do think that we're going to start to see a lot of that performance happening over the next year or two. And that's going to be a pretty interesting, weird world where these things are actually working kind of like we would expect as people. Hey, we'll continue our interview in a moment after a word from our sponsors.

Josh Albrecht: (14:03) Let me try to put a little taxonomy on some of this, all these different approaches to improving reasoning. A number of the things you said there from the RLHF, the supervision, it seems to be late-stage training techniques. And then I think I also heard echoes of one of my favorite podcasts that you guys did actually with Noam Brown of the Cicero, which is the AI playing diplomacy at human level-ish, whatever paper. I hear echoes of that when you talk about using more compute at runtime. That was a huge takeaway from that discussion for me. One thing that maybe didn't come up as much is curriculum learning, pre-training type stuff. Is that because you don't think that's as big of a deal? There's been a ton of people who've gone and said, let me rip a bunch of stuff from GPT-4, and then I'll train my open source model to mimic the GPT-4 reasoning. And then often they'll declare victory on some benchmark or something. But my general sense of that wave is that it's kind of past now and for good reason, because mostly nobody really got anywhere close to GPT-4, even if they managed to hit a similar score on a particular benchmark. So it does seem like that finishing stuff is maybe not enough. Spending more compute at runtime definitely seems like a potential huge unlock. I gather that maybe that's part of what Gemini is supposed to be maybe doing. Do you think that finishing stuff is more powerful than I'm giving it credit for? Then I'd love to hear your thoughts on the pre-training curriculum learning side as well.

Nathan Labenz: (15:48) There's three things. One other technique on the finishing stuff and compute at inference, there's also things that are not just compute at inference, but asking a different question or trying to use the tools in different ways. I think those are the places where we're more excited about getting a lot more gains. If you take a question and you really break it down into tons and tons of little pieces and then answer those questions and build those back up, you can get much, much better generations. And this is what we see with people that are making tools for writing novels or something like this. They don't just ask, write me a novel. They have a whole structure to it. And so you can think about how to break these things down. Maybe you can even start to break things down automatically. That kind of, how do we use these tools in different ways, I think, is much more high leverage than some of the other approaches. The pre-training one, I think there are interesting things that can be done in terms of curriculum, pre-training. Actually, one of the companies that we invested in recently has a researcher whose focus is on this in particular, and they have some papers that show you can actually take out a decent fraction of the training data and not really hurt performance at all. And the effect of that is that you can train faster. But these effects are more on the scale of 2x in terms of how much compute you have to put in to get there. It's not really helping you at the end of the day make a much better model because if you train... you just need more data, but there isn't any more data. So alright, fine. So the curriculum things, I think we're really getting pretty good features from the way we're pre-training these models right now. And then the trick is, how do we make these later stages, the fine-tuning, the other later data, the supervised fine-tuning, the RLHF? How do we do this other stuff to push it in the right direction? And there, I think we can go a lot further. There's some interesting stuff with grokking with addition or modular addition showing if you really train it long enough and hard enough on these particular examples, it actually gets 100% accuracy on modular addition. And that's actually pretty interesting, especially from a reasoning perspective. If we could get something that could reason in a much more robust way, that's actually quite interesting. So for us, we're interested in that kind of generated data and data that's really using a lot, making a lot of specific fine-tuning data for these types of things that you really care about. I think you can push really hard on that part of it as well.

Josh Albrecht: (18:00) Yeah, interesting. Boy, there's so many connections just between your one statement there and basically the whole field and definitely a number of episodes that we've done as well. We did an episode with the founders of Elicit, who have done some really good work on breaking these problems down into their constituent parts. Boy, yeah, so many. Just an episode with Alex Watson, who's the founder of Gretel, which is the synthetic data company, which is a whole interesting thing that I enjoyed going on a deep dive down the rabbit hole on. And then maybe the thing that jumps out to me most about what we were just saying is you think we're getting pretty good features from the way that we're pre-training. Is that equivalent to saying you think that they're learning a robust world model? So what's the difference between good features and a good world model?

Nathan Labenz: (18:53) I am really interested in people working on systems that can learn better world models, but I do think that that will take a little bit more of a larger change in terms of how these systems actually work. Right now, transformers and large language models are language models. They're just statistical models of, what is the probability that this word comes next? And that is a very powerful tool that we can use in different ways, but ultimately, the kinds of features that they learn. I think the work by Chris Olah at Anthropic, they had this very interesting recent paper on mechanistic interpretability and monosemanticity, and that is really interesting. That stuff is saying, look, you can see the features here. The features are like, does this look like a Base64 string? Does this look like Hebrew? Does this look like whatever? Does this look like the word "the" in a mathematical context? The word "the" in a physics context? The word "the" in a social sciences context. Okay, that's kind of a weird feature. Why would I ever want that? But I think what they're doing is these transformers are taking all these weird features of language, of statistical co-occurrence of words and using those to do the tasks that we actually care about. And it turns out that most of the tasks that we care about, you actually can solve in this really weird way without knowing anything about the world. And that's just a property of the types of questions that we normally ask and the property of the data and the training data and the distribution. You can ask really weird questions. Some people showed that you can take print and len in Python and reassign them to each other. So len is print and print is len now. And then the language model is just absolutely garbage at tasks afterwards. Whereas a person will realize, this is weird. You shouldn't do that, but fine. I guess I can figure out what you're doing. The language model doesn't have the same ability to tear apart the symbolic nature of these programs. Right? Because it's just looking at the co-occurrences of words. Never seen print used where it's supposed to be using len before. And so what we need if we want to get more robust systems is to change the underlying thing to get to a better world model. So the stuff that works on multimodality, those kinds of things can help incentivize it. There's a bunch of other ways, the monosemanticity thing. You can imagine using the monosemanticity work to do the sparse autoencoder thing they did and then say, yes, we actually want features that correlate to the real world. Let's use this as a way of making it more generalizable, more robust. So I think we'll start to see those kinds of things as we go further and further into the future. People will figure out how to make these more robust and more anchored to the real world and thus more useful, but that's going to be a little bit of a harder, longer thing. I don't think that's going to happen necessarily over the next year.

Josh Albrecht: (21:21) Just as an aside, for anyone who hasn't heard it and wants a 20 minute monologue on that paper, I did one in the Research Roundup episode not too long ago. You don't need it, but some in the audience may not be familiar with that yet. I do think that's phenomenally interesting work and hence the long monologue on it. So does that put you in a position of disagreement with, or would you reframe some of the things that we hear from Ilya from OpenAI who says, when we train the language model just to predict the next token, what it actually learns is a whole representation of the world that generated that next token. It sounds like you don't really think about it that way. Josh Albrecht: (22:09) It does generate a representation, but I think my point is there are many different representations, and some representations are better than others. Some of them correspond with reality, and most of them don't in transformers. So I think that's the point. There are lots and lots of representations. It learns a representation. That might not be the best representation for your particular type of task or the particular way in which you want to generalize. Physics is all about us finding simpler and simpler representations, right? Mass and velocity. These are really useful concepts. You can distill a lot of information down to just these small number of quantities and be very predictive. You're not getting quite the same thing in transformers, and we can see this provably in small toy cases. If you look at addition, for example, transformers do not learn general purpose addition. At best, they learn what is called modular addition, addition in a fixed size set of things, and that's because of the type of programs that transformers can learn. What that means is they cannot generalize provably to longer sequences in the same way that you and I can generalize to longer sequences of addition. However, Hadi, one of our friends recently, just had a really cool paper about this showing that, actually, you can change the structure of the information coming in by putting these prefix markers so that the transformer can learn a different type of program, a simpler, better type of program for addition, and that does generalize. And so the original thing that's being learned here is just this bad way of thinking about it where it's doing it in this really weird trigonometric space and adding these things together, which is not how you're supposed to do addition. It's fine for modular addition, but not for regular addition. But this other one lets you learn the actual algorithm for addition, which is simpler than the modular one. And so that is saying, look, if you structure the data correctly or if you structure your network correctly, you can get better representations. I think this addition case is a good example. It's nice and simple and easy to see, but I think we see the same thing across all different tasks. Probably most different tasks are like this as well. Especially, this is why when you ask GPT-4, draw me a unicorn, or is it safe to put a baby in a dishwasher? It doesn't really have the right concepts to quite answer these correctly, but we will get there as we start to add more data to it, either from multimodality, putting these markers in there, putting our own biases in there, etc.

Nathan Labenz: (24:22) Yeah. How do you see the multimodality? I guess if I had to summarize my own sense of this, I do feel like there is some very meaningful reasoning going on in these systems. I often say AI is alien intelligence because as with that Grokking result, it's like, boy, that's a weird way to do it. It certainly doesn't seem to be anything like what I'm doing, but it is getting the right answer. And yet, in some cases, it can be superhuman, in other cases, it can be amazingly stupid at things that would seemingly be surprisingly obvious. But with the addition of the image understanding, it does seem to have taken a notable step up where now it seems increasingly hard to argue that there's not some pretty significant world model in there. Right? I mean, because these random snapshot scenes that it's able to handle so well, clearly they were not in the training data, right? I just took that picture. So how do you think about the difference that the addition of, say, vision, but new modalities in general is making?

Josh Albrecht: (25:38) Yeah. I think it's related to what I was just saying. The modalities act a lot, I think, like these prefix markers from Hadi's paper. The prefix markers, in addition, are making it easier for the transformer to learn a simpler representation of the world. Similarly with pictures, if you see a picture of a cat and then you read all these stories about cats, it's like, oh, okay. You kind of have to end up with representations that are a little bit more similar. And so I think that's how I think about it, is that it's helping force these towards more general, more robust things. In particular, images are interesting, but I think probably the most interesting version of this are videos, especially causal interaction videos, first person videos, these types of things, with good descriptions of what's going on. I think once you're really predicting what is happening in a video, like a summary of a whole long video in the world, it's pretty hard to fit that data plus this other text and plus these all these other videos. You end up, it constrains the representation so much that I think it'll probably generalize. We'll have much, much better representations, I think, as we add all these other modalities in.

Nathan Labenz: (26:42) How much do you think you can separate reasoning and knowledge? You might imagine, and there have been attempts like this where people train a language model on just a bunch of pure math or a bunch of logical sort of deductions that could even be just programmatically generated. And then the hope would be you have some sort of reasoning model that comes out of that that might not know literally anything about the world. In that case, it could be true to say you've just been trained on pure logic. You know nothing about the world. How separable do you think those things ultimately are going to be?

Josh Albrecht: (27:20) It's an interesting question as well. It's difficult. What do we mean by reasoning? What do we mean by knowledge? These things are very, very connected, and there's different types of knowledge. There's the knowledge of how many people live in Tunisia, and there's the knowledge of how to swing a golf club. And these are also two totally separate types of knowledge. I think we're going to start to get better terms and tools for breaking these things down and being a little more specific about what we mean in the next few years, which is very exciting. As you said, you can definitely make things that learn logic puzzle type things very well. One of the internal experiments we did as a sort of toy experiment, we were looking at these things, I think they're called Einstein puzzles. I don't know why they're called that. But just those things from third grade where it's like, Billy and Sally and Jenny each have one object. The objects are red and blue and green. One of them is a blah. This person has this one. This one has one that isn't like that one. And then who has the red object? Those kinds of things where you have to do this logical inference to figure it out. We found that even ridiculously small transformers are able to do really well on this. Right? And so they don't really know anything about the world. You can ask them all sorts of really questions. They're trash at everything else. But they're really good at this type of reasoning. Right? This, quote unquote reasoning. That's just one type of reasoning too. Right? It's a very, very specific small set. I would say that reasoning is probably a set of heuristics or skills or ways of transforming information. So knowledge is pretty related to that because if you have no information, it's kind of hard to transform it. Yeah. I think that's kind of how I would separate the two, but I don't think we have the perfect terms for this yet, and we're still figuring them out. Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (28:54) Yeah. Micro reasoning skills, I think, is definitely an interesting paradigm. In the end, you may have just a huge number of very specific heuristics that get deployed. Although that does still raise the question of insight. Right? Or, I sometimes call these eureka moments. And, I think this is maybe the biggest question, even perhaps bigger than can we get reliable agents, would be can we get to a point where models can figure things out that experts don't know, and really come up with things that are not, I was about to say not in the training data, but that may not even be the thing. Right? Some of the recent stuff I've been looking at too with, oh hey, here's the technique to have a 10 million token context window or whatever. When I think about that and I think about 10 million tokens fully attending to each other, it's like, there's a lot in that training data perhaps that people have not, in fact, identified or learned everything that they could from. So I wonder if you have any thoughts on that. What do you think is the outlook for, let's say, eureka moments, truly novel insights out of generalist systems?

Josh Albrecht: (30:06) I don't think that it's going to be a very discrete or a very different thing. I think that as we make them much better at reasoning, as we make them better at acting in the world, we're already going to start to see these things. And in fact, you can already sort of, if you squint at it, kind of make this claim for existing things like some of the patterns that it finds in proteins or some of these other things that people don't necessarily have the same intuition for are already kind of interesting, certainly more interesting than I would have thought of. So maybe there's someone in the world who has a much better intuition about these things than me, but I think pretty soon we're going to start to find ones that are actually novel in a pretty interesting way. I don't think that that'll be too separate of a thing.

Nathan Labenz: (30:43) So let's talk about some of the projects that you guys have done and published out of Imbue. Maybe for starters, because you were talking about the multimodality and the learning from video, for example, you created a training environment, as I understand it, for agents that is kind of a Garden of Eden sort of thing where I went and looked at some of the videos and some of the simulations. Basically, it's meant to be fast, which is one thing, and kind of a substitute perhaps for Roblox where you have a big environment, but it's slow to deal with. Here, you've got this kind of quick, computationally efficient, natural environment. And the idea is that agents can explore it and try to figure out how to do stuff. And they're doing this in a totally wordlessly, textlessly knowledge free other than just the environment that they're in environment. So I'd love to hear a little bit about that project and whether or not any agents have actually accomplished much now that it's out there for people to train on.

Josh Albrecht: (31:47) So, that project we started on actually almost two years ago now, so at least a year and a half ago. And the main purpose of that, prior to that, we've done a lot of work on self-supervised learning and made some very cool systems that can learn things completely autonomously, much like a child, even if you just leave a person alone, they'll figure out what are objects, etc. So we had this, but you can have a self-supervised system learning all sorts of things, but what should you learn? And so we are experimenting more with reinforcement learning and figuring out what matters to learn. You don't want to spend all your time looking at the pattern on the wall or something like that. You want to find food and stay alive and all these other things. So we have as people biases about what information is and is not important. So as we started exploring that and exploring reinforcement learning, we realized that one of the big things holding the field back was a lack of benchmarks where these other more complicated abilities like curiosity or novelty or exploration would actually matter. In Atari, you just kind of get penalized for that stuff. It's not usually very helpful except on maybe a small handful of games and then there's kind of like, are you overfitting to this, etc. So we wanted to make something that kind of had a much bigger set of things that you could do, a much more challenging, much more open world, and that's why we made it procedurally generated and much larger and everything. We also spent a lot of time, because reinforcement learning is much less efficient than most learning. So we spent a lot of time making sure it's very efficient. And I think it ended up being on the order of a hundred times faster than Minecraft, etc. And what that allowed us to do is to play around with agents on a wide variety of tasks of a varying range of difficulties and see what do our current reinforcement learning systems, what are they really good at doing? And what we found is that they're actually fairly good at learning these kinds of behaviors. Like, it learns to open the door, for example, even with complicated locks and bars and all these kinds of things, but it mostly does it by bouncing next to the door and jiggling things until it finally succeeds. If you watch it, you would definitely struggle to say that it understood how to open a door. It does find a strategy for opening a door, but then as you keep letting it go, you'll find a more and more efficient strategy for opening the door. But this is not at all how we open doors. We turn the handle and pull it open because we know how the mechanism works after we've learned it. So we published that work last year at NeurIPS, almost a year ago. We had submitted it a year and a half ago. And after we submitted it, we started working on actually adding multiple agents and adding language and adding other things to this environment. And as we started doing that, we started realizing like, wow, there's all these different tasks you could do. You could put Atari in here and have it play Atari. You could put a web browser in here and have it work on a web browser. You could do this other stuff. And as you start giving them plans or higher level actions and these reasoning things, then it's much, much easier to accomplish these harder tasks that for PPO or these simple reinforcement learning things are just way too hard to do. I think most of the work that we do, most of the tasks that we succeed at, we're not using reinforcement learning to learn this. Right? That's not how you get a PhD. You don't try getting a PhD 10,000 times, and then you finally get it and say, oh, I guess I should do more of that to get my PhD. That's not at all how we do almost everything. Right? We're mostly planning. We're mostly thinking and anticipating and using this logical reasoning stuff. And so that's why we've shifted our focus towards those types of tasks, towards the coding tasks, reasoning tasks, tasks in your browser, desktop, where the planning piece is there. And we still have all the reinforcement learning stuff. We still know where those things are good. Those are great for figuring out what button to click or those kinds of lower level behavior type things or guiding the system, but they're kind of two complementary techniques.

Nathan Labenz: (35:14) Yeah, especially, I guess, things that are kind of wordless. Speaking of eureka moments, was just this really interesting result I'm sure you saw about using GPT-4 to write reward models that were then used to drive the reinforcement learning and even to teach a robot hand to twirl a pencil. And it's like, yeah, that's a hard one to communicate, right? That is one of those things that you kind of have to just learn by just stumbling at it a bunch until you learn it. So it's funny, I'm always a little reluctant to take too much inspiration from humans into my understanding of AIs because I always think I'm just worried about smuggling confusion in with analogies. But then it is also just often pretty compelling that there are some insights there to be gleaned, no doubt. So got to be careful with that stuff. Coding agents is kind of the big shift in focus. And I understand that this was, if I understand correctly, this was kind of a driver of the recent fundraise. In a recent interview, it was indicated, without too much detail, that a demo was the key moment that got people compelled enough by what you guys are building to write a, however many figures a $200 million check is. So tell me about the agenda now on the coding agent side.

Josh Albrecht: (36:34) Yeah. Actually, you just mentioned Eureka, which is a paper by Jim Fan's group, and also another previous one of his was Voyager. And both of these are, in the case of Eureka, it's writing a reward model or adjusting the reward model. A dense reward model is what it's really doing. That paper already, you have to know if you can succeed or not. It has a sparse reward model, and we're densifying it here. And in the Voyager paper, they're writing little skills using this library of tools, an API for Minecraft. How do you put together these different skills to do slightly higher level tasks? In both of those cases, you're making a much better reinforcement learning agent by writing code effectively. And so one of the really interesting things for us, one of the reasons we're so focused on code is that it's actually just really useful. These are two concrete examples of things that Jim Fan has published, but there's other types of things that you can do as well. There's other types of ways that you can use code to make better reinforcement learning agents. And so I think that actually writing code is going to become a really important part of how we even develop and run these agents in the first place. It's not just going to be like, oh, we're making coding agents so that developers can write better code for their web apps or whatever. Now I think that most of the new code that's going to be written is actually going to be more like in the inner loop of these agents. Right? How do they help optimize their own prompts? How do they help select which few shot examples to do? How do they learn from those experiences by automatically pulling out different pieces or automatically breaking down these problems? There's a lot of really interesting stuff that you can do once your agent can start to write some code.

Nathan Labenz: (38:03) I mean, I see the same pattern in a lot of different domains where you have kind of the highest level reasoning system, whether it's a self-driving car planner or an autonomous drone navigation system, or even just some of the frameworks that folks are using. Even to break my rule and say like an analogy, I feel like kind of even myself, there's this difference in frequency where you have the high level thing that runs a little bit slow, but gives some sort of structured commands down to lower level systems that then execute those and report results back and raise errors as needed. Is that basically the kind of framework and structure that you are expecting to work?

Josh Albrecht: (38:54) It's part of it. I think in the case of self-driving cars or most of the systems that we have today, the breakdown has been done by a person. We kind of think about it, like, I guess we need some vision. I guess we have lanes to follow. I guess we have this. I guess we have that. We've done the breakdown part. I think what's interesting for me for the future systems is that they'll be able to break these things down more autonomously, and that's actually kind of part of the demo that led to the fundraise and everything. How do you break these things down in a way that's not just a person coming and putting in all of their knowledge about how to solve the problem? So, yes, I think overall, you do want to have different levels of abstraction. You want to have something at the highest level that knows the goal and just orchestrates some lower level things and that's broken down. The reason for this is these real tasks, when you're trying to do not just demos, they span many levels of abstraction. There's a lot of complexity in the real world. Right? You think about Stripe or something. It's like, how can there be so many people working in Stripe? All you're doing is paying for a thing online. How hard can that be? Turns out really hard. Turns out there's a lot of details to that kind of stuff. Right? It turns out everything is like that. And so if we have a system that can more automatically break these things down and actually start solving these problems and putting them back together properly again, I think it's going to look, broad strokes kind of similar, but in a sense it'll be quite different because this can happen dynamically. This can change over time. You might be able to come back to the system and say, you know what? We have this, we're using this language model here, but it's doing something stupid. We're just doing addition. Let's just call WolframAlpha. Let's just use a calculator. Okay, great. Now it's a lot faster. And so once this is more dynamic, it's going to be more evolving. It's going to be able to optimize and continually improve in a way that's much, much harder for a self-driving car system that's been made by whole huge teams of people.

Nathan Labenz: (40:32) So I don't know how much you can tell us about, at this point, what that demo looked like, but I certainly would love to hear more of the details. And then I'm also kind of wondering what language models were you using in the core of this thing? Like, one way to do it would be to go use GPT-4 and have it serve part of the thing. I don't know if you have trained your own large scale models so far, although I know that with the raise and the 10,000 H100s that are a part of that deal, that's certainly in the future. So where are you guys right now in terms of building your own core language models versus using others? And, yeah, to the degree you can, what are the capabilities that you're starting to see unlocked?

Josh Albrecht: (41:18) We had our own cluster even back then. It was not on the same scale. We'd already trained actually hundreds or maybe thousands of language models by that time as part of CARBS actually, our hyperparameter optimizer. So we did have some language models of our own. There's also open source ones that we can fine tune. And there are ones like OpenAI and Claude. So the demo actually used a mix of all this stuff in different ways, you can update them and see like, okay, how well does this one work? How well does that one work? What happens if you fine tune it, etc? It wasn't always strictly better to use GPT-4. In some cases, you do want to be able to run this really quickly or run a whole bunch of these in parallel. There's definitely rate limits and things like that. So there are different pieces of this. So even back then, it was still useful to use a mix. Since then, we've been able to train much larger ones and that's the stuff that we're working on now. We finally have our compute coming online and can actually start to train much larger ones. So that's pretty interesting. That's a lot of the work that we're doing over the next year or two is improving those. In terms of what the demo actually was, it was really us making a few different processes actually for automatically breaking down specific types of questions on specific academic datasets and showing them like, look, if you break these down in these automated ways that are actually not really that complicated, then you can get much, much better answers when you then take those answers and reassemble them back into the final answer. And I think people since have actually published some of these kinds of things that we were doing as various different types of academic papers. So it's been nice to see like, oh, okay, that wasn't just a one-off thing. That seems like a technique that actually generally applies, more applicable, or is a more generally applicable type of a thing. For example, yeah, some of the tree of thought type stuff was very similar to some of the things that we were working on. It's like, okay. If you break it down in this different way, you have different uncertainties. How do you explore those? How do you bring those back? You can also think about, yeah, not all of them have made their way to academic papers quite yet, but some of them have. So it's been nice to see that. But, yeah, at a high level, it's basically just that. I think one of the datasets that we focused on was ANLI, the adversarial natural language inference dataset out of Facebook where the goal is to say, okay, here's this context, here's this hypothesis. Is the hypothesis true or false or you can't tell? And this is a pretty useful thing from the perspective of reasoning because you really want to know, is this right or wrong? And that's a very, very helpful thing. And it's called adversarial natural language inference because it's about people trying to trick the language models. So this has a bunch of really good examples of things where it's like, yep, the language model gets it wrong. Okay, why is that? Oh, it's actually because when you break this apart, you start to see, oh, I see. It got confused about, this song was written by X and Y. And then you just ask if it was written by just X and it just says yes because the word is just there and it's fine. But once you break it apart and you're like, okay, who are all the authors of this, etc, then the answer becomes sort of obvious. So once you make the, and this is a similar thing to chain of thought. Chain of thought, what it's doing is it's kind of spelling out the information for the language model so it's easier for the attention to force the right tokens to happen. But you don't have to do that just with chain of thought. You can also do that by breaking down. You can generate questions to ask and then answer those questions. You can look at the uncertainty. You can explore these things in different ways. So we had just a whole suite of different things for doing that, for breaking them down, and showed much, much better performance on this task. Now who cares about ANLI in particular? We didn't care about publishing this particular thing and wrapping it up and everything, but the point was, look, you can really solve these problems by breaking them down this way. We should go make this actually useful.

Nathan Labenz: (44:45) Yeah. So again, it's kind of an echo of the more compute at runtime playing a big role there. On the compute, so this is a topic of so much speculation. So NVIDIA is one of the investors in the round, right? And I was just doing a little back of the envelope math, and I'm missing something in this analysis. You can help me figure out where I'm going wrong. $200 million raise. I just looked online at the going price of a H100. And if I were to buy 10,000 of them by the retail math, that seems to come out to $300 million all on its own. So either you're getting a deal somewhere or there's some sort of financing that's implied in this. Can you give us a little more kind of unpacking of the nature of your access to all this compute? Nathan Labenz: (45:39) Yeah. So we're working closely with a new compute provider, Voltage Park, and they are setting up this much larger cluster. They're the ones who actually bought it. We originally were thinking, oh, should we buy them? Should we not? Because we bought our own before, but this is a much larger amount. It's like, oh, should we buy it? We're going to be locked into this for so long. So they actually end up buying this huge amount of compute, and we are getting the compute through them. We're not going to have the compute for the entire lifetime of this thing, but we will have the compute. They can get a much larger cluster. We can use that much larger cluster without spending all the money that we raised. So that's kind of how it works. We're planning on spending it over a shorter amount of time than the entire lifetime of the chip.

Josh Albrecht: (46:16) That dynamic in and of itself is definitely super interesting. When you think about this cluster, 10,000 H100s, I've been fascinated lately by just trying to do a little bit of what I've called time to GPT-4 math. Obviously, nobody knows exactly what the total FLOPs into GPT-4 were. We have a sense that it's maybe something around 10 to the 25, whatever. And then if I just kind of start doing the math of, okay, well, an H100 can do roughly 10 to the 15 FLOPs. And how does that kind of trickle down to how long does it take me in, say, days with a cluster of 10,000 H100s to get to GPT-4 FLOPs scale? Depending on the assumptions or whatever, it's like 3 to 7 days. And that's a startlingly short time.

Nathan Labenz: (47:19) 3 to 7 days for GPT-4. That's a little bit off on my own estimate, but how big are you assuming GPT-4 is?

Josh Albrecht: (47:26) I'm doing 10 to the 25 for GPT-4.

Nathan Labenz: (47:30) FLOPs or parameters?

Josh Albrecht: (47:32) Total FLOPs in. So basically, working, you know, saying it's just assuming it's one order of magnitude less than the 10 to the 26, recently declared reporting threshold. And then I'm putting 4 times 10 to the 15th device FLOPs, which is what I understand the spec to be if you're doing 8-bit quantization in your training.

Nathan Labenz: (47:57) Yeah. You can't quite do that.

Josh Albrecht: (48:00) What would you estimate time to GPT-4 to be and how do you kind of back into that?

Nathan Labenz: (48:05) I mean, the best I have is from some other source, I think, online saying that it probably took about 30,000 A100s for about 5 months or 3 to 5 months, something like that. Now we don't know, is that all the training, is that just the last run? I don't know, but it's kind of roughly right order of magnitude. It also checks out if you figure out how many parameters it was. I think there was something saying, oh, maybe it's a mixture of experts with 8 experts each, 200 billion parameters or whatever, which, again, makes sense. Back of the envelope, that seems roughly where you want to be. They're probably doing it roughly to optimal type training, maybe a little bit of overtraining or something like that. Mosaic actually had a good blog post recently, well, not recently anymore, a while ago, looking at how long it takes to train networks of various sizes. I think for theirs, they estimated that on their Mosaic cluster, for a 30 billion parameter model to train Chinchilla optimal, it needed 600 billion tokens, and so it took about a month to train. If you take that and you kind of scale it up to, let's say, a 4,000 H100 cluster, then I think you can train maybe a 200 billion parameter model in 45 days. So it's going to take a while if you're just training a 1.6 trillion parameter model, but our strategy is not necessarily to train the absolute largest model possible, but rather, I think if you look at the GPT-4 technical report and you look at those perplexity curves, you see them go down like this. And then if you look, it's like, okay, this point is here. This other point is here. It has one one-hundredth as much compute, and you've only lost a tiny bit in performance. And so I think our perspective is, look, I'd rather spend, you know, if we're going to spend a billion dollars on this one training run, I'd rather spend $10 million on the training run and then get my performance increase somewhere else. Can I make it better in any other way? A 3 or 4 or 5% increase is going to make a huge difference there. So I think that's kind of how we're thinking about it. We'll train very large models, but the goal will be to push on these other things to try and make the performance better instead of just trying to make it big. That also makes it easier from a data perspective. You don't need as much data to train those much smaller models as well.

Josh Albrecht: (50:17) Yeah. That's definitely really interesting. So it seems like you're not conceiving of this compute cluster as a giant laser that you're going to point at a single target, at least not very often, if ever.

Nathan Labenz: (50:31) That's not our strategy. It's just a different bet. Anthropic and OpenAI, they're the ones exploring the super high scale models. It seems like from Llama and from these other models, you can get really good performance from these smaller models. When we fine-tune Llama, we see better than GPT-4 performance on a lot of tasks. And so do we really need to spend that much? Maybe our next cluster will be super gigantic. But for now, I think we can get really far with ones that are still very large. 200 billion parameters is the size of one of those heads, right? So it's still quite big. In fact, in a sense, that's really what you're getting in the inference pass anyway if you're kind of assigning only to certain heads. So we think we'll probably get relatively close. I'd much rather spend 45 days training something than a year.

Josh Albrecht: (51:11) Yeah. So what is it about, you've mentioned, you reacted when I said the 8-bit quantization. My read of the literature recently has been that quantization certainly at runtime, you've seen people going all the way down to 3-bit. But even in training, it seems like the trend is toward fewer and fewer bits. But you think just 8 is one step too far? You train at 16?

Nathan Labenz: (51:35) Yeah. I mean, I think for now, you might want to train at 16. You're welcome to train at 8 for smaller models. I think as they get bigger, you have larger risks of them diverging, and there's other sorts of problems that happen. And you might not get quite the same performance that you were expecting as you're doing the FP8 stuff. So there's still a few more kinks to be worked out, I think, and NVIDIA is working on those. And we're also looking at those, but I think just be careful. And all the literature looks, and the demos and everything look great, but I think if you're doing this at home, you're going to spend a few million on something, it's not quite that easy.

Josh Albrecht: (52:08) Is there sort of a progression that might also make sense there? Obviously, in the early going, the curve is steep, right? Seems like you could get away in the early going with 8-bit. And then maybe at some point, the curve gets flatter and you're starting to decrease your learning rate on a schedule, then maybe you say, hey, now we'll flip into a few extra bits mode.

Nathan Labenz: (52:33) The curve's steep for most of the training. When you look at the curves, it's like first, you know, 20%, you're getting this big improvement. And then for the last 80 to 90%, it's just this slow, slow grind. So you're not really saving yourself too much. And now you have the extra complexity of flipping from one to the other. So, yeah, it's possible. And I do think it's possible to do with FP8. I think you just need to be careful. I think it's just easy to have accumulation, precision errors that end up biting you.

Josh Albrecht: (52:59) Well, that's the kind of expertise that we're looking to unlock a little nuggets of here, so that's cool. You mentioned also this CARBS framework that you guys have developed for Cost-Aware... What's the R? It's Pareto region. Gotcha. Okay. So this is a natural segue to that work of kind of figuring out, well, what should the learning rate be? And do these kinds of things work and pay off or not? I haven't had a chance to observe this work in depth, but I've seen some very compelling, basically GIF-level graphics of it, which I'm always a big fan of, that kind of show the systematic exploration of the Pareto curve and kind of just gradually, incrementally, bit by bit pushing it out. Tell us a little bit about that work, and I'll follow up on the parts that kind of jump out to me most.

Nathan Labenz: (53:50) Yeah. CARBS is actually super exciting and super handy and useful when we're doing our own research. It actually came about as a result of just the way you normally do machine learning research when you start out. It's like, okay, you train your network, it kind of works. You're like, I wonder if I need to change the learning rate. Try a few different ones. You're kind of always poking at this manually. And then at some point, you're like, oh, maybe I should do a grid search or maybe I should do this optimization. But then a grid search is really inefficient. Then you're like, ugh, well, maybe I could find a smarter way of picking which things to do. Okay, well, is there stuff in literature about this? Yeah. I mean, if you have something that kind of works, you can do a sort of local search with Bayesian optimization. That's one half of it, Bayesian optimization, and the other half is this natural evolutionary strategies. Really CARBS is just gluing these two together and making it so that the program is calculating which thing to poke at and try next. And this is nice because it can do a much better job than us in these very, very high-dimensional spaces. When you're considering 10 or 12 different parameters, it's really hard as a person to know, where's the optimal place to put a point to learn the most, especially in a cost-aware sense. So this tool really emerged from our own use and just not wanting to have to mess with it. Now we have it at a point where you can kind of just hit go, go to sleep, wake up, and by the time you wake up, it's run tons and tons of language models of different sizes, and you can see this nice curve, and you can see how not just the data and compute change with scale, but how all the parameters change with scale. And you can say, okay, I kind of see, oh, it seems like it's adding more and more heads. It's changing the key-value size in this way. Oh, it looks like if I wanted to make this much, much bigger, I don't want to be here, and you can kind of pick these points a little bit better than otherwise you'd be able to and start to see really interesting new types of scaling laws in these other parameters that are different and a little more subtle as well.

Josh Albrecht: (55:34) Yeah. So it's the automated discovery of scaling laws. And it sounds like, when I think of hyperparameters, typically, I think of those as being outside of the definition of the model, but it sounded like you're also even looking at different widths, different numbers of heads. So basically different model structures too.

Nathan Labenz: (55:56) Yeah. You can extend this to network architecture search if you really wanted to. But the nice thing about CARBS is that it's robust to any parameters. So things we can tune are things that are otherwise kind of difficult to tune because, you know, maybe as you make your batch size bigger and bigger, eventually you run out of memory and it crashes. Well, CARBS can kind of account for this automatically and say, okay, this is the region where it's crashing. So I'm going to sort of automatically adjust to stay out of that region because I'm getting bad performance over there. And the thing is, in practice, yeah, batch size is easy if you're just scaling this up. Of course, don't make it too big, but some of these things start to interact in different ways, right? As you have more and more parameters, larger and larger batch size, you kind of run out of memory at different places and it's hard to code up exactly where that threshold is. So it's sort of a little bit easier just to let it go and figure out which of these networks actually work. I think it'd be interesting to extend it to network architecture search, but I don't know if the network architectures actually make that much of a difference. The architecture, the transformer architectures are pretty good. So, yeah, I'm not sure if that's the first thing that we'll do, but it is something we're interested in doing eventually.

Josh Albrecht: (56:59) Yeah. Another thing that kind of occurs to me in this, if I understand correctly, this is all fully explicit code, right? You guys have coded every line of this and know exactly what it does. But it does kind of strike me that when you said this thing can work better in these high-dimensional spaces than humans can, that set off for me, well, no one else can sometimes work better in high-dimensional spaces is a model, right? So is there a way in which this is potentially creating the training data or mapping out the space that you might eventually end up with a model that makes these sorts of predictions? Does that seem like a realistic next evolution of this?

Nathan Labenz: (57:41) Yeah. Right now, if you look at the paper, we mean model in a few different senses. One model, it's using a Gaussian process. It's using a very simple model of how do these parameters change. That's not really the right model to use. It's not quite the right prior for some of these things. And if you did know the actual scaling law, you could put the true model of this thing in there, and then it would scale. It would know this much, much better, and it'd be much more data efficient. So that's one type of model that would be really interesting to put in there. Another type of model is to learn which type of model you're supposed to put in there in the first place, which I think is kind of what you're getting to. If you use a transformer to kind of model these spaces and guess, okay, with these hyperparameters, what will my performance be? Actually, Google had some pretty interesting research on this, I think, maybe about a year ago or something like that. And, yeah, it did seem to work for their types of workloads. You need a lot of data for those types of systems to work. So I'm not 100% sure if that'll be the first thing that we do, but it is an interesting method of exploration. And then the final model-like way of integrating a model would be, well, what you really want for CARBS is, what these things always look like is kind of a line that comes up and then kind of tapers off. You can only tune your parameters so much, right? You're going to get to saturating areas. And then at that point, what you do as a person is you're like, okay, great. I'm going to stop doing this. I'm going to explore these new axes. Well, what are those new axes? Those new axes are new configuration values that you have to implement in your network. Okay? Well, you might be able to write some of those. Or maybe you could automatically pull in ideas from open source literature, or maybe you could generate variants of a function that you have that's kind of slow or that seems important. And you could use a transformer or a coding agent or something like this to generate those variants that you want to explore. And I think that would be another interesting way of using a model in this loop. That would be kind of fun to explore, but that one I think is also kind of a little difficult. Maybe we'll wait a little while until these coding agents are a little more robust before we do that, but it would be cool.

Josh Albrecht: (59:38) Yeah. You guys have a very sort of eclectic portfolio of things that you've put out. And I guess that's kind of a reflection of what I understand to be a sort of unique and kind of curiosity-driven culture at the company. So comment on that all you want. But the next one that I wanted to touch on too is this stepwise self-supervised learning paper that came out a few months ago. And this is what I've really stared at a bit and tried to develop my intuition for. And I feel like I'm making progress, but I don't have still quite as much of an intuition for what's going on there as I would like. So just first, a setup very briefly, you're using an old school dataset of CIFAR-10, 10 different classes of images and teaching a relatively small network to identify which type of image is this. And then you do something that, again, I find fascinating, but I'm not quite grokking it myself, I guess, which is take a purely analytical mathematical approach to figuring out what you expect to happen and then confirming that that is in fact what happens. And in both cases, what seems to happen is one feature comes online at a time and it comes online very suddenly or relatively suddenly. And you can kind of see these sudden drops in the loss curve, which seem to correspond to these sudden grokking, I guess, maybe, I don't know if you use that term or not, of these new features. And that's about as far as I've got. I'd love to understand a little bit better the intuition for the math on the analytical side for why we should expect this in the first place. And then how you understand this. Is it akin to grokking or is it something different? All I know is I'm fascinated by it, but I don't quite get it yet.

Nathan Labenz: (1:01:36) This was a very interesting research project. This actually started with an intern. It did, as he said, kind of start in this very kind of curiosity-driven way. I think he was just training some network or maybe he was watching one of us train some networks and thought, that just looks really weird. Why does that loss have those notches in it? That's not what we'd expect. Oh, yeah. I think that's what it was. I think he was doing a really toy simple version. I think he was trying to get it to an analytical version of this because his background is more on the theory and math side. And as he trained it, he noticed this very strange kind of stepwise thing as he took it away from the really complicated, fully maxed out thing that gets state-of-the-art and back to the slightly simpler form and seeing this kind of pattern and thinking, why is that actually happening? And this project kind of evolved out of that of us investigating it and then looking at the theory side. Why is that happening? One of the things that we're interested in as a company is this kind of deep learning theory, but we mean deep learning theory not in the pointless bounds on parity problems, but rather the practical, how do these things actually work questions. So very practical kind of applications of theory. And this was an example of that, us trying to figure out what's actually happening in these self-supervised image learning systems. So here, it's not exactly the same as grokking. I encourage people to check out the blog post and everything to take a look at it because it is very visual and it's very interesting to see the loss go down in these steps. What's really happening is that it's learning a kind of direction of the data at a time. So if we take a step back, deep learning is actually sort of different than a lot of classical machine learning in a very particular way. We don't know what that particular way is as a field yet exactly, and so the thing that is different we call feature learning. Okay. Feature learning is everything that isn't kernel learning. We know a lot about kernel learning. Kernel methods like support vector machines, almost all other kind of earlier machine learning things are kind of like kernel methods. Kernel methods are measuring similarities or distances between different data points in different ways, and you can use this to kind of make, if you have a classifier, you can make a separating hyperplane to say, this is the one class. This is the other class. And here's the border with the features. This is where the border is. And support vector machines are a great example of this. They make this border. You try and make it as far away from each class at the same time. You're kind of balancing your false positives and false negatives. There's an optimal place to put it. Nice and easy. We know a lot about kernel methods and how they work. Kernel methods, however, take the kernel as fixed. You have to decide how am I going to measure the similarity between two data points before you start doing your learning. Deep learning, on the other hand, does not say that. It does not have a kernel that is fixed. Instead, the kernel changes over time, and that's the feature learning part of it. Now it turns out that you can kind of take what we do for deep learning and turn it into a kind of kernel problem, and this is the neural tangent kernel. So recently in the past few years, people have made this thing called the neural tangent kernel, which is saying, look, actually, deep learning in infinite width networks, a network is actually a kernel learning system, and it's not evolving at all when it's infinite width. And so then we can make this approximation and just pretend it's a kernel method. Now we all know all these things about it. You can even calculate the kernel. You can do all this kind of weird stuff. It's very interesting, but those approximations, those infinitely wide networks are actually not as good as the finitely wide ones, and that's because this feature learning doesn't happen. Okay. So in these cases, this paper, really what it's saying is actually what's happening in self-supervised learning is mostly kernel learning. It's mostly learning, kind of kernel PCA, an unsupervised kind of, what are the main eigen directions of the data and of this kernel? There is some other stuff that's happening. These other networks do perform a little bit better, but if we kind of simplify them in this way to make them into the sort of strict kernel regime, then you get to see this weird stepwise loss, which is exactly what we'd predict from the theory of having this kernel PCA type approach to this thing. And so in the paper, it goes through the math of the linear versions of this and then the nonlinear versions we would expect to generalize from the linear one to the nonlinear one because of the neural tangent kernel and how the neural tangent kernels are very similar to the deep learning systems. And so we kind of see, oh, okay, look, it looks like most of what's happening in these systems is it learning these features. And when we look at what those features are, the primary direction or whatever in this kind of data kernel space, they end up being really stupid things like color or brightness or contrast or things like this when we were doing our own embeddings on a kind of simplified version of these things. And so it's almost like a way, a sort of hand-made way to think about this is, what it learns first is, okay, I'm trying to separate images. I just need to tell them apart. I'm first just going to check by color. And if it's red or whatever, I guess it's that color. If it's blue, it's this other color or it's this other image. Okay, great. Well, now I've got a little more capacity. So now I guess I can do not just color, but I can also do contrast. Not just color and contrast, brightness. Not, oh, okay. I kind of ran out of overall global things to do. Now I'll start looking at stuff in this region of the thing, or now I'll start looking at these types of patterns or whatever. And so it's kind of like pulling out pieces of the data to pay attention to and to use to separate things. And these kind of things are the features. They're not the features that we would use for an image, right? We would use things like what is in the image, the sort of semantics of it, but it doesn't really matter. You can still use these weird features to kind of do a pretty good job at the types of tasks that we're trying to do. The self-supervised image thing, right, the task is to tell apart two different images and to tell if two transforms of the same image are the same. So if you take one image and you make it black and white, you want to say, this image is the same as, you know, these two images, the black and white version, the color version, they're the same image, and this other image, you know, the color version and this other image are not the same as each other. So that's what's happening in self-supervised image learning. So here, all it's saying is the types of features that it's learning are these kind of, I don't know, I think they're kind of dumb. They do evolve in the deep learning ones, and they do probably end up making better ones over time, but it was interesting to see how this actually happens. Another work of Jamie's, which was not really presented here, but is very connected to this, is kernel learning has this kind of capacity. Actually, Jamie found these kind of conserved quantities like you would have in physics, like velocity or momentum or mass, where it's like, this is just how much learning you have, learnability. You have X amount of learnability, and you can spend it on these different directions in different ways, but you can only spend so much of it. You can kind of balance between learning these different eigen directions, but you want to spend it kind of maximally on the most important one and then a little bit less on the next one, little less on the next one. That's what we see here as well. The first one kind of gives you the biggest impact, and the next one, the next one is kind of like lower and lower impacts. So those two works are connected as well.

Josh Albrecht: (1:08:07) I think probably most people will be like me and feeling like we need to study this a little bit further still to really get it.

Nathan Labenz: (1:08:11) Yeah, I think you'll have to dig into the blog posts and papers, and feel free to reach out. We're happy to chat about it more, for sure. It is a little bit complicated, and it's not necessarily the normal kind of approach that people take to deep learning, for sure.

Josh Albrecht: (1:08:27) Very, very interesting, though. Anything I can do to get a better sense for, as you said, the very practical, what are these things learning and how? I'm always kind of willing to roll up my sleeves on that. We only have so much time, and that definitely is one that I want to kind of continue to come back to and ponder more. But just to kind of round out a little bit more of what you guys are doing, you're also engaging in policy. And I understand you were at the AI Safety Summit recently in the UK and have kind of been doing some prototype work to show governments how they might start to use the latest AIs in, obviously, all the information processing that goes on in the bureaucracy. Give us a rundown too of what you're doing on the engagement safety and policy side.

Nathan Labenz: (1:09:09) Yeah, so we have a few different things going on there. We're very interested in making sure that as AI systems become more powerful, they're used responsibly and safely. On the policy side, we've been engaging with various governments to help them understand what's possible with current AI systems and how they might be deployed in government settings. The UK AI Safety Summit was a great opportunity to showcase some of our work and engage with policymakers from around the world. We've been building some prototype systems to demonstrate how AI agents could help with information processing tasks that are common in government - things like analyzing documents, synthesizing information from multiple sources, and helping with research and analysis. The goal is to help policymakers understand both the potential benefits and the limitations of current systems, so they can make informed decisions about regulation and deployment. We're also very focused on the safety aspects - making sure these systems are reliable, that they fail gracefully when they encounter something outside their capabilities, and that there are appropriate human oversight mechanisms in place. It's a really important area for us because we believe that getting policy right is crucial for the beneficial development of AI technology.

Josh Albrecht: (1:09:09) That's great. Well, I think that's a good place to wrap up on. Thank you so much for taking the time to walk us through all of this fascinating work.Josh Albrecht: (1:09:13) A lot of people are very interested in AI. A lot of people are concerned about AI. A lot of people worry about potential effects now and in the slightly further future as well. And I think it's a thing that we think about as well, building these systems. We want to make sure that these things are actually having a positive impact on the world. So our approach is a very practical engineering one to safety and to policy and to regulation. And I think we've been very excited to see the stuff that people were putting out at the Safety Summit in the UK, the executive order. I think a lot of people are thinking very reasonable thoughts about how do we measure these systems? What kind of impact are they going to have? And it's nice to see governments taking this stuff seriously and thinking about it. What we want to do is we want to be helpful. So one of the things that we did actually was we looked at the request for comments on AI by the Department of Commerce, by the NTIA. And we looked at all the submissions, and these submissions could come from anyone. They came from organizations, they came from nonprofits, they came from agencies, and they came from a bunch of people. So there were thousands of comments by individuals. And we wanted to ask, okay, can we actually use AI to understand these comments? Can we make a sort of positive use case example? Yes, there's lots of things to be worried about. You can see our previous work on the kind of failures on models of ethics scenarios and how we can break them in these different ways. But here we wanted to show the opposite version. If you're really careful about this, you can do a good job of using them properly. So what we did is we broke down the problem, and we actually used language models to ask questions about every single one of those in a way that would have been really annoying for a person to have to go read every single one of them. And then we also asked people on a smaller subset and then correlated the responses on the smaller subset with the language model to check, is the language model getting these wrong very frequently? Is it being biased in some weird way, et cetera? And so we really dug into the details. And we found that if you're careful about the questions that you ask and you ask your questions in the right way and you check, does this make sense? Am I asking the right kind of question? Do people correlate with this? You can have some pretty cool tools for analyzing much larger sets of data. So I think that's what we did in this particular one. Some of the things that we found, there were a lot of people obviously that were pessimistic about this. There were actually a lot more artists that talked about this. Copyright infringement was a big thing. Personal economic impacts was another big thing. People worried about AI not having the right values or privacy or misinformation. There was definitely some interest in people for regulating this. We can only say so much in terms of what are the conclusions from these comments because they're very self-selected. This doesn't represent the entire population, but it was cool at least to be able to analyze and say, hey, at least for the people who responded, what were they trying to say?

Nathan Labenz: (1:11:51) It seems like at least somebody has heard that message. We got the Biden administration telling departments that they need to start to figure out how they're going to incorporate AI into their work. And honestly, I thought that was pretty cool to see. And I am a big believer that getting hands-on with the technology is for almost every practical purpose, a good step for almost anyone to take. So that was cool. What were your takeaways from the AI Safety Summit and your participation there?

Josh Albrecht: (1:12:22) So I wasn't there actually. My co-founder, Kanjun Qiu, was there, and she did have a whole tweet thread on some of her takeaways, but I won't summarize them here because they're not necessarily my thoughts. But I would encourage people who are interested to go check those out. I think my takeaways from that and the executive order were just that a lot of this was a lot more reasonable and measured than I would necessarily have expected. There's a lot of calls for figuring out how to measure things, a lot of calls for investigation and things like this, a lot of focusing, I think, rightly on mitigations and on very specific problems, looking at cybersecurity or biological harm or something like this. Yeah, we should just do a good job on those things anyway. And there are already agencies that try and do that. So we should just have them spend some time thinking about how do we make this stuff better anyway? Is there anything we want to change given these new models? That seems like the right way to approach it instead of trying to make this big blanket new agency or radically change everything. There's a lot of smart people thinking really hard about this stuff already. We just need to do the detailed work of actually making things better, not overhaul everything.

Nathan Labenz: (1:13:31) Yeah. It seems like it would definitely be a little premature for that. At the same time, this does feel like it introduces a sort of tail risk that is probably not in anybody's established jurisdiction. I don't know if you would agree with that, but...

Josh Albrecht: (1:13:46) That's what the UK AI Safety Summit was focusing on, was more of these longer-term risks. And there does seem to be actually pretty significant international cooperation on that because nobody wants to go extinct. So even China is like, yeah, this seems great. Let's make sure everything is nice and safe from these other long tail risk things. I think there's a lot of willingness to collaborate on those types of things, which was very encouraging to see.

Nathan Labenz: (1:14:09) Yeah, no doubt. I mean, the fact that there's any agreement or same-page sort of stuff with China at this point is a ray of hope as far as I'm concerned. So just a couple minutes left. I mean, what a wide-ranging portfolio you guys have from having created these environments to experiment with reinforcement learning in, to moving to coding agents, to developing frameworks to break things down, to this optimization package, to the fundamental more interpretability-style research, and even the policy engagement all going at once. If I counted correctly on the website, there were only 26 faces. So that's a lot to be going on at a small company. Obviously, a lot more resources coming in now. I assume a significant part of that will be to growing the team. So how do you kind of tie all this together? What's the pitch to new people? Is it that they can add a seventh thing to the portfolio because you guys are just so open to that kind of thing? Or do you tie it into some single vision that you guys will be going toward? What kinds of people are you looking for?

Josh Albrecht: (1:15:20) Looking at it from the outside, it may seem like it's a little bit more scattered and all these different things going on, but actually everything really does tie together. The way we set goals as a company for the quarter is we just sit down and we type in a big Google Doc together, and we just write out what do we want to do this quarter? What did we do last quarter? What went well? What didn't go well? What are we worried about? What are we excited about? And then we sort of synthesize this stuff together and figure out, okay, these are the main small number of things that we're actually going to do. So we do a relatively small number of things and they are pretty directed. If we look at Carbs, for example, Carbs is a piece of infrastructure that just accelerates us massively across all the types of experiments that we run. And so this has been super worthwhile. This has more than paid for itself since we made it a long, long time ago. Similarly, for the NTIA analysis, us doing that analysis was us prototyping our own models and our own internal agents for question-answering-type tasks. And the reason we're doing that is this quarter, our goal is to make tools that we ourselves seriously use. So this is one example of those tools. We also have tools for fixing errors, writing unit tests, for writing other functions, for writing recruiting emails and scheduling. So there's all sorts of different things that we're doing. We're not doing an infinite number of things. We're doing four or five carefully chosen applications and then making sure those work, and how does that feed back into the model? How does that feed back into the infrastructure and the data that we need, et cetera? So it is all working together. Yes, we do have a relatively small number of people. We have 27 now as of, I think, Tuesday, we had a recruiter join us. So we're slightly larger. We have a small number of people, but they're all very competent. The reason we're able to do so many of these things is just we have a very independent, autonomous, talented team, and we want to keep that. And we are going to grow that. We are going to probably grow, maybe double year over year. We very specifically are not going to grow as fast as necessarily OpenAI has or Anthropic has or other companies have over the past year or two. And the reason for that is that we want to make sure that from a cultural perspective, we preserve the things that we're really excited about. We would much rather have a company that is 200, 300, 400 people where everyone is super highly leveraged and independent and amazing and doing and working at some huge magnification factor than have a super huge team. And I think the thing that hopefully is going to enable that is as we actually make these AI agents work, then great, we can act at a much higher level. Instead of having a giant team of recruiting coordinators, we'll probably just have one who's using a whole bunch of these different agents. I think that's what we want for everyone, and that's already what we're starting to see. We'll probably end up spending more on compute than we will on people from our most recent fundraise. And I expect that actually to just grow over the future. We'll actually end up having very highly leveraged individuals. So I think that's how we think about it. Can we actually make the tools to make this happen so that we can stay small and tight-knit? And there's a lot of benefits that we get in terms of communication and culture, et cetera, from that. We just need to make these tools actually work so that we don't need to hire 10,000 people.

Nathan Labenz: (1:18:13) Thank you for this rundown of all your recent projects. It is a fascinating collection, and I will certainly be digging in a little bit more on some of the research and eagerly awaiting what it is that you guys put out next. For now, Josh Albrecht, CTO of Imbue, thank you for being part of the Cognitive Revolution.

Josh Albrecht: (1:18:33) Thank you very much.

Nathan Labenz: (1:18:34) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Imbue CTO Josh Albrecht on Creating AI Agents for Reasoning, Reliability, and Robustness

Watch Episode Here

Listen to Episode Here

Show Notes

Full Transcript

Transcript

Read next

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Imbue CTO Josh Albrecht on Creating AI Agents for Reasoning, Reliability, and Robustness

Watch Episode Here

Listen to Episode Here

Show Notes

Full Transcript

Transcript

Read next

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software