Jesse Hoogland and Daniel Murfet, founders of Timaeus, introduce their mathematically rigorous approach to AI safety through "developmental interpretability" based on Singular Learning Theory.

Watch Episode Here

Read Episode Description

Jesse Hoogland and Daniel Murfet, founders of Timaeus, introduce their mathematically rigorous approach to AI safety through "developmental interpretability" based on Singular Learning Theory. They explain how neural network loss landscapes are actually complex, jagged surfaces full of "singularities" where models can change internally without affecting external behavior—potentially masking dangerous misalignment. Using their Local Learning Coefficient measure, they've demonstrated the ability to identify critical phase changes during training in models up to 7 billion parameters, offering a complementary approach to mechanistic interpretability. This work aims to move beyond trial-and-error neural network training toward a more principled engineering discipline that could catch safety issues during training rather than after deployment.

Sponsors:
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

The AGNTCY: The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org/?utmcampaig...

NetSuite by Oracle: NetSuite by Oracle is the AI-powered business management suite trusted by over 41,000 businesses, offering a unified platform for accounting, financial management, inventory, and HR. Gain total visibility and control to make quick decisions and automate everyday tasks—download the free ebook, Navigating Global Trade: Three Insights for Leaders, at https://netsuite.com/cognitive

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) Teaser
(04:44) About the Episode
(09:28) Introduction and Background
(11:01) Timaeus Origins and Philosophy
(14:18) Mathematical Foundations and SLT
(17:11) Developmental Interpretability Approach (Part 1)
(20:53) Sponsors: Oracle Cloud Infrastructure | The AGNTCY
(22:53) Developmental Interpretability Approach (Part 2)
(24:08) Proto-Paradigm and SAEs
(29:21) Generalization Theory Deep Dive
(34:59) Central Dogma Framework (Part 1)
(36:57) Sponsor: NetSuite by Oracle
(38:21) Central Dogma Framework (Part 2)
(39:19) Loss Landscape Geometry
(45:25) Degeneracies and Singularities
(52:09) Structure and Generalization
(01:00:20) Essential Dynamics Research
(01:05:04) Grokking vs Typical Learning
(01:12:03) Double Descent Discussion
(01:14:39) Interpretability and Alignment Applications
(01:22:01) Reward Hacking and Overgeneralization
(01:30:03) Future Training Vision
(01:36:20) Scaling and Compute Requirements
(01:38:19) Future Research Directions
(01:41:27) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

TRANSCRIPT

Introduction

Hello, and welcome back to the Cognitive Revolution!

Today I'm excited to share my conversation with Jesse Hoogland and Daniel Murfet, founders of Timaeus, an AI Safety & Alignment Research Nonprofit that's pursuing an ambitious, mathematically rigorous, and fascinating approach to understanding the development and function of neural networks.

Named after one of Plato's dialogues, Timaeus' work is based on Singular Learning Theory, or SLT, which applies algebraic geometry to statistical learning theory.

The core premise of SLT is that training data determines the geometry of the loss landscape, which in turn determines which algorithms models learn in training and how their behavior will ultimately generalize once training is done and they're put into actual use.

The driving insight of SLT is that the super-high-dimensional loss landscapes in which modern neural networks are optimized are not actually well represented by the smooth-bottom-valley-shaped surfaces that we often see depicted in figures. On the contrary, Daniel, who recently left a tenured professorship in algebraic geometry to pursue this work at Timaeus full-time, calls these representations “maximally misleading”, and explains that in reality, loss landscapes are highly complex, jagged surfaces, which are full of "singularities" – also known as "degeneracies" – which are directions in weight space that a model can move without changing its external behavior or loss score, but which nevertheless sometimes does involve a change to the model's internal circuitry, such that the model might behave very differently in novel situations.

This, of course, has profound implications for big picture AI safety questions. To frame it in terms that would be familiar to Eliezer Yudkowski readers from 15+ years ago, the difference between a model that acts nice & friendly because it is fundamentally aligned to human values, and a model that acts the same way because it's learned how to please humans while actually pursuing its own goals, could be the difference between a Superintelligence-powered Utopia and human extinction – and yet… today, we don't have reliable ways to tell the difference.

Anthropic- and Goodfire-style mechanistic interpretability has of course made great progress toward identifying the concepts that trained NNs represent and these days also offers some visibility into the circuits they use, but there’s a long way to go, and I definitely believe that there’s plenty of opportunity for complementary approaches to strengthen our overall understanding.

The Timaeus’ approach, which they call "developmental interpretability”, aims to understand how neural networks evolve through their training process, using a measure called the Local Learning Coefficient to help identify what might otherwise be invisible internal phase changes that could affect downstream model behavior.

This line of work, like all approaches to understanding neural networks, is still pretty early in its own developmental history, but critically, the Timaeus team has shown that it can scale beyond toy models. Their latest work applies their techniques to 7 billion parameter models and is able to identify critical phase change moments corresponding to the appearance of important functional circuits, and so we might say that it’s roughly at the “Toward Monosemanticity” stage, and hope that with more engineering and compute, it will continue to scale to frontier models.

If successful, this could perhaps prevent an episode like the one that happened in Claude 4 training, where a certain safety dataset related to harmful system prompts was mistakenly left out of the data mix, causing the model to generalize in such a way that it followed rather than refused harmful system prompts. Anthropic caught that issue with behavior testing and patched it, but the hope for developmental interpretability is that such things might be caught by instrumentation during the training process before it seriously affected model behavior.

And in the best case scenario, this could help us move beyond the trial and error phase of neural network training to a more principled, engineering-like discipline, where specific datasets are used at specific times, for specific purposes, leading to predictable and reliable results.

This is high-concept, mathematically sophisticated work, and this conversation is really just an introduction. I did my best to take my time to develop my own intuitions for what’s going on inside a neural network during the training process, and to compare and contrast some of the phenomena Jesse and Daniel describe to other things, like Grokking, that we’ve previously covered, and to grapple with questions of how much generalization we want from neural networks, and how in some cases too much generalization could be harmful. I found the conversation fascinating throughout, and while it will stretch your brain in a different way than most of our episodes, I expect that for some it will immediately be among your favorite episodes that we’ve done.

Now, I hope you enjoy this introduction to Developmental Interpretability, a new approach to understanding neural networks, with pioneers Jesse Hoogland and Daniel Murfet, founders of Timaeus.

Main Episode

Nathan Labenz: Jesse Hoogland and Daniel Murfet, founders of AI safety and alignment research nonprofit Timaeus, welcome to the Cognitive Revolution.

Daniel Murfet: Hi, Nathan. Thanks for having us.

Nathan Labenz: I am excited for this conversation. I think I am going to learn a lot from it. One of the things I think is really interesting about what you are doing is it presents a bit of a contrast to what I see as the general default approach today in the safety and alignment space, which you might call prosaic safety and alignment methods or defense in depth. I am always on the lookout for something that feels like it, even if it is a long shot, could really work. By 'really work,' I mean give us enough depth of understanding and a reason to believe that this approach could actually work, even in the surprising future we may try to carry it into. When I say 'really work,' I mean so much so that I do not have to worry about it anymore, and I can just focus on all the applications I want to build with AI, without having to deal with 'what if it gets out of control' episodes. I think you have something that fits that bill. It is at times over my head, and I expect members of the audience will also have to stretch their brains to truly grasp it. That will be part of the fun, so I am excited to get into it. For starters, please give me a little background on how you came together as an organization. I was intrigued by the name, and I am also very intrigued by the fact that, Daniel, you recently left a tenured professorship to go all in on this work. Start with the backstory, and then we will get into the more technical details.

Daniel Murfet: Yes. The story starts two and a half thousand years ago with Plato. Timaeus is a dialogue by Plato in which he puts forth the first theory of everything. It is a proto-chemistry where the elements of earth, wind, and fire are associated with Platonic forms: the cube, the icosahedron, and the dodecahedron, among others. The theory is completely wrong, but it is the first instance of someone thinking, 'Mathematics really has a chance of understanding the natural world.' It is that spirit that drives us today. Maybe we can find ideas in mathematics and the sciences that help us understand this new phase of matter, these deep learning systems, to ensure they are safe. Also, in the Timaeus, the world is intelligent. The history of the universe is just a learning process, and physics is just a subset of learning theory.

Nathan Labenz: That is fascinating. I have never read it, I will confess. It is striking that someone had such a worldview 2,500 years ago. This is the deepest in history. Usually, I skip the backstory in many podcast openings because it is often, 'We saw ChatGPT and we knew we had to build this app.' There is a lot of that. But I think this is the deepest in history we have gone for the intellectual roots of a line of work. So, how literally do you take the idea that the hierarchy you described, of physics being a subset of learning, might really be real? I know there are many differing opinions, much like in AI. We have Turing Award winners saying we are all going to die, and other Turing Award winners saying the first ones are crazy, regarding the fundamental nature of the universe. There are probably even more disparate takes. Do you come from a secular tradition or perspective on whether it is a mathematical universe, or are we bracketing that for now?

Daniel Murfet: I was mostly trying to troll your physicist listeners with that. That seems a bit of a tangent. I do not think I want to take a position on the mathematical nature of reality. However, I mentioned that aspect of the Timaeus not completely in an idle way, because there are many deep connections between physics and learning theory. Those are some of the connections that inspire much of the work we do, particularly with statistical physics and so on. The study of learning machines is as deep as physics in many ways, and it is not a surprise that many physicists are starting to devote their professional careers to understanding aspects of these systems. That is perhaps the main connection.

Nathan Labenz: Better than having all the physicists go into hedge funds, I say. So, how about a bit more on what kind of math you were doing as a professor and how that positioned you for this approach to AI, because it does seem quite distinctive from almost everything else I have seen.

Nathan Labenz: There are a few other points in space that may be close to it, but not many.

Daniel Murfet: I am an algebraic geometer by training. Algebraic geometry is a part of pure mathematics. We mostly study solutions of polynomial equations, very high-dimensional geometric forms, and sophisticated techniques for understanding those forms. Some of it is related to the real world; most of it is not. Most of what I did involved some connections to string theory at various points, which I have studied a bit. Mostly, I was working on algebraic geometry and a little bit of mathematical logic. At some point, I was following the developments in deep learning. They were very interesting, but completely separate from my professional life until I encountered Sumio Watanabe's work. He is a Japanese mathematician, the inventor of singular learning theory. It seems hard to believe, but one of the core tenets of his work is that a very deep part of algebraic geometry is central to Bayesian statistics. That was curious to me. I was also aware that people were...

Jesse Hoogland: ...not completely satisfied with the theoretical treatments of deep learning that existed at the time. It seemed tempting to me as an algebraic geometer to adopt this language and approach to try and understand these systems. That's how I got into singular learning theory. That was a few years before I was drawn into thinking about AI safety. It was through the fundamental theory of Bayesian statistics. All parts of mathematics are ultimately connected at some level, so it's not necessarily surprising that geometry comes into learning theory. There are other ways it enters, but that's one of them.

Nathan Labenz: Okay. In terms of motivating the whole narrative, you could tell me how you think is better to motivate it. As I've studied the subject, one of the things that has seemingly distinguished your approach from most others in understanding AI systems generally is the focus on the developmental process. How is it actually learning through the training process as opposed to running it, getting a result, and then trying to make sense out of it? That's one possible motivational line. Another could be that it starts with this one weird thing in math that I couldn't get over. Maybe you can draw those threads together and tee up how you got to this current-

Jesse Hoogland: Yeah.

Nathan Labenz: ...body of work.

Jesse Hoogland: All the research we're doing is based on singular learning theory, SLT. If I had to reduce that into one sentence, the core idea of SLT is that the geometry of the loss landscape, the surface we walk down when defining an optimization process, contains important information for understanding neural networks and potentially steering them more reliably. This has applications for interpretability and for alignment. When we started looking into applications of this theory, which was at the time still very mathematical and theoretical, for empirically understanding neural networks two years ago, one of the first applications we came up with was this agenda we called developmental interpretability. Can you use SLT to better understand how neural networks develop over the course of training? The hope here is that development is an axis that can reduce the problem of interpretability to something more manageable. All interpretability techniques hope to reduce the problem of understanding a 100 billion or trillion-parameter model into fewer numbers, into something more manageable. One way to do this is to understand the changes that give rise to that model over the course of training. If there's a finite number of changes, and the number of changes is much smaller than the number of parameters, then you have a useful simplification for the problem. That's what motivated developmental interpretability or dev-interp when we started two years ago.

Nathan Labenz: When you say the number of changes are smaller than the number of parameters, does that mean the number of steps in training, where each step would be a change?

Jesse Hoogland: That's a good question because if you have to understand each individual training step, this isn't a reduction. There are still billions of training steps, many gradients to calculate. So the unit of change needs to be larger than just one training step. What's the right unit of change here? That's where SLT comes in because SLT says that the right unit of change is a phase transition, like a developmental stage in biology. It's an empirical question whether you can find these phase transitions and whether they exist at a frequency that's useful for simplifying the problem. But in the systems we've studied so far, that does seem to be the case; you can meaningfully reduce the problem by looking at these stages.

Nathan Labenz: Okay. That's definitely interesting. Would you make an analogy between a phase change and grokking?

Jesse Hoogland: Yes, grokking would be an example of this kind of change. Typically, we identify two primary kinds of changes. The first, which we call a type A transition, involves the model becoming more complex as it learns more information. You can imagine a human developing a richer mental model of the world that becomes more predictive and accurate, but each additional step requires a more sophisticated model. Then you have the other kind of learning behavior, like grokking, where you find a simpler explanation for the same data. You keep the same level of performance in explaining the data, but you gain a solution that's simpler and therefore generalizes better. The tension between these two things gives rise to this interesting learning process in neural networks.

Nathan Labenz: Cool. Not too long ago, I did an episode with two of the co-founders of Goodfire. Tom, who is the chief scientist there, gave a very short update on where we are in interpretability today. He said, "We've been pre-paradigmatic for a long time, but I'll now give us proto-paradigmatic status." He then went on to describe the proto-paradigm as there seems to be a fairly broad consensus that neural networks learn interpretable things, that these things we call features. They are packed super densely into parameter space using superposition, and that means the individual neurons are polysemantic. These features are almost orthogonal to each other, but not quite, and that's how they're packed in. The magnitude of their activations seems to correspond with the intensity with which they are relevant to a particular context, and these features connect up in circuits across the layers of a model. Do you guys sign onto that proto-paradigm, or is there anything that you would add or subtract from it to put forward your own proto-paradigm?

Daniel Murfet: I'll take that one. Firstly, just to add to the discussion on the previous question, we expect, and do in fact see, clear phase transitions in small systems, but nobody really expects to see hundreds of very clear step changes in the loss curve or other measurements in large models.

Nathan Labenz: Okay.

Daniel Murfet: So that's another question: if this is actually an axis of reduction, how do you find that structure inside a pretty homogenous looking global training process? You need some other means of looking more closely or tuning your detection instrument to various different kinds of frequencies to figure out what's going on. So there's additional information there. To answer your question, I'm a big fan of the SAEs work, particularly this recent work out of Anthropic and what GoodFIRE is doing. Actually, in that episode, I think Tom mentions parameter-based methods. I like the analogy he gave, which is to think of it as, if you have a graph, you need to understand both the nodes and the edges. It's very clear that ultimately the information in the model is in the parameters. There's a role for understanding what the model is doing by examining that structure. One way of thinking about that is examining the geometry of the loss landscape. I don't think of SAEs and our approach to interpretability as necessarily in a struggle. They're likely to inform each other. If I were to make one critique potentially of things like SAEs, many people have concerns or doubts about SAEs, and there are also many people who are very enthusiastic about them. I think I'm probably on the more enthusiastic end. But if I had to make a critique, it's that there isn't really a mathematical foundation for SAEs that connects them to things like generalization. At least not that I understand, and I don't think it exists yet. That's not to say it can't exist, and maybe we will, for example, do some work on that at some point. But if you understand what the model is doing by computing SAEs and you find circuits, it's very unclear, for example, if you were to do continual learning or reinforcement learning training on top of that interpretability that you did on a base model, what the behavior you end up with after you continue to train, and its relationship to the interpretability examination you did before. It isn't clear what the relation is between the structure you found in the SAEs and the data generating process or the training process, so you don't know how stable it is to that. I can imagine empirically it is quite stable. When I spoke, I expressed this concern to Chris Olah, and his response, if he doesn't mind me passing this on, is that if you understand circuits through SAEs and then you have some further modification of the model, such as fine-tuning, what's probably happening is that the fine-tuning process is recruiting the representations and circuits that existed before and using them for a different purpose. Indeed, Anthropic has some work looking at features before and after fine-tuning, and I would say it's consistent with that interpretation. So I think you can give an empirical answer to the question of how SAEs relate to the training process, the generating process, and generalization, just by empirically checking and seeing that it seems to work. I think that's reasonable. I think that isn't necessarily a path to a high level of assurance that we might hope to get. When people talk about generalization in connection with AI safety, it's the hope that you can ground your understanding of the model in something mathematical and a solid foundation that is not purely empirical. Our approach to interpretability, in contrast, starts more from the generalization end and works towards interpretability. The hope is that, and obviously we're also more about parameters rather than activations. There are likely to be relations between these two perspectives. I can speculate on those if you want.

Nathan Labenz: It strikes me that everybody has an intuitive sense of what generalization means, but you may want to offer a more formal or precise meaning for that. Generalization at the ChatGPT user level is, "I can take in anything," and it seems, as Ilya once said, "The amazing thing is I talk to the AI and I feel I am understood." That's perhaps the mundane utility, the short form story of generalization. Then there's the grokking generalization where you go from memorization to algorithmic, and I suspect you're going to generalize that to your sets of generalizations. How do you mean that?

Daniel Murfet: Perhaps you're understood because you're an interpolation of the training data, Nathan. When people talk about generalization, they often mean multiple, somewhat distinct things. Technically, generalization is usually a number, a generalization error, and it represents the model's ability to predict on samples from the data distribution not seen in training. That might be a held-out test set, as it often is in machine learning. Theoretically, we often suppose there's a generating process that both generated the training data and from which you can continue to draw samples. Generalization error would then be about the gap between the predictions you make based on the model learned from the training data, evaluated on these new samples from the same generating process. That is, in-distribution generalization. Usually people don't add this prefix, in-distribution. It's understood that you are testing the trained model on samples from the same generating process that produced the training data. If you're not, then we refer to that as out-of-distribution generalization, which is a very fraught, difficult thing to understand because those new samples could be absolutely anything, and why should you expect any particular good performance on that? The attitude towards generalization that existed pre-GPT-3 was... This is one of the big cognitive transformations in the last few years: people truly understanding that you can pre-train a large model on very diverse data and get something like generalization to a large range of tasks. Exactly what form of generalization that is, is not super clear. And the degree to which you're actually getting capabilities that work well outside the training distribution, in some sense, is a very controversial topic. Are LLMs really reasoning? If they're really reasoning, they can do arbitrary things that reasoning can do, which may be very far outside the training data. People debate these kinds of questions all the time. It's quite unclear, but the interest we have, beyond thinking about generalization as a number, a measure of this ability to predict on new samples. You can ask for an understanding of generalization which has as its core mathematical type. Not a number, but perhaps something more complex, like an algorithm. The algorithm is the thing that allows you to generalize. So it's the thing behind it. If you have low generalization error, perhaps the reason is that you have internalized some algorithmic aspect of the generating process. In some sense, the study of interpretability is like this: one way of phrasing it is you're just studying generalization, but you've changed the type signature of the object you're trying to produce. Rather than being interested in a number, you're interested in the thing behind the number. I would say that's what the aim of interpretability is, in some sense: to explain why you're able to have low generalization error.

Nathan Labenz: That's intuitive, certainly at the level of the original grokking paper. The first steps, it memorizes relatively quickly. It can get the samples it has seen right. It still performs terribly on all the examples, even from the same distribution, that it hasn't seen until... And I always like to remind people of this. Orders of magnitude more steps later, then you slowly... It looks like a steep rise on the log X-axis on that famous graph, but it's actually the last 90% of the training time when this grokking process slowly happens, and then you get an actual algorithm for, in that case, modular arithmetic, and now you're good to go. And I think there are some interesting things there to unpack. One that's really, I think, close to the heart of why all this matters is that it has been shown there are multiple different ways to grok that problem. And we know now that, I think this came out of Max Tegmark's group, if I remember correctly. There's the trigonometric way, and then there's the pizza slice algorithm. These both get the right answer, but they perform quite differently. Maybe that doesn't matter in that particular case, but obviously you can imagine learning very different algorithms to solve the same problems, especially as we go into a space colonization mode or trying to put these things into an online learning environment. That could lead to potentially very different downstream behavior, and so it makes a lot of intuitive sense that we would like to understand what is going on there. You can elaborate on that or clarify anything I'm getting wrong, but perhaps this is a good time to give what I have started to think of as the central dogma of your approach, in the same way that we have DNA eventually getting to proteins. You have this story of starting with data, and there's a set of relationships that takes us from data to model. I don't know if you like that central dogma label, but-

Jesse Hoogland: Yeah.

Daniel Murfet: I'm sure Jason's going to use this every week now on Mindfreaks. So thank you for that.

Daniel Murfet: It's a good phrase. I don't know if it's quoted yet.

Daniel Murfet: Yeah.

Nathan Labenz: But yeah. Give us the central dogma, and then we can...

Jesse Hoogland: Mm-hmm.

Nathan Labenz: When we get to the end, we'll still have this ambiguity, right? At least for now, what algorithms did XLM learn?

Jesse Hoogland: Mm-hmm. I think the key question is, where does neural network behavior ultimately come from, right? All of our alignment techniques, what we're trying to do is change model behavior and make sure that they generalize robustly out of distribution. Many problems can be reduced to this understanding. Ultimately, behavior comes from training data, right? There are basically three inputs to every learning process: the neural network architecture, some choice of optimizer, and the training data. Of these, training data is the most important one. It's the one that tells one model to learn one algorithm and another model to learn a different algorithm. So, what we'd like to understand is how training data gives rise to the final behaviors models end up with. In particular, because this is what all of our current alignment techniques look like: RLHF, Constitutional AI, DPO, Deliberative alignment. All of these techniques are basically just modifications on the same deep learning processes with different data. So, in particular, data determines the geometry of the loss landscape. It's that geometry which in turn tells SGD how to move around, so it tells your optimization process how to move. That learning process picks out the final weights in those algorithms that a network ends up with, and it's the structures in those weights that determine how models generalize and thus whether they're aligned or not. So, this central dogma, as you called it, is something we call the S4 correspondence between structure and data, structure and geometry, structure and the learning process, and structure and the final weights of the model. This is what we see as key to understanding the mapping from data to final behaviors.

Nathan Labenz: So, what do we know in general about the loss landscape? We've all seen many one-dimensional loss curves of just loss dropping over time. You mentioned earlier that, especially in larger models, I've always understood that as there are probably lots of little grokking moments happening and phase transitions, but because they're also aggregated up, we can't see them at that scale. And then we've also seen many two-dimensional visualizations of loss landscape, and I honestly have no idea whether, when I look at something that looks like a well in a two-dimensional space, if that is leading-

Daniel Murfet: You can look at those pictures, yeah.

Nathan Labenz: Yeah. So, I don't put too much stock in my intuition for that because I'm just like, I don't know. Is two like 200 billion? I don't know. It seems quite different, or plausibly quite different. What do we really know about loss landscapes today?

Daniel Murfet: Yeah, I think... To go back to algebraic geometry briefly, in one, two, or three dimensions you can use pictures and intuition from pictures. But once you go to higher dimensions, your intuitions are not a reliable guide to how to think about these objects. So that's where you try to bake your intuitions from lower dimensions into mathematical forms which continue to work, right? And that's, in some sense, what geometry is. I think these pictures of... If you draw a two-dimensional surface picture of a loss landscape, the way you do that is by choosing two directions in a very high-dimensional space, and then plotting the loss as a function of the coordinate on that plane, right? One of the things you're, with probability one, going to do if you choose a random slice like that is not see, for example, the degeneracy that determines generalization according to singular learning theory. So, in some sense, those pictures are maximally misleading when it comes to generalization.

Nathan Labenz: Hmm.

Daniel Murfet: I'm not saying they're not useful, and they're certainly pretty, but in many cases, the actual relevant geometry looks more like a bunch of planes intersecting in some crazy way. That's what the level sets look like. If you take a kind of valley picture of valleys and a loss landscape that you'd usually see, the level sets where you just look at the set of points at a given height above the floor, so a set of points with a given loss, will, if you think about it like the edges of the valleys and so on, have some sort of nice smooth shape, right? And then if you go to the very bottom of the valley, you'll eventually just get a single point. The level sets of the loss landscapes, and there are actually two things you could mean by loss landscape, we'll have to disambiguate in a second. The level sets of those landscapes are not smooth shapes like that. They're very complex geometric forms with lots of intersecting lines, lots of high-dimensional things that are gnarled up in various ways, and those are the singularities that we're interested in.

Nathan Labenz: Hmm.

Daniel Murfet: To be clear, the loss landscape here, the population loss, this is a theoretical object. This is an important, subtle, and difficult point. You never actually have access to this object, the population loss. That would be if you could average over every sample from the true generating process, for example, images. You could somehow average the loss over every possible draw from that data distribution. Of course, you would never see that. If you are plotting an empirical loss landscape, you have some samples, then you compute the loss based on that, and you plot it. The actual geometry that dictates generalization mathematically is the geometry of the population loss. That influences the behavior of all these empirical losses based on samples. But you never actually have access to that.

Jesse Hoogland: I want to offer a slightly higher-level description.

Daniel Murfet: Mm-hmm.

Jesse Hoogland: Usually when you ask people to imagine what the loss landscape looks like at the bottom, or you look at these slices, it will look like a basin. A roundish basin, like a parabola. What Dan calls degeneracy is the property that this is the wrong way to think about it. There are valleys and canyons, directions you can walk along that do not change the loss. In a sense, SLT tells us these are by far the most important directions. The number of valleys you have determines how well your model generalizes. If there are more valleys, then you can perturb your weights. You can change your implementation without changing the function you have implemented. That means your function is simpler.

Daniel Murfet: Simple functions can be implemented in more different ways. Then, through an Occam's razor argument, the simplest function generalizes best. You can make this statement very precise using SLT.

Nathan Labenz: I want to unpack this a little bit. First of all, it is a good reality check on all these visualizations that what you are doing when training a model is taking the gradient and taking a step in that gradient direction. You have no actual landscape around you. It is not like you are being guided broadly by this high-resolution object that you are finding your way down. You are literally one point in time, take a step, and hopefully that was a good step. But it could have even been a bad step. There are not a lot of guarantees there. Regarding the degeneracies, I have many questions. Probably some of them are fairly ignorant. One basic question is, how do I know this even exists? Because you mentioned the challenge that we do not have the actual full, let's say... I do not know if 'platonic' is too loaded of a term, this idealized...

Daniel Murfet: It is a fine term.

Nathan Labenz: ... landscape from the idealized full dataset.

Daniel Murfet: Mm-hmm.

Nathan Labenz: Since we do not have that,

Daniel Murfet: Mm-hmm.

Nathan Labenz: then I am not exactly sure how we establish that there is this direction we can move. Because it is not like we can say, "Here are five inputs. I change this parameter. Nothing changes on the output. We are done." How do we know this actually exists, or am I missing some caveat or constraint on how this is formulated? Good question. Please de-confuse me if I am capable of being de-confused.

Daniel Murfet: There are a few ways into that. One is to study very simple neural networks where you can figure out the population loss in closed form. That is where Watanabe started. You can take... This was decades ago, so he was studying tanh networks rather than modern neural networks. You can take one hidden layer tanh networks, for instance, for a very synthetic data distribution that you can understand. Then you can write down, do the integrals, and figure out what the population loss is. Then you can study these degeneracies and see that they are real. They do not have any interesting relation to the data because the data is not interesting. For interesting data, you cannot do that calculation. But in these very synthetic settings, you can show what the degeneracies are, do various calculations, and study the theorems that connect generalization to geometry explicitly in examples. There are numerous examples like that, including full neural networks, albeit simple ones. You can do things that are more related to current practice. One of the papers we wrote looked at the toy model of superposition by Anthropic. It is very interesting. There is an actual small autoencoder, and you can do the integrals there as well. You can write down and get the actual population loss. You can study its geometry. We did that. We computed these degeneracies, and you can see that the training process is governed by these degeneracies in the way the theory says. That is again not a super interesting data distribution, so the degeneracies... It is quite complicated what is going on ultimately. But that is a case where you can find out what the degeneracies are. That is the interesting form of degeneracy. There is a boring auto-degeneracy worth getting out of the way, which is scaling symmetries and so on. If you have a ReLU, you can scale up the input and down the output by any real number, a positive real number. That is a way of changing the weights that does not change the function computed. You have many degeneracies like that. Those are not uninteresting. For example, with the QK matrix, this degeneracy will have an impact on the kinds of functions that part of the network will tend to learn. It is non-trivial even though it is a bit boring on its face. But there are other kinds of degeneracy that emerge in a more overall fashion, which cannot be pinned down to individual obvious forms of degeneracy like that. You can do it in toy cases. You can see some examples, like these products and matrices, where it has to arise. But at some point, the bigger you make the model and the more interesting you make the data distribution, it stops being feasible to do theoretical calculations and becomes an empirical question. Perhaps I will come back to that in a moment. There is one more theoretical thing you can do: invent toy models of what you think are the key ingredients in a more interesting model. There are various papers that have looked at, for example, transformers doing in-context learning. You abstract out some of the annoying details in real transformers and come up with a closed-form formula, some function of the parameters that you think models well what transformers are trying to do. Saxe's Lab, Andrew Saxe, has made a career out of doing this very beautifully in many different settings. Not specifically for transformers, but across many different settings. But there is a recent paper that looks at in-context learning for transformers and comes up with a toy model, an explicit potential. Some function, some loss function, describes that. You can see the degeneracy in that function, for instance. That is another way of trying to get at it. But if we are talking about algorithms and models and trying to back those out of weights, so far I do not think there is a very clear toy model of that in that form. It will be interesting. I do not think it exists yet. So it becomes an empirical question. Can you go out and extract signals of that degeneracy from information you get from these empirical losses? The theory says you can do that, and that is a large part of what we do.

Daniel Murfet: Sure.

Jesse Hoogland: The link between degeneracy and generalization. One of the core theorems of SLT states that you can link a sensitivity analysis in weights. This is the question: if I perturb the weights a little bit, how much does the loss increase? You can link this sensitivity analysis to a sensitivity analysis in data. Right? If I change the data distribution, how much will my loss go up? And it's that link which gives us, first of all, a link between structure inside of the model, because structure is reflected inside of the geometry of the loss landscape, to generalization and potentially out-of-distribution generalization. Now, you can look at what happens if I apply a small local change to my data distribution, or upweight some distribution. For example, GitHub code data versus Wiki text. Which parts of the model become active? Which directions in the loss landscape respond most to this? And it's through this connection between structure, geometry, and data that we are developing these tools for interpretability.

Nathan Labenz: Okay, I need to dig into that a little bit more. Again, one ignorant question around the nature of these degeneracies. I certainly get what you're saying in terms of some of them being, in a sense, trivial. You can scale one thing up, another thing down, and get the same thing. That's obvious. As you get to these large-scale things, it feels almost accidental to me. It feels super contingent. But then you made the statement that they govern the training process, and I realized I don't really know what that means. What does it

Daniel Murfet: Yeah.

Nathan Labenz: mean for them to be governing the training process? And if they are in fact governing the training process, then that suggests that they're much less accidental and contingent. Or maybe there are different kinds of them, some of which are deeply built into the nature of the world and reflected from that. And others maybe are accidental and contingent on the way you order your batches or the exact mix of this

Daniel Murfet: Mm-hmm.

Nathan Labenz: versus that. So,

Daniel Murfet: Yeah.

Nathan Labenz: Yeah, again, consider that your prompt and tell me what I need to know.

Daniel Murfet: There's a bit of mathematical background I'll provide. Stochastic gradient descent is a specific optimization process, but much more generally, we're interested in many different areas than just following a gradient. You have some function, and then you follow a gradient to maximize or minimize it. In cases where that gradient is the gradient of something that looks like a sum of squares, for example, X squared plus Y squared plus X squared, you're just trying to find the bottom of a bowl. That process isn't interesting. There's just the bottom of the bowl, and you'll go there. That's convex optimization. You can still do interesting mathematics with that, but it's fundamentally not a complex process. But many processes in nature do not look like that. And that's why there's a field called nonlinear dynamics. More generally, if you're trying to follow a potential, imagine a surface covered in little arrows. They tell you which direction you'll go. That's the gradient of some fixed potentials. The potential is just a general term for a function whose gradient you'll follow. You have this vector field, all these arrows pointing various directions. It's a fact about dynamical processes that the places where the gradient vanishes organize the trajectories. In two dimensions, that might seem a little counterintuitive, but you could think about it like this. If instead I go back to one dimension, let's just take a curve. The places where the gradient vanishes are the maxima and the minima, or the saddle points. I taught calculus for many years, and a large part of the course is teaching people to classify the minima and the maxima because the other points in the curve are boring. If you want to understand some function, like a cost function, you want to know where it's maximized or minimized. What it's doing at some random point on the curve is completely irrelevant. You just want to know where the maximum and minimum are. That's an instantiation of the principle that if you're trying to understand a function, you are often looking for the places where interesting stuff happens, and those are maxima and minima in one dimension. In higher dimensions, they're singularities, so they're places where the gradient is zero in all directions, and you can never get there because no trajectory that starts from outside the singularity ever arrives, and if it's at the singularity never leaves. But if you follow a random trajectory following a potential, and there's some singularity somewhere, it's going to do something around it. This will dominate the behavior, so the trajectories will tend to approach it and then escape in some particular direction. The singularity is organizing the set of global trajectories, and that's an informal statement, but there are formal statements that use ideas from topology. There are ways of making that more precise. That's a general principle, and it applies for learning with SGD just as much as it applies to any of these other systems in physics to which people would apply this principle. That's a general rationale for caring about critical points or singularities. They are almost synonymous. It's a reason for caring about those as the organizing principle for dynamical systems and learning in particular. Now, what it looks like for an actual learning process in deep learning to be governed by singularities, that's a very complex question in really interesting cases where a model is doing interesting stuff. But in simple cases, you can see that with the toy model superposition work I was describing earlier, the columns of the weight matrix can be visualized as vectors in two dimensions. There are two rows of this matrix and some number of columns, and you can plot those columns as vectors. Then you get shapes like a pentagon, a square, or a hexagon that connect the vertices of those vectors. Moving between the neighborhoods of these critical points literally corresponds to growing a leg or contracting a leg and rearranging the others in some particular way. So, the growth process of the network's structure is dictated by the movement between neighborhoods of these two different singularities. That's in a very simple case. In a more complex case, it's not so clear how to think about that. The singularities aren't isolated things where there's one singularity here and normal stuff in between. In a large model, there are singularities everywhere, and it's not well understood.

Nathan Labenz: I'm almost envisioning a fluid dynamics type of thing where the point we are in parameter space speeds up when it gets around the drain but then maybe spins out and goes slower as it-

Daniel Murfet: Yeah.

Nathan Labenz: ... has some wider orbit, and probably a lot is lost in that, but that's at least the visual that's coming to mind. Another-

Daniel Murfet: Yeah. There's this, you can visualize the function in function space of a small language model and there are some parts of the trajectory that actually look like it's orbiting around some particular mode of prediction. Yeah, that kind-

Nathan Labenz: Right.

Daniel Murfet: ... of image is-

Nathan Labenz: Yeah.

Daniel Murfet: ... is appropriate.

Nathan Labenz: So do these singularities, in a simple cost function as you're describing, maximize it, minimize it, whatever, it's very clear what that point means and why we're trying to identify it in the first place. In the case especially of a more complicated neural network, I'm tempted to ask, is that the type of, if imagine a grokking phenomenon, are those singularities the place where the algorithm is perfectly grokked such that there's never anything more to learn, at least with respect to some subset of problems, or is that reading too much into the nature of the-

Daniel Murfet: Well, think of the singularities as different ways of solving the problem, different kinds of solutions at a given level of loss. So if you're at a given point in the loss landscape, different kinds of singularities correspond to different ways of predicting on the data you've seen so far. Jesse, maybe you want to say a bit about the ED stuff? That's...

Jesse Hoogland: Yeah, I think it's an interesting one.

Daniel Murfet: Yeah.

Jesse Hoogland: Maybe I'll say one more thing about this. So coming back to generalization, one thing we're worried about is your model might learn two algorithms that look identical from the training data, but one of them generalizes in a way that you don't like, and one of them generalizes in a way that you do like. So how do you distinguish them? What Dan's just saying is that maybe if you look at the neighborhood in loss landscape, they actually have different geometries associated with them. So this is the sense in which reading this geometry gives you information about what kind of algorithm you actually end up with. What Dan was referring to as ED stands for Essential Dynamics. This is a paper we put out earlier this year where you can train a neural network, a transformer, to do in-context linear regression. So you can give it XY, XY, XY samples and ask it to predict the Ys from the Xs where each Y is generated from a simple linear transformation of the Xs plus some noise. What happens is the model, if you train it on many different samples of this, will learn to do regression over the course of context. However, you can vary the number of distinct tasks that the model is exposed to during training. If you expose it to many different tasks, it learns regression. If you only expose it to a few different tasks, it will memorize those tasks and-

Daniel Murfet: Mm-hmm.

Jesse Hoogland: ... it won't learn the regression solution. It will learn this memorization solution. Which solution the model ends up with varies as you vary the complexity or the diversity of the number of tasks. You see a phase transition where the model initially memorizes and then eventually learns this other solution. But the selection process between these is something you can actually try to understand with singular learning theory. Each solution has an associated performance and complexity. Memorization is always better at performance than this regression solution, which I call the generalization or meta-learning solution. The generalization solution, however, is simpler past a certain level of things that you have to memorize. What happens is the simplest solution roughly occupies more volume in parameter space and is therefore easier to find. You notice this phenomenon where the model will first learn the generalizing solution, the simple one, before moving to the memorization solution. We can classify exactly which kinds of dynamics qualitatively you should expect to see based on this trade-off between performance and complexity, or complexity is reflected in the degeneracy of the loss landscape.

Daniel Murfet: So that's the opposite order to grokking.

Nathan Labenz: Yeah, I was noticing that for sure. Help me understand how it's happening in reverse in this case versus the grokking case.

Daniel Murfet: Well, grokking is actually a strange phenomenon.

Nathan Labenz: Yes.

Daniel Murfet: So it's not typical. There's been overfitting to the example of grokking. At least in our opinion, this is not the typical way in which memorization and generalization probably interact.

Nathan Labenz: Okay. Say more.

Jesse Hoogland: What's special about grokking is that the performance on the training set is more or less identical. Because you have one solution that is more complex, the memorization solution is more complex, but it's the thing you learn initially. So you end up with this transition where at a fixed level of loss in the loss landscape, you end up in a broader basin and in a basin that has more degeneracy.

Daniel Murfet: This is a phenomenon that can only happen if you get to, conceptually, the very end of training, so the very bottom of the loss landscape. Then there are some perfect solutions that are simpler and more complex, and you prefer the simpler one if you can find it. Eventually you will find it. But the trade-off between memorization and generalization that occurs more frequently, without such artificial conditions, is more like you prefer the simple solution that's bad, has high loss. You eventually trade that off for a solution that has lower loss and more complexity. You pay a complexity penalty in order to get the lower loss. That's the more typical relation between memorization and generalization.

Jesse Hoogland: Yes, and that's the case in this in-context linear regression example. So what SLT says is you should expect models to minimize an effective loss, or it's the loss with this emergent regularization term, this implicit bias that comes from degeneracy, that comes from the number of ways you can perturb weights without actually changing the function. So typically the model is going to prefer initially to learn the simple bad thing before moving on to the complex better thing.

Nathan Labenz: So help me understand one more time how the grokking case is different. In that case it takes a lot longer to learn-

Nathan Labenz: the, what I would think of as the simpler algorithm-

Nathan Labenz: which presumably does have, in terms of some formal complexity metric, I would assume that that does have a lower score for complexity than memorizing N examples, right?

Daniel Murfet: As Jesse referred to earlier, there are two typical changes in the trade-off between loss and complexity that are predicted by the core mathematical result of SLT. By trade-off, I mean trade-off between the loss and the complexity as estimated, and we call that the LLC, local learning coefficient. So I'll say complexity. One way you can have a preferred trade-off is to pay the penalty of increasing the complexity by lowering the loss. That will decrease what's called the free energy, which determines which solution is more preferred from a Bayesian statistics point of view. So that's what we sometimes refer to as a type A transition. That's the typical learning. You learn more stuff, you get better. But there is another way to decrease the free energy, which is to decrease the complexity at a fixed loss. So this is what we refer to as a type B transition. It's hard to ascertain exactly conditions under which this is the case, but grokking seems to be one of them. We've seen others. So those are situations where the model really isn't improving on the loss, but it simplifies the algorithm that it's using. We can see this in the linear regression setting in some sense as well. And of course there are many other examples of grokking-like behavior. The underlying principle is you should decrease the free energy, which is a sum of two terms, one to do with the loss and one to do with the learning coefficient. You're allowed to increase one of those terms. If you decrease the other one enough, that will decrease the free energy. That's normal learning. And then there are situations where you can have a similar decrease by decreasing one of the terms of complexity.

Nathan Labenz: One thing that's jumping out at me is, in general, obviously people can do variations on this, but a vanilla setup is you're minimizing the loss, and you used this term, implicit regularization. So, if I'm sitting looking at my single loss function and trying to drive it as low as I can, I understand what weight decay is when it comes to... and that's one form of this regularization. But if I understand you correctly, you're saying that there is some natural pull that is happening, even if I am, from an external perspective, minimizing loss that is moving me around in this complexity space and prefers simpler solutions. But I don't have an intuition, as of now, for why, or how that's happening if I'm not-

Nathan Labenz: taking an active

Nathan Labenz: step in the learning process to enforce that.

Daniel Murfet: You can imagine asking the question, what happens if you randomly drop yourself somewhere in the loss landscape? More likely than not, you will end up somewhere with pretty high loss, but you will also end up somewhere that has a very broad basin. There are many different ways you can perturb that solution that will leave the loss basically the same. Generally, as you move down the loss landscape, it will be easier to find the solutions that occupy a greater volume, and the main thing that contributes to the volume is the degeneracy, the number of valleys. It is easier to find canyons that have more valleys than it is to find very narrow nooks and crannies. This is the sense in which you get this emergent implicit regularization from the structure of the loss landscape.

Nathan Labenz: I could do this

Daniel Murfet: Yeah.

Nathan Labenz: all day sitting here trying to visualize these things and asking you to help de-confuse me, and I think we have many listeners like that too. Though I probably need to be somewhat disciplined about how

Nathan Labenz: many more of those kinds of questions I ask. This seems to connect very fundamentally to double descent. Or just these... Maybe just refine double descent for a second, and then I definitely want to take a moment and also really zoom out. It is all fun and games in modular arithmetic and toy problems to motivate the entire thing, and I will probably put this in the intro: the canonical safety worry is you cannot necessarily tell the difference between a super intelligence that has your best interest at heart and one that does not. You may have a real hard time in many ways if you cannot make that distinction as you get to more and more powerful systems. So there is the double descent thing, but then let us zoom out and talk about the biggest picture worries that we have. And then I also think maybe we should start to go toward what is the path for you guys from here with this research agenda? We have seen from Anthropic, I would say, and others of course, from whatever, toy models of superposition two and a half years ago, an unbelievable amount of progress, in my view.

Daniel Murfet: Yeah.

Nathan Labenz: Obviously, a long way still to go, but if you had told me then that we would be here now, I would be like, "That is really amazing." It has struck me that one of the big things that has gone into that has just been a ton of compute. I do not have an intuition yet for whether that is a similar trajectory that you guys think you will follow or if it is going to be a more eureka moment, to recall the Greeks again, driven process.

Nathan Labenz: A lot there, but double descent, if you want to say anything about that, what are the biggest... If we zoom out to not toys, but the real things that we are really worried about, and then, how do we go from these toy model understandings to starting to tackle that

Daniel Murfet: Yeah.

Nathan Labenz: hopefully on relevant timelines?

Daniel Murfet: Regarding double descent, there are various kinds of double descent. There is double descent with respect to training samples, and there is double descent with respect to model size. I think nobody has systematically attempted to study double descent with singular learning theory yet. We have not really tried, I think partly because it just does not look that mysterious when you have approached the problem from the mathematical perspective that we do. It is not mysterious that models with many parameters can generalize well.

Nathan Labenz: Is this, in terms of an intuition, another one of those dimensionality things where when I see a hill I am like, "Boy, it looks like a lot to climb over that hill," but when I am in 200 billion dimensional space, there is always a way to wind my way through it?

Daniel Murfet: That is an often presented intuition for why optimization in high dimensions is amenable. I am not so sure that is really an explanation. I think that is closer to a folk story maybe. I think it is not understood why, just having a large model with lots of parameters, you can, of course, find things that do not optimize well that have lots of parameters that are not neural networks. Or you can get the neural networks configured badly and they will not train very well. There remain many mysteries about why large neural networks can actually find well-performing, well-generalizing solutions. Singular learning theory does not resolve all those mysteries. But it does at least give you a mathematical framework from which you are not surprised that there can be models with lots of parameters, which do well and predict well, and generalize well. So I think I do not really have a clear answer for you on double descent. Maybe one of your listeners wants to think more about it. We have not put a lot of effort into it. There is a story; you can think about double descent with respect to training samples and see that you can back out a picture that looks somewhat like the double descent curve from generalization curves that are associated with one of these transitions I was describing earlier. Potentially there is an explanation there, I do not know.

Jesse Hoogland: I'm just worried that at some points we've gone too deep into the technical details.

Jesse Hoogland: So if I can take a chance to re-clarify or restate some things I've said earlier.

Daniel Murfet: Yeah, sure.

Jesse Hoogland: Just to ride that.

Daniel Murfet: Always.

Jesse Hoogland: More. Big picture, we want to end up with a friendly AI. We don't understand enough about what's going on inside the model to currently make any guarantees of, "This model is aligned. This model is not aligned." It can exhibit the same behavior in training and then generalize differently when deployed. For me, the basic premise of interpretability is understanding model internals to a sufficient degree that we can disambiguate these two cases and understand generalization. SLT tells us that we can look at the geometry of the loss landscape; we can look at how structure is reflected in this geometry to start getting at this question. We have this principled link to in-distribution generalization. That's a starting point for developing a theory of out-of-distribution generalization and understanding better what happens in the case of SGD. It's on the interpretability side. What this looks like in practice is you can probe a bunch of points in the loss landscape, apply small perturbations to your model. Hit it with a hammer and see how it responds to that probe, to that small perturbation, and this tells you something about what's going on internally. This is the basis for starting to extract information about model internals from geometry. We've been looking at a series of projects and papers where we first study small language models, subject them to these kinds of perturbations, and then measure things like, how much does the performance decrease? How much worse does the performance get under these kinds of perturbations? That tells you about the complexity of the solution the model has learned. How much does performance increase if I apply these structured perturbations to a specific part of the model, an attention head or a component? This, it turns out, can tell you something about this attention head being specialized to this kind of data. These attention heads are all abduction heads. These ones are doing something else, like memorizing different n-grams and skip n-grams. You can start to actually tear apart the internals of the model using these kinds of probes. More recently, we've been looking at applications for something like circuit discovery. In a principled way, can you associate the components of the model with patterns and data and do this attribution from what structures the model ends up with to the data that activates that structure? This is starting to get at the kind of understanding that could lead to an understanding of why the model exhibits the behavior it does, rooted in its internal structures, without necessarily having a full mechanistic understanding of what's going on. You can do the sensitivity analysis, and it can tell you that these components are involved without needing to know exactly causally what the input to output map was. This is more like a top-down approach to interpretability informed through this perturbation analysis. On the one side, we want to develop tools for interpretability. On the other side, at some point, we want to be able to steer the learning process so that we don't end up with a misaligned model in the first place. We have to take a preventative approach. Ultimately, that's going to look like intervening in the training data. The hope is, to the degree that the mapping from training data through geometry of the loss landscape, through the learning process, to the final weights you end up with, is invertible to some extent, can you go backwards and come up with techniques for choosing training data that aligns your model more robustly? These are the kinds of applications of SLT that we're trying to develop for alignment. We're only just getting started here, but our real hope for this agenda in the long term is to give us better ways to make models more aligned.

Nathan Labenz: Are those bell-ringing exercises? I'm imagining as you try to scale this up to larger and larger systems, you would be running experiments: if we add some noise, how often do we see behavior that we didn't want? Or it has some connections to AI control type ideas too. How often does the thing take the bait? If we set up a situation where there's some bad thing that it could do, how often does that happen with different perturbations? And the less it happens, the more aligned we are. Is that the measurement that you would imagine taking?

Daniel Murfet: Yeah, I don't think I'm going to bite on that particular proposal, but if you think that you can elicit something about the structures in the model, which at the end of training you have some understanding produced the concerning behavior, and trace the origin... To go back to the developmental perspective again, if you can trace the origin of those structures and behaviors to some kind of pattern in the data, then you can try and intervene on that, using your understanding of that pattern, intervene at some earlier point to shape that structure differently. Obviously, that's for a complex behavior and very complex internal structure, an ambitious thing to attempt. That's one of the reasons why some progress on interpretability from our point of view is essential to unpacking that. If you imagine some complex mechanism which is producing the undesired behavior, you're not just going to find the good samples out there and throw those in the training process at 12,000 steps and be like, "Be good, be good." I don't think that's sufficiently fine-grained. But if you have some understanding of how that complex behavior came together, this mechanism, how it was shaped over development and what factors went into that shaping, then it's not crazy to imagine that there's something analogous to morphogens. Morphogens are an umbrella term that refers to factors in the development of biological organisms that shape when things happen or what things happen. They could be particular molecules or proteins, and these dictate the way that development works in biology. If you insert the right morphogens at the right time, you can change the outcome of development. This is one guiding analogy for why it makes sense to hypothesize that if you understand the developmental process well enough, you might be able to intervene at various points strategically to change the outcome. Indeed, that's how synthetic biology works. You understand the development process, you understand the morphogens, some parts of synthetic biology, and then you can program the development in a way you prefer.

Nathan Labenz: One question on the nature of overgeneralization. Is reward hacking an example of overgeneralization where the model has gone too far in understanding the true nature of that signal beyond what we intended, leading to reward hacking? Is that fair?

Daniel Murfet: It could be. There are a few things. Reward hacking has a technical definition under which the answer to your question is no. Technically, reward hacking means the system found a way to get more reward than the intended solution. That's purely about the reward, analogous to the loss. That's not really about complexity. That's just about getting more of the actual explicit target of the optimization process. So under that definition, the answer to your question is no. Reward hacking is not an instance of the trade-off we discussed earlier, which you're calling overgeneralization. However, it's possible that instances we currently call reward hacking are partly a trade-off favoring a simpler solution over the intended one.

Daniel Murfet: I don't claim to have evidence for that, but I would hypothesize that. However, I think you're referring to one hypothesis for how dangerous behavior could arise. This differs from the usual story of dangerous behavior stemming from over-optimizing a reward target. To briefly revisit Jesse's earlier experiments: There was a simpler solution: ridge regression. In theory, if you train long enough, you should always switch to memorizing the data, because the optimization pressure wants you to get the lowest loss possible. The lowest loss possible is to memorize the data, if possible. Typically, when training a GPT model, you can't just memorize the data, so you must generalize. But if memorization is possible, it's the best approach given the training signal. However, in that experiment, we actually see that in many cases, the model sticks to the generalizing solution and never switches to the memorizing one. So, if you imagine the loss landscape, at its very bottom is the optimal solution specified by the training data, which in this case can be mathematically proven to be memorizing the tasks. But there's a simpler solution with higher loss that doesn't memorize but generalizes. Training can get stuck there forever, even though it's not the true global minimum. I mean, I say forever; obviously, we didn't train forever, but it seems it would stay there forever. This is a good heuristic, I think, contrary to some normal stories about AI risk: you might not get what you ask for. You don't get what you try to specify in the training data. Instead, you get a simplification, a heuristic, an approximation, which is not what you intended or expected. That could be good or bad. In AI safety, people mostly have an intuition that the simpler thing will be better, right? Because they often think about it this way: If you imagine the model becomes a scheming agent, even though you didn't optimize for this, so it's trying to do the task, it has structure in its weights to do the task, and then it needs to tack on additional structure to take over the world, right? Since that would be more complex than the solution that just does the task, and we have a simplicity bias, that's great because there's strong pressure to remove this additional, dangerous stuff. That may be a reasonable way of thinking about it. One should not have much confidence in this. This is a quite imprecise story, not really grounded mathematically. So that's potentially the case. But to tell an opposite story, simplification could be quite dangerous. And if you'll indulge me, there's a really interesting historical example here. Something called the Windrush scandal. The UK, I think in the 1940s, allowed citizens of the British Empire to settle in the UK without documentation. Many people came to the UK, primarily from the Caribbean. In the 1970s, they changed the laws, insisting people have documentation of their citizenship to access services. These people, who came as part of this program, didn't have the relevant documentation. Some were deported, and many bad things happened. An investigation and scandal occurred in 2018, I think, looking into this. So this is a case where you have a policy. Obviously, the state's intended policy is not to deport people it previously allowed into the country, right? But when that policy is actually implemented at the street level, it becomes a simplification of the original intent. It's a very brute judgment call by the street-level bureaucrat. This phenomenon is referred to as street-level bureaucracy, which simplifies what the original policy tried to specify. This ends up having negative unintended consequences. So, I think a case can be made that in AI safety, we should also be concerned about what happens if the behavior we specify at great expense by collecting training data ends up being a simplification, in ways we don't anticipate, of the intended target behavior, leading to undesirable outcomes.

Nathan Labenz: Seeing a state and seeing a model may be more similar than different, certainly relative to what I had conceived of previously. I think I have two more questions. One is, I've heard Jesse tell the story in the past in other forums that model training in the future could look more like refining oil does today, where it's a very messy process to start, but you can figure out exactly what components need to be added when. You have a sense for when these phase transitions are happening, and you have a much more precise level of control over that process. I'd love to hear a better version of that story than what I just offered. My other question is, what do you make of other safety efforts? You've said that we probably can't just be good examples and hope for the best, or at least that doesn't seem adequate. But I'd be interested in your assessment of all the various, or at least some of the most prominent techniques out there today. What do you think has the best hope of working, or what do you think is doomed? So, your vision for the future of model training and the level of control we might acquire over that process. Maybe you could talk about how you scale and the compute, because I don't think we ever touched on that. And then this assessment of the landscape. There's a lot there, but I'll let you have at it.

Jesse Hoogland: I could start with the oil refining example. I think it's useful to compare current deep learning to alchemy. We have a huge cauldron, which is the architecture. We have a fire, which is the optimizer. And then we have the reagents we put in this cauldron that we start to mix together. Those reagents are the training data. Currently, we just throw the entire internet into this cauldron, start stirring, and hope that we haven't accidentally mixed bleach and Clorox. What we have in mind for the future of training looks more like industrial chemical manufacturing, where you know exactly which reagents you're mixing together, at what concentration, at what point in the process, and with what catalysts and ingredients. This level of control is not something that's accessible to us right now, but it's something that we're already empirically starting to develop. We have splintered the post-training process into a bunch of different stages. There's chain-of-thought RL-1, there's a second stage of chain-of-thought RL. There's a little bit of instruction fine-tuning somewhere. There's Constitutional AI one. There's the second stage of Constitutional AI just for the personality, then some more refusal training. Similarly on the pre-training side, this process is being split into multiple stages so that we're developing an empirical understanding of how to better control the learning process. What we think is possible is that you could develop the scientific understanding to actually know when you should mix two samples in a batch together to get the desired behavior out the other end. This would give you more control over the entire process and what you end up with.

Nathan Labenz: Just to interject one example that struck me recently with the Claude 4 system card. I'm sure you've seen this anecdote. It hasn't been very deeply explained, although Sam Bowman talked about it online. They reported having observed that the model was following harmful system prompts in a way that surprised them, and then they were trying to figure out why. Sure enough, it turns out that they had omitted the sys prompt harmful dataset, which had been developed specifically to teach the model what to do when given a harmful system prompt. So the simpler solution of always following the system prompt is what it ends up learning for lack of that curb being included as it was meant to be. Interestingly, from my perspective, they did not go back and retrain. I think there's a lot I'm still unpacking, like what exactly should I be inferring about the world on multiple different dimensions from the fact that they didn't just say, "Oh, we messed that up." Revert and pick up where that thing was supposed to be introduced and continue on. Presumably, that means it was happening somewhat earlier in the process than some very late finishing stage. Presumably, it also means they have constraints at Anthropic in terms of either compute budget or timeline to launch for competitiveness reasons or whatever. There's a whole interesting analysis to be done there too. But that's at least one of these very recent and concrete examples where without necessarily a full theory to drive it, that is starting to emerge, and you better make sure you get those datasets in the right place at the right time. It honestly seems like we have a caught... I had a private conversation with somebody at Anthropic just to ask for clarification on this, and they basically said, "We're confident we got the value of that dataset in the end." I was also... So I'm like, okay, that's interesting. I sort of believe you, and you definitely know more than me. At the same time, how can you be confident? On what basis could you have any confidence about that other than obviously poking at the model a lot? But that's the sort of confidence, I guess, that you hope to create in the process of continuing to do this work.

Daniel Murfet: It isn't an unreasonable expectation, given that our civilization is about to be transformed by this technology, to have it be more of an engineering science and less of a big mystery. It's not that nobody understands anything; the data mixtures selected and their order are empirically very finely tuned, with a lot of effort put into that. Selecting hyperparameters has some underlying theory. A lot of people and money are being poured into deep learning, so it's not completely incomprehensible. However, most people would agree it would be highly desirable for it to be much more understood and more like other aspects of engineering that we subject to safety engineering, which deep learning currently isn't. That seems hard to disagree with. That's not an easy thing to do, especially in a short period of time. To come back to your question about scaling and our situation compared to where Anthropic was with SAEs a few years ago, this analogy seems quite apt to us. It's been a difficult path to get from fairly pure abstract theory to validating it in real systems. Now we're starting to do things like circuit discovery. We can find induction circuits and simple things like that in an unsupervised way. There seems to be scaling to do, and that's one of the main things that remains to get to a point where we're finding complex internal structure. In terms of compute, that's a little difficult to say. The compute overhead of computing SAEs, for instance, where you train them, is very expensive. Our compute burdens will be in different places, but I imagine it's still very substantial to try and understand large models. We have been studying models up to seven billion parameters.

Nathan Labenz: We can definitely use more compute. If a listener has compute to offer, please.

Daniel Murfet: Yes, that's right.

Nathan Labenz: Something to add here: if you have more compute, you can draw more samples and apply more perturbations, which gives you a finer signal about what's going on inside the model. We expect that generally, we will be able to translate more compute into a finer understanding of structure.

Daniel Murfet: Regarding other approaches to alignment, in many respects, the things we're describing are useful foundations for many different things you might do. That's one way we think about it. For instance, the work we're doing around shaping the data distribution is potentially useful for some of the work the UK AI Security Institute wants to do around elicitation. We have various ideas about how our work can contribute. Some of it is just good, solid foundational science of understanding models, the training process, and how model behavior works. In the same way that interpretability is broadly useful, much of this is broadly useful. It seems unclear whether the current paradigm of shaping model behavior at the end of training is going to be a very robust solution to alignment as we move towards higher capability levels. This is an oft-expressed concern. Something that looks like baking in behavior and control of behavior earlier in the training process seems broadly likely to be adopted. Arguably, the deliberative alignment approach, especially if you scale up the amount of compute spent on reinforcement learning to be equal to pre-training, involves a lot of shaping of model behavior.

Nathan Labenz: that is taking place over a large part of the effective compute.

Jesse Hoogland: This idea is not just ours. There are papers like 'Pre-training from Human Feedback.' A lot of people realize that we need to bring alignment earlier in the learning process. I can end with a look to the next few months. Currently, a lot of our effort involves scaling techniques up to larger models and running many experiments with them. We've validated these circuit discovery techniques in small, three-million-parameter language models. Now, can we do this in seven-billion-parameter language models? We think this is quite likely to work by the end of the year. On the alignment front, we're starting our first experiments applying these techniques to try to steer the learning process. This is still early days. We'll probably have early signs of life by the end of the year in small language models.

Nathan Labenz: All right.

Jesse Hoogland: We're developing applications for things like elicitation and data attributions. There's a singular learning theory extension to influence functions that you can study, which gives us ideas about the influence of samples on other behavior. We are trying to use these applications to validate that SLT and the techniques we're developing offer something beyond existing techniques. That's where we're heading.

Nathan Labenz: Cool. I love it. This has been fascinating. I appreciate you indulging me in so many little side questions and attempts to develop my own intuition. And I certainly think the big picture need for a proper developmental understanding of how these AIs are forming and any theoretical basis at all for what they're going to do when they get into a truly out of distribution situation is badly needed. And especially as we get closer to 2027 and beyond, it's really going to help me sleep well at night. So, to quote my dad who often quotes Leslie Nielsen from Airplane, "Good luck. We're all counting on you." Anything we didn't touch on you want to leave people with real quick?

Jesse Hoogland: If you want to learn more about singular learning theory and read about some of the papers we discussed, and some we didn't, you can go to our website. That's timaeus.co. T-I-M-A-E-U-S.co. There's also a Discord server link there. That's in particular the place to learn more. We have a weekly seminar and plenty of opportunity to get engaged with the research going on.

Nathan Labenz: Cool. I've typed it wrong more than once myself. It's T-I-M-A-E-U-S.co.

Jesse Hoogland: Yeah.

Nathan Labenz: And it's a brain stretching exercise, at least for me and probably for most of us. But I've definitely enjoyed stretching my brain in this way. So, I look forward to future progress. We can do a check back in maybe in a year's time or whatever, and see how much we've closed the gap on all these important questions. But for now, I will say Jesse Hoogland and Daniel Murfet, founders of Timaeus, thank you for being part of The Cognitive Revolution.

Jesse Hoogland: Thank you, Nathan. Thank you.

AI in the AM — Week 2 Highlights (June 2026)

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

Embryology of AI: How Training Data Shapes AI Development w/ Timaeus' Jesse Hoogland & Daniel Murfet

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

AI in the AM — Week 2 Highlights (June 2026)

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

Embryology of AI: How Training Data Shapes AI Development w/ Timaeus' Jesse Hoogland & Daniel Murfet

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

AI in the AM — Week 2 Highlights (June 2026)

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

AI in the AM — Week 1 Highlights (June 2026)

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures