In this groundbreaking episode of the Cognitive Revolution, we explore the intersection of AI and biology with expert Amelie Schreiber.
Watch Episode Here
Read Episode Description
In this groundbreaking episode of the Cognitive Revolution, we explore the intersection of AI and biology with expert Amelie Schreiber. Learn about the advances in drug design, protein network engineering, and the unfolding AI revolution in scientific discovery. Discover the implications for human health, longevity, and the future of biological research. Join us as we delve into an exciting conversation that may redefine our understanding of biology and medicine.
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
CHAPTERS:
(00:00:00) Introduction
(00:04:53) Introduction to Amelie Schreiber and the Podcast
(00:08:59) Understanding Protein Interactions
(00:11:45) Traditional Methods vs. AI Approaches
(00:13:51) Molecular Dynamics and AI Models
(00:18:02) AlphaFold and Protein Structure Prediction
(00:18:43) Sponsors: Oracle | Brave
(00:20:51) Protein Dynamics and New AI Models
(00:32:36) Sponsors: Squad | Omneky
(00:34:22) Challenges in Protein Interaction Models
(00:44:44) Generalization and Data Splitting in AI Models
(00:48:43) Advanced AI Models for Protein Complexes
(00:52:25) Practical Applications of AI in Biochemistry
(01:01:53) Designing Protein Sequences with Ligand and PNN
(01:05:19) Binder Design and Fold Conditioning
(01:08:48) Challenges and Bottlenecks in Drug Discovery
(01:16:09) Adoption and Accessibility of New Technologies
(01:21:04) Future Prospects and Ethical Considerations
(01:37:08) The Role of AI Agents in Biological Research
(01:40:18) Balancing Innovation and Safety in Biotechnology
Full Transcript
Transcript
Nathan Labenz (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Eric Thornburg. Hello and welcome back to the Cognitive Revolution. Today I'm speaking with Amelie Schreiber, a computational biochemist and AI researcher working at the forefront of applying the latest AI models and research techniques to some of the most complex and impactful domains imaginable. Drug design, protein network engineering, and broadly unlocking the secrets of biological systems. In all honesty, this episode has given me more future shock than perhaps any other episode we've done. While traditional analytical methods are not well suited to the crazy complexity of biological systems, and as a result, there's still much more that we don't understand about biology than that we do. As Amelie makes clear, modern AI architectures are perfectly suited to take advantage of biology's massive datasets, and thus, a new wave of AI models are poised to dramatically accelerate the process of scientific discovery in biology. Starting with how we understand protein and other cellular structures and the physical interactions between them and likely soon zooming out to help us understand how our cells, tissues, and bodies function at higher levels. As you might expect, given the incredible complexity of biology, if you're not already well versed in the subject, the first hour will feature a number of new technical concepts. And while Amelie and I do our best to explain them all clearly, this episode, for me at least, did require multiple rewinds for full comprehension. I think it might help to keep in mind the distinction between static structural analysis and dynamic conformational analysis, which allows for molecules to change shape as they interact with 1 another as you listen. Remember too that all the models we discussed today are really rather narrow in scope. Foundation models for biology along the lines of the LLMs that many of us are most familiar with are just now starting to be trained. 1 upshot of this is that we discuss different models for predicting shapes versus for predicting sequences. Obviously, in reality, these are 2 sides of the same coin, but for our purposes, they are often modeled separately. In the second hour, we discussed the implications of all this for human health, longevity, and biosecurity. I had heard, as you probably have, the story of how the first COVID vaccine was designed in just a few days after the right scientists received the required information, but I had never really considered what the world might look like if that pace of medical r and d were to become the norm. The impact would seem to be a near certain revolution, not just in biology, but also in practical medicine. 1 particularly striking theme is the potential for AI to change the way that we do biological research on humans. The difficulty and danger of experimenting directly on humans has always been a major bottleneck, but the latest models are quickly approaching the point where we should be able to run meaningful digital experiments, and that could radically accelerate the pace of discovery both for the utopian better and at least with some probability, perhaps catastrophically, for the worse. And all this, by the way, was before AlphaFold 3, which dropped shortly after we recorded which will certainly get its due attention in future episodes. With the full range of possibility in mind and noting just how small many of these biology models are relative to the latest language models, I definitely wanna give a shout out to the policy wonks at the White House who set a lower reporting threshold of 10 to the 23 flops for reporting on biological models as compared to 10 to the 26 flops for language models. That and their recent policy requiring DNA synthesis companies to screen orders against known dangerous sequences are looking extremely smart right now. By comparison, the now familiar question of whether today's chatbots are more helpful than Google for the purposes of making a bioweapon, to me, honestly, already feels quite quaint. For what it's worth, because the vision of the future sketched out here was so far outside my previous understanding, I did take some time to double check my high level interpretation of Amelie's claims with some very credible people in the field. 1 of my very smartest friends who is also most deeply enmeshed in these issues said simply, that is exactly what is going to happen. If you find this work valuable, I would appreciate it if you take a moment to share it with friends. This episode in particular took a lot more work than a typical CEO interview, but it will be very well worth it if I can help our high value audience get up to speed on such a critical area. And as always, we invite your feedback and suggestions either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now I hope you enjoy this introductory deep dive into a genuinely awesome area of emerging research that may soon impact all of our lives. This is the AI revolution in biology with Amelie Schreiber. Amelie Schreiber, computational biochemist and AI researcher. Welcome to the Cognitive Revolution.
Amelie Schreiber (4:59) Nice to be here.
Nathan Labenz (5:01) I'm excited for this conversation. I expect to learn a lot from it. This all started with a tweet that I saw you put out where you said, here are my top 10 AI tools for biology. And I read down the list, and I was like, I don't know what any of these are. So Yeah. I think pretty much immediately to say, hey, would you be interested in educating me in the form of a podcast? So I appreciate you taking the time. You wanna start off by just giving a little bit of context who you are, what you're doing day to day, and what kinds of problems you're trying to solve?
Amelie Schreiber (5:30) Yeah. Sure. I'm an AI researcher, and I focus on computational biochemistry applications. I actually started out as a mathematician. My training in grad school was in mathematics, actually. Started off with pure mathematics and then transitioned into applied mathematics and data analysis. And then after grad school, I got into deep learning and started working on more AI related things. For me, the biochemistry applications are 1 of the most compelling things that we could be working on right now. I think other than, like, AGI, whatever that means, I I think it's probably the most important problem we can be working on because it's a huge impact on human health. The applications are really profound and have the potential to be very impactful for everyone. So I get really excited about the biochemistry applications of AI. I have some projects and things that I'm working on that are more related to environmental things and material science type applications, but primarily, I focus on the medical or or biomedical aspect of things.
Nathan Labenz (6:28) Okay. So let me try to give a extremely high level understanding of a couple of the biggest problems that are being studied in biomedical sciences. As I was preparing for this, I was having a good conversation with club 3 about it. And I came away with the understanding that both at the level of an individual cell and then again at the level of the overall organism system, we have 1 really massive challenge, which is that we don't know how it works. We've got the DNA, which is the code, the RNA, which is both messenger, but also is a machine. The proteins are all machines that fold up in weird ways and interact with each other in three-dimensional space. And then you've got the small molecules as well, which are, like, signaling, but also if they fit right, you know, all these puzzle pieces fit together in super strange ways. There's this incredible network of interactions where things are disabling each other or promoting each other or interacting in all sorts of complicated ways. The nature of those interactions initially was entirely unclear. And now humanity at large has embarked on this grand project of trying to figure out how do cells work and how do our bodies work. And I'm gathering we're, like, maybe 5 to 10% of the way there. Most of the interactions remain unknown to us, but we've at least mapped out a decent chunk. So the first challenge, if you're trying to solve a disease, is figure out what is the pathway, what is the interaction in this, like, super complicated thing that is going wrong. And then the second challenge is, if I know what is going wrong, can I do something to intervene to stop it? But then again, there could be a lot of other knock on effects. In in in that way, there's an important commonality with large language models. These things are not super clean. In language models, people are probably familiar with the notion of superposition, which is when when activation in a network can light up for multiple different reasons, and you certainly see all of these kind of patterns of complexity in biology as well. But we know that there's a lot of stuff going on. There's a lot of interactions that are happening. We don't really know a lot about that, but we're gradually learning more all the time. And then if you do have a target, now it's okay. These are all three-dimensional spatial things, and so it's just extremely, extremely difficult.
Amelie Schreiber (8:52) It is. Yeah.
Nathan Labenz (8:53) How am I doing there in terms of just setting a a foundation to understand the challenges?
Amelie Schreiber (8:58) I I think you've really hit the nail on the head. I work with a lot of different kinds of molecules. So small molecules, which are are your drugs that you, like, take in, like, a pill form, these are smaller than proteins and less complicated. And I also work with proteins and also a little bit with DNA and RNA. And understanding protein interaction networks is a very difficult and complex problem. To just determine whether or not 2 proteins interact with each other already is a hard problem. And then modeling how they fit together when they do interact and understanding the strength and how transient that interaction might be, it's very difficult. So protein interactions especially, but also protein DNA, protein RNA, and protein small molecule interactions are all very complicated things to model. There are some new ways to address this that have come out recently that I have a lot of hope for that I think are very promising and very effective approaches. I use various kinds of AI tools, analyze these molecules, create new ones, modify them, engineer them in ways, grafting them together and stuff, performing complicated operations on these molecules so that they perform a particular kind of function. As an example, designing a new protein to bind to another protein so that you can cause some kind of cascade event in a protein interaction network, or so that you can block certain interactions between proteins. A really good example in cancer, your basic first examples of mechanism of cancer, there's PD 1 and PD L 1. These are 2 proteins. 1 of them's located in the cancer and 1 of them's located on your immune cells. What happens is these 2 proteins end up binding to each other and essentially the cancer turns off your immune cell so that your immune cell doesn't attack the cancer. And that's not good. Right? You don't want this kind of interaction between these 2 proteins because you want your immune system to recognize the cancer and destroy it. Having this interaction turns off your immune system in a very specific way. So you can do things like design binders to this protein to block that interaction and that can help you combat cancer and hopefully treat it. And that's just 1 basic example of something that you can do with these tools that I'm using and that I'm interested in. You can also design new drugs with them. You can design drugs that are specific to a particular binding pocket on a protein. You can design drugs that have very specific chemical properties. You can design proteins that have very specific chemical and functional properties. And this is pretty new stuff, but you can also do a lot of the same things with DNA and RNA molecules as well.
Nathan Labenz (11:46) Okay. This is a paradigm shift. Right? The application of AI to biology, obviously, AI is creating all sorts of paradigm shifts. I But think it might also be helpful for people to understand a little bit better the before state when we didn't have any of these tools yet, which is not that long ago.
Amelie Schreiber (12:03) Not that long ago. Yeah.
Nathan Labenz (12:04) What was the sort of prevailing approach to figuring stuff out? You hear these stories of, oh, look. We found this frog in the rainforest that is immune to a certain disease. What's going on there? Let's see if we can't find something in this frog that could be a medicine or whatever. It's really anecdotally special observation motivated investigations in a lot of cases. And then I know there's also just a lot of brute forcing where it's like, we have no idea which proteins are gonna interact with which other ones. So let's just create this massive cross matrix and see if we can figure it out that way and look for hits, kind of massive assays just exploring the space. All of these things without any idea of what the puzzle pieces actually look like, which makes it obviously very difficult to figure out how they would fit together. What more would you tell people who wanna understand, okay, what was the before before all this stuff started to come online?
Amelie Schreiber (12:55) There there's a lot of methods that come from wet lab work where people do this stuff in animals, like they say you want to do directed evolution on a protein and try to find higher functioning variants or variants that are more thermostable, that have higher expression or something like that. You can mutate these things in a lab, you can do like point mutations or you can do 2 mutations at a time or you can do multiple, but when you start adding in things higher than just single point mutations, you get this combinatorial complexity. Right? And so it gets really unwieldy. I feel like my experience with traditional methods is somewhat limited. I don't come from a wet lab background. I come from a very computational background. I haven't spent a lot of time working with more traditional methods that people have used historically. So computationally, traditionally, the way that this sort of thing was approached was through molecular dynamic simulations. You have what's called a potential, and this potential tells you how the dynamics evolve.
Nathan Labenz (14:04) And is that basically like a energy potential diagram? Help me pin down the ground truth.
Amelie Schreiber (14:09) Yeah. Observing the dynamics of a protein is really hard. The ground truth ultimately is the Boltzmann distribution, which is this theoretical physics idea that comes from statistical mechanics. It's a probability distribution. And you can think of it in terms of an energy landscape. The low parts, the valleys that you have in your energy landscape are gonna be the metastable states of the protein. And then the peaks, like on top of the mountains, these are states that are very transient and that are unlikely to exist for very long. So when you're trying to model the Boltzmann distribution, you're trying to get out these low energy confirmations from it and then the transitions between those and understanding how often you transition between this state and this other state and what that transition looks like. This is all part of the dynamics of the protein in the environment. And so I guess the answer to your question is the ground truth is the Boltzmann distribution. Whether or not you can model that on a computer or observe that in a lab is another question. But the Boltzmann distribution of the protein is the ideal ground truth.
Nathan Labenz (15:20) So it's a probability distribution of what percentage of the time the protein in some solution or whatever is in Mhmm. Shape a versus shape b versus shape c. So in terms of just a little intuition for the shapes, I envision a slinky where if I just sit the slinky down on the floor, it will come into a pretty tight coil. And that, you might say, is, like, its lowest energy state. Then I can stretch it out. And if I Mhmm. You know, do the work and put the energy into it, then I can stretch it out. Now That's right. If I let it go, it's gonna snap back to its low energy state. That reverb moment in between things is a transitional state. It's not gonna be in that state very long. It's on its way from 1 to another.
Amelie Schreiber (16:09) That's right. Yeah.
Nathan Labenz (16:09) And for any given protein, there could be multiple different low energy states that it might spend time in.
Amelie Schreiber (16:15) Yeah.
Nathan Labenz (16:16) And, of course, the individual proteins, they're in solution. Right? So there's water molecules bouncing off them all the time. So the constant bumping into the environment creates opportunity for these things to occasionally flop from shape to shape. Yep. And there probably also is, like, a path issue where you maybe can get from a to b and b to c, but not necessarily a to c without going through
Amelie Schreiber (16:40) That's right. Yeah.
Nathan Labenz (16:41) B, I imagine.
Amelie Schreiber (16:42) Yeah. Exactly.
Nathan Labenz (16:43) And so doing all of this is probably hard for a lot of reasons, but you highlighted the fact that the computation is really slow. Right?
Amelie Schreiber (16:52) Yeah. So molecular dynamic simulations, they come in a lot of different flavors, a lot of different complexities. Some of them model the proteins using just standard Newtonian physics and then you can add in more complexity on top of that, things like quantum properties and other features of the protein to make the molecular dynamics simulation more complex and more robust. But these simulations are really computationally intensive. They take a long time, they take a lot of GPUs, and they're just not very efficient. And on top of that, the length of time that you run your simulation for really, in a lot of cases, determines how accurate your distribution is and how accurate the confirmations or trajectories that you get out of that are. So you might run your simulation and not run it for long enough and you don't get all the different states that the protein might be in, that you miss some of them. Right? And there are some new AI models that are trying to address this and trying to make headway into augmenting or even replacing molecular dynamic simulations in a lot of cases. So for example, AlphaFold2 came out a couple of years ago and that was a big deal, right? But you just get a single static structure from AlphaFold2. So we give it the protein sequence, which is just a sequence of amino acids, which are represented by 20 letters, and it takes in this protein sequence and it provides you with a static structure for that protein that's a low energy confirmation for that protein. It also tells you how confident the model is that that is the structure of the protein at that particular point in the protein. And then you can take the average and get an overall confidence for the whole protein.
Nathan Labenz (18:44) Hey. We'll continue our interview in a moment after a word from our sponsors. So let me just get into a little bit more how that happens. If I understand correctly, there were proteins that had been analyzed mostly through X-ray crystallography to the point where people were pretty confident that they had a structure. People put a lot of time in the lab. We gotta make enough of this stuff, get it to coalesce into some crystal form, then we can hit with X rays, then we can try to decipher how that gets scattered, and then we can come up with a structure. And we hope that approximates the structure that it actually takes in the cell, but no guarantees because protein crystals don't really occur in nature. Right? That's very odd thing in the first place. So that was, like, closest approximation that we could get. Right? That I think that's why the initial things were limited to outputting a single predicted structure.
Amelie Schreiber (19:35) That's right. But again, that protein could exist in other confirmations. It might have other states that it exists in when it's interacting with other proteins or based on the environment that it's in, what temperature it is. Things like this can change the shape of the protein, right? They move around, they're very jiggly and they do things. And so having a way to sample the Boltzmann distribution and get all of these different confirmations out of them And also understanding the transitions between these states is a really difficult problem. People have been trying to address this. There's actually a model that just came out middle of last year that does a really good job of addressing this. It's called distributional graph former. It's a generalization of AlphaFold 2. It's not just a single static structure anymore. It's actually a whole ensemble of structures and also the transition pathways between those different metastable states. Distributional graph former does a pretty good job of doing it too. There's some room for improvement for sure, but it it's a pretty solid model. Some people that I have a lot of respect for worked on this, and it's a diffusion model similar to, like, DALL E, which most of your watchers are probably familiar with, but it works on proteins instead of images.
Nathan Labenz (20:51) And then another level of, like, insane complexity. They interact with each other. So this is, like, maybe in the presence of some other protein that constrains it in a certain way or some other small molecule perhaps that fits into a pocket of it in a certain way. You can have these deformations. And then subject to those constraints, they still find their natural low energy state. Returning to my slinky visualization, I could also step on the middle of it, and then the parts on the side would presumably still look like normal, but there'd be this, like, deformed part in the middle where I'm stepping on it. It's where AlphaFold Yep. Multiple multimer starts to
Amelie Schreiber (21:25) Multimer. Yeah. AlphaFold multimer models the interactions between proteins. When it predicts the 2 proteins together, there's an interface region where the proteins are in contact with each other. And AlphaFold Multimer tells you the quality of the interfaces between the 2 proteins. And then you have these predicted aligned error outputs, and these tell you how good of a quality the model has provided you.
Nathan Labenz (21:48) So on the interface, you have a range of kind of whether these things are like puzzle pieces that are perfectly fit to each other and have a really tight coupling or jigsaw fit pieces that sort of fit, but they don't perfectly fit and they don't really wanna stay together as much, all the way down to obviously they just don't fit at all and they just kind of don't don't adhere to each other.
Amelie Schreiber (22:10) Yeah.
Nathan Labenz (22:10) Is there anything more you could we could do to develop our intuition there for this dynamic range of how adherent these things are to each other?
Amelie Schreiber (22:18) Yeah. So there I get there's 2 ideas that you're addressing here. 1 of them is binding affinity, like how much affinity these 2 things have for each other to to bind to each other. And then there's also how transient are the interactions between them in terms of, like, dynamics and, like, the time spent next to each other. There's a recent work. I think it's out of MIT. They use the AlphaFold 2 architecture, and they train it as a flow matching model, which is a generalization of diffusion. And using this, they get these ensembles of confirmations for a protein. And part of the information that they get out of this is how transient the interactions between residues are. Let's say you run your AlphaFold model and you produce 10,000 different confirmations or states that the protein exists in, you can cluster those structurally and then look at which residues are close to each other in each of those clusters. And you can tell how transient the interaction between these 2 residues are and and how much time they spend together.
Nathan Labenz (23:26) 1 thing that caught my attention, I believe you had said that there were, like, as many as 10,000 possible confirmations for a single thing.
Amelie Schreiber (23:36) Yes. When you run AlphaFold, it generates confirmations of a protein for you, and it was trained on molecular dynamics simulation data. So it does have some notion of dynamics and that produces different confirmations of the protein for you. And you can tell it how many of these to produce, right? And the more that it produces, the more likely you are to get a nice global picture of all the different confirmations that might exist for that protein. So we've actually tested AlphaFlo on proteins called fold switching proteins. The most popular example in the literature and in the community right now is KYB. And KYB is this fold switching protein that like 10% of the time it exists in this 1 confirmation and then 90% of the time it exists in this other confirmation And this is influenced by like a circadian rhythm. And when you apply AlphaFlo to some of these fold switching proteins, it doesn't capture both of those fold switch states. So in the example of KaiB, it actually only really predicts confirmations that are relatively close to the slightly higher energy confirmation, the 1 that's a little bit less likely, the 1 that it exists in 10% of the time, which is a little bit odd. Right? Because you would expect and maybe want your model to go for the ground state or the lowest energy confirmation. And for some reason, it it sticks kind of close to the fold switch confirmation. I'm not sure that there's like a good explanation for why. I think it goes back to the AlphaFold 2 predictions because when you predict the structure using AlphaFold 2, you get the fold switch state. You don't get the ground state. People have done all kinds of little hacks to try and get out the ground state instead of the fold switch state. The most popular method they've hacked together is, doing MSA subsampling.
Nathan Labenz (25:34) Can we talk about this MSA thing for a second?
Amelie Schreiber (25:36) MSAs are are basically this. A lot of times, the changes that happen only happen in certain regions of the protein, and then there are other regions of the protein that are highly conserved. And so when you do an MSA, you can look at which regions of the protein are really highly conserved. And this will give you some indication of which parts of the protein are gonna affect the function if you mutate them. So like if I have the region of the protein, the evolution hasn't changed or has changed very, very infrequently. If I change that, if I mutate that, it's probably gonna degrade the function of the protein or change the function of the protein or maybe you'll decrease its thermal stability or something like that. So MSAs are just telling you evolutionary information and AlphaFold2 uses MSAs. You can either use the database of MSAs or you can have it compute them on the fly to inform how it predicts the structure of the protein. Because when you have proteins that are very closely related evolutionary, their structures are often very similar. That's not always true. I can have 2 proteins that are very similar but have very different structures. That happens all the time. But if you have this extra evolutionary information in the form of an MSA, you have extra information that helps you predict the structure of the protein of interest.
Nathan Labenz (26:58) This is sort of calling to mind the, classic image of the plane with all of the spots where the planes came back having been shot and then all the the places that that they didn't observe, those were the planes that got shot down. So this is the evolutionary equivalent of that where
Amelie Schreiber (27:14) Yep.
Nathan Labenz (27:14) We don't see changes in certain regions because they're super core to functionality. And that in turn is super useful for prediction making because these parts are the really important parts, and the the sort of interactions that they have are the whole point of of what's going on.
Amelie Schreiber (27:33) And so clustering the MSA captures certain evolutionary information. Different clusters will focus on or accent particular confirmations. And so if I take out 1 of my clusters and predict the structure using those clusters in the MSA, I'll get a confirmation out from AlphaFold2. And then if I pick a different cluster of sequences in my MSA, I'll get maybe a slightly different confirmation or maybe a drastically different confirmation. And if you do this right, sometimes you can get out the different confirmations of these fold switching proteins. And for KiB, they've done this successfully, and for a few others, but it's not a super robust method. And I think AlphaFlo is more informative. Boltzmann generators or distributional graph former, I think are significantly better in terms of how they approach the problem because they're gonna give you more robust information.
Nathan Labenz (28:32) Okay. That's interesting.
Amelie Schreiber (28:33) So, yeah. I think AlphaFold Multimer works really well. You give it the 2 proteins or multiple proteins, and you find out if there's an interaction there. It tells you what the quality of the interfaces are between the 2 proteins that you're predicting in the Multimer. You could generalize this. Right? And this is something that hasn't been done yet. So people who are looking for like research projects, this would probably be a pretty good 1. If you generalize this to AlphaFold Multimer, do the AlphaFold thing with AlphaFold Multimer, if it works well, you could figure out how transient those interactions between 2 proteins are. Right? Because now you have 2 proteins or more that are in different confirmations at different times and AlphaFold will generate all these different confirmations for you and then you can analyze all these different confirmations and figure out how transient the interactions between the 2 proteins are and get an idea of the dynamic side of things. So hopefully that gives you some idea. I think computational approaches are proving to be much faster, much more effective, and you can scale them. And this is really good because we need things that are more computationally efficient so that we can do this for a lot of proteins, right, because we have a lot of proteins to do this for. We're working sometimes with millions or hundreds of millions of proteins. Having millions of variants of a protein and being able to assess them and determine their quality in some metric has become a lot easier in the past couple of years even. So like a good example of how AI has sort of replaced wet lab methods, doing directed evolution in a lab in a lot of ways now that we have AI tools to do similar things feels a little bit unnecessary. Like why do we need to go and inject an animal and wait some amount of time for this to play out inside the animal and then actually synthesize these things by hand in a lab somewhere when we can, like, do very similar things computationally and and often get better results?
Nathan Labenz (30:34) Hey. We'll continue our interview in a moment after a word from our sponsors. So it seems in some sense surprising that you would be able to make a new black box system to make these predictions that they would be faster and more accurate as opposed to just running the physics. I have a hypothesis on this, but I'm
Amelie Schreiber (30:54) Yeah.
Nathan Labenz (30:54) How is that leap happening?
Amelie Schreiber (30:56) Okay. So I think the key here is compression. Neural networks are compressors of information. Whereas in molecular dynamics, you could simplify things by simplifying the forces or simplifying the model in terms of how complicated the physics are that you're using to model problems. Like if I strip everything down and just do like bare bones, Newtonian physics, I can simplify things that way, but there's no real compression happening. For these AI models, you can think of them as functions, but you can also think of them as compressors of information. You're taking something complicated and noisy sometimes and you're compressing it and you're providing a representation of it that is more compact. Traditionally, think of a deep learning neural network, you've got data, you train on your data, maybe you have a train test validation splitter, you train on your training data and you see how it performs on your test data and that's your trained model. But you can also train models with physics constraints, right? There are some approaches that people are using that are completely data free and are based on physics and the model is learning the physics and compressing that physics, you get something that is faster, like orders of magnitude faster, right? You get something that produces your answer in like a minute instead of 4 hours or days And if you've do if you've done it right, it generalizes to systems that it hasn't been trained on before. And people are doing this for this problem in particular for getting the Boltzmann distribution of a protein and getting all these ensembles of confirmations and their transitions between the states. Does that make sense?
Nathan Labenz (32:31) Yeah. That was pretty much my hypothesis that it's essentially learning higher order concepts beyond the raw physics. This striking observation that comes up over and over again that, okay. Yeah. Language models, they're only trained to predict the next token, but they not only seem to be generalizing certainly beyond the narrowly defined bounds of their training data and perhaps to some degree even more than that, it's a hotly debated topic. But what is pretty clearly demonstrated at this point is that in the middle to late layers of a language model transformer, the techniques are there to say, okay. These this pattern of activations seems to correspond to this higher order concept that we care about, which is kind of a miraculous thing that it's just predicting the next token, but it's learning these concepts of justice and fairness and ethics and whatever that are obviously useful to predict the next token, and that's presumably why they're arising, but not something that's been, like, specifically coded for. So I guess in the application to biology, basically the same phenomenon is happening where raw data or or raw simulated physics, whatever is the input. I I'm guessing that the token level vocabulary of a protein would be the 20 amino acids, but the sort of Mhmm. Higher order concepts are like, oh, this chunk of a thing is reused a lot. And these 2 chunks of things interact with each other in a particular way. Give us a little more intuition of that. What are the higher order concepts that these things seem to be learning?
Amelie Schreiber (34:14) Yeah. So the concept that I think you're trying to grasp onto is protein motifs. Motifs are like reoccurring patterns that happen in proteins. They're short little sequences of amino acids that recur often in lots of different proteins and generally have a very similar structure across different proteins. If we put aside dynamics for a moment and we just look at like structure prediction, so there's another model ESMFold. This is an alternative model to AlphaFold2 that does essentially the same thing. It predicts the 3 d structure of the protein in some low energy state, right? It doesn't perform as well as AlphaFold 2, but it does pretty well. And it's a language model. It's something that people call a protein language model. It's built on the BERT architecture actually, which sounds kinda bad. Right? Because BERT's like this older model that's only used for like specific things now. The GPT models have overshadowed or outshined the BERT models at this point. But in biology, this actually makes a lot more sense because you have the masked language modeling objective that you train on for protein sequences and you just mask out some of the amino acids in the protein sequence and have it predict what those are, right? And just by training it to do this and then putting a folding model on top of it called Evoformer, which actually comes from the AlphaFold2 architecture, you don't train on any physics, but somehow you learn how to predict 3 d structures of proteins. And so in this case, it's almost like physics mostly isn't needed. If you just wanna predict a static structure of a protein, you can get pretty good results just using a BERT type architecture with mask language modeling objective training on millions of proteins. You get something that will fold proteins for you pretty darn well. There's some really nuanced architectural differences. AlphaFold 2 uses multiple sequence alignments, and it also uses templates. And in general, it is better performing because it has these extra, like, added things in the architecture that that improve its ability to predict the structure. But even there, there isn't really a lot of physics explicitly happening. We're not giving it, like, forces and potentials and things like that. And yet somehow we're able to predict the the structure of the protein with really high accuracy for a lot of proteins, for most proteins. I guess that's another example of where AI can be a lot better than the molecular dynamics because if you wanna model a protein actually folding in a molecular dynamic simulation to get the folded structure of the protein, this is pretty hard and time consuming and computationally intensive. Whereas for AlphaFold2 or ESMFold, you give it a big protein and and it it takes a few minutes. These are language model like models, especially ESMFold. ESMFold is pretty much just a language model. And so they are learning higher order concepts for sure, just as they do for natural language. You can do topic modeling and things like that on these models where you're looking at groups of amino acids instead of just the individual vocabulary elements. Another thing people have found is the attention maps recapitulate the contact maps for proteins. The contact map is like a 2 d matrix representation of all the contacts between the different amino acids in the protein. And it turns out the attention map, the matrix that you get from your attention mechanism recapitulates that and is highly correlated with those contact maps. There's some work on this, maybe 3 years old now called Bertology Meets Biology. And they do a really in-depth study of how to pull out these different things, active sites and binding sites and motifs and things like this based on the attention maps in the protein language model. But I would say training a model to predict binding affinity is a hard thing to do. And there's actually a big problem with a lot of these models. Most models that try to predict protein interactions don't generalize well. When people are training them, a lot of times people don't split their data in the right way. If your training data has sequences in it that are very similar to your test data sequences, you're gonna get overfitting and you're not even gonna realize it. And a lot of the protein interaction models that have been trained don't take this into account and they don't split their data based on like sequence similarity or structural similarity.
Nathan Labenz (38:46) The overfitting piece, didn't quite understand it.
Amelie Schreiber (38:48) When you're training your neural network, you want to make sure that your test data doesn't have a bunch of sequences in it that are highly similar to your training data because it's almost like you're training on the same thing twice and you overfit this way, you don't get a good indication of how well your model generalizes and a lot of people do this. A lot of people coming from, like, deep learning, they don't have a super strong background in biology, they make this mistake very often. And so there's just tons and tons of models out there that don't generalize and that are very overfit because they didn't split their data based on, like, sequence similarity and or structure similarity.
Nathan Labenz (39:25) Interesting. Okay.
Amelie Schreiber (39:26) Yeah.
Nathan Labenz (39:26) That's not really done in language modeling. Right? In language modeling, you could certainly have, like, lots of sentences that are very similar, and you don't draw conceptual lines between different parts of the data and Yeah. Try to generalize across those lines. So what is the difference in the biology context? My intuition says to the degree that these models are learning the higher order concepts that really matter, that they should be able to learn those kind of regardless. And, again, in the language model case, don't really have a concept of overfitting, at least in the, like, large scale foundation models.
Amelie Schreiber (40:04) Is this I mean, there's still a concept of overfitting in in NLP and training big giant models, quad or chat GPT or whatever. You can definitely still overfit these models for these really big chat type models, like really big LLMs and stuff. First of all, they often only train on 1 pass through the data. Right? They don't do multiple passes. And so partially because of that overfitting is not much an issue because you're not training multiple times on the same data.
Nathan Labenz (40:30) Well, I do think that is changing. 1
Amelie Schreiber (40:31) little Yeah.
Nathan Labenz (40:32) Nugget from the recent Zuckerberg Dwarkash interview that I def definitely made my ears and, brain light up was when he said they trained these latest LAMA models on 15,000,000,000,000 tokens. He said we'd stop because we at some point, need to move on to LAMA 4. But he said we could have rotated the high value tokens through again. And Sure. I was like, oh, interesting.
Amelie Schreiber (40:54) Yeah.
Nathan Labenz (40:55) Anyway, that's a bit of an aside, but I'm still not quite gracking it, to be honest.
Amelie Schreiber (41:01) Okay. So let's see. So you have your training data and your test data. Like, your test data is supposed to tell you something about how well the model generalizes. Right? If the metrics on your test data are substantially worse than the metrics on your training data, then you've overfit. Right? So to get a good idea of how well your model is generalizing to unseen data, you want at least part of your test data to be very dissimilar from your training data. And this is a big thing that people discuss is like whether or not these things generalize to data that's out of distribution. And if you train them in the right way, you can get models that generalize well to out of distribution data. That means it's picked up on some deeper concept that generalizes to areas that it hasn't seen before, and it's using those really deep concepts to make predictions or generate things that are out of distribution. This is not something that's impossible to do, you just have to do it right. And in the case of biological sequences like proteins, the way that you do this is you look at the sequence similarity between the protein sequences and your training data and your test data. And you can also look at structural similarity because you can have different sequences that fold into the exact same structure more or less. And so sometimes depending on what your model architecture is and what your goal is, you may also wanna split based on structural similarity and not just sequence similarity to make sure that your model is generalizing and not overfitting. Does that make sense? Yeah.
Nathan Labenz (42:43) I think so. For any given architecture, presumably, the very best performance would be if you trained on all the data. But Sure. In order to effectively evaluate your architectures along the way, you need to have some holdout to Yep. Evaluate against. And these similar enough sequences basically are a data leak in the same way that if you find out after you do your thing that, oh, hey. We actually had MMLU in the training data for the language model, then it means like, oh, well, can't really trust the MMLU scores anymore. Right. And the model still may be good or may not be so good, but if you train on MMLU, we can't really score you on MMLU. This is a similar You
Amelie Schreiber (43:25) can't trust that problem. Yep. Exactly. And there are even more nuanced ways of splitting the data too for specific things. They trained a new version of diff doc. I think they call it diff doc l because it's larger. It's the large diff doc. And they use this method called confidence bootstrapping, which is sort of like reinforcement learning. And they also split their data in a really unique way. They split their data based on these domains that are present in these interactions to train a model for docking. So you have a ligand, a small molecule ligand that you're trying to dock into a protein and do blind docking, which is a hard problem. And they get substantially better performance and it generalizes a lot better in this new version of diff doc because of the way that they trained it and the way that they split their data. And they have really strong confidence that their model is generalizing to out of distribution data and it's able to predict on things that it hasn't seen before, and they're very dissimilar from what it has seen.
Nathan Labenz (44:27) So, in this way, just to make sure I understand, what they've demonstrated is that you should be more confident in doing this stuff in totally new regimes based on the fact that they've been very meticulous in the data split. Whereas if they throw everything into the training, then you would have a hard time knowing to what degree to be confident in out of distribution stuff. Yeah. Okay. Cool.
Amelie Schreiber (44:50) And they and they did a really good job with that model. I think their performance boost was pretty big. They got a lot better performance out of it too. So it's definitely a good model to look into if you're looking to get into this stuff. Something else we should probably get into is training models that are like AlphaFold that do more than just proteins. You might have complexes where there's a protein in a small molecule or protein in DNA or protein in RNA or a metal or whatever. There's all these other biomolecules that you might wanna model in complex with each other. And to generalize AlphaFold Multimer to this new situation is pretty hard, but people have managed to do a pretty reasonable job of it. There's still room for improvement on these, but a new model that just came out recently is RosettaFold All Atom. RosettaFold All Atom will let you predict complexes that have proteins in small molecules, proteins in nucleic acids, and proteins in metals too. Big step forward now because now you can apply this idea to these other complexes and you can get this score that tells you how likely there is to be an interaction there and how strong that interaction is. And you could also probably do the AlphaFold type method with something like RosettaFold all atom as well and get something like conformational ensembles out of it. And that something like that is probably coming soon. So yeah, I would say now a lot of the more interesting models are more influenced by diffusion and flow matching models. A lot of the generative models that we're getting that are predicting the Boltzmann distribution or that are generating new protein structures for you with specific like shapes and functions or that are allowing you to design new sequences that fold into a particular backbone, a lot of them are more influenced by diffusion and flow matching than language modeling and transformers. Some of them use transformers, but they're starting to be a lot more influenced by DALL E type models, I would say. You're starting to get this really fine grained control over what kinds of proteins or small molecules or nucleic acids that you can generate. And there's actually some models that have come out that are text conditioned. So a couple of the models that I mentioned were protein DT and molecule STM. These are generative models that are text conditioned. They use the exact same method as CLIP. They use contrastive learning and they have captions and molecules or captions and proteins. And they allow you to type in a natural language prompt and get out a molecule. So protein DT, for example, or MoleculeSTM, you can give it a text prompt in natural language describing the properties of the molecule in just natural human language, giving it like this molecule has these chemical properties and it has this sort of bias away from or towards these amino acids or interacts with these other molecules in such way. Like you can give it these natural language text prompts and it will generate proteins and small molecules for you that satisfy these constraints or that correspond to the text prompt that you give it. And there are more models like this that have come out recently that do similar things. They're text conditions like diffusion models or flow matching models that are generative and that produce molecules with specified properties based on natural language text, which in my mind, that's kind of mind blowing. I think that's amazing. Having a model that will just generate molecules for you that fit natural language descriptions, that's crazy. Like, that's amazing.
Nathan Labenz (48:40) Yeah. No doubt. Okay. In actual work, like, what are the cycles that are used to actually make progress on questions of interest? Like, you have a target. You wanna make some modification. Like, how do these things fit together to allow us to work a lot faster and get more value from our limited web work resources?
Amelie Schreiber (49:00) Are you familiar with RF diffusion at all?
Nathan Labenz (49:03) No, not really.
Amelie Schreiber (49:03) RF diffusion is a diffusion model that used the RosettaFold backbone and trained it as a diffusion model to denoise 3 d structures of proteins. You don't give it a text input, but you can condition it on other types of things. It's a little more specific, and it's a little more geared towards people who wanna perform surgery on proteins. RF diffusion doesn't usually get used on its own because RF diffusion just works at the level of structure. It's doing the denoising process on the structures of the proteins. It doesn't know anything about the sequence of the protein. So I need a separate model like ligand and PNN to design the sequence for that 3 d structure. So they kind of go hand in hand. RF diffusion designs the structure and then ligand MPNN designs the sequence for that structure that will fold into it so that you have something you can actually go and synthesize in a lab. You need a protein sequence to synthesize to get the protein structure. So once you've generated your 3 d backbone of your protein, you design a sequence for it using ligand and PNN. So if you wanna design new proteins with specific structural properties and you wanna get your hands dirty and perform surgery on these proteins and graft them together or take pieces from this 1 and graft it onto this 1 or generate a protein with a specific sort of fold tertiary structure or maybe generate a protein that binds to a specific protein with really high affinity and specificity. These are the things that RF diffusion is good for. There are several different versions of RF diffusion. There's a new version that just came out called RF diffusion all atom, which allows you to generate proteins that will bind to small molecules. So you can generate proteins with binding pockets that are highly shaped complementary to a small molecule ligand. And with RF diffusion and RF diffusion all atom, you can also do something called motif scaffolding, which is where you pull out motifs from a protein that you like, maybe they're binding sites or active sites, or maybe there's a particular segment of the protein that binds to a different protein that you're interested in or performs a specific function, and you can pull these pieces of the protein out and scaffold them and build a new protein around those pieces that holds those pieces in place so that you have a new protein with very specific 3 d structure that performs very specific function. It's been a very influential model and a very useful model, and it's definitely 1 of my favorites.
Nathan Labenz (51:41) What exactly it is that you're specifying? You're saying I want it to have certain properties and then it's filling in the sequence that gets you to those properties?
Amelie Schreiber (51:52) Yeah. The most basic usage of RF diffusion is unconditional generation of a protein. So you just, you tell it some length or range of lengths and it will just generate proteins of that length that are brand new, that have never been seen in nature before, and very often are very designable, and that you can actually synthesize and create in a lab. And that's just completely unconditional generation. You don't give it any sort of constraints. It just generates a new protein for you of a particular length. 1 of the really interesting functions of RF diffusion is it will design binders for you. So if I have a target protein and I want to design a protein that binds to it with really high affinity and really high specificity, RF diffusion's really good at that. It'll design binders for pretty much any protein you can give it. And very often, the proteins that it designs that bind to your target protein are very high affinity and very high specificity and often very thermo stable. And using ligand MPNN, can increase your binding affinity and other properties even more. And then the other functionality that's really useful is the motif scaffolding, which is where you pick out pieces of a protein or multiple proteins, and you build a new protein that holds all those different pieces in place in a particular way. And so this is 1 thing that you could imagine doing with this would be building or designing an adjuvant or grafting 2 proteins together, or let's say I have 1 part of a protein that binds to protein A and I have some other protein that binds to protein B and I wanna design a protein that binds to both protein A and protein B so that I can get a complex with 3 proteins in it. Right? I wanna have a protein that binds to protein a and protein b. So what I can do is I can take those pieces out of the protein that bind to protein a and I can take the pieces that bind to protein b and I can scaffold them together into a new protein and now I've got a new protein that binds to both of those proteins. And I can do things like build out protein interaction networks this way, so if I wanna design a particular protein interaction network, this is really useful for that. That's a pretty technical and involved thing to do, is a non trivial thing to do, but you can do it with RF diffusion. You can design proteins that will bind to multiple different proteins in multiple different contexts and build out these networks this way. And this gives you a way to modulate protein interaction networks, right? Because you can modulate your protein interaction network by blocking particular interactions by designing a binder, but you can also add in interactions that weren't there before using the motif scaffolding or maybe binder design and motif scaffolding together. So it's very much geared towards people who are like really interested in performing surgery on proteins and protein interaction networks. And it's very useful in a lot of different contexts. And now with RF diffusion all atom, also with other interactions as well, because with the all atom models, you can model more interactions than just the protein interactions.
Nathan Labenz (55:12) So how does this in practice get used? That maybe the cancer example from the top is a good 1. Yeah. In that scenario, we already had an understanding of the cancer cell has this thing that disables the immune cell. And so then you could say, okay. Let's start let's brainstorm some ideas here. Right? We if we can bind something to that bit of the cancer cell that would cap it, blunt its ability to then interfere with the immune cell, then the immune cell presumably would do its job still and kill the cancer. Of course, you've got highly tangled webs of interaction that you're doing this within, but you've got something local like that. Now kind of take me through the cycles and the practical application of some of these tools for a specific problem of that sort.
Amelie Schreiber (56:05) Yeah. So, like, the 1 that we're talking about earlier with the cancer with PD-one and PD L1 is a really good first example that I think is really good place for pretty much anybody to start because you can very easily and very quickly design a binder to 1 of these proteins with RF diffusion. It's really good at designing binders for pretty much anything. If I have this PD-one, PD L1 interaction between these 2 proteins, essentially what's happening is the cancer is turning off your immune system's response to it, and then your immune system doesn't effectively combat the cancer. But if I design a binder to 1 of these 2 proteins to block that interaction and my binder outcompetes that interaction, so it has a higher affinity than the PD-one, PD L1 interaction has, then I can successfully block that interaction and prevent the cancer from turning off the immune cells responses to it. Another way that RF diffusion can be used, so let's say you have a protein that binds to a target, but it doesn't bind very well. The affinity is low and the shape complement complementarity isn't terribly good. I can use RF diffusion and protein MPNN or ligand MPNN to modify that and increase the binding affinity. So something I can do with RF diffusion is partial diffusion. So I can take my protein of interest that binds to my target and I can perform partial diffusion, which is where I add a little bit of noise, I don't completely noise it, I don't turn it into a Gaussian distribution or anything, I just add a little bit of noise into it and then have RF diffusion denoise that and what it'll do is it'll denoise it into a structure that's kind of similar to the 1 that I started with but it's also a little bit different now because I added that noise and then denoised it with RF diffusion. And if I do this and I give RF diffusion the target protein that I'm interested in, I can improve the shape complementarity between those 2. So the new denoised version of my original protein is gonna have a higher shape complementarity to my target. And then I can go and design sequences for that backbone that are different from my starting sequence. So maybe my starting sequence, maybe there were some residues that had unfavorable chemical properties at the interface. Maybe I need to use different amino acids to improve some chemical properties so that the binding affinity goes up. I can do that with protein MPNN or ligand MPNN. So ligand MPNN is the newer all atom version of protein MPNN. It's an improvement on protein MPNN and it also generalizes to context where you have other kinds of molecules other than just proteins. And I design a sequence with ligand and PNN and I can tell ligand and PNN to bias certain residues towards certain amino acids or away from certain amino acids, like at the interface for example, to get those more favorable chemical properties to increase my binding affinity further. There's also versions of Ligand NPN that will allow you to specify whether a residue is a buried residue, an interface residue or something else. Yeah, so like using partial diffusion with RF diffusion to slightly noise and then de noise the structure that you already have can help you increase things like binding affinity and things like thermal stability and other properties. Then let's see, what's another application of RF diffusion? Okay. Symmetric generation. You can design symmetric oligomers. Let's say I want to have a protein that interacts with other copies of itself. And if I have enough copies of this protein, it forms a symmetric complex. For example, let's say I have a protein that fits together with 2 other copies of itself in a triangle formation. And it's symmetric to the cyclic group of order 3. So I've got a order 3 rotational symmetry that's happening. I can use RF diffusion to generate proteins with symmetry like this. And it has other symmetries that it can use. So there's dihedral symmetries, there's symmetries, icosahedral symmetries, and tetrahedral symmetries in addition to the cyclic symmetries. And the cyclic and the dihedral, I can choose any order cyclic group or dihedral group for my symmetry for those. And RF diffusion will design a symmetric complex of proteins where there's like multiple copies of the protein in this like nice symmetric structure. This occurs in nature in multiple places. 1 example is in viral capsids. And they've also recently, some David Baker's lab recently published some work where they designed these like symmetric oligomers as like biosensors. And I think they also designed some to help with like drug delivery. So they have these, the pocket that forms when you have a cyclically symmetric complex of proteins, the pocket that forms inside of that, you can design that in such a way that it captures a small molecule drug really well and delivers it to somewhere in particular. Right? So that's another application of RF diffusion that's really useful. What else? There's the binder design, fold conditioning, which is where you can tell it specific tertiary structures to condition on. Maybe I wanna design something like a TIM barrel. I can tell RF diffusion to generate something that has the rough 3 d shape of a barrel. So I can tell RF diffusion specific tertiary structures to generate, and it will generate these different folds for you. It's called fold conditioning. Yeah, so here's a good example of how to use motif scaffolding. Let's say if I want to design an inhibitor for 2 different proteins that are interacting with each other, I could extract the motif corresponding to the interaction and then use that to design a new protein, scaffold that with the motif scaffolding. And then I can optimize it using partial diffusion and ligand NPNN to design sequences that have favorable chemical properties. And then I can check and see using the LIS score from AlphaFold Multimer how well my new protein interacts.
Nathan Labenz (1:02:41) If I'm getting it, it's like, okay. We have a an interaction that is problematic. We wanna interfere with it Mhmm. 1 way or another. We can, like, cap the 1 thing or cap the other thing. And so Mhmm. There's this kind of cycle of generate a 3 d structure. That's 1 model. Generate a sequence to do that. That's another model. Refine those to look for favorable chemical properties. And then you're validating this to the degree that you can validate it before you're actually doing any actual lab work, with something like AlphaFold Multimer and looking for a high score there. How Mhmm. So how quick does this happen? It seems like we're talking orders of magnitude speed up compared to the pre AI way of doing this? And how accurate is it?
Amelie Schreiber (1:03:31) Yeah. I mean, RF fusion, like, I can generate a single backbone in a minute. Even like a pretty big 1, it just takes a minute to to generate a a backbone. And if you have a really good GPU, that's way faster. If I'm just working on my laptop or something, I can generate something in just a minute. And then ligand and PNN is significantly faster than that. So designing a sequence for the 3 d structure is actually really fast. Per protein and per sequence, it's less than a couple of minutes to do that whole thing. And so I can design hundreds or thousands of backbones and then design hundreds or thousands of sequences for each back bone with ligand MPNN. So it's pretty fast and then pretty computationally efficient. And they've made some improvements to RF diffusion to reduce how many time steps you need to actually get a good structure out of it. And time and time again, when they actually synthesize these and check for like thermal stability or specificity or binding affinity, very often they're very high. And if I have like low thermal stability or low binding affinity or low specificity, very often when I use these models to improve that, they improve it dramatically. So they're pretty effective and they're not perfect, but they're pretty good. It's not a hard thing to do to design a protein with RF diffusion and then design a sequence with ligand and PNN and get something that really does fold into that structure with high confidence or that really does interact with my target protein with high affinity and specificity. Yeah.
Nathan Labenz (1:05:05) So how does this then fit into the broader, like, validation loop? You have this 1 thing you're like going back to our cancer example. Right? The oh, this part of the cancer cell biointeracts with this part of the immune cell and disables it. So you wanna put a essentially a physical cap on that thing so that is blocked and that can't happen, and the cancer can get okay. Cool. So I'm gonna generate potentially thousands of shapes, thousands of sequences per shape, then potentially run millions of things through an AlphaFold multi mer getting scores. And then what do I do at the end of that? Do I take, like, the top 100 and go actually try them in a living cell?
Amelie Schreiber (1:05:50) Yeah. Exactly.
Nathan Labenz (1:05:50) And so this seems and you're saying they tend to work. And then I guess we also have questions of, like, side effects would be another big downstream question. Right? We don't know what else this thing could do when put into the full environment of the cell.
Amelie Schreiber (1:06:05) Well, that's where this the specificity comes in. Right? Generally speaking, RF diffusion is capable of designing binders, for example, with really high specificity. So they bind to the target and pretty much only bind to the target. Yeah, and it's a really good thing, right? Because if you design a binder that just interacts with a whole bunch of other stuff, it may never make it to the target. It may cause other side effects like you're saying, have off target effects. But yeah, RF diffusion is pretty capable of designing really high specificity binders. The motif scaffolding, there's more nuance there because you're performing surgery on proteins and pieces of proteins, and so that can get a little more nuanced and a little more complicated. But if you're just designing a binder with RF diffusion, like, it's pretty easy to design 1 that has really high specificity and that doesn't have a lot of, like, off off target effects and stuff.
Nathan Labenz (1:06:55) So what's the bottleneck on this process? Is it identifying targets? Is it like, especially if you can design things that are that specific and they don't have much in the way of other knock on effects. I mean, it seems like we should be curing a lot of diseases pretty quick here.
Amelie Schreiber (1:07:13) I think I mean, diseases are pretty complicated, and sometimes it's not just 1 thing that you're targeting. Sometimes it's multiple targets and multiple kinds of interactions. And so you have to think in terms of, like, entire, like, protein interaction networks or even, like, interaction networks that involve other molecules as well and modulating those in very specific ways. And often figuring out what parts of a protein interaction network to to change or to modulate and how to do that, like, what sort of changes you should make to this interaction network, that's a pretty complicated problem. And so I think part of it is identifying specific targets within an interaction network because some of the interaction networks can get really complicated, and figuring out which parts of them to modulate is not an easy thing to do. And then computationally, isn't much of a bottleneck. If you just have a good GPU, you don't really need multiple GPUs even. You just have a good h 100 or something. You can get a lot done with that with RF diffusion and, like, an MPNN. The the the hardware requirements are not very taxing. They're probably a little less efficient than than I would like them to be because most average people are not gonna have access to an h 1. Like, a lot of people are are gonna be using, like, much lower lower tier GPUs and stuff. But, like, RF diffusion, for example, there's a Colab notebook that you can run on, a t 4. You can run it in Google Colab, and it takes a little while. It's not as fast, but the computational bottleneck is not a huge bottleneck right now. I think it could be a little better, but the models aren't huge models. These aren't billions of parameters. I think RF fusion has a couple 100,000,000 parameters or something like that, or in AlphaFold, same thing, like it's a couple 100,000,000 parameters or something. So they're not really big models that require really excessive hardware to do the computations. And I think the slowest part of that pipeline where you design binders or design motif scaffold with RF diffusion or partial diffusion or what have you, and then designing the sequence with protein NPNN or ligand NPNN and then validating with AlphaFold or AlphaFold Multimer, the slowest part in that whole thing is actually the AlphaFoldMultimer part or the AlphaFold2 part where you're validating and predicting the structure of whatever you've generated and designed with the other 2 models. That's actually the slowest part in the whole thing. But if you're able to predict a lot of structures with AlphaFold, then you're fine. There's no real computational bottleneck there. And as far as like, why are we not curing a bunch of diseases? A lot of this also has to do with how long it takes to get these things to market. And the traditional system that's set up right now for drug discovery and protein therapeutic, it's hard and slow to get new things through. Like, it's a very slow system, and it wasn't designed for, like, large high throughput methods. And it just it takes a lot of time to get a new drug or a new protein therapeutic through and approved and get it through all the testing, like, the different clinical testing phases.
Nathan Labenz (1:10:35) Are we, like, stuffing the pipeline at this point with new things? I mean, it seems like the biggest reason that this used to be a hard thing of course, it's, like, slow, there's a lot of, like, safety, perhaps redundant steps in there, especially if you could say, hey. We have, like, very high confidence that this is a highly specific thing. I mean, that would really do a lot to inform the what the safety profile might be. But even if you not even if, but in the past, it's like the biggest problem was, like, either it wouldn't work, right, or it would have side effects that were, like, intolerable. And if you could take both of those things, not entirely off the table, but if you could sort of say, hey. We can be much more confident now, seemingly, like, at least in order of magnitude, more confident that any given thing is gonna do what you expect it to do and that it won't do other things that you don't want it to do, I could improve both of those by an order of magnitude, then it's you're sort of seemingly 2 orders of magnitude more likely for things to work, which would take you, like Mhmm. From sort of sub 1%, a lot of shooting in the dark to a lot of the clinical trials would be expected to work. Is that the kind of era that we're entering into now where clinical trials should go from, like, roll of the dice to you'd be more, like, disappointed when they don't work?
Amelie Schreiber (1:11:54) Yeah. I mean, I think we're definitely moving into an era where all this stuff is gonna speed up a lot. And I I think so there are, like, a lot of situations where you wanna understand more than just a static structure. Like, we've been talking about, like, having models that give you more dynamic information and stuff like that. And those are so recent that they haven't really been adopted by a lot of people yet. And I think once those get used more, that'll speed things up a lot and improve things a lot as well. But, yeah, I think also part of it is just adoption. Like, a lot of researchers are not using this stuff yet because it's so new. It's such a new paradigm. It's such a new methodology. There isn't a ton of information out there about how to use them. You a lot of researchers don't understand how to use these models effectively or how to use them at all. Because right now, you've gotta go through this, like, somewhat complicated process of setting up these models. And you don't have to know a lot of coding, but you do have to code some. And you have to understand how to set these things up in in, like, a conda environment or something like that a lot of times. And so if you're not coming from, like, a programming background or you don't have some experience with programming, like, a lot of these models are kinda hard to touch. They're hard to use because you you can't really use them unless you know a little bit of coding at least.
Nathan Labenz (1:13:16) It seems to be quite a different skill set. I've actually studied It is. Yeah. And the kinds of things that I was taught to do, there were no collab notebooks involved. You know, it was like physical techniques for separating chemicals from 1 another was a big part of the where the time went.
Amelie Schreiber (1:13:31) Yeah. And that's a whole other thing too. Like, drug synthesis and small molecules is like a whole other story that we can get into. But because there there are, like, really interesting diffusion models out for those as well that'll generate molecules with specific properties. But, yeah, it's a very new and a very unique skill set. Most people working on them are still developing their understanding. There aren't very many experts on this stuff. There aren't a lot of people that are training people to do this. I think we're all learning together. I'm definitely still learning a lot of stuff from my coworkers and other researchers in the field. And it's kind of mind blowing how fast this stuff is developing too. It's developing incredibly fast, which is really good in my opinion. I wanna see this stuff proliferated and developed and used because we're gonna solve problems much faster this way. There are some platforms that are popping up and are making these things a lot more accessible. 1 that I would mention is 3 10 AI is working on a tool that they're calling Copilot, and it's essentially a chat interface that uses tools. So you can talk to it in natural language, like you talk to chat GPT or something, and then it knows how to use function calling to use other models as tools. So you can say generate a protein with such and such property, and then it'll use a particular model to do that, generate some protein with that particular property. And then you can say, like, increase binding affinity with such and such protein, and it'll modify your protein for you to increase the binding affinity or something. Or you can say, dock this small molecule to this protein, and it'll call on diff doc and dock the small molecule to the protein for you and then return that. And that's all using like a chat interface, which is making a bunch of these models a lot more accessible to people. And right now it's pretty good already. It uses a pretty wide range of tools for different things. Like, it'll use like protein MPNN or ligand MPNN to design sequences for a structure. I'm not sure if it uses RF diffusion yet, but they just keep adding more tools to it. And they probably are gonna start including RF fusion and some other models as tools for it as well. And once they have a good selection of tools for it to use, that'll be really good. And that'll lower the barrier to entry for a lot of people. But so 3 10 AI, depending on how good that gets and how fast it gets really good, I think it's gonna make a lot of these models a lot more accessible and really increase adoption of these techniques. And I think that's gonna have a huge impact in the near future. And I I think that's gonna be really big and important because adoption right now is pretty limited. Just using a really simple metric by if you just look at, like, how many views a YouTube video on RF diffusion gets, it's not that many. Right? There aren't that many people watching
Nathan Labenz (1:16:32) This is gonna be the 1 that goes viral and changes at all.
Amelie Schreiber (1:16:35) I hope so. That would be great because adoption is gonna be big. Think because the more creative people you get using these things, the more you're gonna see creative uses of them and, like, novel approaches to solving problems that people weren't even thinking of before. Like, when have you been able to design proteins that build out complicated protein interaction networks using binder design and motif scaffolding? That's just so brand new within the last couple of years that the adoption just hasn't caught on yet. So hopefully, more people will start experimenting with these models and really learn how to use them well, and we'll see a lot of really interesting novel techniques popping up in the near future.
Nathan Labenz (1:17:20) Yeah. This seems pretty remarkable. What so it seems like the problem then shifts to the case that given a target, we can
Amelie Schreiber (1:17:27) Mhmm.
Nathan Labenz (1:17:28) Pretty reliably and, like, even relatively computationally efficiently come up with something Mhmm. That will hit the target, not hit other things, you know, not cause a lot of collateral damage. Now it's like Mhmm. Target identification becomes the thing that really matters the most. And this seems like it's happening in, like, a lot of areas at once. Certainly thinking about, like, a military environment. Like, we've got really good missiles that can hit very precise targets, but then then the the targeting obviously becomes the the high stakes decision. Mhmm. And Yeah. In lots of just business operations context too. Right? Like, figuring out the right thing to do is the is often the hard part. So what do you think are the prospects for that sort of thing? I've been working through this and thinking we've got quite a tech stack for Mhmm. Obviously, we can, like, sequence lots of DNA. We can also pull out from an individual cell now what was the sort of state of the transcriptome, what genes were being expressed at any given time, what proteins were were being created. I don't know quite the degree to we to which we can do that with small molecules with a really localized sample. But it seems like we've got, like, a lot of ability to generate a lot of data and
Amelie Schreiber (1:18:41) Mhmm.
Nathan Labenz (1:18:41) To probably then create a foundation model of some sort. This is kind of where EVO, I I sort of sense, is going even though that's not even doing all this stuff yet. But already
Amelie Schreiber (1:18:53) Mhmm.
Nathan Labenz (1:18:53) They're training a language model on Mhmm. DNA sequences, purely bacteria and phage DNA from what I understand. And
Amelie Schreiber (1:19:03) Yeah. I was a little disappointed by that, but I think part of their reason for doing such a, like, specific selection was, for safety reasons. I think they're, like, trying to be a little cautious about what they train on and and what the model is capable of doing for for safety reasons. But I I was a little disappointed that they only trained on that data for sure. Hopefully, there'll be another version at some point that's trained on other data as well. But, yeah.
Nathan Labenz (1:19:30) I'm sure there's a lot more to come. We've so they train on all this data. Now they can generate sequences in the same way that a language model can generate text. Right? And then autoregressive byte by byte, base pair by base pair generation. And Yeah. Then there's these really interesting things that they can do downstream of it where, for example, a gene essentiality score or test, basically, if you change a sequence in a particular gene and Mhmm. Generate from that changed sequence, it seems that the model has developed a sort of higher order understanding of how things fit together such that
Amelie Schreiber (1:20:16) Mhmm.
Nathan Labenz (1:20:17) If you do make a change to something that really matters, then you see sort of an unraveling of the later generation. Like, you see a very high perplexity downstream of this changed sequence, and that reflects the fact that, hey. Well, if you've changed that sequence, I can't really make any predict you know, any confident predictions now because we're outside of the the set of things that can work. And so once you change that, like, it's all kind of noise. Whereas if you change something and predictions remain confident downstream, then you infer that must not have been super important. That was a that was an area that more easily could be a change can be tolerated in that particular sequence because and and we can infer that from the fact that it continues to make confident predictions downstream. That's pretty Yeah. Remarkable stuff that suggests to me that especially as we scale up the dataset a lot more, which and this 1 was 300,000,000,000. I mean, that's not nothing, but it's not much compared to what we could easily imagine doing. Certainly not much compared to the 15,000,000,000,000 tokens that Meta trained llama 3 on. But then also, like, more modalities. Right? I mean, we're seeing this in the language models where you can have, obviously, language integrated with image and plenty of other things. Audio now if it's into Gemini 2. It seems like you sort of imagine from just the DNA to sort of the DNA and maybe the transcriptome or the proteome or whatever the sort of state of a cell is. And you can even imagine scaling this up another degree to sort of the system's level too. But to learn to predict the next state from the current state seems like we're getting really close to being able to do that. I would kind of expect that in the next couple years at most, we would start to see large scale foundation models for biology that would predict, like, how a cell will evolve through time, how a system level description would evolve through time. And then you could start to do these, like, counterfactual or hypothetical perturbations and see, okay. If we change this, then how will that make things evolve? If we change this, how will that make things evolve?
Amelie Schreiber (1:22:27) And
Nathan Labenz (1:22:28) so then I guess what we would expect to be learned by those systems is what is now the black box, right, of all these, like, interactions that we have only mapped out, like, whatever 5% of. And then you could even imagine a situation where you start to do interpretability techniques on digital neural networks to then figure out what the actual pattern of interactions is in the actual biological systems. So Yeah. Is that where this is going over the next couple of years? I mean and then it seems like if we can achieve that, then you kinda have a fairly closed loop of, okay. We can now identify good targets at a pretty high rate. We can identify Yeah. Or design interventions that are pretty likely to be successful. The, like, specificity is already high. I mean, it's we're entering into sort of a super steep part of the s curve in terms of not just, like, understanding biological systems, but really being able to intervene in them. And it just seems like a totally different regime that we're headed for. So is that what you basically expect to see over the next couple of years?
Amelie Schreiber (1:23:33) I hope so. I hope that all of that comes to reality or comes into existence. I think it's about to get crazy for sure. I think, like, we're fast approaching a scenario where all of these technologies are gonna kinda converge, and we're gonna have a lot of power over editing and modifying our biology. And I think that's a very exciting thing because there there's a lot of problems that I really wanna see solved in my lifetime. And I think we are definitely moving in the right direction, and I think we are fast approaching a situation where we can actually solve a lot of these problems. The main thing that you're talking about is, like, this complicated sort of hierarchical structure that's happening. And the level that I think at most of the time is at the level of molecule interactions, but you can definitely take that up another level and look at how those interactions come together in a network to create a particular sort of, like, phenotype or to create a particular sort of state of the organism. Right? And, like, some of those states are diseased states that we don't want. But figuring out how to modify these interaction networks, I think is somewhere that we really need to focus on a lot. And 1 thing that's really useful and that I really wanna see pushed further is this notion of LIS score with AlphaFold and Ultimer and seeing also things like AlphaFold distributional graph former generalized to complexes, to complicated complexes of molecules and not just 1 or 2 proteins or something. I really wanna see more progress happen there because once we understand all of the interaction networks and we're able to modulate them in very specific ways, we're gonna solve a substantial amount of problems. And like modeling these interaction networks is becoming possible now. I think before AlphaFold Multimer and before some of these other like docking methods and before like RosettaFold all atom came out, we really didn't have the tools to model all of these different interactions. But we have very recently just hit a point where we do have most of the tools that we need, if not all of them, to model all of these interactions. And once we implement something like AlphaFlo or distributional graph form for complicated complexes of molecules, that's really gonna be, like, pretty substantial. And I'm not sure how I would approach the problem of determining what specific interactions or interaction networks to modulate. Like, that feels like such a big problem to me, but I also know that there are a lot of biologists that know specific interaction networks they wanna modulate, and they wanna modulate them in very specific ways, and there's a lot of that, and now we can do that. I think we just need to get these tools into the hands of those biologists and make them really accessible, because they're not super accessible yet. I mean, like a lot of them are open source and they're out there and you can get the model weights and a lot of them have the training code and the inference code and all this stuff available, but probably your average biologist is not gonna know how to use those tools right now. So making them accessible and easy to use is really where a lot of the work is gonna be as well. Because there are a lot of people that want to do these things, but they haven't quite gotten to the point where they've adopted all of these tools yet. And also figuring out how to use them all in tandem with each other. Right? Because you don't wanna just use 1 of these models, you wanna use multiple and use them all like together to solve a problem. And so that requires you to learn like multiple different models and how they work and how to use each 1 of them. And that's not an easy thing to do for a lot of people.
Nathan Labenz (1:27:45) Yeah. This is a issue in AI even in much more, you know, simple use cases of just using GPT effectively. The sort of diff the technology diffusion through society is is really the big bottleneck I would say right now to when people say, like, why hasn't if JTP is so great, like, why hasn't it, you know, changed productivity all that much? It's like people are not using it nearly as much as they could is 1 really big reason.
Amelie Schreiber (1:28:07) So A huge reason. Yeah.
Nathan Labenz (1:28:09) If I had to kind of map out the story of the next couple years as I understand it or as I'm kind of piecing it together from everything that you're teaching me about, it is right now, we have a big backlog of targets. And Mhmm. We have a pretty robust new set of tools that used together can design things that will hit those targets and not create too much collateral damage such that Mhmm. It's not super hard. It's hard to learn to use the the tools. But once you have the skill set, it becomes relatively easy to take a target and crunch through a bunch of iterations and come to a bunch of candidates. And those are, like Mhmm. Likely enough to work that we should start to see serious acceleration of the ability to make these sort of find the solutions to these well posed problems. And then Yeah.
Amelie Schreiber (1:29:10) For sure.
Nathan Labenz (1:29:11) Sort of in parallel, we should probably also expect that EVO 5 or EVO 4 will be capable of dramatically better holistic modeling of the overall networks. And Mhmm. That will then once we sort of deplete the current set of targets that have been painstakingly worked out through non AI methods, why does it always seem like it happens at the same time? It's always these sort of gradual overlapping curves. But my gut says we've got a few years in front of us of picking the low hanging fruit of, hey. We've we've got all these targets out there, and now we've got good methods to hit those targets. That's gonna take a while for people to learn the tools, do it, obviously validate it, run get clinical trials going in lots of different areas. And then as the sort of current backlog gets worked through, it's probably a better way to to phrase it, right around the same time, I sort of expect that all of a sudden, the tension will turn to, oh my god. Now we have these foundation models that kinda model the whole causal graph in in all of its crazy complexity. And now Right. We're actually going 1 level out and saying, now we can apply similar computational techniques to the identification of the targets. And we'll do that in Mhmm. Sort of a similar way of being like, okay. I want this. This is what's happening, and I wanna prevent it from happening, or this is what I want to happen proactively that isn't currently happening. Let me just kind of brute force my way through a bunch of perturbations to the and, of course, we can get better than brute force too. But even just imagining a sort of several couple generations down of EVO, it seems to make a tweak, see what happens, is going to be so radically accessible from a computational perspective that we'll then also just have this explosion of, like, quality targets to identify. Mhmm. It seems like we're we may be not super far from a you know, this is not we're not even in a world here of, like, an any sort of AGI. Right? We're talking, like, these are still tool simulated simulation and and tool type things that people would be using that that we're not assuming anything here about AI agents doing the work. Although you did have a little bit of that with the 3 10 kind of copilot. But is there anything that I should be, like, reining in my expectations on? I mean, are there things about like, 1 1 tweet that I sent you was around to what degree can these things handle sort of point mutations or whatever? And there was, you know, there's maybe individual idiosyncrasy becomes a really hard problem at some point. But, like, how far does this paradigm that I'm sketching out extend, do you think? And and what limits does it hit?
Amelie Schreiber (1:31:55) Yeah. So I think so just briefly on on, like, point mutations or like more complicated mutations where you mutate multiple things and just predicting how positive or deleterious that mutation or set of mutations might be on protein or something. Like that's actually, that's a capability that ESM has. And that's already 2 or 3 years old at this point. And Evo does it, they got state of the art performance with Evo on predicting which mutations were positive and which mutations were negative and which sets of mutations were positive or negative. And you can compute a few different kinds of like scores that tell you about this and you can actually build out evolutionary trajectories to show over time how things are likely to evolve based on how positive or negative the impact of a mutation is on protein or something or DNA sequence or whatever. And Evo, they got state of the art performance on this. And that's actually they also did this with AlphaFold. I think they called it AlphaMissense or something like that. They predicted all the single point mutations and all the effects that those have. And that that's actually, like, not a hard thing to do. So, like, mapping out how a mutation affects, like, a protein or DNA or something and, like, the course of evolution that's likely to occur, that's actually, like, pretty easy to do now. I also wanna draw attention to another project that I'm aware of. It's not a model per se. They are using I think they are using GPT 4. They may have trained their own in house model. I'm not sure. But there there's a company called Future House and they're designing a, like an agent, an autonomous agent that will do a lot of this research for you and it'll do things like literature search and review, and come up with hypotheses of different kinds of interaction networks that you might wanna modulate or different targets that you might want, and it'll do it pretty well. That's a pretty new thing that just started happening, like, last year, I think, maybe late last year. And they're getting pretty good results with that. And that's actually using an LLM, like, as the base, right, to build an agent to do this research. And then I think they're also building an autonomous lab that drives itself. So once the agent comes up with hypotheses or targets or what have you, the lab is autonomous as well. And that's a pretty exciting direction. I think that coupled with some of these more, like, specific tools to, like, these, like, fine grained operations on, like, proteins and small molecules and and DNA and RNA. When those 2 kind of converge, that's gonna be a really big deal. And I see that happening in the next year. The progress that they made that they're making at future houses is really impressive, I think. So that would also be something that people should definitely look into. And like, I think your timeline is within a couple of years for sure. I don't think it's gonna be that long before we start seeing, like, substantial progress and changes. I mean, OpenAI just partnered with Moderna. Right?
Nathan Labenz (1:35:17) Yeah. Hundreds of GPTs, I'm sure that's just the beginning.
Amelie Schreiber (1:35:21) And I think that also is like a paradigm that's gonna match well with these other more specific tools because when you enable an agent to use tools, you unlock a lot of possibilities, right? When you enable something that can review thousands of research articles then develop like targets or hypotheses to test and then can call on these specific tools to design molecules or proteins or nucleic acids to perform these specific functions or hit these specific targets. That's gonna move really fast. And I think I'm excited. I'm also a little nervous because I think there's a of potential for misuse in the wrong hands. Like the sort of things that you can accomplish will be amazing and beautiful. And we'll we're gonna see a lot of health problems just disappearing. We're gonna see lifespan extended, and that's gonna really improve the quality of life for for everyone. But but also we do have bad actors in the world. And that is something I worry about for sure because in the wrong hands, we could be looking at very dangerous things as well. And so having some kind of oversight for these things is very important because, I mean, we're looking at moving into an age where you could target a specific group of people based on their genetics or something. That's both immensely useful and also very dangerous. And I have a lot of faith in the people that are working on these things. Like the people that are building these models and doing this research are really good people with a lot of really good intentions and a lot of know how and experience. And that to me is very reassuring, but there's always some random jerk that has the potential to mess it up for everyone. We have to be prepared for that.
Nathan Labenz (1:37:20) Yeah. No doubt. I mean, the preparing for this, definitely 1 of the major updates that I've made is that when people talk about the bio risks from AI, the conversation that I've heard most of has been like, how does it compare to Google? How does GPT 4 compare to Google? Is it does it make it easier for you to get certain information or figure out how to do certain things? And in familiarizing myself to the degree that I have with all this technology, it's like, that all of a sudden feels very quaint already. It's like, this is not Mhmm. A question of comparing to Google. This is like generating entirely new stuff. And I Yeah. I looked back not long ago at the list of mass extinctions in the history of the planet and what caused them. Of course, some were caused by totally exogenous shocks, like asteroid hits the earth. But the first 1
Amelie Schreiber (1:38:18) Mhmm.
Nathan Labenz (1:38:18) On the list on Wikipedia is the oxygenation of the atmosphere. And it's simply that something kind of pops up and either itself, nobody knows how to eat, or it creates some waste product that, you know, of that nobody's prepared to deal with. And what we now breathe and depend on was at 1 point the cause of a mass extinction event. I kind of try to keep these, like, super zoomed out perspectives in mind, and it does seem I mean, tell me if you think there's any limitation to this or whatever. But with this sort of brute force search through sort of biological space, right, it seems like there's not really anything conceptual that I could identify preventing talk about gain of function type research on a totally different level. Those seem like the the really dangerous things, and I don't know how you prepare yourself for that.
Amelie Schreiber (1:39:16) I think there there are multiple things that we can do to help prevent prevent some of these bad things from happening. And I think we're going to need to rely on the models that we build, especially the the agentic models that we build to help guide us in some of this. Because at some point in the near future, we're gonna have agents based on really robust language models or or something similar, and they're gonna be able to review thousands of research papers and do tests in a lab on things, take in a an amount of data that we can't really take in. Right? That no human or even group of humans can really take in and process and digest and use. Like, the scale is going to be much larger than a human can really work with. And so we're gonna have to rely on some of our models to help guide us. And also, like, going back to how does it compare to Google, I think another argument that could be that could be made for why not to worry about some of these things is, okay, maybe I can get on my computer and design some new thing that was really toxic or something, but I still have to go into a lab and synthesize all that stuff. And a lot of that part of the process is very highly regulated and watched. Right? I can't just go get a bunch of random chemicals and build this stuff in my house. Right? Like, it's harder to do than that. And, like, synthesizing proteins is it's a nontrivial process. So, like, a lot of the worry, I think, of, oh god, we're gonna have an AI that designs a deadly virus or a bioweapon or something. A lot of that is really overblown, especially at the moment. And I I think a lot of people overlook the fact that computationally designing a molecule is just 1 step in the process. You also have to do all the lab work and synthesis and work on some kind of delivery mechanism. And this is all stuff that's nontrivial to do, and that helps prevent bad actors from actually going through with some of this stuff. Now there may be, like, state sponsored bad actors with access to good labs and lab equipment that will circumvent a lot of that. But when you're working at, like, a state sponsored level, like, that's a matter of, like, international relations and and also national security that has almost nothing to do with AI. Right? Like the design, the computational design of some random toxic molecule is just 1 step in a complicated process. And people don't often think of that. And I see a lot of like very influential big name people in the industry who were coming from like the NLP side of things and like the LLM side of things, and they don't really understand the biology and the process that goes into actually making these things. And and they I think they their worry is justified, but they're also like not understanding the nuances and where the dangers actually are. So that I think it's important for people to keep that in mind. I think it's very important before we start getting all anxious and worried about some random person using LAMA 3 to design a super flu or something, you have to remember it's very unlikely and very difficult for just a random average person to go through the entire process of designing and synthesizing and delivering some kind of toxic molecule. That's a long difficult process that an individual would be very unlikely to be able to accomplish, even with the help of a very intelligent LLM or a very capable agent. Now when you have larger research oriented companies working on these things that have, like, their own wet labs, and they're building agents to run these wet labs and to computationally design the molecules and form plans on how to use them. Like there, I can see, okay, we need some kind of oversight. We need some people working on how to make sure that process is is safe, how to protect that information and data and equipment, right? Because there are plenty of situations where something like corporate espionage is happening and people are trying to do nefarious things with some of the technology that some of these companies have. But again, that that has very little to do with the AI itself and has more to do with how we're interacting with other countries and other other research organizations. So, definitely, we need to be developing plans for how to use and regulate and oversee, like, a a really capable agent that is gonna go through the entire process of like computationally designing and verifying and then also synthesizing and delivering. And that is technology that is like in existence now. We already have agents that are like doing most of this process. Right? And a lot of people are pushing for more of that, which I agree with because we're not gonna be able to solve these problems on our own. We're gonna need some kind of really capable agent that can help guide us through these processes, because they're so complicated and so difficult to understand the whole picture and all of its nuances that I don't know if humans can get there without some kind of agent helping out and helping design the molecules and producing it and delivering it and etcetera. My concern is just making sure that these companies and organizations that are building this technology and using it have some people that are overseeing the safety side of things and they're that are concerned with red teaming and preventing things like corporate espionage and making sure that everyone that's using that technology is using it responsibly and making sure that we're, like that the companies are are hiring not just capable people, but also like people with good moral grounding and good intentions. That's not even the future, that's kind of now. Like that's already a concern that we need to address now because we already have agents that are doing these things. And most people don't have access to them though. Like most people don't have access to a really capable agent that can do this whole process or do most of this process. Like that's not something that even most companies have access to much less individuals because a lot of these models are closed. And I could see like maybe some kind of like state sponsored group of researchers could use open source models to build something similar and do something nefarious, but that that's a whole like, it's a very complicated process of building such a thing.
Nathan Labenz (1:46:24) Do you think that the do we have any read on whether offense or defense is favored here, so to speak? In in the sense of Yeah. To for nuclear weapons, for example, if 1 of the major nuclear powers fires all its missiles, nobody can shoot all those missiles down. Seems to me that that is an offense favored regime, and so we're kind of stuck in this, like, mutually assured destruction paradigm, which is, yikes. This we kind of need to get to a different paradigm because we're under, like, real threat of nuclear catastrophe as long as we've all got thousands of missiles pointed at each other and no viable defense. I don't have an intuition for
Amelie Schreiber (1:47:05) I think that
Nathan Labenz (1:47:05) if biology, like, works the same way or not.
Amelie Schreiber (1:47:08) I think this is, like, less of a question about biology and more of a question about cybersecurity. Because I think it it's, like very similar in spirit to like cybersecurity. And there as well, most often the attacker has the advantage over the defender. But there are a couple of things that may change that. 1 is really capable agents. Because if you have a really capable agent that's able to defend against human attacks that are like just inferior, that's gonna be a really big part of protecting against bad actors. And so I guess, like, that's an argument for acceleration, actually. Because if you have the most capable agent, then probably your defense is gonna be a lot better than everyone else's. And if your agent is capable enough, it may be effective enough to ward off pretty much anything. And I don't know how soon that's gonna come into the picture, but I think that is a good argument for keeping up the pace of development of LLMs and agents and things like that. I think the advantage of the attacker over the defender may end up shifting because of agents. We may end up having a situation where that's no longer the case. And yeah.
Nathan Labenz (1:48:22) So I'm a little confused about why is the shift from biology to cybersecurity? Because I'm envisioning a world where, for example, EVO is open source, LAML 3 is open source. If EVO 3 is open source, then we start to at some point, we enter into a regime where, yeah, it may not be easy. It may be, you know, hard for 1 person. But at some point, it does get, like, lower than state actor level where somebody could launch some crazy attack. And then it's okay. If you create some superbug with certain properties, can we defend against it? I'm I'm a little bit unclear as to
Amelie Schreiber (1:49:01) Right.
Nathan Labenz (1:49:01) How you're I'm not sure if you're, like, equating that to cybersecurity or saying that's, like, primary somehow.
Amelie Schreiber (1:49:06) No. I guess no. I I think that's a good point. Yeah. Because if you develop some kind of I think of I guess a virus is probably as good an example as any, but let's say you develop a virus of some kind that targets a specific population. Developing a cure for that traditionally has been a very slow process. And like the fastest that we've ever done it was probably COVID. And that still took some time. Right? And on the other hand, there is recent work that came out of it was University of California, California, but but I'm I'm not not sure sure which which 1. They recently published some research about, like, universal vaccines, and they're able to design a vaccine that was, like, applicable to a wide range of mutants of a virus. And they said that the technology or the method was highly transferable to other vaccines. And so there's no reason that this can't be applied to pretty much any vaccine to develop like universal vaccines against all variants of a virus or like most variants anyway. So that's gonna that'll be helpful. That'll be good for, like, defense sort of things. And then as far as the applications of certain AI models to developing defenses, it's a very complex topic that I don't some of the new things that are coming out feel like they might be the answer, but it's a little too early to tell.
Nathan Labenz (1:50:26) Yeah. That's definitely a that's a very good data point. I mean, my kind of default would be just to think I have 3 kids, and I'm no expert in how babies develop in the womb, but it's definitely clear that a lot of things have to go right. Like, an unbelievable number of things have to go right in the proper sequence. At any point, if something goes wrong, like, that could be the end of it. Yeah. My general default model would be like, a lot of things have to go right and only kind of 1 or 2 big things would have to go wrong. And so it seems like there's a lot of surface area to defend and a lot of kind of places that could be attacked. But then, hey, if you can make a universal vaccine, then all of a sudden that does start to look quite a bit different. And I think you're right also to say a big part of this does seem to be sort of what is the prevailing, like, international relations regime because Mhmm. If it seems like pretty safe to say, if we get into a bioweapons arms race, we're gonna be in bad shape. We really have to have some
Amelie Schreiber (1:51:34) Yeah.
Nathan Labenz (1:51:35) More globally cooperative approach. Or the missiles have 1 really nice property, which is they don't, like, spontaneously escape their silos and self replicate around the world, whereas the the list of lab leaks is, like, quite long.
Amelie Schreiber (1:51:52) Yeah.
Nathan Labenz (1:51:52) It just seems like there's no way that we can get into an international bioweapons arms race and survive it. We just have to avoid that trap in the first place. And
Amelie Schreiber (1:52:04) Yeah. Also
Nathan Labenz (1:52:05) I hope we do.
Amelie Schreiber (1:52:06) What happens when you develop cures for a wide range of diseases and you're able to extend human lifespan significantly longer than what it is now and eliminate a lot of the diseases that we face. Once that exists, I don't know, maybe that's a paradigm shift. Maybe that's like a shift in human consciousness at that point. And we start thinking about things very differently because we're all used to thinking about everything in terms of being finite. And I think having an approach or thinking about things in a way that isn't finite anymore and thinking about things in terms of how valuable our health and our life is because of the fact that enables us to be with who we love for longer. The time that we have with the people we care about is in my opinion, the most valuable thing that we have. And once you enable people to have healthy lives with people they care about more or less indefinitely, that changes a lot. And I'm very excited to see. I hope that happens in my lifetime. I think it will. I think it'll probably happen within the next decade even, but I hope that really changes human consciousness to a point to where a lot of these problems, like just kind of start to go away because we stop thinking of everything in terms of finite resources, finite lifespans, finite time with the people that we care about. Like hopefully it's enough of a conscious shift in consciousness that we see some of these problems fading because a lot of them are cultural. Right? A lot of these problems or a lot of these, like, threats at the heart are very cultural. It's not about the technology, it's about how we use the technology and how we're interacting with each other when we use it. And that requires people to think differently. It's not a problem that can just always be addressed with some new technology or some new defense mechanism. We really have to change our thinking. And that I hope that when these things start becoming widely available, people's thinking will shift dramatically. Maybe that's where things are headed. I actually have a lot of hope that's where things are headed and a lot of optimism because I think overall, like most people are good. And I think like we can heal a lot of things, not just health problems, but a lot of psychological things. We can heal with these tools because it's gonna change the way that we interact with each other and the way that we perceive our environment and our relationship to it.
Nathan Labenz (1:54:46) That is beautiful sentiment and maybe a good place to end.
Amelie Schreiber (1:54:50) Yeah. I I think I agree with you. This has been really great, by the way. I had a lot of fun doing this, and I really appreciated this.
Nathan Labenz (1:54:57) Well, I appreciate you teaching somebody who doesn't know nearly as much as I suddenly feel like I really should about this area. So feeling's definitely mutual. Amelie Schreiber, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.