Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Jassi Pannu of Johns Hopkins discusses how frontier AI heightens risks of engineered pandemics, proposing a Biosecurity Data Level framework and defense-in-depth measures to restrict dangerous biological data while enabling research.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Watch Episode Here


Listen to Episode Here


Show Notes

Jassi Pannu, Assistant Professor at Johns Hopkins, explains how rapidly advancing AI is transforming biological research and raising the risk of engineered pandemics. They map today’s biosecurity landscape, from pathogen detection and DNA sequencing to vaccine development, and examine how frontier models can already troubleshoot lab work and bypass data safeguards. The conversation introduces a proposed Biosecurity Data Level framework to restrict only the most dangerous functional biological data while preserving open science. They close with a broader defense-in-depth strategy—Delay, Deter, Detect, Defend—including DNA synthesis screening, global pathogen surveillance, and practical tools like Far UV sterilization.

LINKS:

Sponsors:

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Framer:

Framer is an enterprise-grade website builder that lets business teams design, launch, and optimize their.com with AI-powered wireframing, real-time collaboration, and built-in analytics. Start building for free and get 30% off a Framer Pro annual plan at https://framer.com/cognitive

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(05:59) From outbreak to vaccine

(17:08) Threat actors and data (Part 1)

(21:23) Sponsors: VCX | Framer

(23:53) Threat actors and data (Part 2)

(31:05) Gain-of-function research risks (Part 1)

(37:39) Sponsors: Claude | Tasklet

(41:03) Gain-of-function research risks (Part 2)

(48:05) AI models in biology

(01:00:51) Dangerous AI capabilities

(01:07:59) Biosecurity data level framework

(01:18:58) Policy, governance, and infrastructure

(01:28:53) Defense in depth vision

(01:40:43) Episode Outro

(01:45:02) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.


Introduction

Hello, and welcome back to the Cognitive Revolution!

Today my guest is Jassi Pannu, Assistant Professor at Johns Hopkins, who recently co-authored an important paper calling for the creation of access control systems meant to prevent the dissemination and misuse of functional biological data from which AI models could learn extremely dangerous capabilities, such as the modification or even de novo design of highly contagious & deadly viruses.

We begin with an overview of the biosecurity landscape today, including how new viruses are detected, how patient data is aggregated and analyzed in the context of a new threat, and what the pipeline from DNA sequencing to vaccine candidate looks like today.

The good news is that we are able to design new vaccines amazingly quickly, at least for viruses that are similar to others we've seen, but there is unfortunately a lot of bad news as well.  

In 2012, for example, two research groups independently published results showing that wild-type bird flu, which already had an estimated 60% fatality rate, but couldn't spread between humans, could become mammal-to-mammal transmissible with just 5 mutations.  Such gain-of-function research has been broadly de-funded since the COVID pandemic, but it does remain legal, and visibility into the experiments that private labs are conducting is low.

Governments, Jassi says, aren't likely to develop bioweapons capable of casing pandemics, for the simple reason that, short of vaccinating their populations in advance of an attack, they can't realistically expect to control them.  

But with AI capabilities crossing critical thresholds month by month, the threat from extremist groups and even lone actors is quickly moving from a theoretical to a deadly practical concern. 

Consider that Geoffrey Irving, Chief Scientist at the UK AI Security Institute, recently highlighted that today's frontier models can troubleshoot laboratory experiments from a cell phone picture better, on average, than PhDs.  

And in just the 10 days or so since we recorded this conversation, we've seen Andrej Karpathy's AutoResearch framework demonstrate that AI agents can run – and make research progress – for days on end.
Even more to the point, Anthropic just reported that Opus 4.6, when faced with a benchmark challenge it couldn't solve, spontaneously located the full benchmark dataset on Huggingface and then figured out how to decrypt the solutions – which were encrypted in the first place for the purpose of preventing the answers from leaking into training data – all in order to get a single question right.

With reasoning AIs already able to spontaneously overcome such barriers to information, we should expect that future research agents will find and exploit any signal-rich data that exists anywhere on the internet.  

And with the smallpox sequence and the horsepox synthesis protocol already online, and biological data poised to grow super-exponentially in the coming years, we have real reason to worry and ample cause to get serious about implementing data controls before the situation gets out of hand.

Again, though, there is good news.  Recent work by the teams behind the EVO and ESM families of biofoundation models showed that strategically excluding key datasets, such as the DNA sequences of viruses that infect humans, dramatically reduced models' performance on dangerous tasks, while leaving desirable capabilities intact.

This means that the vast majority of biological data can remain open source & open access – and indeed Jassi & co-authors' proposal for a Biosecurity Data Level framework that echoes the existing Level 0 to Level 4 Biosafety framework for physical wetlabs – would subject only an estimated 1% of data, which connects pathogen sequences to dangerous properties, to additional restrictions.  And structures such as Trusted Research Environments, which allow researchers to run code on data without transmitting data from its secure location, would still support valuable research.  

Once again, despite my personal history as a lifelong techno-optimist libertarian who broadly believes that data wants to and ought to be free, I find myself eager to support these control measures.

Of course, that's not the only opportunity we have to improve biosecurity, and toward the end we also discuss the broader defense-in-depth strategy that biosecurity experts recommend – Delay, Deter, Detect, and Defend – which includes mandatory pre-synthesis screening of sequences by DNA manufacturers, investment in wastewater monitoring and other passive global pathogen surveillance, and practical front-line defenses like PPE stockpiling and FAR UV sterilization.

All of this is in everyone's shared interest, but it does require leaders to see beyond the current news cycle for long enough to make it happen.  I certainly hope they do, but also recommend taking individual action where you can, both to improve your own personal safety, and to support the consumer market for biosecurity products.  

My wife and I, for example, at our friend Jeff Kaufman's recommendation, purchased the Aerolamp Far UV light for use in my son's hospital rooms throughout his cancer treatment, and I'd welcome additional suggestions for other products that could help us minimize disease burden today while also serving as private insurance against pandemics, if anyone has any recommendations. 

For now, I would simply emphasize that, by default, we are fast approaching a world in which a rapidly growing number of people, and perhaps autonomous AIs as well, will have the ability to create deadly, transmissible, self-replicating viruses, which could dramatically alter the trajectory of human history, and it really does seem like we should do something about it. 

With that, I hope you are properly alarmed by this scary but solutions-oriented conversation about the sorry state of biosecurity and the rapidly rising threat from biosavvy AI systems, with Johns Hopkins Professor Jassi Pannu.


Main Episode

Nathan Labenz: Jassy Pannu, assistant professor at Johns Hopkins, welcome to The Cognitive Revolution.

Jassi Pannu: Thanks for having me.

Nathan Labenz: I'm excited for this conversation. I think I'm going to learn a lot from it. So let's dive in. Recently, you co-authored with a number of others a call for controls on biological data. We've heard, I think, quite a bit about the possibility that AI systems of various kinds could create new sorts of bio risks, and this is one attempt to put some controls in place to hopefully cut that off at the pass before it really becomes a big issue downstream. I want to get into that from a bunch of different angles, but I think actually it would be helpful to maybe take a little bit of a step back and lay some foundations because I think people who follow this feed, they know a lot about the AI side. They probably, on average, don't know nearly as much about the current state of play when it comes to biological data over at large. And so I thought it would be helpful, and I'm very curious actually about some of these things I realized I don't know as I was preparing for this. So for starters, if you could take us into the moment which we had not so long ago, and hopefully won't have again, but we very well might, where all of a sudden there is a new outbreak of something, something that we've never seen before, we don't know what it is, and people are concerned, and patients in a particular hospital or a particular city are showing up with concerning symptoms. What happens? How do we turn that sort of initial small patient population into actual knowledge about what it is we're dealing with?

Jassi Pannu: Yeah, yeah, I think that's a great place to start. And I think just to take a step back, it's important to realize how does society come to the conclusion that there is a new virus circulating? What is our mechanism for making that decision? because we don't have a global or national alert system for this kind of thing, in the same way that you have a radar system for ICBMs. Right now, we largely rely on symptoms. So patients get sick, they visit a hospital, doctors get concerned, they run a series of tests, and doctors always start with tests that are common. So things like influenza, RSV, rhinovirus, and we're just collecting samples by swabbing their nose and sending it to the hospital lab. And it's when these tests are negative and that if the patient is very sick, then we'll go ahead and run further tests. And we learned during COVID-19 that this whole process is pretty lengthy. It works well for familiar viruses, but it does not quickly detect new viruses. And in the case, I'll just mention the case of influenza because that's where the situation is a little bit unique. We've had several influenza epidemics in the past. So we do actually have a global system to collect new influenza sequences, but that also it operates on a contribution model. So the national lab has to send up that sequence to the global repository and it's an active submission process. We don't have a passive global alert system and we can talk about the benefits of having something like that later. But once the community has decided, okay, it looks like there's a cluster of patients where there's a new virus, we need to figure out what this virus is. That's when we start to do these more exotic tests like metagenomic sequencing to try and figure out what is the sequence of that pathogen. And that's what happened during the COVID-19 pandemic. So we think based on research that COVID likely emerged around late November to December of 2019, but then it was fully sequenced in January. And the decision to share that sequence was, the way it happened was that a researcher sequenced the virus, This was a researcher in China. They publicly shared sequence. They didn't request permission from the Chinese government, which is pretty different from how it might've happened for influenza virus. And that's when we started, kicked off the process of designing diagnostics and trying to scale those.

Nathan Labenz: I recently had an experience of, my son has had cancer and people know about this if they've listened to this feed. I went through this process of sharing data with the, So somebody, I think it's a big part of their job, based on how much time they spent with me, came by and said, Hey, I'm from the data sharing world, and I'm here to answer all your questions and get a thousand signatures on a thousand pages so we can hopefully use this data for the betterment of all humanity. And I said yes to all that stuff. And then I've still been getting stuff in the mail asking me to share data again with particular studies or groups or whatever. And I'm wondering if when it's a pathogen, if that's maybe different, because I guess what I'm sharing in the case of my son's case is like information very specific to him, whereas you could think about it differently. If it's a pathogen, it's not your pathogen, right? It's the world's pathogen in some sense. But like how much consent or opting in do patients have to do, if any, to get this data from them and into the higher levels of analysis?

Jassi Pannu: Yeah. What you're pointing to is is the fact that we have very different systems for dealing with individual level risk versus societal level risk. So we have a really strong set of protections around making sure that your son's data is inadvertently shared, that you've consented to everything, and we're thinking about the privacy risks of all of that data. And that's really to protect your son, the individual patient. But then when it comes to societal level risks about a pathogen where basically the globe is your patient, we don't have as good mechanisms for figuring out how to protect society from those risks. And we don't have as good mechanisms for protecting the data that could lead to those consequential outbreaks. But yeah, just to focus on the clinical data, that very much depends on where you are, what country you're in, what system you're operating in. And in the US, as it sounds like you've experienced, our system is very fragmented. It's different within different states, within different networks, whether it's a public community hospital or a private hospital. And those different, the fragmentation results in different people having to ask you for permission over and over again, because there isn't a single unified repository for all of that data. It is a bit different in other countries. So if you look at the UK, they have the National Health Service. And what the UK has been able to do is create a research platform called Open Safely. And this was something that was stood up during 2020. It provides researchers access to clinical data for 95% of the UK population. And it's really interesting because it doesn't require giving the data to the researchers. It forces the researchers to come to the data. So the code comes to the data set, and you can submit your code to the platform. You never have to see the private data. It's secure the entire time. And there's been a lot of positive reception to the platform of OpenSafely, but we don't have a equivalent in the US where everything's unified in that way.

Nathan Labenz: Yeah, that's interesting. We'll come back to maybe the different levels of protection or restriction that certain different data sets should have in your mind. But let me just stay on the response narrative for a minute. So I know that it sounds like it was actually maybe longer between when the virus first emerged and when it was first sequenced. than it was from when it was first sequenced to when the vaccine was initially designed. Like I've read a couple of these magazine pieces that tell the story of a, I don't know, 48 to 72 hour period where certain researchers got the sequence, they were able to do some analysis on it, identify a particular protein, which I think was the spike protein, and then plug that into an mRNA platform. And my understanding was that the final vaccine that like I got wasn't that different from the initial design that they put together in a couple, probably pretty long days. I think it was even in maybe as early as February of, of 2020. Has that pipeline changed at all? Obviously there's been so many different ML models that have come out over the last five years that predict shapes of things better and what's going to bind to what better. But I don't know. That's pretty hard to beat in terms of timeline. Would you say, if that were to happen again today, would that process look much different from how it looked then?

Jassi Pannu: Maybe first, I'll talk about what happened during the COVID-19 pandemic and those different timelines. And then we can reflect on how this will look in the future with AI. So you're completely correct that during the process of designing COVID-19 vaccines-- and I'll focus on the mRNA platform vaccine as a primary example. The design process, the computational steps where you were looking at the spike protein, working backwards to figure out what is the mRNA sequence that you're going to use for your vaccine design, that process was very fast. But that was clearly not the bottleneck. There were many other steps that took much longer. And so, yes, the process of knowing there was an outbreak and then sequencing the pathogen, figuring out what pathogen we were dealing with, And then, yes, those steps took quite a bit of time. But then there was also a whole body of research that researchers relied on that I think doesn't get enough airtime. Because once researchers realized that we were dealing with SARS-CoV-2, they were able to look at this body of research that was done on SARS-CoV-1, which was the original variant of acute respiratory syndrome that happened many years prior. It didn't spread into a global pandemic, but there were patient cases that were isolated across the globe. And we actually managed to prevent that from becoming a global pandemic. And it was that research that resulted in researchers knowing that the spike protein was extremely important. So then they were able to say, okay, we know spike protein, it's important. Let's computationally design the vaccine. That part was very fast. And then it was turning that into an actual vaccine. So all of the clinical components, regulatory process, the different clinical trials that you have to go through and then distributing that vaccine vaccine globally. Those steps took a lot longer than the actual computational design. I think that now with advancements in AI models, people are very optimistic about being able to design new proteins, design antibodies, design vaccines. I would say that there still is going to be the bottleneck of scaling in the physical world. Clinical trials still remain a huge barrier. Scaling and deployment of vaccines across the globe is a huge barrier. And so figuring out if AI can speed up those steps will be really useful. And it's perhaps the neglected component of the overall pathway.

Nathan Labenz: One thing I've learned in looking at the question of security at Frontier model developers in the AI industry is that there's, you know, and looking at like responsible scaling policies, right? There's a lot of levels to the game of security and the different levels seem to correspond to different actors that you might be concerned about and like how hard it would be to prevent them from doing bad stuff. So at the, for the likes of DeepMind and OpenAI and Anthropic, the general consensus seems to be that if a determined nation state actor, China, were to want to steal the model weights, they would probably be able to do it. And there's not too much that could be done to prevent that. How would you map the sort of threat landscape when it comes to the bio risk side. Do we have, is it a similar thing where we have like random crazy people versus like somewhat more sophisticated groups of people versus all the way up to nation states? And what do you think is like reasonable to expect we can actually stop with all of the measures that we'll talk about potentially developing.

Jassi Pannu: Yeah, it's a really important question. I think within the realm of biosecurity, there are a lot of different threats that people refer to when they're thinking about chemical and biological weapons. There's things like toxins, small molecule toxins, protein toxins. Then there's organisms that are not transmissible between humans, things like anthrax, where we have reliable countermeasures, antibiotics that work against those things. And then there's the more extreme of pandemic threats, pandemic viruses that are novel. We've never seen them before. We don't have diagnostics for them. We don't have therapeutics or vaccines for them. And when you think about that spectrum, the consequences of the potential consequences of a pandemic virus are far higher than many of those other threats. This is all pretty obvious to us now after having lived through one. But it's also interesting that a pandemic virus is not a particularly desirable weapon for a nation state. It's not targeted. It's not easy to necessarily protect your own population. You'd have to design a vaccine and vaccinate your entire population. It's hard to do that without someone noticing. And so I think in general, nation states are not the primary actor that one is considering when they think about pandemic threats. It definitely is more so the folks that are not motivated by rationality. So smaller groups, terrorist groups, potentially lone actors, those are the folks that people really are concerned about. And so that's why a data control mechanism is most likely in the interests of all countries. I would say that China's equally invested in making sure there isn't a future pandemic as the United States. So I'm hopeful that there could be some international cooperation, or if at least not cooperation, acknowledgement that data controls benefit both the US and China and nation states globally. That's what I would say. Yeah. In terms of data controls and how can you prevent those kinds of actors that I outlined, lone actors, smaller groups, from getting access to your datasets, I think that controls are meaningful there. I think that currently the default is sharing that data publicly available for anonymous access. It's extremely easy to access, and even putting minor barriers would make a difference. And the other important thing to consider is we want to make sure that defenders, or people who are advancing countermeasures research, advancing virology research, they have access to that data, and you're limiting access to malicious actors. And so controls can kind of do that differential privileging, where you're privileging defensive use cases, you're limiting the offensive use cases, you can track who's using it, give access to that crowd, and limit access to others.

Nathan Labenz: So can we maybe just map out the data landscape as it exists today? I kind of want to do this for both the data landscape and the models that are obviously spawned from the data. On the data side, I've heard it said many times that you can find the smallpox sequence on the internet. And then there's the question of, if that's true, why hasn't that turned into a crisis already? And I have heard various accounts, I'd be interested in yours. But then more broadly, that's like a known sequence for a known problem. Probably, you tell me, but it strikes me that it's a small enough amount of data that's probably pretty hard to control or clean up from all the places that it might already have been replicated. It's a little bit hard for me to imagine a world where that's been scrubbed so thoroughly that somebody who wanted to go find it wouldn't be able to find it, but maybe you'd have a plan for how we could get there. But then if we expand out the the scope of data of concern, there's tons of bio data in general, right? And I know most of that you're not looking to restrict. So how would you draw the sort of, I don't know if they're concentric circles or not, from the most narrow, this is the smallpox sequence, which we probably shouldn't be passing out too freely, to somewhat larger, and then beyond that, everything that would be fine. How would you characterize those classes of data?

Jassi Pannu: Yeah, let's start with just the general broad categories of data. The most abundant biological data that currently exists is sequence data. We are currently swimming in petabytes and petabytes of sequence data, a lot of which we don't know what it does or the functions of those different sequences. And it's just because it's been extremely easy to collect and now sequence that kind of data. So there's something called Carlson's curve, which is like the equivalent of of Moore's Law for biology and DNA sequencing. And it's just actually shattered Moore's Law because it's become exponentially cheaper to do DNA sequencing. And so that's resulted in a lot of passive DNA sequencing collection and just sequencing everything. And that kind of data is available in government-supported repositories. So things like GenBank, which is supported by the NCBI, part of the NIH in the US, and that alone has 40 petabytes plus of just unannotated, raw, poor quality, frankly, DNA sequencing data. And yeah, that is a large part in why there are efforts to build AI models using that data because it's abundant. When we think about other types of data, there's protein sequence databases, and then there's the protein, PDB, protein databank, which is what formed the basis of AlphaFold, and that contains protein structure data. That data was collected by hand over the years. I'm sure a lot of us have heard the story of painstaking efforts to do experiments to figure out protein structure where one grad student, that would be their whole PhD project, and that data set is actually very small. It would definitely fit on a thumb drive. It's like less than a terabyte of data. And so those, like the different types of data we have access to for biology, they're disparate, they're different types, and they're different sizes. So when we're thinking about what in that whole landscape, what kinds of data might be of particular concern, I would say that it's currently an open question. There are some that think that you could train an AI model on genetic sequence data alone and get a pretty functional model. And I think that models like EVO2 that have done this, where they've just trained on DNA sequence data, they can perform well on protein related tasks. They can span scales, they can do genome generation. So there's optimism that you could do quite a bit with just sequence data.

Jassi Pannu: But I think that there is this other view in the community that really genetic data is observational. What you really need to advance biology is some kind of data that gives you insight into causality. And that's where things like perturbation data or knockout data sets, where you're systematically knocking out different genes of a virus and then looking at how that impacts the function, or you're systematically looking at how viral proteins bind to human proteins. That's what I'll just broadly call that functional data that gives you some insight into causality. And there's thinking that incorporating that kind of data into training AI models is what's really needed to get you over to making functional biological constructs that are viable in the real world. And so what we're proposing in terms of our data controls are that you rightly said, the vast majority of data should not be under control. I think that there's actually been a huge effort in biology to make data open access and more openly shared because that advances research overall. So it's fully supportive of that. And the controls that we're proposing are really on functional data that gives you insight into important features of viruses that, frankly, the US government has recognized result in information relevant to whether this virus is pandemic capable or not. So transmissibility, virulence, or how deadly it is, things like immune evasion. So can you modify a virus so that it gets around an existing vaccine or gets around your own immune system? Those are the kind of features that the US government actually already tracks for wet lab research. If you're proposing a wet lab experiment that intends to enhance a pandemic virus to make it more transmissible, more virulent, or to evade your vaccines, that's something that the US government wants to know about and is going to ask you whether or not it's a good idea. So doing it in computational domain is just extending that a little bit further. We're really proposing focusing narrowly on that kind of data. And the one other thing that I'd add is that I completely agree, going out and scrubbing the internet of data that already exists is not going to be possible. It would be Herculean effort and probably not worth the effort. And what we're proposing is controls on data that's generated in the future, new data sets. I guess my thesis is that now that we know that AI is quite promising for biology, there will be huge investment in creating new datasets. We're already kind of seeing this with the US government's Genesis Mission, with the OpenAI Foundation's commitment to spend billions on datasets by the Chan-Eckerbrook Biohub as well. And so as really large scale efforts get underway to generate not just observational, but also causality related information, perturbation datasets, That's where we want, we're suggesting, if that's done on pandemic pathogens, that data should probably not be shared for a completely anonymous access. You should track who has access to it and have some controls around it.

Nathan Labenz: So going back to the people doing wet lab experiments on gain of function style premises, I'd be interested in your take, to put my cards on the table, I think that's not a good idea when I look at the timeline from the rise of the virus to the sequencing to the vaccine design in the COVID case. And I do take your point that there was some prior knowledge there that accelerated things. I'd also be interested to hear like how much, if it weren't for that sort of COVID one knowledge, how much would that timeline have moved? But I guess my zoomed out and somewhat ignorant view is that didn't take that long. Certainly the clinical trial part took a lot and the distribution, manufacturing distribution took a lot longer. So if the argument is we want to do these experiments now because we will be able to shorten the timeline in the future, if something like this does happen to be able to respond to it, I would say you're not really taking the bulk of the time out by quickening the pace to a vaccine design. And I'm also, I'd hate to see it let it loose. So maybe just shouldn't do that. Do you see that the same way? And then I guess another question would be an obvious extension would be like, should we apply the same reasoning to certain kinds of data generation in the first place? Is there a certain kind of data set that we should just say, maybe we're better off not scaling? It'd be interesting to know exactly what it is in these viral sequences that cause transmissibility or whatever, but we are creating something there that in a sense, it's obviously a little bit more upstream, but could in theory escape in a similar way. Yeah, let's start with, what's your take on gain-of-function research? in the wet lab? And then does that same analysis apply to the data generation side?

Jassi Pannu: Yeah, excellent. And I think you can say your view with more confidence, because I think it's a very reasonable view. And there was a lot packed in there. So if I miss anything that you just asked, feel free to just flag it to me. But to just provide some background on what is gain-of-function research, perhaps, I don't know if you've talked about this before on the podcast, I'll give an example from 2012. So in 2012, there were two experiments that were done by two different research groups looking at avian influenza, which at the time was thought to have a 60% case fatality rate, so highly lethal, but was not human-to-human transmissible. So there were cases of humans getting avian influenza from animals, but it was not resulting in a global pandemic because it didn't efficiently transmit between humans, which is a happy accident for us, frankly. And what the researchers were doing was they did animal experiments in ferrets where they intentionally increased the transmissibility between ferrets of that virus. And ferrets are the known mammalian model. They're meant to represent human immunity. And so the hypothesis was that they were creating a human-to-human transmissible version of this highly lethal virus. And of course, when these experiments were submitted for publication, I believe they were simultaneously submitted to both Science and Nature at the same time, those journals received those publications and alerted the US government, wondering, frankly, what they should do with these results, because the manuscripts included the specific mutations that would be required to create that level of transmissibility. And one of the groups found that it was only five mutations in the avian influenza virus that got you to human-to-human transmissibility. So with that work, and so that's the kind of work that people refer to as gain-of-function research, the technical term is dual use research of concern or enhanced pandemic pathogen research. And that work at the time, it was recommended that those publications, or at least the details of the mutations, not see the light of day. And there were two concerns there. One is the concern that you raised, which is Wow, humans are working with these kinds of pathogens in the lab. We know that there's human error. And what happens if someone's dealing with that pathogen that they just created to be more transmissible? They themselves get infected, leave the lab without knowing it, and trigger a global pandemic. That's a legitimate concern. There have been instances in the past where samples of really concerning viruses have been found. in settings like the CDC. So past examples, the CDC found vials of smallpox in a freezer. They didn't know they were there. They were still viable. They're supposed to be only two places in the world that have active samples of smallpox, and these were not known to be there. So I think overall, there's concern about lab accident, human error, and that was one of the major concerns that's generally falling under the bucket of biosafety. The second concern was that aside from the dealing physically with the pathogen, there was the concern about the information related to the experiments. So not only how the researchers did this and how someone else could replicate that same effort, but then to the exact mutations that would be needed and whether or not that was information that should be in the public domain. And this relates very much to what you described, which is The horsepox synthesis protocol is in the public domain. The smallpox sequence is in the public domain. So theoretically, someone could put those two things together and try to create smallpox, even if they weren't able to get access to the physical specimen themselves. And overall, there's just lots of information about protocols for doing reverse genetics or other ways of rescuing live infectious virus for pandemic pathogens. So in the wet lab field, this has actually been a huge debate for years and years. with regards to what do we do about this information? Should it be controlled in some way? And so far where policy and regulation has come down is there's a focus on controlling the physical specimen, and there's a focus on preventing experimental work to increase these concerning characteristics of pathogens. But there's been... considered infeasible to try and go back and scrub the internet of data that's already out there. And so we really have to try and figure out a mechanism for deciding this in advance before it's already out there and we can't do anything to pull it back.

Nathan Labenz: But it's still not illegal to do this, is that right? Like you can, do you need any special permission or is it a ask for forgiveness, not permission regime that we're on with such gain of function research, even in the physical realm?

Jassi Pannu: Yeah, this is, it's a really good point. In the case of the 2012 experiments, there was no legal, there was no law that those researchers were breaking. The mechanism that the US government at least has used in the past has been the using Essentially, if you receive funding from the US government, you therefore have to follow certain policies. And so this policy around not doing research enhancing pandemic pathogens is one way that the US government has tried to do this. There are some laws in the books for dealing with controlled pathogens where it's the Federal Select Agent Program. For example, if you want to handle anthrax samples, you have to be a registered lab that is tracked under this program. But yeah, that's slightly separate. I would say the other thing to consider is work that, for example, seeks to go into bat caves. People are collecting samples of viruses where there's a suspicion that those viruses could be pandemic capable. They sample them in those caves. This is called virus hunting. They bring the samples back to the lab, they manipulate them, they try to characterize them. And this is also something that the US government and other governments used to spend money on, but after the COVID-19 pandemic, the fallout from that, the lab leak hypothesis, the political dynamics around that. A lot of that work has been defunded, but it's not explicitly illegal.

Nathan Labenz: I don't want to get too bogged down in this particular point, but is there a good reason for that? I do understand, of course, that we benefit tremendously from biomedical research broadly. I I could imagine, you might say, actually the border is a little harder to define and it can be a little fuzzier, so it's like hard to legislate. But if this is one that we're sleeping on without a really good reason, it might be time to start writing our representative. So is there a good reason that this isn't more controlled than it currently is?

Jassi Pannu: I would love to see it more controlled than it currently is. I think that the real reason that it isn't is because governments are good at legislating things that happen often. And what we're dealing with are pretty rare instances that certainly would lead to extremely high consequence harms, global pandemics, things that we don't want to see, but they just don't happen very often. And so the push for policymakers to, this is a live issue, this is something that needs to be legislated, that comes and goes very quickly. And we already saw with the COVID-19 pandemic, there was a lot of concern, there still is debate as to what was the origins of COVID-19, we haven't resolved that question, the WHO director actually just a couple of days ago put out a statement saying that We still need to do work on resolving this question. But the policymakers have moved on to more pressing issues because that's just the nature of policymaking that they have to put out fires today. I think that in reality, what we need is we do need both national and international rules. So right now, the WHO has rules where they say only Russia's Vector Institute and the US CDC Those are the only two places in the world that can have access to smallpox. That's a great initiative, but that did not prevent a researcher from unilaterally publishing the step-by-step protocol for how to synthesize horsepox, the close relative to smallpox. And so what that kind of highlights is as synthetic biology, virology, biomedical capabilities advance, we need some better way to make sure our regulations keep up That's a concerning topic. But I think so far, you know, the reason that we don't, this isn't a live issue or why we're not kind of thinking about it day-to-day is because it still does require a lot of expertise to synthesize any of these pathogens from scratch. It really is something that you need to have a lot of background in, but this is where the concerns related to AI come up. So in terms of the different types of AI models and whether or not they provide uplift and what kinds of bio models could be used to do this, this is such an evolving and open question that people are trying to figure out. I think the hard part is that it's moving quite quickly, and so it's hard to see how policymakers can keep up, but we're working on it.

Nathan Labenz: How much actual wet lab gain-of-function research do you think is going on today? Has it been dramatically curtailed by these strings attached to funding and general awareness in the community that it's maybe not a good idea? Or do you think there is still a lot going on? Because I guess another reason that there might not be a law is, well, everybody quit doing it because they realized that it was a bad idea and who needs to make a law against something nobody's doing. But is that the case? Or do we have any even way of really knowing how much is going on?

Jassi Pannu: With regards to wet lab gain of function research, I guess I would first want to say that the kinds of research that we need for future vaccine design, like determining the spike protein is important. That's like how we advance our ability to create vaccines for new pathogens. That kind of work doesn't require gain of function research. Gain of function research, the kind that we're talking about is very narrowly scoped and it does not require enhancement of the transmission of a pathogen or making it more virulent or making it escape the immune system. Those kinds of experiments are really not needed for the vast majority of the kinds of advancements we would want in biomedicine. So with that in mind, I would say over the past few years, since the COVID-19 pandemic, a lot of this work has been defunded and has been reduced, just a by US government funding mechanisms. Our blind spot is the work that's happening in private labs. We don't actually have any legal mechanism for going into a private lab and determining are they doing certain kinds of pathogen research other than whether they have registered under the Federal Select Agent Program. But there have been instances of laboratories in California most recently where they're handling certain types of pathogens that they really, frankly, shouldn't be and they don't have the containment protocols for. Yeah, I'll pause there. Overall, I think we're in a better spot than we were. I think people have recognized the downsides of this research and the risks, and certainly governments are paying a lot more attention to it.

Nathan Labenz: It's funny, it echoes in a way the reduction in bad behavior that we usually see from one generation of large language model to the next, where it's like, we recognized that this was a problem, did some stuff to try to curtail it, and we reduced about 90%. Great news. The other 10% is out there for future work to contend with. Dizzying in the gain of function case. Okay. You mentioned, let's go with your segue. So there are different kinds of models, obviously, in the AI space that people might be concerned with. And I would, in my own head, thinking ahead to this conversation, I was like, of course we've got the large language models, which output to text and can like reason about things and can, and first of all, they might just know facts that could be problematic. They can use tools. They can, just did a conversation with Jeffrey Irving, who's the chief scientist at the UK AC. And he, I had not realized before that frontier models these days are really getting quite good at shooting lab experiments from cell phone pictures. Like they've now got to the point where you can just snap a picture of what you're working on. tell the AI that it's not working and it will coach you through how to get it working. So that's one, that's like the know-how, the reasoning, the procedural stuff. Then we've got models that, as you alluded to with things like AlphaFold and that whole genre, are like very good at making very specific predictions. What shape is this going to be? What's going to bind to what? So on and so forth. And then there's the middle ground hybrid, things like Evo and Evo 2, where They're trained kind of like large language models on these vast data sets. In many cases, they are literal next token predictors, albeit in the DNA or the protein sequence domain. And I probably have the least intuition for those. I guess you complicate that taxonomy for me if you want, but then maybe just go through and tell me how concerned I should be in a world where there's no data controls and the models have it to learn on everything we have. How concerned should I be about those different kinds of models or possibly like how they might be stitched together?

Jassi Pannu: Got it. I think in general, I like this taxonomy. I was not as creative as you in terms of like coming up with the different capabilities that these groups have, but I think about them largely in the same way. So LLMs, they're trained on lots of biology information from textbooks to scientific papers. They can give that information to someone that doesn't already have it. They're not already a biology expert, and that's usually called uplift. And broadly, that's a really great thing. That's LLMs teaching someone new biology, helping students learn, and really, it's a quite useful capability. There's a subset of instances, for example, how do I obtain illegally a automatic weapon? Or how do I illegally obtain smallpox samples? That's the kind of information that is clearly not something that should be provided to the general public, and that's where Frontier Labs are working to apply classifiers and refusals to make sure that that kind of knowledge is not being widely shared. Then when you think about tools that can be used for biology tasks, I think people often call these biodesign tools. These are specialized models that are trained on biology data, and they're used to do specific things. And so I can think of this as like, this is a model that gives someone a new capability. It's not about knowledge, it's about what you can do. And these types of models really require someone to already be an expert. You have to already be a computational biologist working with models to really leverage these kinds of capabilities. But the interesting thing is that it can allow those researchers to do something that was just not possible before. So before the world of protein design and AlphaFold and structure prediction, it was just not possible to take a protein sequence and then immediately through computational methods, play with new designs or try to infer its structure and function. And so those are really interesting capabilities. Again, the risks here are less about providing uplift to someone who didn't already know how to do that, but it's about giving experts the ability to do new things with biology. And then this third category, I agree, it's kind of, it's, these models like EVO2 and ESM3, these are what I would call biology foundation models. They're trying to be general purpose in the same way that LLMs are.

Jassi Pannu: They're often also trained on different kinds of data. So AlphaFold obviously is trained on the PEB, but it also has MSA data. And then ESM has different types of data it's trained on as well. So these are models that are often much larger than biodesign tools, which can be small enough for an individual research group to train and host locally. Biofoundation models, they require more data, more compute, they're more expensive for groups to develop. And so you often see that these models are developed by larger organizations. AlphaFold, obviously being part of Google DeepMind, not being an independent lab, independent academic lab. And then ESM is from Evolutionary Scale. So these types of models are trying to infer the fundamental laws of biology. They're trying to understand how biomolecular components interact across different scales and really elicit like what are the underlying laws that govern protein function and protein structure. And in the case of EVO2, operate across different scales. EVO2 is a model that's trained on just DNA sequence. And what they were able to show is that it can actually help with tasks across both sequence and protein and genetic regulatory circuits. So that's like operating at different scales in biology and it's, yeah, inferring laws that transfer between those different scales. So overall, lots of interesting work being done in all of these. I think the risk considerations can be separated across the different types of models. But what I would argue is that that will collapse over time because what organizations are working towards are integrated workflows. So ultimately the dream is to be able to have your AI agent design your experiments. It will be connected to your autonomous robotics, which will conduct those experiments. The data from those experiments will be collected and then fed back into your biology foundation model, which will then be used by your to design future experiments and et cetera, et cetera, on loop. And these kinds of iterative feedback loops where you're getting data from the real world, I think is where people are most hopeful for what process can kind of advance biology, because right now we have these huge data sets that are kind of messy and collected in an observational way. But what these feedback loops would allow you to do is systematically perturb systems, systematically try to assess causality and then use that information information to get further and further improve your in silico models. The dream would be ultimately you get to a point where your in silico models perform so well that they start to replace some of the wet lab biology that you've done and you get better and better predictions of what different drugs, for example, will do in cellular models, animal models. And then ultimately the dream would be that you get better predictions for clinical trials, so you have to do fewer of those and they have yields.

Nathan Labenz: So what do we want to take? I'm obsessed with, by the way, with that idea of both the closing of that loop and then also a little hobby horse of mine I'd be interested in your take on is the sort of latent space integration of these different modalities. Obviously, we've seen this with image and text in the sense that I can now go to a Nano Banano model or whatever and give it an image and also some text instruction. And it is understanding those in a joint way to a degree that wouldn't be possible if it was just like prompting an external image model with text, right? I could have a language model that uses an an image generator as a tool, we've seen that, but this sort of deeper integration gives you just like much higher fidelity to the original and you can do like text and image prompting in a very natural, integrated, cohesive way. So I've been wondering on what, I assume we're going to see it, but on what timescale do you think we see that kind of stuff in the natural sciences and specifically in biology as well, where you might say not, okay, hey, language model, You can call this protein model as a tool, but rather like you are both. And what I want is working from this protein as an example, give me another protein that could do the same thing and just have that all be understood in the same the same set of weights. What do you think the outlook is for that sort of system?

Jassi Pannu: I agree that that is where the field wants to go and it would be really useful to be able to develop that. I feel like there are a couple of bottlenecks along the way. One is the data generation piece that is still something that requires scaling in the physical world and doing experiments in the physical world. And that itself will be bottlenecked by advances in robotics. So if we were to suddenly see robotics speeds up, we're able to do a lot more laboratory work autonomously using robotics, then the overall picture in terms of data generation also will, will advance. So I guess I'm hedging, I'm not giving you a timeline, perhaps next, you know, five to 15 years. These are the kinds of advancements we would expect.

Nathan Labenz: Is that data, I mean, we, you said obviously we have like huge amounts of just raw sequence data, but we are short on sort of the causal graph, if you will, of I did this and this resulted. Presumably a lot of that is maybe locked up at like pharma companies that have done some of this stuff, or even just in the clinical data. Do you think that if we had full access to all the data that exists, regardless of who owns it and how it was created and where it's sequestered, would we have enough data already for that kind of thing to happen? And we're essentially recreating for IP reasons, something, or privacy, whatever reasons that something we essentially do already have as a society? Or would you say, no, not really, the clinical's too messy and maybe the pharma doesn't have it, I don't know.

Jassi Pannu: Yeah, I would say, so pharma definitely has lots of highly valuable data that they do not share in the public domain for reasons that are obvious. And that they are, and pharma is actually using their internal data sets to develop proprietary models in-house. They're certainly trying to do that. And if we were to suddenly wave a magic wand and say, the government says everyone has to play nice and share their data sets, how far would we get? I think that we would get a little bit further, but I think that these feedback loops and a new way of generating data, it's fundamentally a different approach. You can think of it as the existing way of approaching biology is a bit artisanal, it's a bit observational, and that results in data sets being messy, having a lot of bias, and being hard to work with. And what we really need to do is shift towards a much more systematic approach. where we are generating data that is comprehensive and it's systematically probing every single aspect. And that's where you really need the robotic aspect to scale that data and replicate it carefully rather than, you know, having multiple different humans trying to do the protocol. There's always differences between them when they're collecting data. So yeah, just transitioning to a systematic approach that is enabled by robotics, I think that that is not something that you would get just by by enabling data sharing across private companies.

Nathan Labenz: Gotcha. Okay. Well, let's pop out of that rabbit hole and come back to the main topic. So we've got language models that can tell people things they maybe shouldn't know. They can increasingly use all kinds of tools, including design tools. Not clear to me at this point how well they could use something like an Evo 2, but when we think about those models, and their capabilities, what sort of, how do you think about what capabilities should be not created in the first place? Is it about, I guess it's probably multiple things, but yeah, maybe I'll just leave it there. What capabilities do you think that those different kinds of models should not have in order to reduce the risk to society broadly?

Jassi Pannu: I think that the fundamental challenge of biology is that a lot of these capabilities would be useful on the defensive side, but it's when they're used offensively that they pose concerns. And so it's the question of what capabilities do we want, but it's also the question of what capabilities should we provide access to and how broadly and when. And so when we're thinking about advancing the future of AI for biology, I think the way I like to think about it is that we should really try to step on the gas for things that are clearly good and clearly do not have a lot of risks and things in that bucket to me are things like virtual cell models or ways of advancing clinical trials or ways of making sure we can do better countermeasure manufacturing and distribution. So lots of things that we could do that are clearly beneficial. And then there is a bucket of things that If that capability was broadly accessible right now, that would be quite destabilizing. So I think this is just a hypothetical example. I'm not trying to say that this is actually the current state of capabilities, but let's say there was a breakthrough where suddenly it's very easy to use an autonomous robotic system that's quite cheap to build or get access to, and that robotic system can very quickly synthesize a pathogen. So that's like obviously a futuristic scenario, but let's say that were possible, then that's something that probably we wouldn't want anyone to be able to buy that device off the shelf. We'd want to know who has access to that device and what they're using it for. So other things that are trending in the more concerning capability bucket would be things like firewall design and Even there, there's considerations on, well, we know that gene therapy based on viruses is actually an advance that we would love to see. Or other purposes for viral design, like designing bacteriophages, which are viruses that only infect bacteria. They don't infect humans. But the challenge is that when it comes to artificial intelligence, a lot of the approaches are general purpose. And so if it becomes quite easy to have an AI model that can design A bacteriophage, then the question is, well, then it seems quite easy to repurpose that for pandemic human pathogens. And how many people do we actually want to have access to that kind of capability? It's probably a subset of legitimate researchers who are using that. It's not something that you would want widely accessible on the internet, especially in a world where we don't have as easily accessible countermeasures. So it really becomes an offense dominant capability where the design and acquisition of a pathogen becomes easy and it's facilitated by AI and exists in the digital world. is being uplifted in the digital world. But then our countermeasures to a pandemic remain very physical world bottlenecked. And that's a world where it just becomes very offense dominant.

Nathan Labenz: Even when with things like a wholesale model, would I be right to worry that obviously one of the things you would want to do with a wholesale model is throw stuff at it and see what happens to the cell, right? So if you have that, All kinds of great things you might be able to do, but then presumably you could also start to throw, pick your virus of choice and start evolving that in whatever way you want to and potentially just brute force throwing the same or all these little permutations of a virus at the whole cell model. It strikes me that these things are vulnerable to like brute force attack. If they're going to be good, they're going to be vulnerable to that sort of brute force attack. Is that right? Or is there any way around that conclusion?

Jassi Pannu: Yeah, you are embodying the debates that people constantly have in the biosecurity community. And I think that what you're saying is correct. There are ways to envision every kind of biomedical advance, especially in the AI domain. There are ways of envisioning how it could be used for harm. And because of that, you have to think about how direct the harm pathway is and how consequential the ultimate harm would be. And you have to try and draw a line somewhere. So if you just compare, for example, a generative language model that had no data filtering, had no data exclusion, it was highly performative on viral genome design, and it could do that for human pandemic pathogens. There, the pathway for harm is quite direct. Someone who has very limited biology knowledge could use that model to generate thousands of potential designs, sequence them in the wet lab, see which one is the optimal candidate, and then use that candidate. So it's a fair, I mean, still a lot of work going into that, but it's a direct pathway. But what you're describing, which would then require plugging in multiple different AI models, generating candidates with a different model, then running them through a different predictive model, seeing the consequences, trying to figure out which viral candidate would cause, for example, a systemic inflammatory response or something, or that would target a certain organ cells. So I think there is a pathway to harm there, it's just, you when you game it out, there's more steps involved and more expertise required. And then the ultimate question you ask yourself is, what kind of actor would choose that pathway over an existing weapons pathway? So if it really requires high-level expertise that only a nation state has has access to, is that nation state really choosing a biological weapon or are they more likely to choose something else that a nation state would have access to? And those are the kinds of questions that security professionals try to game out.

Nathan Labenz: Okay, maybe let's get to the proposed solution. I've been coming at this from a ton of different angles. You've got a whole taxonomy of five levels of biological data. Obviously this is inspired by or pattern matched at least to the levels of security around bio facilities. So maybe just take us through the zero to five, like what are the kinds of data that fall into these different levels? What would the access look like? What would the precautions look like? Paint a picture of the world that you envision.

Jassi Pannu: Great. I'll try and paint a somewhat visual picture. So for those that are imagining this, this is a five-tiered system that goes from level zero to four. And this is modeled on what some of you may be familiar with, which is biosafety levels or BSL levels. These are the famous safety levels that biological laboratories use in terms of determining, do I have to wear a spacesuit when I go into the lab, or can I just use a fume hood to deal with my samples? And it determines different containment approaches that are required. And actually, the BSL system was the basis for a lot of the frontier safety policies that different frontier AI labs have in terms of the idea of having a system and your mechanisms for controlling it and kind of having it roughly on four tiers. But what we're proposing here is applying this not to the model and not to the physical pathogen, but rather applying it to data. And the reason that we're proposing this is because the entire biomedical research ecosystem, when it comes to AI, is built on open source models. Academics build open source models. They share those models openly. Other researchers manipulate and change those models. And there are a lot of benefits to that whole open source ecosystem. And those benefits are what have prevented the security approaches from being applied to models. And it seems like that's going to be a pretty intractable approach. And we were looking for a different approach that could be applied to ensure that you could still preserve this open source model ecosystem, but you're not distributing capabilities that would be particularly concerning, like viral design. So that's how we ultimately settled on biological data. Biological data, especially the kinds that I described, the more functional data, is expensive to produce, requires wet lab, and requires expertise to produce. And that's why it's a potentially useful choke point. The tiering system that we describe would preserve the vast majority of biological data as fully open access. And that's what we're calling BDL0, where most data would be available to researchers. But as you go from levels one, two, three, up to level four, you have increasing levels of control based on how potentially concerning the data is. And the way it's essentially broken up is BDL1 is data that would allow you to infer viral patterns. It's pretty basic security, just requiring an account and understanding who the person is and if they're legitimate. researcher. And as you go up, you're getting more and more focused on properties of pandemic pathogens that would really directly lead to harm. These are the properties that I mentioned to you that governments globally already pay attention to for wet lab research. So it's when you're making a pathogen more transmissible, when you are making the host range larger, if it can infect more animals or if it can move from just infecting an animal species to also infecting humans. When you are manipulating the pathogen so that it evades the immune system. These are the kinds of properties where if your data set has data that is linked directly linking those properties to pathogens. then it requires things like use approval. You have to, if, for example, you would go to the repository, you would say, I intend to do this kind of model development based on this data for this purpose. If you're a legitimate researcher who has a good purpose for doing this, then you will get approval versus just having this data openly accessible for anonymous access. So maybe I'll pause there.

Nathan Labenz: How would you describe the magnitudes of those? Is the outer ring, the BDL zero, is that like 99% of the actual raw data or how, and how small does it get when you get up to the uppermost levels?

Jassi Pannu: Yeah, I would say 99% is a pretty good guess. We don't actually have numbers to base this on because there isn't a comprehensive tracking system for these kinds of data sets. But my guess, based on the research that we've done, would be that the highest security tier, the BDL4 tier, is a very, very small subset of all data. And it would be a very few small number of specialized virology labs, for example, that would be And frankly, those labs are probably already limiting access to those data sets that they produce in some way. It's just not a formalized system. So BDL0, again, the vast majority of data that we're talking about petabytes and petabytes of data, the vast majority being uncontrolled. And the controls that we're proposing are on a very narrow slice of data, particularly getting up to BDL3 and 4 with, I would say, perhaps dozens or less laboratories affected. I am making that statement based on my knowledge of the field, but there probably needs to be some more comprehensive effort to try and figure out who exactly is generating this kind of data. And the visibility bottleneck that we currently have is what's happening in the private ecosystem. There's pretty good visibility in terms of government funded work and less so on the private side.

Nathan Labenz: So in terms of the impact that this would have, one of the things I thought was really interesting in reading the recent paper calling for these kinds of controls was the report, which I hadn't realized, that there has been data holdout work done on a couple of the leading models, ESM3 and EVO2 specifically. Could you talk us through a little bit of what that has looked like? And I did put in a good word for me with Alex Rivas, please, to get him on the show. I've tried, but we did do one with Brian He on Evo. And so I have a general sense of what that looks like, but I didn't get into what was held out, how much of the overall data, how did that affect performance in areas of concern? Did it also affect because I assume one thing people would be really worried about here is like, I don't want to have a dumb model in general, right? If I slice out this data, what does that mean in terms of what can't it do that I want it to not be able to do? But also, are there things that it can't do that I would wish it still could do? Are we paying any costs with the benefits?

Jassi Pannu: Yeah. Yeah, it's an important question. I think that we all have an intuitive sense that AI model capabilities are based on the data that an AI model is trained on. That intuitively makes sense to us, but the degree to which that is true is an empirical question. And especially in biology, there is a reasonable reason to question, if I was to remove a very small subset of data, could my model just interpolate around that gap? And if your model has really internalized a fundamental understanding of the laws that govern biology, does it really matter if you start segmenting out different small pieces of data? So this was an empirical question until, as you said, some of the leading biological AI model developers actually went about doing this. And so I'll just describe the two examples you mentioned, ESM3, which is a generative protein design model made by Evolutionary Scale, and then EVO2, which is a generative DNA language model made by Brian Heat. And both of those groups had decided that they wanted to share their models, but they didn't want to disseminate the capability of others being able to use their models for viral design. So the way they went about limiting that capability was limiting what went into the training data. So in the case of, I'll just speak about EVO2, for example, given that I was involved in that work, we decided to remove the sequences related to viruses that could infect humans, viruses that could infect eukaryotic organisms, But there was still some information related to other types of viruses, for example, those that could infect bacteria that was included in the training data. And after doing the big pre-training run, we then did some evaluations where we actually checked to see, okay, is the model limited in its capabilities on certain tasks? A common task that people use in this case is looking at how well the model can do certain viral protein-related tasks and how well the model can generate sequences that correspond to functional viruses. And those were all things that the team checked and actually showed that the model capabilities were significantly less than how it did on other domains. And those evaluation results were all published as part of the EVO2 manuscript. And the interesting thing about the work that the evolutionary scale team did for ESM3 was that they actually had both versions of the model. This wasn't something that the EVO2 team did, but the evolutionary scale team had both the trained model that had been trained on all data that they had chosen to include, as well as the data filtered model. And they were able to show a delta in performance on the same tasks with regards to, for example, viral protein function prediction. So yeah, that's like an inkling of some of the empirical work that has been done and could be done in the future to try and suss out how much does it matter in terms of when you remove this kinds of data? What is the particular kind of data that matters? These are all questions that could probably be explored a lot more.

Nathan Labenz: Could you give a sense of the order of magnitude of community reduction? Are we talking, It just can't do that stuff at all anymore, or it's in the uncanny valley somewhere? I don't have an intuition, honestly, for what I should expect.

Jassi Pannu: Yeah, yeah. I can say, so for example, in the case of Evo 2, it was much more along the lines of the model function was effectively random rather than just like reduced by a small amount.

Nathan Labenz: Cool. That's great. Great news. I love it when something works. Okay. Maybe, let me zoom out and take stock of all this stuff. So we've got a ton of new data coming online. want to facilitate sharing. We want to facilitate all this discovery, but we've got to have some of these different classes of data controlled so that models aren't trained and disseminated on them. Presumably this also means like the models themselves that are trained by the ESM team and the Evo team also need to be controlled. How I guess we shouldn't be expecting government, if we don't have government controls on actual wet lab gain of function research, it doesn't sound like we're gonna get government control on this sort of thing. So what are we doing? We're campaigning and trying to build private agreement and consensus. Is that the play and how's it going?

Jassi Pannu: Yeah, great questions. Just to take a step back in terms of what happens with the model and should the models be controlled. I think in the case of Evo2 and ESM, they had effectively neutered the concerning capability from their model. And so that's why they felt more confident in disseminating those models. And I think, yeah, ultimately it made sense to be able to share those models openly because they had worked to reduce the concerning capability. But what I was more so referring to is if you do end up implementing data controls as we propose, and you get access to BDL3 or BDL4 data for the purposes of training a model, then what you wouldn't want to happen is that then that model is then shared publicly. Because effectively, then your mitigation didn't really do anything. And we do make sure to mention that any AI model trained on that kind of secure data should also be shared in a secure way if it needs to be shared. But yeah, in terms of what should be done next, I think that the interesting thing about data controls is that there's a lot of policy precedent. We already do this in other domains. We do it for privacy related data. We do it for human genomics data. And so there's reasonable precedent for extending that same approach to data controls. So I guess. My collaborators and colleagues who worked on this are optimistic that perhaps this could get picked up by policymakers, but of course it's a new concept and so we'll keep plugging away at it. And then more broadly in terms of what will happen with wet lab gain of function work and model capabilities, I think that actually both the Republicans and the Democrats have decided that this is a bipartisan issue. And it was under both the prior Biden administration and the Trump administration that some work on advancing regulation on wet lab gain of function work went ahead. So I was really glad to see that progress. And And I think that also both the US CASI and UKACI are both doing really great work with regards to biosecurity and biological model capabilities. I think that US CASI, just speaking given that I'm more familiar with their recent work, they have put out RFIs, requests for information from the scientific community to better understand, okay, we know that developers are doing some of these data filtering steps, they're trying to limit model capabilities, Maybe we should have a more systematic approach to figuring out, what are all the capabilities that we should be concerned about? How can we test for those capabilities? How do we effectively mitigate them? And so I would just love to see KC be resourced and staffed to be able to advance that line of work. Because I think a lot of developers are trying to do it themselves in an ad hoc way, but they don't have the same security access and intelligence access that something like USKC has.

Nathan Labenz: Do you envision a sort of-- centralized, like I'm still a little bit fuzzy on the workflow of something like this. If I, for whatever reason, I'm out here generating some sensitive data that maps viral sequence onto viral capability, and now I'm like, okay, I've got this data. I want to be a good citizen. What do I do? Are you envisioning a scenario where I like take it to a central data bank, get give it to them and then scrub it from my computers and go on with my life. Because we can't have a, everybody can't be doing these sort of security levels for themselves, right? So there would have to be some, maybe not totally centralized, but at least no more than a countable, relatively small, countable number of the organizations, entities or whatever that would have custody of the data. So is that kind of the idea that you, people would feed it in and then they would be expected to like actually delete from their own servers? Is that realistic to expect people would do that? How do you expect that to actually go?

Jassi Pannu: Yeah, yeah. This is all about how do you operationalize this system that we're proposing? And the way that this has been done in the past is through things called trusted research environments, or TREs, and OpenSafely, the system that I mentioned at the beginning, is one of these. And so the idea would be that you have a secure environment that hosts the data and also enables researchers to bring their code to the data and answer questions. And obviously, in the age of AI, the ideal would be that you have compute resources available to that, although we'll see if that would be possible. But I think the way that you would set this up is that, yeah, we consider different approaches. One approach could be that the government sets this up. I think that after we looked into this seemed less ideal, I think a lot of researchers have been unhappy with some of the trusted research environments that governments set up. And I think that there are private actors that could probably do a better job of setting up really savvy trusted research environments that work well for researchers. And we have some examples of this that I mentioned that are included in the paper. So what we suggest is that institutions not individual researchers, but institutions like universities or larger research collaborations, set up a trusted research environment if they want to do this kind of data generation. And they would be given standards that the government would set, but they would be in charge of actually doing the building and maintaining just because they're probably going to be better at doing that. The interesting thing about a trusted research environment is that you could actually imagine this also being a boon for researchers. Because when you're trying to do AI research, what you want is a centralized, integrated platform where all the data sits and you have access to it all at once. It's much less useful to have individual data sets housed with individual researchers. So if we really wanted to step on the gas of advancing countermeasures research for pandemic pathogens, maybe this is something we would want anyways. We would want an integrated system that hosts all the data and really makes it easy for our researchers to interact with. What we're proposing to layer on top of that is that you do have some security that goes along with that system. So not only are you hopefully getting the benefits of the integrated platform, but then you're also getting the security controls given the sensitive nature of the data.

Nathan Labenz: Gotcha. When it comes to like monitors, I know that you had alluded to some rare points of cross administration agreement. One of those I understand is the insistence or I don't know if it's fully a requirement, but I think it's verging on requirement that DNA synthesis companies apply certain classifiers or whatever to try to detect. if somebody is trying to get a harmful sequence synthesized through their company. There's also things like wastewater monitoring. And I wonder if, I guess I have a couple questions on this. One is, is there an equivalent of wastewater monitoring for data? Would it make sense? I don't know if the data of concern is the kind of thing that could be... identified if it's just hanging out on somebody's lab website, potentially people might not even fully realize if they've generated problematic data. So is there any system you could imagine to go around and identify data that's out there and then classify it as possibly harmful? Or is that so difficult of a classification problem that it would be like doomed from the start?

Jassi Pannu: It's a really interesting concept. I have not thought about it too much, to be honest, but I feel like you would probably need... There's two approaches to figuring out what the data landscape is for this kind of data. One is a more active contribution-based approach where the government says, If you think you're creating this data, you have to actively contribute it to these repositories. What you're describing is a passive approach where no one has to take any individual action. Some system is flagging the data, and then it perhaps even automatically collects it and then scrubs it from its original source. That would be the dream. I'm not aware of any system like that. I think it would probably take a lot of work to do. I think that the default approach has been the first one, which is a voluntary or active contribution by the researchers who are generating the data. what you're describing would be cool and would be the analogy to wastewater surveillance, essentially.

Nathan Labenz: When it comes to just the quality of monitors in general, a big thing I always think about in the kind of rest of the world as it pertains to AI is we're in this sort of weird in-between phase where things are coming online, but the world hasn't really reacted that much yet. So for example, AI agents are coming online. There's a lot of concerns around things like prompt injection and whatever, but the world hasn't really become like a very dangerous place for an AI agent so far, right? There's like not that many people have set up honey traps or prompt injection attacks to try to throw my open claw off its path and talk it into doing something that they want it to do. I assume that's going to happen a lot more and there's going to be an arms race. techniques to prevent my open claw from falling for it and ever more sophisticated jailbreaks, whatever. Is there, I assume there's got to be something analogous in the biological domain where like, for starters, these classifiers that the DNA synthesis companies are running, I'm guessing that nobody really yet has tried to evade them. I wonder if you see that kind of dynamic developing on the horizon. And Nicholas Carlini is a past guest who emphasizes the attacker usually gets to act last, the defenses are set up, and then the attacker always has that advantage of knowing what they're up against. Maybe not always, but often. Do you see any of those kind of dynamics now, or do you worry about them in the future? And I guess if you extrapolate them out, do you see us getting to a place where we have a... high level of confidence that we're in a defense dominant world and we're going to be able to keep all this stuff under control? Or is that unto itself still like a very open question for you?

Jassi Pannu: Yeah, I have some ideas, but maybe we'll, let's start with the gene synthesis screening and the different, so there's a few different attack surfaces. I'll just describe them as that, the first one being gene synthesis providers. So what you're describing is the system that is currently voluntary, where gene synthesis providers, companies that make pieces of DNA and sell them to researchers, They have implemented systems where they are checking to make sure that someone didn't just order pieces of smallpox, someone didn't order pieces of Ebola. And they have, it's a twofold mechanism. The first is an automated mechanism that is just essentially sequence matching. It's looking, does the order match a sequence from Ebola? Does it match a sequence from smallpox? If the automated system thinks that there is some degree of concern there, it gets kicked up to a physical, I mean, a human researcher, a human expert who then looks at that and determines, is this something of concern or not? And also what's happening in parallel is some degree of KYC, know your customer. So who did the order come from? Is it a researcher? Does this researcher work with these kinds of pathogens all the time? Have we spoken to them before and this issue has come up before? Those are all the kinds of questions that are being addressed as part of gene synthesis screening. Gene synthesis screening is something that 80% of companies already implement. And over the past years, it's become much more cost-effective to do. So initially, it was a bit of a cost barrier. They really did a lot of work to make this a cost-effective system because you can imagine that if this is something that's being run on every single order that's coming in, it has to be cheap enough to do. Otherwise, it's really burdensome for these companies. And so currently, it's a voluntary system. Like I said, the vast majority of companies already do it. But the concern is if I'm a bad actor, then a voluntary system that 80% of companies use isn't really going to stop me from obtaining sequences that I want to get my hands on because I'll simply just go to the companies that don't do the screening. And so what's now being advanced is the idea of making this a mandatory rule that all companies have to follow. And that would really limit the access that someone has to physical specimens that they might try to use to turn into an infectious pathogen. But I guess in this world of where is the research going? What are the capabilities that people want to achieve? The dream of where the future of biology is going is that I, as a biologist, no longer even have to step foot in the lab. I have my autonomous cloud lab that I can fully control. I'm sitting at home using cloud code to help me, and I can just program some designs to be done on certain pathogens. To be completely frank, we're nowhere close to this world. It's still going to take a lot of work. Right now, the cloud labs that you hear about still require a lot of human input. It might not be specialized biologist input, but there is still a person picking up samples from one bench and moving it to the other, and that's a real bottleneck. But in a world where you do have a fully remote, highly sophisticated cloud lab, then you can imagine if there are agents in You know, if I'm a bad actor and I have an army of a thousand agents that are just trying to hack their way into this cloud lab, you want to make sure that if your cloud lab has sophisticated capabilities related to pathogen creation and design, you need some cybersecurity around that. And so that's like a much more future-oriented thing, but something you could imagine becomes applicable later. And right now, there's a fundamental kind of information infrastructure layer that we're missing. So both for Cloud Labs and for gene synthesis screening, if I, as a bad actor, try to obtain sequences of Ebola from one company, small pieces from one company, a few pieces from a different company, and I split up my order across different companies, there isn't some kind of system where all those companies are easily communicating that information to each other and checking those orders against each other in real time. This is something that's bottlenecked by information sharing between companies. That's also kind of a similar concern to why the Frontier Model Forum was created in terms of... You want private companies to be sharing security-related information. You need a legal infrastructure for that. And yeah, this is like, who would house that? Would that be the FBI? Who's facilitating this? These are all kinds of policy questions that need to be addressed. But yeah, perhaps we can end on a more optimistic note.

Jassi Pannu: I'm happy to give my vision of what a defense-dominant world looks like and see what you think of it. I think that overall, I break this up into four broad buckets of interventions. Well, perhaps to take a step back, I think people often think about what's the theory of victory here? How do we have a single unified strategy like our strategy for nuclear deterrence? We have a single unified theory of victory for nuclear deterrence that has served us well for decades. How do we get there for biology? And I guess after having thought about this for some time, it just feels like biology is very different. It's a distributed technology. It's dual use. You want to give a lot of people access to it. You're trying to limit a subset. And so there's all these aspects that really make a unified theory of victory seem much harder to get to. And so I think the most successful approach is likely to be defense in depth, a layered approach, multiple different defensive strategies applied at once. And the four buckets that I divide this up into are Deter, I'm sorry, I should start with delay, deter, detect, and defend. So delay is essentially limiting access to concerning capabilities. Gene synthesis screening would fall into that bucket. You're delaying the dissemination of that capability to get access to DNA fragments, for example. You could also imagine what we're describing for our data controls as part of delay. Then there's deterrence. Deterrence is figuring out how you can punish someone for using a biological weapon. We do live in a world where there's an international treaty against biological weapons. I'm glad that we live in a world that has that treaty versus not, even if overall that treaty is on the weaker side in terms of the actual mechanisms we have to ensure people are complying with it. But that's one thing. I think where deterrence breaks down is if your actor is not rational and doesn't respond to typical punishment mechanisms. So that's a challenge. But then in terms of detection, this is also something we talked about, which is how do we have a distributed passive surveillance system, like our radar system for ICBMs that will just detect when there's a new pathogen without us having to go out there and look for it, perhaps even for pathogens where there are no symptoms. When HIV was spreading early on, it would have been amazing to have known that much earlier. And that's particularly important for pathogens that take a long time for patients to develop symptoms. So having some kind of global surveillance system, people often refer to this as bioradar or biothreat radar. It's not something we currently have. And then the last pillar is defenses. Really, what are our defenses once there is a pathogen online already? And I think that people often think of defenses as things like vaccines and countermeasures. But I would encourage folks to be much broader in what they envision defenses to be. Because I would argue that I'm sitting in my home right now, and I actually have defenses all around me. I'm drinking water that's been centrally filtered. I know there's no cholera, no pathogens in that water. I have screens on my windows, mosquitoes can't get through them. I know that I'm not gonna get malaria if there was malaria outside. And so there's already a lot of kind of public health defenses built into our environment, but we don't have that for airborne transmission. And so one thing that's being explored by organizations, one that comes to mind is Blueprint Biosecurity. They are exploring built environment defenses to sterilize the air. This is using approaches like far UV and other approaches like glycol vapors. So could you passively sterilize the air so you don't even have to detect the pathogen? You don't even need a vaccine. You kind of just always know that you have passive protection around you. So hopefully I would say like that's a pretty comprehensive approach if we managed to do all It would take a lot of work and investment to get there, but it would make us a lot safer.

Nathan Labenz: Cool. Shout out briefly to the Arrow Lamp folks for my son's hospital room because he was obviously going to be so immunocompromised much of the time. We thought, what could we do? It's cancer is obviously the biggest concern, but infection, when you're so, you know, compromised is like another thing that poses a real risk to cancer. kids in that position. So between HEPA filters that we bought and installed around the house and the Aerolamp dev kit, which we have mounted above his hospital bed every time he's been in the hospital, we've hopefully taken some of the risk off the table of him getting any kind of infection while he's going through all this. But yeah, well, that's a great vision. I hope we implement it. You are doing God's work for spending your precious time and energy on this. Anything else we didn't touch on or any other calls to action or ways that people can help you that you'd want to leave people with before we break?

Jassi Pannu: I think we covered everything and really appreciate your interest. It sounds like you're one step ahead of everyone else in terms of already getting your kid outfitted with all the defenses he needs. So, yeah, it's great to chat with you.

Nathan Labenz: Jasi Panu, thank you for being part of the Cognitive Revolution.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.