Automating Scientific Discovery, with Andrew White, Head of Science at Future House

Automating Scientific Discovery, with Andrew White, Head of Science at Future House

In this episode of The Cognitive Revolution, Nathan interviews Andrew White, Professor of Chemical Engineering at the University of Rochester and Head of Science at Future House.


Watch Episode Here


Read Episode Description

In this episode of The Cognitive Revolution, Nathan interviews Andrew White, Professor of Chemical Engineering at the University of Rochester and Head of Science at Future House. We explore groundbreaking AI systems for scientific discovery, including PaperQA and Aviary, and discuss how large language models are transforming research. Join us for an insightful conversation about the intersection of AI and scientific advancement with this pioneering researcher in his first-ever podcast appearance.

Check out Future House: https://www.futurehouse.org

Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive

SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognit...

Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive

CHAPTERS:
(00:00:00) Teaser
(00:01:13) About the Episode
(00:04:37) Andrew White's Journey
(00:10:23) GPT-4 Red Team
(00:15:33) GPT-4 & Chemistry
(00:17:54) Sponsors: Oracle Cloud Infrastructure (OCI) | SelectQuote
(00:20:19) Biology vs Physics
(00:23:14) Conceptual Dark Matter
(00:26:27) Future House Intro
(00:30:42) Semi-Autonomous AI
(00:35:39) Sponsors: Shopify
(00:37:00) Lab Automation
(00:39:46) In Silico Experiments
(00:45:22) Cost of Experiments
(00:51:30) Multi-Omic Models
(00:54:54) Scale and Grokking
(01:00:53) Future House Projects
(01:10:42) Paper QA Insights
(01:16:28) Generalizing to Other Domains
(01:17:57) Using Figures Effectively
(01:22:01) Need for Specialized Tools
(01:24:23) Paper QA Cost & Latency
(01:27:37) Aviary: Agents & Environments
(01:31:42) Black Box Gradient Estimation
(01:36:14) Open vs Closed Models
(01:37:52) Improvement with Training
(01:40:00) Runtime Choice & Q-Learning
(01:43:43) Narrow vs General AI
(01:48:22) Future Directions & Needs
(01:53:22) Future House: What's Next?
(01:55:32) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Transcript

Andrew White (0:00) You will never be able to reduce biology to, like, these cartoon diagrams. There is, like, complexity at every single level, and it always plays a role. So I think it's a domain which is driven by observations and empirical measurements, rather than a domain that's driven by, like, some sort of virtual in silico model that you can drive. I think sort of stops this, I don't know, ASI or AGI hypothesis that, like, a model that's so intelligent could just wake up 1 day and know how to cure cancer by just thinking through it. Sometimes it seems like we can solve the problems with empirical, like machine learning models, sometimes it looks like we can solve first principle methods, and I don't know. The only thing that does work 100% of the time is measuring it in the lab, and maybe all these methods are just approximations, and all they can do is increase your hit rate or decrease the number of experiments you have to do in a loop in the lab. Future House is really Sam Rodriguez's brainchild, I think. Sam came up with this idea of a future house, which is basically like an FRO, like we want to be at the scale of that level, like 20 and $50,000,000 and like a few year time scale, but it's not like a 5 year goal. It's like a moonshot project that you may not accomplish in 5 years or maybe you will know if you can accomplish it in 5 years and then you'll need to go get more money to do it, or it'll be commercializable at that time.

Nathan Labenz (1:13) Hello, and welcome back to the cognitive revolution. Today, I am excited to share my conversation with Andrew White, professor of chemical engineering at the University of Rochester, and now cofounder and head of science at Future House, an Eric Schmidt backed focused research organization that's building increasingly autonomous AI systems to accelerate scientific discovery. We begin by briefly discussing Andrew's background in statistical mechanics and molecular simulation, how his AI journey began during a 2019 sabbatical, how he came to write a textbook on deep learning for molecules and materials, and ultimately to his involvement with OpenAI's GPT-four Red Team in 2022, which is where we had first crossed paths. From there, we unpacked 2 of Future House's major recent releases, PaperQA and Aviary. Paper QA is a question answering framework that works across entire bodies of scientific literature using a mix of techniques, including keyword expansion, full text search, contextual summarization, and large language model powered relevance filtering to achieve superhuman performance on question answering, contradiction detection, and Wikipedia style citation supported topic summary writing. Here, Andrew emphasized Future House's philosophy of optimizing for results rather than efficiency. They are willing to spend whatever compute or token budget is required and to wait for however many seconds are required to achieve the best possible output. This quality first approach, as you'll hear, is 1 that I think a great many AI builders should take inspiration from. Aviary, meanwhile, Future House describes as a gymnasium framework for training language model agents on constructive tasks. In addition to creating conceptual clarity by distinguishing between agents, which contain core models and memory, and their environments, which provide tools and interfaces, this project introduces an interesting representation of agent systems as stochastic computation graphs, and very interestingly, how agent systems can be trained end to end even when black box commercial models are used at key nodes. I found Andrew to be so thoughtful in his responses, and he was sufficiently generous with his time, that I took the opportunity to ask a bunch of related questions along the way as well. 1 answer that continues to rattle around in my head was Andrew's argument that because better conceptual frameworks and automation platforms are quickly reducing the cost of real world experimental work, perhaps machine learning models' ability to run experiments in silico will ultimately prove less transformative than I had been expecting. There are, as you'll hear, a bunch more. And this was, believe it or not, Andrew's first ever podcast appearance. It took me months of friendly persistence to make it happen, but I think you'll agree that his skill as a scientific communicator is excellent, and he really should do a lot more of these going forward. If you're finding value in the show and want to help us spotlight more unassuming AI thought leaders, we'd appreciate it if you take a moment to share this episode with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Your feedback and suggestions, including for more pioneers of AI for science, are welcome too. You can contact us via our website, cognitiverevolution.ai, or feel free to DM me on your favorite social network. With that, I hope you enjoyed this deep dive into frontier applications of large language models for scientific research with Andrew White of Future House. Andrew White, cofounder and head of science at Future House. Welcome to the cognitive revolution.

Andrew White (4:42) Happy to be here.

Nathan Labenz (4:44) I'm really excited for this. This is 1 I've been, you know, begging and pestering you to do for a long time and honored that this is your first ever appearance on a podcast. So a lot to dig into, and I think it's gonna be a lot of fun.

Andrew White (4:56) Awesome. Thanks.

Nathan Labenz (4:58) So we first met actually on the GPT-four red team, which has been now more than 2 years ago. And I usually don't do people's backstories too much because so many of them are Chat GPT came out. I knew I was gonna be a big deal, especially in the entrepreneurship space these days. But being part of that GPT-4red team clearly means you were on this earlier than most. And in looking into your background, I see a chemical engineering background. I don't see, at least on the level of like LinkedIn, as much of an AI story, so I'd love to hear kind of your general what was it that helped you catch the wave early when you did, and how did you find yourself getting involved with the GPT-4Red team?

Andrew White (5:40) Yeah, so, I mean, it goes back for a while. So my training, my PhD, my postdoc is in a field called statistical mechanics. And statistical mechanics is like a branch of thermodynamics, but it's basically statistics of systems with extremely high degrees of freedom. So it's usually simulations of gases, liquids, proteins, things like that, or quantum systems. And so I've been working on this for a while, and then maybe in 2019, I got invited to UCLA for sabbatical. It's something called the Institute for Pure and Applied Mathematics. This is called IPAM. It's like an applied math group, and they wanted to do a topic on machine learning and physical sciences. So I went out there and I met with, you know, a lot of people were there. Yan Lakun was there. I think Yoshi Bengio stopped by for a talk. There was a guy named Pat Riley who has actually just started a new company, but he was leading part of Google accelerated science for a long time. He did a lot of their fusion research, DNA encoded libraries. It was a really great group of people, and we were all trying to figure out how these systems could be applied. At the time, it was pretty classical sort of machine learning. You do free feature engineering and fit them, And there's 1 guy, what's his name? Matthias Roop was there who had basically worked on this oh, yeah, Anatol von LLHFold. Him and Matthias had built this system that was already state of the art for predicting energies of small molecules, and so people knew this like kind of thing was going to take off in chemistry and physics as well. Think people were really excited about it. So then I didn't really understand the field that way. I went to the sabbatical and I went to a meeting in Tokyo on material science. Sergei Kalinin was there who's a big name in AI materials and Lee Cronin was there who's a big name in chemistry and origin of life and machine learning now. And it was a great meeting. I got to learn a lot about how these people were thinking about it and Lee is a very controversial figure in chemistry not because of anything he's done wrong but it's just basically these very strong opinions and I think it was really cool to talk to him. He had some very strong opinions and Sergei did as well to learn about their thinking in the field. When I came back from all this and this trip to Tokyo and the sabbatical, I ended up sitting down writing a textbook called deep learning for molecules and materials, and it was kind of like course notes slash textbook. I wanted to learn to use this new Jupyter book, which is like an executable book. So I wrote this out and it's been pretty popular since then, but I started teaching the class and learning more about it. And I started using language models pretty early. There was a paper by a German Cedar's group and Chris and Pearson, and they had done this work on like word2vec and material science. And that paper just blew my mind that you could basically embed the representation of a material using natural language as opposed to trying to, like, come up with the right features. And so I started working in language and chemistry, and I worked with Glenn Hockey, who's at NYU, and basically we were exploring, can you do things like drive VMD, which is a molecular dynamics visualization engine with voice recognition, and we were using what's called Codex. Was that the original OpenAI language model? Was it Codex?

Nathan Labenz (8:36) That was their code 1, as I recall. They had GPT 2, 3 That's right. Codexes right in that time frame.

Andrew White (8:45) Yeah. I think that's right. So I think that I think they came out with with da Vinci in maybe January or February, then we came out with Codex a few months later. And so we were driving MD, like the dynamic simulations with this, which was really cool because VMD is written in this obscure language called TCL. I mean, I don't know. I think it's obscure. Maybe some people think it's bread and butter, but no 1 knew how to program a TCL, so every time we had to write scripts for doing MD analysis, molecular dynamics analysis, we'd basically Google a whole bunch, copy and paste code, email friends. And so to have this language model that could just write the code a priority was super cool. We wrote like a position paper. Basically, we wrote like a 2 or 3 page paper that was published in digital discovery, which is like a pretty cutting edge journal a few years ago, started by Alanis Baraguzic who's right now like just a huge towering figure in AI materials. And we wrote this article about how we think the field could change with language models and, you know, I put it out on Twitter and at the time I was starting to use Twitter more and that was becoming a very exciting area. Some people at OpenAI were like trying to think about chemical, biological, radiological, nuclear safety. So seaborne is a word that people throw around a lot in AI safety. And not just AI safety, this is like a topic of terrorism and conventional warfare. So they were curious about seaborne risk, and they sort of were looking at who was working on language models and chemistry. It was a very short list of people, and they got my name and they got Glenn's name and they asked us to participate, and this is in August 2022? Yeah, August 2020 so I started working on the red team there, and, I mean, it was all, like, nobody knew what they were doing. It was, like, completely new ground. Like, I remember, like, just trying random stuff, and it was really exciting time. And I didn't know what I was doing really with language models and chemistry until a few, like, really, I think, groundbreaking papers came out, like this miracle paper and the the react paper. I think that really changed the perspective of what you could do. But originally, back in August and September, like, I was just, like, really confused on how these models could be useful in chemistry, but it was still, like, pretty cool to try a bunch of random ideas. And so that's how I kind of got into it. Happy to go more in this story, but that's sort of how I got to the red team point.

Nathan Labenz (10:55) This is a little bit of sidebar from our main topic, but I kinda came away from that red team experience, I would say, kind of alarmed. I was like, man, the pace at which these systems are getting better is really incredible. And I think now looking back, I sort of infer that GPT-four was more of a, like, maximalist moonshot attempt than I maybe understood it to be at the time. But, nevertheless, the leap from 3 and even the instruct series, all that stuff to 4 was just so eye opening. Then I was like, but the control measures are not keeping up. At least they did not seem to be keeping up at the time of that thing, and so I got a little freaked out there. Did you handle it with, more equanimity than I did, or were you also a little freaked out?

Andrew White (11:43) I think I got a little bit spooked at first because it was the first time you could really see coherent, responsive answers. So I remember trying things like you know let's make nerve gas like here's step 1 here's step 2 and I would ask it how do you synthesize these things and it looked really convincing at first and it was like really like oh my gosh like this is going to change like how we think about deploying models. But then I started drawing up the chemical structures and looking to trace the atoms and how they move from step to step, and I realized that it was all hallucinated. It actually had no conception of how to move atoms between molecules. And so it was like a strange sort of thing where at first I thought that it was really technically brilliant, and then I realized it was actually, like, really good at bullshitting, I guess. And it's a strange thing because you, like, don't know if it's good or if it's bad, and the reasoning that it gives is actually quite plausible. And so for the first like 1 month, was like, this is crazy, and then for the next 2 months, like maybe October and then early November, I was like, there's nothing to this. Like, it's just it's gonna be really cool. It's a nice incremental improvement, But then I started hooking up tools to the system. I think in like late November, early December, I was starting to use some of the ideas that were like blank chain was showing the stuff. I think it was around like ChatGPT API was released. So people were showing that you could, like, sort of do these sort of React calls. Maybe that wasn't no. I don't know if ChatGPT was available then, but, like, people were still doing it with da Vinci at the time. And then I started hooking up to tools. And, again, after that happened, I was like, oh my gosh. Actually, this is maybe a big deal. And so that led to ChemCrow eventually, but I think, yeah, near the end December, what was surprising to me is like, a, the mitigations that they were trying to build had no influence whatsoever once you started doing tool calling, And then b was that you could really make a lot of good progress if you sort of put the model on these rails of only using tools as opposed to just sort of free form doing chemistry by editing the molecule.

Nathan Labenz (13:40) It's funny you talk about the hallucination in chemistry in particular. I did an experiment with it, asking it to be my chemistry tutor, and I just kept getting this stoichiometry wrong, and it could not it would when I got confused that it would get confused, and then we were just off on a mess of confusion together. And I often think back on that in particular as a good indicator of how much progress there has been since then, even though, you know, we haven't yet seen the next 100 x compute scale up. It's like, these days, it makes a pretty passable chemistry tutor, and it they've really ironed out a lot of those behavioral weaknesses to get it to actually do useful stuff in just obviously a lot more scenarios.

Andrew White (14:26) Yeah. I think what's surprising about these models is that they learn the fundamentals actually quite well. Like, if you ask it to reason through, like, what reaction what the rules are for this chemical reaction, if you ask it to reason about, like, this molecule's properties, that actually is quite good in general. What's strange is it misses things like counting the electrons and it misses things like counting the atoms and stoichiometry. Adapting to this thing is that like it learns backwards from how you would expect when having machines learn to do, you know, STEM problems, is that it seems to struggle with some of the more like robotic steps, and it seems to do quite well with some of the more freeform steps. And so it's 1 of those situations where you're going to see I think like a Grokking effect. Like once it can count the atoms and once it can count the electrons really well then you're going to see a big unlock because it's not going to need any more additional reasoning things. So That's why I'm pretty optimistic about things like o 1 and some of the more rumored models that are coming up after that. I think they could actually sort of unlock chemistry to a point where you don't need to be on rails with these tools moving the molecules around.

Nathan Labenz (15:33) How would you describe o 1? Because you were you're still involved in the predeployment testing with OpenAI. Right? So I imagine you did another kind of deep dive on, like, can this thing give a accurate nerve gas synthesis plan and whatnot?

Andrew White (15:47) Yeah. I would say that, like so so I was on the o 1 technical report, and, yeah, this time around, I think things have gotten a lot more systematic, so people have a much better idea of, like, where the benchmarks are. Still, for me, we decided to do some more exploratory stuff during the red teaming, so of course there's already benchmarks now, so I don't think we need to be running the benchmarks. I mean, we put out Labenz from Future House, which is a benchmark of biological and laboratory protocols, and that was tested in o1. And so we were trying some really, I think, more crazy ideas in the red teaming. And this time around we got pretty far, but it still got hung up on just some key steps. And so it's like a very strange experience because we, I think, had some pretty wild ideas and the models got really close to completing them, but it gets stuck on like basically 1 step in the process. And 1 thing that we didn't really get a chance to try this time around is consensus. So if you look at these evals, like if you run k equals 32, like you run 32 times to take majority vote, that can actually sort of get over these little missteps. And so I think, like, as the model that's available today, I don't think we're getting to that point, but I think we're very close. So I think the sort of next release is actually going to sort of allow us to do this synthesis planning, protocol planning. And I think OpenAI like addressed this a little bit. Their report they showed that laboratory, it's called I think protocol QA, which was from Future House, o1 exceeds human levels and protocol QA is like a model assessment of laboratory plans in the domain of biology. And so o1 is basically above human level there. So I think that sort of shows you that it's getting to this ability to long term plan and not make these kind of trivial mistakes that get caught up. But I think in chemistry specifically, still not there. Biology, I think it's actually getting very close, which is strange because usually biological sequences are very long and so, like, they don't fit in the context window. But a lot of the sort of ancillary things about biology, like how do you do cloning, like what are the steps to get to this process, those things I think are very close.

Nathan Labenz (17:55) Hey. We'll continue our interview in a moment after a word from our sponsors. Does that reflect a sort of structure of biology? I'm kind of thinking of Dario's essay recently on what drives I mean, of course, it's on the big picture of AI, but maybe we'll get to that in, the big picture in a second. But his account of what drives progress in biology and medicine is basically that there's a small number of platform technologies that you sort of understand at a conceptual level and use on your problem versus I was actually a chemistry undergrad myself, and my experience of chemistry was like, there's a ton of different things and a ton of different reactions and, like, few sort of platform measure measurement technologies, yes. But in terms of, like, action steps, like, not so many platform technologies, I guess, is how I would describe it.

Andrew White (18:46) Yeah. I think, I mean, Future House was built to be like a a moonshot on automating science. And we picked biology for a couple reasons, same similar to what Dario talked about. 1 of the reasons is that, yeah, it's a lot more platform based. So there's like let's say you want to design proteins. There is a standard way to make a protein. Like you can use cloning or you can do cell free protein synthesis or you can make it on a machine, but basically there's no stop to be like, can we make this protein? Whereas you go to chemistry, every single molecule is very bespoke to make it and there's a lot of questions about can you make it at scale And it's not like a trivial step. So I think in some ways biology is great because it is kind of like a platform. Like sequencing is almost free and synthesis is quite cheap. You can basically make a 100 amino acid protein, you can order the gene to make that for like $25 So I think biology is a very interesting topic because A) it's pretty clear how to test hypotheses and it's pretty cheap to test hypotheses and B) I think what's unique about biology is that there is no limit to intellectual tasks. Basically there's always going to be another organism genome to explore and annotate all the functions. There's always going to be some new protein that does something unusual. There's always going to more dark metagenomic data where we don't know what this genetic material is. So in the sense, it's actually sort of an unlimited amount of complexity to explore or whatever method you want to use. Whereas if you go to something like physics, physics I mean, there are, like, I think a lot of interesting topics in physics, but in some sense it's pretty expensive to get data in physics, and it's also like there's not a ton of complexity to, like, elucidate. Like, a lot of time it's a very reductionist thing in physics, You're trying to like reduce it down to some equations or some relationships. Whereas in biology there's just so many things to annotate. There is no like in biology we kind of know the universal law of biology already. I mean it's like evolution. And you go to physics and it's like trying to find the law. And so what's cool about biology is that you kind of already know the reductionist point of view and you're trying to like look at the more complex systems and understand how they work and how they fit together. So I think it's a very cool topic for those 2 reasons.

Nathan Labenz (20:55) That's a really interesting comment that you already know the reductionist point of view. How likely do you think it is that there are, like, major aspects of the full biological paradigm that we just have very little grasp on right now? I'm thinking, for example, of Michael Levin, who's been a past guest and has this idea that electro physiological or electrobio signals at sort of tissue levels are an interface that probably will be programmable. He's basically on a a quest to kind of turn that into another 1 of these platform technologies, and yet he seems to be of course, there are other people in that general space, but, like, relative to the rest of biology, it seems like it's quite a niche. So I wonder how many of those other things you think, like, what the sort of dark conceptual matter might still be out there.

Andrew White (21:44) Yeah, yeah. We were talking to Ed Boyden about like optogenetics when he was saying like how did people discover how I think his lab was 1 of the pioneers there. How did you guys come up with this stuff? Basically said that we made like what he calls a tiling tree. Basically you know he wrote them the goal that he wanted and he wrote down like all the ways that they could try to attack it and then they like sort of cut up space of ideas by like okay we could deliver something to the brain or we could deliver a signal to the brain or we could deliver you know via surgery, something physical to the brain. And it's like okay what kind of ways can we send information to tissue? We could try magnetic waves, we could try radio waves, we could try light that penetrates tissue, right? So you sort of like go down these paths and it's yeah I think it's a field where I don't think we're going to run out of ideas ever. Like there's always interesting topics to explore. I think that's sort of the thing about biology is that you just have all these very complex systems with all these different interactions and it's a very dense topic. I think I had an argument with somebody about how do you ever get to the bottom of knowing a protein, right? People would say well once you have the crystal structure you know. Then it's like okay well now you need the statistical ensemble the protein right? And then it's like okay but what about post translational modification? Okay like what about things like the reactions in the protein? There can be these methylations or there can be like these whatever there can be these chemical modifications to protein. And then there can be things like okay what if it has like a water wire that sends protons to the active site, right? You got to model that because the diffusivity of a proton by a water wire is different than the diffusivity of a proton by hopping on a hydronium ion. And it's basically just like you will never be able to reduce biology to these cartoon diagrams. There is like complexity at every single level and it always plays a role. So I think it's a domain which is driven by observations and empirical measurements rather than a domain that's driven by like some sort of virtual in silico model that you can drive. And so I think it's going to be a domain which is I think sort of stops this, I don't know, ASI or AGI hypothesis that like a model that's so intelligent you could just wake up 1 day and know how to cure cancer by just thinking through it. It's really like a domain where you have to get out and you have to measure things repeatably and get into this loop. And being super intelligent may not actually scale as much as just having a lot of good hypotheses that you can do. Anyway, so I don't know if that answered the question, but I was just

Nathan Labenz (24:08) Okay. There's a couple provocative ideas there that I want to come back to, but let's you've mentioned Future House a couple of times and a couple of the projects that you guys have put out. Let's just make sure we set the stage appropriately with what is Future House, how did you get hooked up with Eric Schmidt, who I understand is 1 of, if not your principal backer, and what's kind of the big audacious goal that you guys are chasing?

Andrew White (24:30) Yeah. So, yeah, how do we start Future House? So I think, to be honest, Future House is really Sam Rodriguez's brainchild I think. So Sam who's the CEO and him and I are the co founders of Future House. Sam had this vision of alternative mechanisms to do science. So Sam is the first 1 to sort of write down this idea of focused research organizations and then he worked with Tom Kalil and Adam Marblestone and they started Convergent Research which basically has been launching these focused research organizations. And these are like 5 year non profits that are funded like between 20 and $50,000,000 that are built to just answer 1 very specific question that is not being funded in academia because it's like too big of a question, but not being funded in industry because it's like not a commercializable question. So like 1 example is making new model organisms or like understanding building the connectome for the brain or building technology to do the connectome for the brain. There's another 1 that's like doing lean, like trying to build all the documentation and scaffolding for lean so that it can take off as like the you know the language, the programming language of mathematics. But Sam sort of kept thinking down this path and he came up with this idea of a future house which is basically like an FRO. Like we want to be at the scale of that level, like you know between 20 and $50,000,000 and like a few year time scale. But where it doesn't it's not like a 5 year goal. It's like a moonshot project that you may not accomplish in 5 years or maybe you will know if you can accomplish it in 5 years and then you'll need to go get more money to do or it'll be commercializable at that time. And so that's what Sam was sort of thinking about. I happened to be at this time like we know we had finished ChemCrow, we explored what can be possible with LLMs, were exploring more stuff with combining like RAG and tool calling and language models put together. And so I told Sam like why don't you guys build it around AI like automating science, right? So Sam had been exploring ideas of like new microscopy technologies, like new ideas and sequencing or new understanding the brain, but I said we should you should focus on automating science. And so Sam and I sort of brought, you know, he brought I think the model and I sort of brought like what could be the topic and we pitched it to Eric and he was very excited about the idea and he liked the topic. And Eric for the last few years has been really excited about AI, so I think it really fit into his thinking of what the future holds. And so then we basically put together this organization from scratch. I mean there's been organizations like us, like ARC is 1 example, or Altos or Arcadia. These are all sort of non traditional research organizations. I think the real difference between them and us is that we are focused, we are mission focused whereas I think ARC is maybe more of like a traditional sort of PIs with their own ideas. We are focused on 1 mission and 1 of the cool things about having a mission is that you can be a non profit. If you are a non profit and you don't have a mission it gets very confusing about what you're trying to do. Like how do you make decisions about what to buy, what personnel to hire, right? Because if you're not trying to make a profit it's really clear what you're trying unclear what you're trying to optimize for. We have a very like clear mission and it's an impossible audacious mission and so it's we're never going run out of things to do on this mission and it helps us sort of guide the direction. So anyway that's kind of like how the idea got started. There's been lots of like things we've done along the way but we've sort of been around this nucleus of trying to automate science and yeah and then Eric has been the primary backer but you know we've raised funding from other sources, we've got grants from organizations and so I think we're trying to go bigger in the next year or so in the scale of what we're trying to do.

Nathan Labenz (28:06) Cool. So I'll just read your mission statement. I copied it right off the website. Our 10 year mission is to build semiautonomous AIs that can scale scientific research to accelerate the pace of discovery and to provide world class access to cutting edge scientific, medical, and engineering expertise. So that is pretty similar to what I hear when I read Dario's machines of loving grace vision. The semi autonomous is definitely a interesting call out there. In his vision, it seems like they're sort of, on a 10 year timescale, likely to become fully autonomous. He sort of envisions them being like the PIs and, you know, the humans and maybe some other AIs are more like the helpers. Yeah. So in what way do you envision even in the long term of AI these things being not fully autonomous?

Andrew White (28:59) Yeah. I guess there's 2 sort of wrinkles here. I think, 1 wrinkle is that I'm actually pretty bearish on laboratory robotics. You know we may get there but I think that there's been so many companies in biotech especially and I think in AI that just tried to swallow the mission of automating the physical world and just died because of it. And I think it's just like, you know, I don't want say like Emerald Cloud Labs or Ginkgo Bioworks have died. They've made great progress and done great things, but this like commitment to automation I think is it's a huge challenge. And so we can only solve so many grand challenges at Future House, so we decided that we're not trying to automate the lab. And so I think 1 difference between sort of a fully autonomous like AI system that can do science is that it's going to have its own lab, going to have its own robots, it can do all the protocols, things there. We're not trying to solve that part of the scientific equation. If somebody solves laboratory robotics tomorrow, great. We'll happily buy some, sign a partnership, whatever. But we're not trying to automate inside the lab. We do use standard lab automation. We have robot arms, we have liquid handlers, we have acoustic liquid handlers. We do all this sort of regular automation you'd see but we don't have it as part of a mission because we don't want to get like stuck on that sort of aspect. And then another thing I think that differentiates us is I don't think we're going to have a system in 10 years where we can just say like explore the metagenome and then we come back in a year and it will have annotated everything interesting and all of the genetic information on the planet. I think it's going to be more likely that we have a disease, a patient population and we're like what could be the biological mechanism for this disease. Okay here's the biological mechanisms. Which 1 of these could we target with a biologic? Which 1 can we target with some molecule? What's a good starting point. So I think it's going to be like semiautonomous in the sense that we're going have a very clear quest and a very clear set of parameters and we're going to more iterate with the system. Next it gets back to this idea of like biology is really an observational or empirical data limited field is that it's very unlikely we'll have some like you might have something in mathematics where you can just send it off to like do math and you'll just be getting theorems or proofs. I don't know once a quarter you get a dump of some. I think that's like feasible something like math where there is no like empirical limitation. I think in biology you're just gonna be you're not gonna just take some system that's gonna say do these 36 mouse measurements. You're not gonna go order a mouse from a CRO and you know go do the experiment that way. It's always gotta be, I think, done hand in hand with humans.

Nathan Labenz (31:25) Just as a brief aside on these companies like Emerald Cloud Lab and Ginkgo BiWorks that you mentioned that are doing this lab automation, my understanding right now is that they have kind of, like, not exactly human in the loop, but sort of, like, human fulfilling some of the tasks. So they present to you a programmable interface, but behind the scenes, there's, like, a mix of robots and people, you know, running certain parts that are just

Andrew White (31:52) That's supposed to be a secret.

Nathan Labenz (31:53) I don't know. Is that Yes. I think somebody might have said it on the podcast. I won't name

Andrew White (31:58) your names,

Nathan Labenz (31:58) but I don't wanna get into being in trouble. But I think the secret is at least somewhat out. So

Andrew White (32:04) Yeah. I I mean, like, I don't know. I guess this is the the thing is that some people call them like biorobots or wet robots, which is like when you have a human do the tasks that are, like, really hard to automate and there's no reason to because it takes a human, whatever, 2 minutes. So yeah, I'm not really a purist in this sense. I don't actually care or know when something is no human allowed. I think there's a term for lights out automation. If you turn off all the lights in the building, is it still working? If so then humans aren't in the loop. So yeah you're right that Ginkgo and emerald labs, they'll automate whatever. It'll get to 98% and that's you know as far as you want to go to get to the scale you want. I think there's other what's it called? Medra I think is a company that's trying to build like human gripper style lab arms that move around the lab and so then rather than having to like build these bespoke instruments like the ECL and Ginkgo Bioworks build, it's like you go to a regular lab and you just put a bunch of robot arms and they move this stuff around. Like I think that's another path which can get there but yeah you're right they are not 100% automated yet and it's unlikely that they will be 100 automated for a long time because just getting that last sort of step is just a bunch of work for very little gain as far as throughput goes.

Nathan Labenz (33:20) Hey. We'll continue our interview in a moment after a word from our sponsors. These are the human jobs of the future. The our comparative advantage is in manipulating some annoying thing in a huge value chain that we don't understand, but take just, you know, Chinese room style instructions of walk this thing over to this machine, make sure the gunk at the bottom is tapped out. It's quite a could be quite a funny vision. So, okay, that was 1 quick follow-up. The other 1, in terms of just a kind of clarifying follow-up, was I get the distinction between, like, go off and analyze biology versus we have 1 quest at the level of, like, it can't just be purely internal thought. Like, you can't o 1 your way into knowing what actually happens in a biological system. I get that. Is there another distinction there that you're trying to draw? Because you could still say go off and explore biology and come back if it had tools, right, or you had the ability to make these Emerald Cloud Lab API calls.

Andrew White (34:25) Yeah, mean you're right. You can imagine a world in which you have like, I don't know, like science dojo or something where you have like a bunch of tools in it and 1 of the tools is go run laboratory experiments. Maybe when you call that tool it's dispatched to a person who's got to go figure out how to measure that, right? And so you can get to the point where you're really trying to automate the scientific method there. And yeah, that's a great idea. I'm hoping we'll have more to share on that topic soon but yeah, I think you're right. Like this is a completely valid path. I think whether that's semi autonomous or autonomous, think is like a minor point of contention. But I think that it may be possible that is something where you can just have very good tools and you can slot in a machine in there and it can do well. 1 thing I always think about though is that you know we've been building these interfaces for these LLMs where you basically have a big part of drug discovery wrapped up in a set of tools. You have a bunch of literature research wrapped up in a set of tools. You make everything on rails through these language models and you put them in like you put o1 in there and say okay now use these tools to discover something new. I always wonder what if you put like a PhD student in there or like a gig worker that is on Mechanical Turk, like how good would humans do in this setting? Because it's like something we've never really tried before, like nobody's ever tried to make like a unified API for doing science and then put a human in the loop there. So I think we still have to see how good humans are at this task. I think it's a very interesting new setting where like you actually have reduced accomplishing science to a set of whatever 25 tools that can be called programmatically. What if you put a person on the other end of it rather than a language model? Maybe this is like the way of doing science. Or maybe humans are bad at it and then the models will be bad at it and it's just like not a way of doing science. So anyway, I don't know. I think you're right. That is a thing to explore. We are trying to explore that.

Nathan Labenz (36:11) Yeah. That's quite interesting. I guess maybe before going into more depth on your specific projects, and I we do have plenty of time and definitely wanna get into those in technical detail because our audience loves the nitty gritty stuff. 1 of the things that you had said that caught my ear and I wanted to make sure I understood was something to the effect of, for biology, you can't just, like, run it in silico experiments and go from there to real understanding. My working mental model of how this stuff is gonna work with, you know, ever better AlphaFold and ESM 3 and 4 and, you know, I wanna ask you actually, like, which of those things you're most excited about and and kind of what you see as the most important trends there. But my working model has been you do the sort of experiment in silico and you measure the value of that by how often it is in fact predictive. Of course, you still have to go do the what work to do the validation. But at some point, you might be 10 or a 100 times higher hit rate than we used to be because we can do these sort of, yes, this seems close, you know, is worth you must need like a classifier on top. Like, is it worth actually going and doing the real world experiment to validate? And it seems like there's potentially orders of magnitude improvement there in terms of just like how much value we can get out of, say, existing throughput capacity existing capacity for these sort of real world experiments. Is that your mental model too, or do you see it very differently than that?

Andrew White (37:50) Yeah. This is something I haven't really figured out for myself yet. Let me tell you about a few things in this domain that I think people miss. For example, right now you can do like very accurate free energy calculations to see if a small molecule will bind to a protein, and the cloud compute cost for that is like maybe $10 or $15. And the cost of making organic molecules and testing them against proteins has also come down. And so the cost of the equivalent experiment is like maybe $20 or something. You could probably get it down to like $5 depends on the protein and what your catalog of molecules is. So right now you have this point where actually both the cost of the chemistry and the biology has decreased at the same rate as the cost of the sort of calculations. Now you can probably you would bet long term on Moore's law or on better machine learning models and that might be true. So we may be getting near to a crossover point where we really do start replacing a big part of it with these in silico models. But if we get out, like if we zoom out a little bit, that's an example of small molecule binding affinity, I think it's at the point where they're close and they've both been dropping off. We zoom out a little bit, there's this idea maybe 15 or 20 years ago that we could solve protein folding in a lot of biology by doing molecular dynamics. And there was this great sort of effort led by D. E. Shaw research where they basically put together like some of the smartest people, hardware engineers, like programmers, chemists, biologists. They put them all together in like Time Square actually in Tahoe, a very nice building. And they tried to go down this path of like building the chips and the data center from like the atom all the way up to the macro scale to simulate proteins as fast as possible. And they simulated protein folding, know, they were going like milliseconds of protein simulation time. And you can make these charts of basically okay like by 2030 we can simulate organelles, by like 2,050 we can simulate whole cells, and then so by 2050 we can just simulate 1 atom per week. We can simulate a whole cell division cycle in an afternoon and then we can basically run these in virtual cell models. The only problem is, molecular dynamics doesn't do chemical reactions and it turns out that like a lot of biology is acid based chemistry. A lot of biology is actual ATP ADP and NADH plus So you can't just model a cell by showing where all the atoms are and how they're moving. You actually have to model the chemical reactions. And so then people have built these reactive force fields including machine learning reactive force fields. Maybe we can model empirically how these things work, not from first principles but from empirical. And then you find out that actually the way that a proton moves through water is a quantum effect. You can't model that with just classical or machine learning. You have to actually have these like electron density calculations and you actually have to make this like Born Oppenheimer approximation to model the system and that's actually important. Biology works differently if you don't have this effect. And so you run to this like never you never reach the floor of complexity when you try to model these systems and it's like a really hard thing for me to figure out what the answer is to your question. Is that like yes AlphaFold has solved a lot of these things but there's a big part of proteins that are intrinsically disordered. And yes you can capture then maybe another 80% of those things with like Kelvedose, this force field from Crest and Lindbergh Larsson script or Martini, right? There's these things that can try to model coarse grained proteins. Maybe you capture under 80% but then there's some fraction of those that are not really evaluable because whatever you can't just use these coarse grain models. So I don't know there's like a lot of things that I'm talked about here but the point is that sometimes it seems like we can solve the problems with empirical machine learning models, sometimes it looks like we can solve first principle methods and I don't know. The only thing that does work 100% of the time is measuring it in the lab and maybe all these methods are just approximations and all they can do is increase your hit rate or decrease the number of experiments you have to do in a loop in the lab. So anyway, that's like a long winded way of answering a question that I don't know.

Nathan Labenz (41:47) Yeah, that's really very interesting. The when you speak about the price of the actual experiments coming down, are there caveats there around, like, you said, for example, like, depends what your molecule catalog looks like. Like, what if I am trying to hypothesize a new smallish molecule that doesn't exist in nature, maybe never before synthesized, and I'm like, I'm looking for something that can do this thing. I'll generate a huge number of hypothetical candidates and then run that through a model. I can't synthesize those cheaply, right, still today?

Andrew White (42:24) Yeah. I think this is, like, still an open question of what's the right approach, but there are these things called virtual catalogs that have reached, think, I don't know, 80,000,000,000 molecules. So there's like 80,000,000,000 molecules that people are pretty sure they can make, and so what you can do is you can actually just work with those instead of a generative model and just basically screen these 80,000,000,000 molecules. I I don't know what the number is right now. It's like somewhere between 50 and 100. There's something called zinc. You can look it up. Zinc is like a database of hypothetical molecules and it's in dozens of billions. Probably to be honest, these things are made in a combinatorial rule set. These are all the things you can buy cheaply from petroleum side products. These are all the reactions that we know and these are the ways you can combine them all. So you have this like commissural explosion of molecules. So at this point there's pretty much unlimited number of molecules that you can order in a catalog. 1 caveat here is that all these molecules come with like a 80% success rate guarantee. That is you can order it and 20% of the time they'll fail to fill your order, which is fine, you can work around it. But then there is like a lot of groups that have basically built very good models for just making an arbitrary molecule. Like Philippe Fowler and his group developed these molecule transformers which basically can predict the outcome of a reaction and then you can combine that with some search method and basically predict how to synthesize arbitrary molecules. There are commercial products that can do this. IBM has 1, there's a company called Postera that has 1, and so basically it frees people who design molecules from having to think too hard about synthesis. So I think in some sense the synthesis has been solved, but when you're late in a drug discovery program, usually you're centered around some lead and then things get more expensive. You want to modify it and thanks to internationalization you can hire a chemist in China or India and maybe the cost of 1 of these chemists is like a $100,000 a year salary and that includes the lab and all the reagents and things like that, so you can and they can maybe make 20 molecules a week. So it's still, you know, it's not free, but it's reached the point where it's not too expensive. And so I think we do have a lot more freedom now in the chemistry we can design. But on the other hand, like the kind of targets that people are trying to hit, the kind of molecules people are trying to make have also grown in complexity with these new categories of drugs like protax, molecular degraders, this induced proximity work where you make these big fat molecules that have to accomplish like 2 or 3 tasks in the body. And so you're back at the same point you were like maybe 30 years ago where you have a whole bunch of complexity you're trying to pack into a small molecule. And so even though in theory we can make lots of molecules, you're trying to do so much in a limited amount of atoms that you're back to where you are and you need to be very careful and clever and you have only so many experiments. So anyway, don't know. I actually don't remember the question was, but I just talked a whole lot.

Nathan Labenz (45:04) Yeah, well, I think you answered it. It was just around my experience back as a chemistry undergrad was like small molecules were a PhD in many cases to synthesize and there was not a in general, some were easy, but many were hard and there was not a you know, there was not like an 80% success rate on just picking 1 out of a space of a 100,000,000,000.

Andrew White (45:28) Yeah. Yeah. I do want to say that I think that is like a reflective of a certain shift in mindset. So I think historically maybe when you did your undergrad, a lot of people would start with a natural product. So this is kind of the older idea of drug discovery, like phenotypical drug discovery, where you grind up a

Nathan Labenz (45:43) bunch of

Andrew White (45:44) dirt or frog goo or something.

Nathan Labenz (45:47) It's always the Amazon frog, yes.

Andrew White (45:48) Yeah, exactly. And you put that on a cell and you see, okay, does this like, does this cure cancer in my like model? And then what you usually go, okay, it cures cancer in my model, then we gotta like find out what's in the frog goo and then you like run through the mass spec and you run through the NMR and look okay now I know what the compounds are and now I need to start making these things and testing them. And this leads to like this natural product organic chemistry where you're trying to make pretty complicated molecules because they're not made from like room temperature chemical reactions. They're made from enzymes in some biological system. So those are definitely not an 80 percent success rate. But I think a lot of the drug discovery community moved on from natural products. Rather than starting with these natural products which are very complicated but often active molecules, they start with I would say more like petroleum derived compounds. So these are things which are just part of like the global supply of chemistry which is a lot of it is driven from like oil. So because of that the molecules that you see in drug discovery today are very different than what you would see like maybe 30 40 years ago when they came from natural products. So the chemistry is usually easier because by default from the bias of where we get them they're coming from easier reactions like a of lot of amine chemistry, lot of simple basic building blocks like maybe 75 basic chemical reactions that are done in med chem now. So the compounds look different and some people argue that we've lost something by doing that. There's actually some nice effort lately with something called coconut where people try to put as many natural products together as possible and try to get people to train their generative chemistry models on those instead of the more petroleum derived catalogs. Or there's companies like Octant Bio that are doing phenotypic drug discovery with these petroleum derived ones whereas most people are like okay I know I want to hit this protein, I'm going to hit it with these small molecules, design the molecule to fit the protein. Siri's company, Octant, they're like okay I want to fix this cell and so they're working in the cell model which is closer to what people used to do with natural products. So anyway there's like lots of ideas in the field and so that's why you see I think a change in the chemistry where it's become routine because we sort of made this decision to double down on structure based drug discovery with petroleum derived compounds.

Nathan Labenz (47:46) Yeah, fascinating. Okay, that's a really good update for me. I haven't stayed super close to the latest in chemistry

Andrew White (47:51) Yeah.

Nathan Labenz (47:52) And that's a good chapter for me to add to my understanding. 1 other just kind of conceptual question, you've got great answers to these, is what do you think about the, like, multiomic model approach. You know, there's EVO for me earlier I think it was earlier this year was a real seemingly, like, blockbuster moment where I was like, okay. It's gonna happen. And by it, I mean sort of that the models will learn higher order representations of the space, and that's happened in language, but we've got, like, lots of weird debates around, well, they learned all that stuff from us, so it doesn't really count or whatever. But in the biological realm, it's like, there's probably ample stuff there that we don't know that if these models can figure out how to represent in an effective way, which EVO seems to be showing a little bit of a preview of, that maybe like mechanistic interpretability techniques developed for machine learning become the backdoor into understanding like what actually matters in as, you know, it's hopefully learned from just like a huge amount of multiomic data. What's your feeling on that approach?

Andrew White (49:06) Yeah. Gosh. This is a topic I think that maybe is is older than like evo trying model. Mhmm. So I think if you go back a little bit to these companies like Recursion or in situ and to some extent calico, although I don't actually understand what calico does, but basically they try to build this foundation model of biology. You basically build some big matrix of small molecule drugs, transcriptomics data of what genes are being turned on, are turned off, then take pictures of the cells and that's the phenotype. And you can build this big model of like how molecules affect the genes from the transcriptomics and the phenotype from the imaging. And then a big model can come out of that and then you can sort of like the biology would just sort of fall out or the new targets will fall out. And this is sort of like a long this is just like 1 step in the long promise of genetic information, right? Like there was the human genome project, could figure out all the genes, Then there was like, okay, we need to see how they change. And we get something like GWAS where it's like, okay, we're going to finally figure it out now, but GWAS didn't lead to this. And we have these things like depth map or something from the road where we're trying to understand lots of cells and tissue. Yeah, I think like the next 1 is like looking at the genetic, these genetic foundation models. I think that they can be very good for things like metagenomics, for understanding transcription factors, understanding regulation networks, but I don't think they're going to be like the sort of silver bullet that unlocks any deeper understanding of diseases or any deeper understanding of what goes on inside of a cell. I think they're going to be 1 more model, 1 more tool in the toolbox, but I don't see them as really moving the needle that much. So yeah, I don't know. I'm kind of pessimistic on this thing. It's like 1 of those fields where people have tried really cool sounding ideas and it should have worked eventually but every single 1 has failed. So now like when you get old like me, you are just cynical and everyone that comes out you're like well we'll see. I don't know, you can read Derek Lowe. If you read Derek Lowe's blog, the guy's I'm not going say he's cranky but he's seen a lot of waves of new ideas in biology and it's pretty rare for things to work out. A cynical position is the default I think right now, but certainly I think from just understanding biology, awesome. For curing diseases, I think we'll have to wait and see if it really moves the needle anyway.

Nathan Labenz (51:19) Do you think that could be because I think you could tell a very similar story about machine learning up until, like, 5 years ago or maybe 10 years ago depending on exactly when you wanna start the clock, but there's, like, a sport now of, like, everything was invented in the eighties, and I think to a large degree, that is true. Like, most of the ideas that are now really working with sufficient scale, people had at least an initial take on a long time ago, and they just didn't have the scale of data or compute to get there. How you know, I don't know you wanna approach this from, a handicapping, perspective or, like but do you think that there is a chance that it's just a matter of all these things maybe were the right ideas, but we just haven't got quite enough scale to actually grok so to speak?

Andrew White (52:09) Yeah, yeah. I think you're onto something here is that 1 of the challenges of biology is the latency from idea to feedback. Machine learning is great because you can try things and as compute has gotten good you can see the results of these things. And in fact you see a problem right now where basically people have cool ideas that work at like a 1,000,000,000 parameter model or maybe a 500,000,000 parameter model when you scale up to like a 10,000,000,000 parameter model or a 40,000,000,000 parameter model that all of sudden just doesn't work. Or if it's like you works at 500,000,000, go to 1000000000000 tokens and it doesn't work. And so you see this thing of like when you scale up it costs more time and there's you know longer latency and that is a great filter of ideas and you find out okay actually the simplest ideas or these ones are the right ones. In biology we kind of have this as well where basically you can do pretty quick experiments on a protein in a well, right? Like just a little protein you can do, you can manipulate it pretty quick and you put in a cell and then okay it's a little bit longer because you gotta wait for the cells to grow and you gotta transform the cells somehow. Then you go to like tissue and then you go to model, a mouse model, right? Like let's say I want to cure aging. Well unfortunately mice live a long time and so you gotta wait for those mice to die or not. And so like you keep running these latency problems and we get to drug discovery is the it's at this latency where like humans just suck it because it takes whatever 7 years from mechanism to phase 2 trials and it's phase 2 clinical trials tell you if the drug worked or not. And 7 years means all the people there have gone and moved on, right? Like 7 years means that people forgot like why they were doing this in the first place. And so you have these like feedback loops that take so long and that is really hard. So I agree, it could be something like we figured it out, but we won't know for 30 years because we need to try a few cycles to see if it works. Or maybe recursion, maybe in situ, maybe calico, they all got it right. But in situ just got their first platform asset I think in phase 1 or something or to IND. So it took 7 years for them to get from company conception to getting their own molecule into the clinic. You look at recursion. Recursion people love recursion. Like they do great social media, they have a big GPU cluster, everyone is excited about them, hope their stock price keeps going up. But recursion, I think all of the assets that are in clinic right now from them, they bought from other people. So they haven't really got their platform going yet. So we won't know if their idea of building a big foundation model is the right answer for a while. And then things like Atomwise, a very early biotech doing convolutional neural networks on small molecules binding proteins. Great idea. They had great work, great results, great people. But you know what happens is like some of these biotechs, they'll die because they picked the wrong target, not because they had the bad platform or the tech wasn't right. And so you just really hard to separate cause and effect with these very long latencies, very expensive experiments and so many variables going on. So I think it may be the same as machine learning. It may be some bitter lesson but we're not going to learn it until 30 years from now or until we figure out a new way to do experiments, right? Maybe the human organoid research is going to get us to lower latencies on getting feedback from ideas. Maybe, you know, maybe RFK junior is going come up with a whole new way of doing FDA approvals so it'll get faster clinical trials. There's this stupid thing in drug discovery. It's like the biggest, the most important bottleneck right now is like time to enroll the first patient in clinical trial. So if you wanted to cure diseases, cure cancer, if you wanted to save the world in drug discovery, should be working on better like online ads for a clinical trial site enrollment. And that should be where people invest the effort. But that's not a sexy thing to invest the effort in. But that is like, I don't know, what if it's like 60 days, think it's the time to enroll the patient. And that 60 days is like very amenable to innovation, but people put their innovation in like finding a new hit from a machine learning model or like building a new foundation model to maybe improve your biological mechanism by a little bit.

Nathan Labenz (55:56) If there's 1 thing I do have some hope for the Trump administration doing, it would be some sort of data liberation out of electronic health records to then have LLMs crunch through and more proactively identify the possible patients as opposed to having to filter everything through the click of an ad and then, you know, collect the data manually.

Andrew White (56:17) Yeah, actually, if you have a direct line to RFK junior or Donald Trump or if, you know

Nathan Labenz (56:21) I'm a couple of months away, but it wasn't in the asked.

Andrew White (56:25) Okay here's something that they can do. Release all of the IND packets for every drug. That is like every drug which is clinically approved has filed an IND packet which contains a ton of valuable toxicity, you know pharmacology, pharmacokinetic data. And all that data is no longer a competitive use for them because if it's already clinically approved like there's no competition, it's already under patent protection, it's already being sold. If they were to release all that data, there's a huge trove of data that we could use to better fit machine learning models and it's very expensive data to get and it's all kept at the FDA and for I think, know, no really good reason. So anyway, that is a

Nathan Labenz (57:02) I heard there might be some of it in just like folders at Mar A Lago so you could also Take a left turn past the bathroom, and who knows what you might find. Alright. I joke, but I shouldn't joke too much. I don't wanna get in any retribution lists. Okay. So this is a great foundation. I think it's it's always really helpful to understand the worldview that motivates the work. Let's finally get into Future House. You can maybe give us a little bit of kind of you know, I think I have a growing intuition for why the AI scientist and it's sort of is like, you don't really believe in any other silver bullet, and so we just need more scientists that can grind away at this for a long time. And, obviously, an AI scientist has certain strengths. I've particularly dug in on the paper QA and then the new aviary papers, but maybe give us kind of a, you know, a brief history from kind of inception to these most recent ones that you can highlight, any other works and kind of the overall strategy to tvota. Yeah.

Andrew White (58:06) So I guess like Future House started with Chemcrow. So Chemcrow is an early paper where we basically wanted to do like a full scientific discovery process with language models and these tools. So I think the the most important sort of problem we did in chem crow was we combined like retrosynthesis predictors, literature search, like code execution. We combined that with GPT-four and we asked to design like a new dye. Basically we said okay here's I want you to make a molecule that's novel with 2 steps. It's like a 2 step chemical reaction that's never been made before and have it absorb light at a specific wavelength. And we gave it to start with the supporting information from a paper. So basically fit a machine learning model to the supporting information to predict the wavelength absorption of light in molecules. Then it basically used the retrosynthesis tools, did some searches to try to find a new molecule that would absorb at the specific light. And then it came up with a reaction procedure and we had like a robot lab at IBM, IBM Robo RxN, which is like a cloud lab that we were able to use and they were able to go through the synthesis procedure to make that new molecule. And then we were able to test it and actually was very close to the wavelength of light. Think we're like 15 nanometers off or something. So it was like a great closed loop discovery. There were some problems. Basically we were in a rush, we're in a time crunch, so the robot didn't finish that reaction. So we actually had a person go in and do the last step. So it wasn't like a completely AI novel molecule but we did other examples in the paper where it was all done with the robot. But I think it's like a full loop, right? Basically it does literature research, it plans the protocol, it goes and executes it and it does the measurement. But you know 1 of the things I realized is that you know there's been scientific literature is like such an insane concept to me. Like it's really surprising that we did this as a civilization for hundreds of years. Was like big networked artifact of all of scientific scientific progress. I can complain about how it's locked up behind publisher paywalls for hours but I won't, I'll hold it back. But basically this sort of artifact is to me like whatever, like 99% of doing science is knowing literature and the last 1% is like just moving things around a little bit to get to innovation. And so when we started Future House, we wanted to go after like the first thing is scientific literature. Like if you can sort of work with literature and you can understand literature and see what's been done and what hasn't been done, see what's novel, what's not novel. That's going to be a huge fraction of the work of automating science. So it was the first sort of project we did was a paper QA. So paper QA I actually wrote before Future House. I was giving a talk in Copenhagen and Denmark. I think it was in January or February or something like this and I was in the hotel. It was rainy. It was like winter so there wasn't much to do so I just like was messing around my laptop and made paper QA. And it was like around these ideas of like RAG systems and what was different about paper QA was that what it does is it basically pulls up all the relevant papers and then it instead of like pulling the chunks of the papers and giving it to a final LLM to answer, which is sort of how RAG normally works with, what it does is it basically does a MapReduce is it pulls up all of the chunks that could be relevant and then it runs an LLM on each of these chunks to summarize them and re rank them and then give them to the final language model to do the answer. And that solves like 2 problems. 1 problem is that language models are extremely sensitive to distracting information in the context. So when trying to work with a scientific literature, if you got like a wrong paper in there or a section of a wrong paper, it would just blow up the system with incorrect information. So 1 example is like 1 of the questions you could ask it is like in the, I don't know, in the EBM trial what was the size of the placebo arm? And it would pull up the wrong trial methods and if that was in there it would like distract it from that information. So that was what this context summarization step did. So anyway that was paper QA and we spent like basically the concept was done very quickly and there's tons and tons of like defining what are correct answers, how do you work with literature, how do make sure you get all the papers, how do you consider citations, journal quality, how well do humans do, can we make this repeatable, can we iron out bugs. And it took a long time and sort of we kept off the project with something we call wiki crow, which is once we beat human level performance by making lots of changes we have an engineering blog that walks through like all the sort of AB decisions we made to get to beating humans. We wrote 20,000, I actually think it came out to like 18,000 Wikipedia articles that covered the function of every gene in the human genome with the exception of a few like 2,000 or 1,500 that have no papers ever written about them. So that basically was a summarization of all knowledge of the human genome which didn't exist and doesn't exist until this was done because Wikipedia only covered like 2,500 of them. So we added like 16,000 more articles to explain the rest of the human genome. I think that was sort of a good mark that showed that we can do this at scale and we can do things like with a scale that we can do it. We can answer literature questions as well as humans and we can do it like 75 per minute. So we could like do what we call contradiction detection. We have systems which can look for any contradictions or disagreement with any arbitrary statement from the literature, which is actually a really hard task by the way. Like to check all 2 50000000 papers against an arbitrary claim. We can do that at scale of like every archive paper that's on archive per day. We can check every single 1 for any disagreement literature and market. Or we can now like write the Wikipedia article for every disease that exists every 3 weeks considering all new research that comes out. So I think paper QA is like a really good beginning to end. We beat humans, we deploy it, we build the infrastructure, we have an API and we can run it at scale. And now that we have that, then it was like, okay, what comes next? And so we sort of took the lessons we learned in paper QA and that's something called aviary. And what we did is we broke up paper QA into 2 components. Basically there is what we call the environment and this environment is like the tools that are available like look at citations, summarize literature, or sorry summarize this paper, do a Google Scholar search, do a keyword search over some corpus. We turn that into environments and then we have the agent which is the thing that sort of drives the decision making. That's like okay let's go do another Google Scholar search. Okay let's actually search semantic scholar. Let's see who cited this paper. An aviary is this idea that okay let's make a bunch of environments of scientific tasks beyond literature. So like chem crow, there's an environment for chem crow or we have an environment for designing proteins. We have an environment for doing molecular cloning. So we have these environments here and then what we do is we make agents and we've separated these 2 so we can try different agents, not just different LLMs, but we have agents that we've made ourselves with their own LLMs or we have agents that have memory, we have agents that can do things like have multiple actions considered, they can do like reflection, they can do you can have like a reward model baked inside of an LLM. So we've sort of tried to structure this agent interaction. And then what we do is we have an environment and an agent and we train them together on some benchmark. So like we wrote lab bench was a collection of benchmarks that we think are relevant for doing science. When you train them you basically put them together and you call it a crow. And so a crow because basically it has tools, that's the environment, it uses language, that's the language model. Put us together it's like a crow because crows are birds that can talk and also use tools. And then we are now deploying them and we built like an aviary which is a bunch of different crows that can do different tasks and they all are connected. And we're slowly building like a platform of intellectual micro services. Basically go look for contradictions. Go do a literature research on this topic. Design me a molecule that does this. Design a plasmid which can clone this specific protein in this specific organism. And so putting all these things together, we were basically building the API for doing science.

Nathan Labenz (1:05:52) That's fascinating. Cool. Really nice job laying out the progression and the vision there. Any number of follow-up questions. I guess the on the going back to paper QA, we are now on paper QA 2. Right? And the 1 thing that stood out most in terms of, like, why it has worked well, because, of course, other people have tried to do this too. And Yeah. I would say it's not easy to get, above human performance in answering these questions. It sounds like 1 big insight was process everything. I I sometimes call this flash everything in honor of Gemini flash because it's so cheap. I I once set out the challenge for myself to spend a dollar a day on Gemini flash Yeah. Which is 13,000,000 input tokens and, like, a lot of information to process. So it isn't I actually have not achieved that. I've achieved it on a few days, but not on a consistent spend a dollar every day on flash because I just turns out, like, it takes work to just identify those targets. But it sounds like that's 1 of the big things was do this sort of filtering and identification of relevance Yeah. With the language model as opposed to just with an embedding approach. Yeah. What other big drivers of kind of incremental progress would you highlight for people that might wanna build a system?

Andrew White (1:07:12) Yeah. I think that was a big idea was doing that. Solves a lot of problems like all of the effects of chunking, chunk size, quality of parsing, a lot of those things just disappear when you do this extra intermediate step. So it sounds like extra work but actually it solves a lot of I think practical problems. It also increases the cost and the time to get a response and I think that's why you don't see like perplexity or illicit or anybody doing this kind of process, this like sort of 2 step process Because for a consumer facing thing, you're going from whatever 15 seconds or 5 seconds to like a minute or 2 minutes. And I think that kind of reflects the philosophy is we just want to build the best possible system, forget cost, forget latency. I think that's

Nathan Labenz (1:07:54) That in and of itself, just a highlight, I would say, is a really important philosophy that I try to shiny apple seed my way around the world any chance I get. Engineers have way too much emphasis on especially cost. Latency, I understand more if you have a user sitting there waiting. You know, human time is still precious even if tokens are, you know, increasingly super abundant. But I always emphasize, yeah, deliver me the highest value thing first at any cost and at any latency, and then we can kinda work backward from there. So I that's good. Keep going.

Andrew White (1:08:31) Yeah. I

Nathan Labenz (1:08:31) just wanna hammer

Andrew White (1:08:32) that Yeah. It's great. I mean, I think it's a philosophy people haven't really caught on to because people, I think, are still stuck in this mode of like a Google search you know where it should come back fast and it should be like amortized cost. Where they spend a bunch of money building a big index and then they have low cost queries. We didn't spend a bunch of time building a big index. We like do all the processing just in time. And so everything's more expensive but and we don't really amortize the cost. Another big effect is full text search. Basically most search engines over research papers are not full text and that just loses so much information. And that's why most benchmarks and most competing tools work with abstracts and titles because those are accessible from search engines. It's very rare to get full text search. So we have built our own full text search. We open sourced like a version of that in the paper QA 2 repost. They can build your own index. We have our own internal, I mean, thing. It's just exactly what you think. So you can use like full text search in Postgres or you can use Elasticsearch. And then Google Scholar is like just really, really, really good. And so we use a lot of our performances from using things like Google Scholar and Semantic Scholar, which recently has full text search. So that's a big change. I would say other things that are anything you can do to remove distracting information. And so a lot of our engineering blog is things like how to cut out potentially relevant things, how to do like summarization efficiently. Like we might consider 75 sources and then cut them down to 10 and that's better than considering 25 and cutting it down to 15 because you don't have as big of a cut. Yeah so I think full text search. So you know search very, very important. And then this step, think we call it in the paper like RCS retrieve and contextual summaries, sorry, rank and contextual summaries. Those are the most important things.

Nathan Labenz (1:10:25) 1 thing that did jump out at me as I was kind of perusing the code a little bit is it seemed like the control flow, like the actual agent itself, wasn't super complicated. Like, the prompt didn't seem super insane. It it was kind of like, here are your tools. You know, it was sort of much closer to what I think people would sit down and write in a in their first attempt than Yeah. You might think given how well it works. So I wonder if you had any reflections on that.

Andrew White (1:10:52) Yeah. I think the what we have in the repo is, like, pretty general purpose. If you look at, like I think in the repo, have these configs, and there's, a WikiCrow config or a ContraCrow config. Basically, ContraCrow is, like, just look for contradictions, and that's, I think, quite a different 1. And WikiCrow is, like, summarize these things. I think we had a philosophy change. We used to have a very detailed prompt and we moved a lot of the complexity in the tool descriptions. So we have a something called the package is called aviary and what it does is it will take like a Python signature with types and then the docstring and it'll turn into like a tool for an LLM to use. And so the docstrings will end up as the tool description. And so some of the complexity is kind of hidden away in the docstrings where it's things like okay like try doing multiple searches like different keywords. Actually this is why Perplexity is quite good, there Perplexity's pro search. It does what's called query expansion, that you like whatever do your thing, the question you ask, it turns it into like keyword searches. This query expansion turns it to like 3 keyword searches with different phrasings and that really helps. Something we use as well is that we don't use it, we tell the LLM like try doing multiple searches with different keywords. That's actually a really effective way to get a good retrieval of sources.

Nathan Labenz (1:12:03) Don't mention 1

Andrew White (1:12:04) other thing. We have this tool which we don't really advertise because we're not sure we can handle capacity but it's called hasanyone.com. Have you seen that? No. Okay, so we have this tool called hasanyone.com and it's like a very narrow version of our internal tool we use for paper QA and it will basically do a search to see if anybody has done x and it uses our paper database and our like search tools and things like that and yeah you it's like a way to see because paper QA too is like we change some things for it to be more useful to people Like rather than using like a full text search you build and put on cloud infrastructure, it like uses this Rust library called Tentibi which will build the search index for you like based on per directory. I think it's much closer to what people want. We have our own full text search that's, like, on our cloud infrastructure, so you can use the hasanyone.com to sort of get a sense of of what we use internally.

Nathan Labenz (1:12:52) Gotcha. Cool. Would these things work if you just did, a quick pivot to a totally different domain, like a social science? Could you go to economics or education?

Andrew White (1:13:04) Yeah, yeah. It actually works well in many different domains. We've used it in lots of domains. We actually have a phone number we can text, so like, I'll be out at, like, I don't know, a party or something like that, and I can text paper QA, and it will text me back like a response. Yeah, I've had I've asked questions all over the place history, economics, whatever. It has access to all the papers in these domains. Like some of the like we've concentrated on getting access to papers in like biology, you know, in medicine but it has you know archive, chem archive, med archive. It has like the ability to download open access papers. It has like anything we have in our cache so it's able to cover a lot of different domains. But you will see this unevenness like if you go to like, I don't know, if you go to machine learning it's great because everything's an archive. There's no real problems with open access, things like that. If you go to something like, I don't know, internal medicine like New Journal of Medicine, like they have some open access things but they have the most annoying cloud flare like anti bot stuff So we're not able to access any open access papers in New England Journal of Medicine. And so it's kind of like field dependent which journals are dominant. Does it impact the performance?

Nathan Labenz (1:14:06) Yeah that stuff is quite annoying. I mean that has really been a big story of agents in general purpose, web agents, whatever. It's like they're so often hung up by literally the dumbest stuff. It's a very weird phenomenon. 1 thing that I would say is not so dumb, this is what I was going to ask before and Yeah. Finally remembered was the challenge of just using figures effectively. I find, like, whatever, no need for false precision, but just unreflecting on my own mode of absorbing information from papers, a lot of it boils down to the figures. And, unfortunately, most of the time still when you put a PDF of an academic paper into a system, it just kinda skips those and just maybe takes the, like, caption, but doesn't really engage visually with the thing. Did you guys try to do anything about that, or is that something you're just kinda waiting for models maybe to get better at and hope to delegate in the future?

Andrew White (1:15:08) So we actually have a benchmark called FigQA in our lab bench paper, and FigQA is like an assessment of scientific figures where we took pretty hard figures and pretty difficult questions. And actually, I think SONNET, the October SONNET model beats humans on this benchmark now. So we've reached a point now where actually the models are definitely human or superhuman performance on looking at figures. So we are actually adding this now to paper QA. I think like we're really bad at naming things so like we call it paper QA 2, but like we try to use semver, so like to monitor backwards compatibility. So we're actually working on paper QA 2 version 6, and that version 6 is going to have the ability to look at figures. And what we do is just kind of like, you know, really simple is that we parse the table, sorry, we parse the paper as text for the purposes of search and then we feed the images into the models for the later stages, either RCS or RCS and the generate answer, the final answer step. So we are looking at that now. I think the models have reached the point where they're good enough at looking at figures that you can actually do this at scale.

Nathan Labenz (1:16:12) And so you chunk and then you just kind of attach an image to the chunks and then at the very end you kind of include the associated images?

Andrew White (1:16:20) Yeah, yeah. I think we've I mean, we haven't written a paper on this or done a lot of benchmarking on this, but the current philosophy is that the purpose of the text then is just for the search. And so the figure caption is there and then the search is there. And then what you can do is you just give the picture of the page including the text around it as opposed to the you know as opposed to some parsing of it. And yeah, think these models are so good at this that there won't be any issues. Mean you can go down this path. I think we did this for building FiguAs. You can use label box just like human annotation to try to cut the figures out. There's a group thing called like sci int or something. Think it's I forget where they're from. You should call it something, but they've written like a pipeline of py new PDF to extract the figures. But it's really fraught because PDFs are so different. You have a PDF from 1992 and the figure extraction is not gonna work and then a paper from 2020 it might work. So yeah, I think it's really difficult to extract figures programmatically. So I think just taking a render and taking an image of it is is the way to go.

Nathan Labenz (1:17:21) Yeah. Okay. Cool. That's quite interesting. So no love for not to say no love, that's too harsh, but no no need in your estimation for various tools. I'm thinking like marker API. There's a number now that are sort of specifically about slicing up inputs So like

Andrew White (1:17:40) this is a good question. I think our philosophy, going back to what we're talking about, is we just want to spend the most money and the most time but for the lowest technical sophistication. So I think those kind of like there's a few of these companies now and I think a lot of them are built around Detectron fine tunes where you basically fine tune a language model to purse up patents or research papers. And I think just long term what's going to win is just taking a picture of the page and giving it to a model which is sufficiently competent to actually look at the page and answer questions about it. And right now if you just take a picture of the figure and you give it including the caption and everything if you give that to Sonnet, the October model, it will correctly answer questions about the figure. And so I think there is no additional need to go more complicated.

Nathan Labenz (1:18:26) Yeah. I want to see that makes me really interested in the and I'm kind of surprised I haven't seen this yet, the latest Claude score on the RKGI prize because when they did the first 3 5 sonnet that was, you know, a major step up in coding, whatever, how many months ago, I took RKGI puzzles to it, and I was really amazed by how poorly it could see for lack of a more technical account. It, like, could not count blocks or do do, like, the basic atomic functions that you would need to be able to do in order to do the reasoning later. This is probably another example of or at least seems pretty likely an example of something that you kinda said at the top of, like, they learn in sort of a reverse order. Because I bet that if you could see more effectively, you could the reasoning would be a lot stronger, but such basic mistakes were made at the reasoning or at the visual level. This new 1, of course, can count pixels and do all these click accurately on buttons. So Yeah. Where is that score, I wonder?

Andrew White (1:19:27) Yeah. Yeah. I think when we built our benchmarks, the first thing we tried to do was, I would call recall oriented tasks. Like here's a table, turn the table to JSON or here's a figure, like give me all the values of the figure. And the models are not very good at those recall oriented tasks, but then we realized like we don't actually care about that. We actually care about like these precision tasks like based on this figure does the treatment of A improve response or based on this table is method A better than method B? And it turns out that those precision oriented questions are very good. Those are like they get superhuman level performance on them. But these recall oriented tasks like digitizing it or counting things, it doesn't do well at. And that's actually been like a weird kind of mindset shift for me is that mean this is like you go to Gemini, you put a whole 100,000 token document in. You say okay like what are the what are the 50 main conclusions of this document? I'll do a shit job. But if you go like what's conclusion 1? And it's this conclusion. What's conclusion 2? Okay then here's conclusion 2. What's conclusion 3? And these models just work so much better in this like sort of, I don't know, turn by turn or autoregressive way where you ask single questions or single responses. Whereas if you try to get these cohesive things where it's like summarize everything into JSON, they struggle a little bit more. And so I think in the figure understanding, if you ask it specific questions, does well. But when you try to say like get all the information out of the figure, they don't do very well.

Nathan Labenz (1:20:46) Yeah, interesting. So are you going like what is the cost and latency to run a question? Maybe that varies to some degree by the scope of the question, but what do you roughly give in terms of guidance there? And is this something that you will productize in a full way? Like, looked at the GitHub repo, it has 6,400 stars, which is not a small number, but I also feel like the burden of or not the burden, but the barrier of just having to figure it out is a lot compared to if you guys were to commercialize it, I bet you'd have a lot of interested customers.

Andrew White (1:21:25) Yeah, this is a good question. So to answer your first question, you can see the real latency of the full system if you go to hasanyone.com and ask like a new question. And it costs like between maybe 15¢ and a dollar depending on the question. Like if you ask has anybody done X and if X is like some moderately popular topic but it's a niche thing it will do lots and lots of searches and lots of exploration to see if anybody's done it. But if you ask like I don't know has anybody landed a UFO on the Washington Monument? It's not going to do a lot of searches. So it does depend on the question. But then like yeah as far as like the the github repo, we intentionally make it hard to set up because we really only want to be interacting with programmers like hackers who know their way around these systems. I think if we have like a 1 click setup or something really easy to set up then we're gonna have to like go down the path of supporting lots and lots of player people. And already like we've run into so many problems with like it doesn't work on Colab because Google will not do Python 3.11. Sorry they won't do 3.1. I don't even remember at this point But but they basically are not updating Google Colab and we just use some Python 3.11 features and that's already like a big source of problems and I just think if we make it super easy it could be an issue. Although Ethan Moloch was able to get it to work on Windows which was like a real testament to us being like okay we'll try to support windows a little bit. So that was very exciting. But yeah I think commercializing it is, you know we get a lot of inbound. Like people want to use it for doing diligence on acquisitions or diligence on investing in companies. People want to do it for like doing IP searches. People want to use it for their PhD. I will say that we generally let in like academics like PhD students who will ask for access to the full API and we did do a workshop where we get people access to the API with like rate limits of like 5 questions per day. Because we have this Python API that uses our server and so people do these like sort of interesting projects where they have some idea generating model which then like takes every good idea and then runs it through paper QA to see if anybody's done it or if it's supported by the literature or if it's invalidated. So there's like a lot of cool ideas that people have here. But we are always caught in this tension of like we want to measure success as an organization as like number of novel discoveries. And does commercialization or revenue of paper QA, does that lead to more discoveries? I don't know. It could if the right people are using it and the right sort of progress is being made. Or maybe you can argue that growing revenue of paper QA is going be beneficial for the org so we can hire more engineers or something. And so I think we haven't figured out that of balance between what is good for our mission and what is just chasing a mirage or something or what's just going down this commercialization path for no really good reason. So I think we haven't worked that out, but that is something we think about a lot.

Nathan Labenz (1:24:01) Yeah, okay, cool. So turning to aviary, I just had a couple of in the weeds questions on that as well. It seems there was like a couple of novel things in this paper. 1 is the just kind of high level concept work to try to get clarity on, like, what is an agent versus what is the environment. There's been a lot of confusion there, or at least people are sort of doing things a bit differently. You were willing to put a stake in the ground, and it seems to me that the big thing is the memory is internal to the agent in your conception. I was a little less clear on, like, if somebody does it the other way, are you saying they're wrong, or, like, where does that matter? Yeah. But, yeah, I'm interested in kind of your philosophy of what is an agent.

Andrew White (1:24:47) Yeah, so like how we broke it out in the paper is that I think a very working or very practical definition is that anything that you want to train is going to be in the agent. Anything that you think is untrained will go in the environment. They interact with language with observations and the actions or tool requests or function calls basically. And so when we like go down this framing it just simplifies a lot of things that we can try arbitrary agents, arbitrary trainers with agents and then we can have arbitrary environments. And it's like a pretty opinionated interface but it just sort of frees us up that we can really try to change things all these different directions. Moving memory from the environment to the agent I think is again 1 of these opinionated ideas. But when we thought about what is memory, like we thought about things like okay you can just append messages but maybe you need to modify them once there's so many messages you need to like compress them somehow. Or maybe you don't want to do memory, you want to do like 5 previous messages, you want to use RAG for like the previous ones, right? And I think it's because of this, it's like so many ideas about memory, it's got to be a trainable thing. Like no matter how you write down memory there's hyper parameters in there like how you truncate it, how you compress it, what you include in previous messages. Maybe you want to cut out any images in previous messages or something. So I think all these kind of decisions made us realize that memory really is meant to be part of like a whole system. And then I think another decision we made is that we frame the agents as compute graphs, stochastic compute graphs. And this is again like I think an unorthodox, I shouldn't say unorthodox, but like I don't think a solved problem because a lot of people view agents as these kind of state machines that are somehow going through multiple states. If you do it as a compute graph it's a forward to back know, from left to right, it's like you just shoot once. There's no recursion in it. And there's also no state in it, so the way that you actually get state here is that the input to the compute graph is the agent's previous state. So the agent basically takes in a previous state an observation and emits a new state and an action. And this framing is I think again it's like not how I think people naturally feel about agents. I don't think people think about them as compute graphs that much. They like to think about them as programs or some sort of stateful operation. But this just does a lot of things for us. Like it allows us to do backprop over the compute graph. It allows us to efficiently execute them. It allows us to do things like efficiently serialize and deserialize these things as compute graphs without having to think too carefully about the state. So these kind of decisions I think are very practical decisions like how do you make progress on learning? And in the version of the paper I sent you, it didn't have any learning results in it but will have like learning results showing improvement, showing basically different training strategies, having generalized learning across different and with different agents. And so I think the proof is going be the pudding like what can you get done with this framing? I think we can get a lot done with this framing and I think a lot of it is motivated by the fact that we just have to get past 0 shot. Like I think a lot of agents you see today are all 0 shot and in the 0 shot regime where you basically just work on the prompt, work on the hyper parameters manually, these opinions don't matter. Like you can build a 0 shot agent in Python, you don't need framing, you don't need framework. And I think like some of the effort has been put things like on observability or traceability or like developer velocity. I think what ours is really built for is can you train them? And so that I haven't seen as much in the field. Like I think DSPY is the closest to what we've built but in DSPY I think they're not really in an environment, there's not really RL going on. Whereas we really do like online RL, we do online PPO with some of our models.

Nathan Labenz (1:28:05) Okay, yeah, that's quite helpful. I think the thing then that I wanna understand better is this black box gradient estimation, which is the once you've sort of taken an agent and said, okay. We're gonna make it. You're gonna wanna iron out or, you know, uncycle the cyclic things or eliminate any structural barriers to making this thing end to end trainable, now you still have a challenge of, well, what if I'm using Claude or o1 or whatever as part of that compute graph? And I tried, but I didn't develop a great intuition for what you are doing. The the quote that I pulled out that I would love to understand is Yeah. To obtain these estimates, this is for describing black box grading estimation, to obtain these estimates, we model the behavior of a language model and embedding nodes around the current configuration values with a multilayer perceptron and backpropagate through it. Yeah. Explain it like I'm maybe not 5, but, you know, maybe just like a, you know, lowly grad student or

Andrew White (1:29:07) Yeah. So this work came from Sid and Albert, and and they're brilliant. So Albert, actually, he he did RL with Gianni DiFobritti, which is very good biotech company, actually. And Gianni won, like, the Unity RL challenge, I think the second place Unity RL challenge. You guys are, like, RL geniuses. Albert, he also was, I think, the first author on PyTorch RL. So this guy is a genius, so I'm not I actually don't think I fully understand what they did. Basically, this paper by John Schwinn, he wrote this paper on stochastic compute graphs, which basically lays out the framing of them and, like, what the nodes are, the edges are, and, like, talks about this. And we had built something like this and we found this paper. I'm like okay, well this paper actually describes what we built quite well. And he had written out these terms and like how to back prop through stochastic nodes which is like not trivial to do. And then Albert and Sid, they basically decided to extend this to do it over like black box nodes. And this is kind of like a no free lunch sort of thing where you really have to build a surrogate model, right? Because you can't back prop through this black box model because we don't know like what happens if we change the temperature on the input versus the op because there's no way to trace the gradient through because we were calling Anthropic, we're calling OpenAI's API here. What they did is they tried to model the variance explicitly with an MLP. And so what you do is you basically you go to do a call through the black box model and what you do is you pick a temperature of 1, you call it whatever 5 times with this temperature of 1 and that gives you like an estimate of the, I would call it, the aleatoric uncertainty. Like if I just call it over and over again, how much noise is there in that process? And then you get that out and then you can basically use that noise to measure the aleatoric component. And then the epistemic component would be like what happens if I change these things? How much noise do I get? And then you can try to change these parameters a little bit and then get like an estimate of the change there. And then you can model this with an MLP and the MLP is never going to be universal. Obviously if the MLP could model the gradient everywhere then you would have reproduced GPT-four, right? So what this MLP does is it gives you like an estimate of the local gradients by just watching how these input changes affect the output. It's very expensive because you're talking about running like whatever 25 calls per input to estimate the gradient. So what you want to do is you want to basically do a rollout. Just do forward inference using the agent a few times and then you basically get a bunch of rollouts and then you want to go do 1 back prop step and you can basically choose which set do I want to do the most. Then you can estimate these gradients, can try to do things like cluster them. So like here's a bunch of similar inputs so we'll use the same MLP for them and then you can do this back prop. It's very cool to see in practice because what you can do is like in the paper we show you can optimize the temperature for your model. You can optimize like the I think it was the the parameter in the prompt. You can optimize things like your some hyperparameter upstream like how many times you how many periods you have in your prompt. A very unusual kind of way to do backprop. Now in practice really this is more of a trick I think to show that it's possible. This does not lead to very good optimizers because basically usually things upstream of the LLM are like there's some optimum like temperature 0.2 is optimum and once you know that there's not really much benefit in trying to change it. It's like very rare for there to be correlations. You know it's rare that you need to like change temperature at the same time as changing compression factor in the memory. They're just like not that correlated. So I think it's a really cool technical achievement and it's cool that shows that you can do backprop to these black box models, but I think that as far as whether the outcomes change, the outcomes are not that important. Interesting.

Nathan Labenz (1:32:39) Okay. So this would be really powerful if you're using an open model. Yeah. Not so powerful if you are using a closed model, though you can still get something out of it.

Andrew White (1:32:50) Yeah. I think the results we will have on the optimizers, what we do is we kind of split the models in 2. So we actually have a black box LLM and then we have a second model which evaluates the outputs from the black box LLM. So the black box LLM generate a few outputs and then we have a reward model or a Q learning model technically that will then look at these outputs and choose the output that it thinks will be the best output. And so in that setting you have a normal back prop through this model which is an open or white box model. It's just a regular open source model that we're doing for the Q learning. And then you have the black box model and you don't need to propagate to the black box model because things upstream of that are not that important. But anyway, is I think the strategy that we've been using most successfully is this sort of hybrid Q learning where you have an open source model as sort of the guide or the reward model for the closed source model, which is generating sort of the arguments to the actions, and the q learning 1 is sort of evaluating which 1 is the best given the situation based on past experience.

Nathan Labenz (1:33:50) So how I don't know if this is maybe still in progress at the time we're talking, which is a little bit ahead of when the paper actually comes out, but how on some of these, like, more frontier tasks, like, I don't know, you pick which 1 you want, but, you know, you've got high end tasks here like design me a new protein that's a little different from this old protein that'll have improved properties. How much improvement are you seeing with these sort of end to end trainable agents?

Andrew White (1:34:17) Yeah, so it depends on the task. So we should see about maybe a 5 to 10 improvement on the paper QA task, so answering literature research questions. I think in CQA we should see pretty high improvements like maybe from 50 to 70. So like, sorry, 20 points, 50 to 70. And these kind of reflect how formulaic the tasks are in the environments. And so in the paper QA environment, we're basically trying to answer arbitrary questions and like basically taking multiple choice tasks by cheating by looking at literature research. And yeah, the questions can obviously be like anything, right? So the question can be like consider these 5 papers and tell me which 1 was different. So the actions that are chosen are quite hard, it's not surprising that we don't really get a huge benefit. In the CQA, there's a molecular cloning task, we get a pretty large benefit like this 20 points. And the reason why is the questions are quite formulaic and so a Q learning model will be able to outperform or doing iterative expert improvement can improve it a lot because you're basically really doing a specific task with a set of actions. So I think it depends on the environment. I think this gets back to like something of I don't know how useful these training procedures will be in the long run. I think right now we felt helpless before we started this project. Basically we can put in GPT-4.0 and we can do make an environment and then okay it doesn't work. We try changing the prompt a little bit, throw in some in context learning examples, we change temperature, but there's really no way to gather more data and get better. So having this environment at least gives us that ability to like see improvement from rollouts, improvement from gathering data. But I don't know if this is the right answer. Like maybe this is just a stepping stone on the way to some better response. Maybe, you know, Mistral will come out with an endpoint in a few months where you put up an environment and then they will train their models on it as long as you give them a reward model. So it's unclear to me if this is the right way, but at least we have sort of, you know, we've grabbed our destiny in our hands and can actually train models in these environments and make progress.

Nathan Labenz (1:36:25) And so do I understand correctly that the main driver of improvement is, this would be consistent with your philosophy of spend what it takes to get good results, you are doing a lot of generations and the reward model or the sort of the queue models you call it, essentially the 1 that is responsible for discriminating between those generations and determining which 1 to actually go with, that is the main driver of progress. Does that mean then that at runtime, you have to do all the you still have to do a given a new question, you have to do a bunch of generations, and you're actually making a runtime choice?

Andrew White (1:37:03) Yeah. No. I mean, like, the Q learning model is just an open source model. We use we tried, PHY, we tried LLHMA. I don't know which 1 we'll eventually put into the paper, but that model basically is going to be tuned for the specific task, but we still measure on withheld questions. So if we go to like when we evaluate performance, we're going to new question types and so it is it should be generalizable within the domain of using that environment for answering multiple choice questions. Now if we were to take that q learning model and then have the task is now to write a literature review, that model is probably not going to be great. And this comes to like this question of how engineered do you want these agents and environments to be? And I don't know the answer to that question. I think it's an open question but it's 1 of these things where you have to be able to write down like 150 tasks or 500 tasks and the model can get good at doing those tasks. But tasks that are not on your list of 500, it's unclear how well you'll do on them.

Nathan Labenz (1:37:58) Yeah. Okay. So not to get too lost in the weeds, but just to be sure I understand Yeah. The way the system works. Given a new question at runtime, if I'm using a black box, if I'm using, you know, o 1 or whatever to do my core function of trying to answer, do I have to do multiple generations and then have the q 1 pick the best in order?

Andrew White (1:38:22) Yeah, that's right. So usually these models, you can, like, pass in how many out how many completions you want.

Nathan Labenz (1:38:28) So you can do

Andrew White (1:38:29) like k Yeah exactly you do like k equals equals, 8 or something like that. Yeah you have to it will generate 8 of them and then the Q learning model will decide which 1 to actually emit to the environment because you know once you make a step in the environment you can't go back. Although we have experimented with tree search, so we actually have done tree search and this was a strategy to get like ground truth or gold rollouts. Basically we start with the black box model in this setup and instead of like having the key learning model pick 1 of the 8 outcomes we try all 8 duplicate the environment 8 times and obviously you know you can't pick 8 because it's too high of a branching factor you run out of space. But you can pick like a branching factor of 2 or something or maybe you can do like some sort of beam search and you can get very deep and we use that strategy to actually generate trajectories that had positive rewards because in these tasks sometimes you don't get any positive rewards if you just start with nothing. And so we have used like this basically, I don't know, you call it money college research. You basically go through these things by duplicating the environment until you get to a positive reward.

Nathan Labenz (1:39:28) Yeah, okay. Cool. I personally would love to see highly engineered systems have a big place in the future. I don't know if you've ever encountered Eric Drexler's comprehensive AI systems manuscript, but I do think there is something very attractive from a big picture safety and control standpoint to the idea that each of these things are kind of narrow. There's something in here that doesn't generalize well, and therefore it's kind of up to us to architect the overall AI super system that's going to run a lot of society, but we hopefully will have legibility on all these little sub pieces. Yeah. I don't know if it goes that way, though. I mean, I kind of feel like o 2 might just do it all, and then we're back to the big black box. What is your expectation for whether narrow has a place or, you know, especially if you're willing to spend on the compute if the you know the great generalists rule the day.

Andrew White (1:40:26) You know, o2, I think it's gonna be really hard to run it on every task in the lab. I think it's going to be something that is quite inaccessible to run at the scale we use things like GPT-four 0 or even GPT-four. So because of that I think these kind of like systems like we're building where you can have some very like good model like generate a bunch of rollouts that are ground truth or very good or it can generate like a plan and then you have a simpler model execute the plan. I think that might be actually sort of what has to happen eventually because I think we are going to reach a point where like the compute like the models I think we've got lucky right now that the inference is actually the price that it is. I think we might start seeing this domain where basically there's very large expensive models that have a lot of test time compute, lot of inference time compute. And because of that they can be used to sort of help you train or sort of scaffold simpler models. So I think that may actually come to pass, especially with the rumors in the valley. I don't know how you say this, but like for the rumors in the valley, like I think that it's going be unlikely we're seeing the same sort of test time inference properties for the next models.

Nathan Labenz (1:41:37) Yeah, you look at those graphs of performance over inference time and the inference time scale is logarithmic. So, that is a nice it's another 1 of these nice straight lines with the caveat that it's you're going through a lot of orders of magnitude over the course of that graph. And that is not going to be I've been very impressed by how sort of egalitarian the developers have been when it comes to doing things like making GPT-four 0 free and Cloud 3.5 is also free in limited quantities for people to use. And just the the fact that it's sort of genuinely, like, money is not really a barrier, hasn't been a barrier to getting access to the world's best AIs. There's been kind of a Andy Warhol, like, president drinks Coke and president uses ChatGPT, and you use ChatGPT. Doesn't seem like that is about to break with this new

Andrew White (1:42:29) Yeah.

Nathan Labenz (1:42:30) Paradigm, and it's probably gonna break both across people, but also, like, 4 individuals or 4 teams, you're gonna have to, like, budget basically.

Andrew White (1:42:41) Yeah, I think you're right. I mean, and this is something we've experienced at Future House, right? Like with back to the conversation of paper QA, right? Like paper QA costs whatever a dollar to use and takes whatever 4 minutes or 3 minutes to do it. And that changes how you think about using these systems. And I think long term that's why I'm still excited about Future House is that the kind of way that you can orchestrate these things where you can do like 1000 things and they each take 4 minutes and they each cost whatever 50¢ and you can pay $500 for this giant intellectual task. That's just like a very different way of thinking about how we normally do things. And so I think the way that we automate science, it may not be the same kind of science that humans do, but we may be automating like a sort of different branch of science. So like some of the things we do here are like we run a combinatorial literature search. Let's look at like every antibody or every surface receptor against every disease ever reported in papers and this tissue, Right and we can just go and do that sort of combinatorial literature search. Let me get out like a matrix of findings and then we can do like some you know light machine learning on those findings. And that kind of like task of like considering 10,000 papers or 20,000 papers at once is not something a human is doing right now. It's not really displacing kind of human science. We're doing sort of a new category. So I do think it's going require sort of a paradigm shift in how we think about these models because we're not going to get to this sort of like I think we're to get a little bit further away from maybe like the advanced mode we know on like chatty bitty is like you're going to get far from this sort of talking with a very low latency, very intelligent person into these sort of more like engineered systems that are very high throughput, very like very low or shallow intelligence distributed at scale or maybe 1 very high intelligence on a few of these things and then you do this big map reducer. You look at lots of different areas and do some simple intellectual task. Anyway, I don't know. I don't know what's gonna happen. I'm very happy that we've got a theory out where we can actually train these systems complete. I think there's been good work on training models. There's been good work on training prompts but to see like a complete system where you can go into arbitrary environments and have complex agents and be able to train them I think is very exciting for us, but let's see what the future holds.

Nathan Labenz (1:44:46) Yeah, that's cool. Do you have any thoughts on kind of what you're looking out for next? You know, things like I guess you could answer this in terms of, like, what needles might move. You know, I personally always look at novel hypothesis generation as, like, 1 big thing where I think we really still haven't seen the tipping point quite hit. Mustafa from previously inflection, now Microsoft, was just talking about something that I do think would be maybe not so much for your use cases, but in general, a huge game changer, which would be, like, effective long term memory where the thing doesn't have this sort of brittle context window, but some sort of next paradigm for memory. So just interested in general, like, what are the sort of around the corner sorts of things that you think will be the biggest difference makers if and when they do land?

Andrew White (1:45:41) I think like diversity and outputs. Have you seen Aiden's benchmark before?

Nathan Labenz (1:45:48) Aiden?

Andrew White (1:45:49) Yeah. I don't think so. So this is like a benchmark, which is like basically they ask the model to generate explanations for what caused world war 2, and it's like we generate a 100 of these things. And you'll find out that once you get past 4, it's just the restatement with different punctuation or different phrasing of the first 4. And there's this really a big difficulty in these models is like you would think that you could do, you could generate a 100 hypotheses and have them explore all of them simultaneously. But in fact what happens is that they don't actually generate, you can't make them generate a 100 novel hypotheses. Although you think you're doing it with like beam search or like top k sampling, what actually happens is just changing the punctuation or the phrasing. It's really only like 1 or 2 ideas. So I do think we have a problem with these models that they really can't generate a lot of novel hypotheses and they can't really split their thinking that well to different viewpoints. I think that holds us back in some setting because there's like sort of an upper limit to how many ideas you can explore simultaneously. Another problem is just like know like what you said before, 1 thing I'm really watching, I don't know if there's a needle moving or if someone can build a needle for this, but how bots can interact with the internet. Like the internet has gotten like quite different than what it was like 10 years ago. You can't just curl the web anymore. Most websites have bot, anti bot stuff turned on. Most websites are hostile. Like Reddit used to be a great source of training, Stack Overflow used be great source of training. They're both now like hostile towards any sort of scraping effort. X changed to be closed. X used to be like my favorite platform. Twitter used to my favorite platform because everything was written there was public, searchable, findable, citable and it was like a public discourse. Now it's kind of indistinguishable from Facebook in some sense that it's like closed off walled garden, no bots allowed. Blue Sky was very excited about but they've already been overwhelmed by bots like posting there and so I don't know what the future holds for them being like a savior of like a protocol which is accessible by both LLMs or AIs or bots and people. And so I think we're reaching this sort of, I don't know, like dead internet theory where A, like the spots that are for people are being infiltrated by bots and then b) people who want to make bots that just like do good things or do good work are like being stopped by other websites because of the caught in the crossfire. So I don't know what happens next for that, but like we already have a problem where like open access papers are already in a pretty sore state. Like we've been slowly getting better in science of getting things open access but already open access papers are no longer accessible programmatically. And if you even do like license agreement or sorry terms of service uses to download a paper, it's not accessible anymore open access. And so I worry about like if we need to build something different than the internet for like actually these systems interacting or need to build something different than, you know, hosted PDFs. If we need to be more deliberate in how we think about building the world's model of the next generation of AI systems.

Nathan Labenz (1:48:46) Claude just put out there all docs in 1 big text file and that feels like the first tenth of a percent of a step in that direction.

Andrew White (1:48:54) Yeah, yeah. I don't know. I keep thinking about this problem of like, yeah, like the Claude like computer driver model. You can think about that or like I think Molmo was a great model of like how do you point and click at things and you could use that to drive things over the internet. But whatever like 10 years ago I'd be probably writing a Python or Perl script to just go to websites and download the stuff and I don't know use beautiful soup and just grab it. And there was no like crap now. But nowadays you can't even get access to basic documentation now without being hit by these anti bot things. And I don't know, like I would pay money. Like if I could have like a if I could have a quarter and give every website I want to scrape a quarter and that offset the cost of egress, then great. I would totally pay that. But man, it is getting really hard to do anything kind of novel and fun on the Internet these days. It's like a big commitment if you really wanna actually work on the Internet with these systems.

Nathan Labenz (1:49:47) Well, how about in closing, what is next for you guys at Future House? What are you looking for? You can give the the pitch for any profiles that you would like to have reach out to you.

Andrew White (1:49:59) Yeah. I think aviary is a big step for us to have, like, a a playbook to build these agents and environments. So I'm really excited for us putting all these pieces together. I think we're going have some very cool demos of open ended science once you have all these systems interacting. I think if anybody has ideas for paper QA, have this API. We work with people already on projects. We have some really cool stuff going there but love to hear from other people with their ideas. Also, I think we're also open to people's ideas of strategy. There's been some great like nonprofit tools that have been built like Semantic Scholar, Cam out of the Alman Institute for AI and they actually struggle a little bit too about how to to make a free tool available in an open source. Crossref is a nonprofit. They sell their services for a small amount of money to companies. And so I don't think we figured out this way of like getting intellectual services out there in the nonprofit setting and we want to and I don't know how we deliver that. So I think we're still exploring that domain as well. So I don't know anyway and we're also looking for great candidates, people who want to revolutionize the field, people who want to be automating science, we're always looking for them And we're also looking for academic groups that want to try these things on on their own ideas. Yeah, our goal is to deliver novel discoveries and that can be from us doing it or us like getting every scientist in the world using our tools to speed up literature search or to speed up peer review or to speed up designing molecular cloning protocols.

Nathan Labenz (1:51:23) Yeah. Cool. Offering those things as a service, do think sounds like a obviously, a lot of infrastructure and a lot of maintenance love would have to go into that, but it does seem like it could be a massive unlock. So I don't know if there's it it feels to me like there's something there. Like, that platform is 1 that I imagine a lot of people would be really excited to build on.

Andrew White (1:51:45) Yeah. Yeah. Awesome. Thanks for having me.

Nathan Labenz (1:51:48) This has been worth the wait. Andrew White, cofounder and head of science at Future House, thanks for being part of the cognitive revolution.

Andrew White (1:51:56) Thank you.

Nathan Labenz (1:51:57) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.