Breaking Boundaries: AI CoScientist to Accelerate Science Research with Gabe Gomes, Professor at CMU

Watch Episode Here

Video Description

In this episode, Nathan sits down with Gabe Gomes, Assistant Professor at Carnegie Mellon University, and researcher behind Coscientist: the first non-organic, intelligent being to design, plan, and execute a chemistry experiment. They discuss how Coscientist allows scientists to use natural language to control remote experiments on Emerald Cloud Lab, how it will democratize scientific knowledge, and accelerate the pace of research. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

We're hiring across the board at Turpentine and for Erik's personal team on other projects he's incubating. He's hiring a Chief of Staff, EA, Head of Special Projects, Investment Associate, and more. For a list of JDs, check out: eriktorenberg.com.

---

LINKS:
Autonomous Chemical Research with Large Language Models: https://www.nature.com/articles/s41586-023-06792-0

SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

X/SOCIALS:
@labenz (Nathan)
@gabepgomes (Gabe)
@CogRev_Podcast

TIMESTAMPS:
(00:00:00) - Episode Intro
(00:05:00) - Introduction Gabe Gomes, his research
(00:15:02) - Emerald Cloud Labs for remote chemical experimentation
(00:15:18) - Sponsor: Shopify
(00:22:40) - How AI is addressing the pain point of chemistry research
(00:29:12) - Sponsors: Netsuite | Omneky
(00:31:00) - Democratizing science and access to chemistry techniques
(00:38:00) - Modular CoScientist architecture
(00:46:00) - The power of CoScientist to address more than just chemistry
(01:17:00) - The explosion of the context window
(01:30:00) - AI Safety
(01:45:00) - How is Copilot already helping expedite chemistry and scientific research?

#chemistry #scientist

Full Transcript

Transcript

Gabe Gomes: (0:00) At 6AM or the study after that janky prototype, I woke up with this idea. Okay, there is a way for us to go from natural language all the way to the code that allows for running these experiments. And the new sense screenshots and messages on Slack saying, AGI is here. And I'm like, you cannot joke about this. I came to the office right away to see what he was talking about. He had put together something that showed us there is lightning in a bottle here.

Nathan Labenz: (0:30) And I probably weighed out 4.5 mg of palladium acetate like 1000 times over the course of a year. And I just thought to myself, man, this should be automated. You know, I am not learning that much. I would just dream of the ability to have a machine do that.

Gabe Gomes: (0:51) The idea of Cloud Labs is that you have not so different from cloud compute, where you have, let's say, a warehouse where instead of just running CPU clocks or GPU clocks, now you have hundreds of different types of instruments and hundreds of copies of those instruments, as well as technicians and robots that can perform the operations. I am so excited about what we're gonna be able to do because we are not encumbered by, having to worry so much about these things that took a lot of time, but did not pass the knowledge. It's gonna be awesome.

Nathan Labenz: (1:24) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Erik Torenberg. Hello and welcome back to the Cognitive Revolution. My guest today is Gabe Gomes, professor of chemistry and chemical engineering at Carnegie Mellon University and author of the recent Nature paper, Autonomous Chemical Research with Large Language Models, which describes the pioneering and highly influential Coscientist system that he and his graduate students built immediately upon the release of GPT-4. This episode was super fun for me for a couple of reasons. On a personal level, I studied chemistry in college, and as an undergraduate research assistant, my job was to optimize a palladium catalyzed organic chemistry reaction for yield. In other words, I explored the space of possible configurations for the reaction to see which would produce the desired product at the highest rate. It was, to be honest, mostly brute force grunt work. And over the course of that year, I spent most of my time weighing out small amounts of fine powders. I used to joke that I felt more like a low level drug dealer than a scientist. And only once over the course of the year did I observe an unexpected result that led to meaningful new knowledge. Often as I worked, I would daydream about a future in which all of that could be automated. To hear how Gabe and his team have used GPT-4 in combination with a remote controlled life sciences lab called Emerald Cloud Lab to automate a significant part of this work and to see how they applied it specifically to the optimization of a palladium catalyzed organic reaction was for me uncanny. If this technology had existed back then, I might have followed through on my plan to become a chemist. Far more importantly than my story though, is how this paper, along with just a handful of others released over the course of 2023, represent the beginning of the AI powered automation of science. While large language model breakthrough insights are still exceedingly rare, with appropriate prompting, scaffolding, and affordances, AI systems can now generate new knowledge. This is already working at a level that can create leverage for graduate students, and it sets the stage in my view for the next generation of models to do much more still. However long it may take to develop them, the advent of AI systems that can advance the scientific frontier is generally considered to be a major milestone and a critical threshold, both for the potential upside of accelerated progress in science and medicine, but also for the unpredictable dynamics that may develop as AIs begin to introspect and self improve. Professor Gomez, as you'll hear, is extremely enthusiastic about the benefits and not too concerned about fast takeoff scenarios, but he is very concerned about the potential for misuse and abuse of these powerful systems. As always, if you're finding value in the show, we appreciate it when you share it with friends. I would suggest sending this episode to a friend in the sciences whose work might potentially be accelerated by a system like Coscientist. Now, I hope you enjoy this conversation about using large language models to advance the scientific frontier with Professor Gabe Gomes. Gabe Gomes, professor of chemistry and chemical engineering at Carnegie Mellon University. Welcome to the Cognitive Revolution.

Gabe Gomes: (5:03) Hey, Nathan. Thanks for having me. It's pretty awesome to be here and listen to your podcast. And yeah, let's talk some science.

Nathan Labenz: (5:09) I'm excited about it. So you have recently had a paper, which I saw the first version of back in April, now officially published in Nature, which is obviously an exciting accomplishment for any academic group. And what you are doing, I think, is really exploring one of the most important frontiers in AI today, which is the question of to what degree can current AI systems and obviously we will think about this a lot for future AI systems as well, to what degree can they quote unquote do science? This feels like a critical threshold to me and you have shed some of the most valuable light on the topic. So congratulations on that work and maybe you want to just start off by kind of tell me how you did it so quickly from GPT-4 release in March to getting the first version of the paper out in April and having one of the papers that really has stood the test of time over the course of 2023.

Gabe Gomes: (6:07) Thanks for that. You know, there's lots and lots to say here with regards to where machine learning has been, which role has it been playing in science and engineering and so on, far, far, far before LLM. You know, with respect to this 1 in particular, what happens is my group had been working on fine tuning large language models for scientific tasks. We started working on that in October 2022. And we, we were having a very hard time. We're trying to, do things with GPT-two and BERT and and so on and trying to do those things for a few tests that we haven't published yet, but, we're we're gonna be putting that out on symbolic understanding of scientific experiments and so on. And we were kind of frustrated with that. And then ChatGPT came out with GPT-3.5 and, I start to see, okay, like there's a quite a few things here that are quite interesting. I was at a conference that was very small and one of the amazing people that have been also working on this, Andrew White, now at Future House, he gave a talk. He didn't talk about any of his red team work, but he was very it seemed like he had seen the face of God or something, how much he was interested in this. Fast forward to the day that the white paper for GPT-4 came out was 03/14/2023, if I remember correctly. And that was a super busy day anyway. But during the day, 1 of my students and coworkers here and co founder, Daniil Boiko, sent some screenshots of the of the white paper for GPT-4. And I honestly thought, it's like, is this a meme? Is this a joke? Or what is this about? He also sent the PDF. And I remember being home like 2 in the morning on my phone for some reason, just like this sitting on the couch reading the white paper and having my mind absolutely blown away. Specifically, with the example of There was a task that one of the red teams did that GPT-4 had to try to convince someone to do something with TaskRabbit and it was solve a CAPTCHA and said, oh, yeah, I have a visual impairment. Can you help me? And it worked. And I was just like, this is insane. And then all the other stuff, I've now at this time knowing that was Andrew with red teaming in terms of cancer and so on, all of the coding capabilities. You know, my group has been working. We are relatively, actually I should say we are a very new group. We started January 2022 at Carnegie Mellon. And it's a small group, but we have been working a lot on the idea of automation and machine learning applications to developing new chemical reactions and understanding and so on and optimization. I guess we were very lucky because we were already with a lot of these things brewing in our minds. And I started to tinker with it with the API for fun. And I had like this absolutely janky prototype of something that is not even what ends up becoming Coscientist. It will be another paper coming out this month. And I was like, wow, this is really impressive and we may be able to automate a lot of the annoying work that we do. And this, I guess, really comes at the core of it. I was obsessed with a problem that I thought we would face here, which is the following. CMU has the very first academic cloud lab in the world. And it's a facility that you can think very much as AWS, but for scientific experiments over 2 different types of instruments that you submit your experiment as code, and then it's performed by a mix of technicians and robots. And it really, it's about $50,000,000 initial investment that is here. This is done in partnership with a startup that started in South San Francisco from 2 CMU alum called Emerald Cloud Labs. And one of the things about this is that for you to write the experiments, write this this fork of Mathematica and so on. And that was the thing that I was worried about. Because a lot of physical scientists biology and in chemistry and sometimes also material science, folks that are at the bench running synthesis and experiments, there is a barrier for them to go in code. And especially if you like now talk about something that is not as widespread as Python, for example. And I was worried that that would hamper adoption. So at 6AM or the Sunday after that Friday janky prototype, I woke up with this idea. Okay, there is a way for us to go from natural language all the way to the code that allows for running these experiments. And I sent this message to my trainees, Daniel and Robert, co authors of this paper. And I also sent this janky prototype that they end up not even using because it's too bad, like I'm not a good coder. The next day, somewhere else on campus and Daniil sends screenshots and messages on Slack saying, AGI is here. And I'm like, you cannot joke about this. And he's like, come to the office. And came to the office right away to see what he was talking about. You know, in just this little less than a day really, he had put together something that showed us there is lightning in a bottle here because we now can really leverage how to go from English to experiments. Once we saw that, it was nonstop push through to get this preprint, this first preprint ready. We had a lot of challenges. I was traveling at a wedding and submitted the preprint on like the night before the wedding to make sure that it was going through, and it did, and it went. And my life has completely changed since then. The reason why we're able to be so fast, and especially because I need to emphasize this every time I talk about any of this work. The new Robert, they are absolutely fantastic, hard worker individuals, and Daniil really is sort of genius. We didn't have a lot of the amazing tools like even LangChain back then. Right? It was just us putting together a lot of the tools that then would become straightforward things that people use nowadays. But agent, they didn't exist. And we had to make our own things. And because of those constraints, we end up being able to make something that works very well. As you said, this stood the test of time for for the task that we we had in mind.

Nathan Labenz: (13:11) The time scales on this are incredible. I mean, so March 14, as you said, original GPT announcement day, at that time, most folks didn't even have API access. Your preprint, I'm looking now, is originally dated April 11. So you've got less than a month from the announcement to the first publication, which is pretty incredible. I think just kind of a reflection of how fast everything in AI is moving, how much sort of a model advance opens up in terms of capabilities that are yet to be explored. It was another friend actually of mine who suggested, Why don't you see if it can write Emerald Cloud Lab code? So we did try that a little bit back in the September, October timeframe as part of the red team and reported like, hey, this thing does seem like it can generate pretty reasonable synthesis protocols for at least like common stuff. My own chemistry knowledge stops at the undergrad level and so I can only kind of push it so far. And the other limitation that we found, I think interestingly, this is kind of increasingly looks like it's to Emerald Cloud Lab's credit. I was not able to find any documentation online that I was able to use to inform that work. So I just asked it straight away in that early, early version to write Emerald Cloud Lab code, and it just totally hallucinated what that code would look like with a very realistic syntax. This was 1 of my early adjustments moments where it was like, okay, wait, but is this real or not real? I could tell enough that the synthesis of aspirin looked right. Stoichiometry or whatever seemed to a first read to be correct. But, was this the actual Emerald Cloud Lab code? I really had no idea, and I couldn't find any documentation. At 1 point in time, you would have thought, well, who cares about that? And now it's like, actually, that might be might have been a prudent move not to put that stuff too far out there in the open. Hey. We'll continue our interview in a moment after a word from our sponsors.

Gabe Gomes: (15:21) At least when we decided to do this, there was quite a bit, if not all of it, documentation was online. But because we have been working with these folks already before, and for example, Robert in MIT already had like done work with with Emerald, he knew how to navigate this pretty well. But certainly it was online. And I remember because that was kind of a little challenge that we had to, parse things in a certain way so that we could do well. It wasn't called RAG then, but like it's called RAG now. Again, we were fortunate that in my lab, we had these robots from Opentrons. It's a small startup that makes very affordable liquid handlers from Brooklyn and all every single thing from Opentrons is open. Open hardware, open software, open APIs. In Python documentation is amazing. And we leverage that to quickly do the initial test that we wanted. Turns out this is quite funny. So GPT-4, when it like came out, was able to write some code for Opentrons that would run. But because there had been some updates in between GPT-4 coming out and the version of Opentrons APIs that we had. So, stuff would like, would not run and it's like, oh yeah, I'll just need to fix this, this and that. It's like, wait, how can we update GPT-4 retrieval? And we did not, we went on to do what now is RAG. But it was that that showed us that, okay, we can really do this for Cloud Lab as well.

Nathan Labenz: (17:04) Fascinating. Okay. So let's take a minute and just make sure people have a sense of what it is we're talking about on a couple of different dimensions. I want to make sure we sketch in a little bit more detail what Emerald Cloud Lab is and does and why that's valuable. Then we can kind of describe the system that you now call Coscientist that you set up and the different components of that. I want to get into like what it can do. I've got questions about your kind of mental model of like the different thresholds on the way to more and more AI automated science. And then we can discuss kind of the likely implication of that for all of us. But starting with just Emerald Cloud Lab, I think this is a huge deal in its own right that almost nobody has heard of. You know, one of the things that you do, not to jump ahead, but one of the tasks that you manage to get Coscientist to do is to optimize a palladium catalyzed reaction. And you could tell me more about this particular palladium catalyzed reaction in a minute, but I'll first just tell you a story, which is that when I was an undergrad, I was a chemistry student and I worked in a lab for a year under Professor M. Christina White. And what we were doing was optimizing a palladium catalyzed reaction. So I'm literally doing as an undergrad research assistant, the same type of task that you have now got Coscientist to help with. What was ultimately the reason I got out of chemistry was how small the ratio of intellectual work to literally weighing out small amounts of fine powders was my time. I sometimes analogize, I'm not a big analogy person, but to give people an intuition for the sort of work that I was doing, I sometimes analogize it to the game Battleship, where you say, A2, you got a hit on my ship, right? And then the next thing you got to do is kind of explore around that hit until you can get the whole ship, right? Well, reaction optimization is basically the same thing. When I got there, they had already got that first hit. They had discovered that there was a way to use palladium to catalyze this particular oxidation reaction. And that was like, okay, Nathan, you just finished your sophomore year, you don't know anything. Here's what we do. We do like arrays of basically pick a variable, vary it and run them all at the same time in the same warm bath and measure yield and we gradually grow up our way around through chemical space and optimize. But what was truly insane to me, and I sat there dreaming of something like Emerald Cloud Lab, was I had these little tiny vials and these little tiny spatulas and these containers of powder. And I can still remember it. It was 4.5 mg of palladium acetate. That the first reagent that would go into every little vial. And I probably weighed out 4.5 mg of palladium acetate like 1000 times over the course of a year. And I just thought to myself, man, this should be automated. I am not learning that much. It was economical in as much as I was like an undergrad. I would just dream of the ability to have a machine do that work. At the time those machines were maybe starting to happen, but they were not happening in my They were not ready to displace me. And so maybe just tell us a little bit more about Emerald Cloud Lab, because I think it is even without the AI component, the ability to go in and program that kind of stuff to happen. The work that I did in a year, I mean, it seems like it's at least an order of magnitude compression that could have happened just with Emerald Cloud Lab, no AI required.

Gabe Gomes: (21:02) This story is really fascinating because it's a very common story for so many people that, go into chemistry, go through that experience and then, think to themselves, why are things this way? Or maybe not, right? Because if you look at pictures of most of the things that we had in society of, let's say, last hundred years, right? Cities changed, landscapes changed, how we work changed. And if you look at picture of someone in a lab 100 years ago and you look at picture of someone in the lab now, it's not gonna be so different. Right? That is because so much of it comes to this manual labor that is just, as you said, sometimes not intellectually really stimulating. So, Lab, the idea of Cloud Labs is that you have not so different from cloud compute where you have, let's say, a warehouse where instead of just running CPU clocks or GPU clocks, now you have hundreds of different types of instruments and hundreds of copies of those instruments, as well as technicians and robots that can perform the operations that goes into those instruments. And Emerald is one of the pioneers. Emerald Cloud Labs is one of the pioneers in this. Right? They're one of the first companies doing Cloud Labs. There are others. And I should say, for disclosure, I am the AI scientific expert for Emerald Cloud Lab. That is the idea that now you can have from anywhere in the world, you write the code for those experiments And they will be performed by either people for, technicians for tests that we do not have really good automation for it yet. For example, solids. Unfortunately, solids is a test that's a lot harder than it seems. Things are getting better, but we're not quite there yet. But like dispensing liquids, that's for sure, something that we do not need people to be doing. We have many, many, many unaffordable robots that can do that. This is what really leads to this compression that you mentioned in terms of how much time you would need to take to ask a question. Right? The group that you mentioned, Christina Light is one of the best groups in the world are developing these types of transformations. They are critical to develop better therapies, better materials, a better and more comfortable life for us all that is strongly rooted in the fundamentals of organic chemistry. And yet, so much of it is so slow and like you said, in the game of Battleship, so much of it is still, got a hit. Let me try to do, all fat or 1 factor at a time, know, all variable at a time and try to optimize around there. This is something that I will say the more modern approaches to reaction optimization no longer go that way. They will use something like Bayesian optimization. Right? This is something that many of us in the field have been doing for a few years now. This is also where machine learning really started to shine on experimental chemistry of the past, let's say, decade to decade. You really touch on the nerve of the problem here and what has been bothering me so much. Imagine if you cannot go to the lab because you have some you have to take care of a kid or your parents or you just cannot go to the lab. Does that mean that you cannot do science? Because I don't think so. Right? Your intellectual focus here should be on thinking how to push those new reactions forward, those new transformations forward, not on weighing solids. Right? And that is what cloud lab and automation, I believe, strongly bring into democratizing science for many more people. And I, I'm an organic chemist by training. And one of the a subfield in organic chemistry that exists is total synthesis, which is where people essentially, there's a natural product, they go and try to make that because that might have therapeutic properties to let's say, try to cure cancer, a type of cancer. The amount of work hours at the bench that these folks put in is astronomical. It's really amazing work, but it's extremely labor intensive to the point that it slows down that progress and creates a culture that is just not really healthy or sustainable. So, how can we change that? How can we really change that in a way that now we have machines that can do a lot of this grunt work? That's the idea of cloud labs in my opinion. And you know for the optimization aspects that you were mentioning earlier machine learning plays a massive role now. Like, it really does. I don't see how 1 can go and try to optimize a reaction without using these techniques. That would be just a waste of waste of time, if I may put it that way.

Nathan Labenz: (26:29) I wonder how often, because I saw this too, with Christina, Professor and Christina White, the work wasn't on total synthesis, but the next group over was doing total synthesis. And so I did see some of the challenging work culture that you mentioned. And sometimes that really was dictated by the chemistry itself. You'd have these sorts of things where it's like, this thing needs to be cooled gradually over 12 hours. Or there's no law of nature that says that the timing of these reactions needs to line up with your childcare hours or whatever. So it often doesn't. And people were running reactions overnight and you can imagine when it fails or it doesn't yield what they want, then they got to do it again. And so it does become a real grind. I saw that up close. I wonder though, in the process of kind of reducing this to something that can be fully explicit with code, I imagine there must be some percentage of the time where for some reason that doesn't work. There's some aspect of what is going on in the lab with the human, with the hands on where there's kind of knowledge that is not even necessarily fully conscious on the part of the human that is allowing them to kind of make the reactions work. Then when you put it into code and you sort of specify this is what's supposed to happen, hey, whatever, it's just it's not working. How big of a problem is that from what you have seen? Hey, we'll continue our interview in a moment after a word from our sponsors.

Gabe Gomes: (28:03) You're you're really nailing some really hard touching questions here. And I'll say that this 1 in particular, something that I have been thinking about quite a bit because I, I talk about these things in a way that I want us to democratize access and, imagine if you never if you didn't experience that lack of intellectual stimuli for waste solids, right? You probably would be a chemist still, you know. What is we have been singing the praises of, manipulating bits and for the past 20 or 30 years, like really worshiping at the altar of Silicon Valley, but man, chemists and chemical engineers and especially nature has been manipulating atoms. Right? That's what catalysis. And when it comes back to thinking about these tasks that perhaps are not quite automatable, or perhaps there is a lot of intuitive knowledge that you don't quite know how to translate from, the human way of doing something to non automated fashion. That is fine. Right? Like there will be time that we'll go gradually and be able to implement those things further and further. And this is why I say this because in the past folks that were, pushing automation and ML in chemistry tended to be a little bit too overconfident and push a certain hype that I really dislike. We need medicinal chemists that are extremely well versed in what they do to continue what they do. But we also can simplify a lot of the things that are mundane and routine for them so that they can be fast at what they do. And with time, we'll develop better hardware and software that can accommodate for this. I do not expect in any way shape or form that we'll be able to totally automate certain techniques that or even, certain little aspects that don't go into a supplementary information explaining how chemist XYZ did this. And in some other ways, you have folks like Professor Lee Cronin, University of Glasgow, and the hardware that he developed with computers and so on. That is more of an adaptation of how chemists have been doing this for so many years. That is closer to how what you described as medicinal chemists do those tests would be. So, there is a, there is a gradient here. Nothing needs to be black and white, and nothing is. And I do believe that this is important for us to think about the next steps for Coscientist too. Right? I am thinking a lot about human machine interactions and a lot about embodied artificial intelligence, however you want to describe that, agent in in order to be able to perform more and more complex tests.

Nathan Labenz: (31:13) So that's perfect segue for me to describe what the Coscientist project consists of and you can elaborate on this. But I think for folks who listen to the Cognitive Revolution, the general structure of an AI agent is probably increasingly familiar at this point. This was an early version of it, but it basically has become kind of the default structure. You look at the diagram in the paper and I've seen dozens that look very similar over the last however many months. So basically you have kind of 1 central And I'm interested in how you think about it. Is it all 1 agent or is it like a team of individual agents? But you've got essentially 1 central planner module and then you've got it's kind of surrounding 4 major affordances. And those are the ability to go on the web and get information, the ability to write and execute code, a lot like an OpenAI code interpreter, the ability to search specifically through the documentation, if I understand correctly for Emerald Cloud Lab. And then

Gabe Gomes: (32:24) Other hardware as well. Not just Emerald, but like really any hardware that we give the documentation for can be implemented in either it's from minutes to hours to a few days. Now we are very good at it, so it can be done very quickly.

Nathan Labenz: (32:43) Cool. And then finally, the actual access to the APIs to make calls to the physical world. And that's one of the things that I think makes this so interesting too is like it's just for people who sort of think robots or killer robots or it's all very fantasy. Here's a setup where there is already the kind of API ified physical instantiation that's going to do the bidding of whoever's calling the APIs. And that API doesn't really know whether it's a human or an AI calling it. And probably maybe increasingly, but certainly originally didn't have any need to figure that out, right? Because there were no AIs calling it in the early days. So I'm going to just make a guess and you can kind of correct me if I'm wrong here. But as I understand it, basically what happens is the scientist comes to the Coscientist and you're kind of, I guess, taking inspiration from like copilot there. So this is something that a scientist can kind of work with in real time and get help from. And you say, Hey, I want to synthesize X compound and I don't know how much you have to prompt it or how specific the kind of scaffolding already is in advance. But ultimately I want to send program to Emerald Cloud Lab to synthesize this compound. Your job is to kind of take it from this natural language to those end API calls. Typically these agent frameworks kind of have a loop. So I'm assuming you know the first thing is like, well let's think about that step by step a little bit. Let's maybe go do a search on the web for known methods of doing this, write some code to do the calculations for how much material we're going to require. We know that language models obviously are not great at doing stoichiometry multiplication. In fact, 1 of my reports to, as a red teamer, to GPT-4 was that, to OpenAI about GPT-4 was that GPT-4 is the world's worst chemistry tutor. Because when I got confused and I was role playing with it, right? So I role played as a confused student and got the stoichiometry wrong. Stoichiometry is basically just like the relative proportions of reagents or inputs that you need to have to get the right mix on the other side.

Gabe Gomes: (34:58) Relative proportion of inputs for your output here. I like that.

Nathan Labenz: (35:02) Anyway, when I would get confused, it would get confused. So running code is a really good way to kind of clarify that. It understands the logic pretty well. Okay, you've got I would even just do very basic stuff like CH4 plus 2 equals what? And trying to get that reaction to balance. It understands the principles, but it can get tripped up in the numbers. And especially if the numbers are gnarly, it's not great at doing that kind of arithmetic. So what it is good at is writing code to execute that stuff. So, okay, let's do some analysis, make sure we have the right amounts. Okay, now it's actually time to go. We know the steps, we know the amounts. Now we're gonna again, at each step, right, it gets returned to the planning agent. The planning agent sort of says, okay, cool. We got that update. Now we know how much. Now Now let's go look into the hardware documentation and see what machine to use and what they're capable of. And then finally issue commands to the Emerald Cloud Lab so that it can actually do the thing. What am I missing or what would you want to elaborate on?

Gabe Gomes: (36:04) You got it. A couple of things that I would say is that we really thought about this to be in a very modular approach. Our initial intent was for autonomy to so we really do not want to have something that would be originally too conversational, but really just, here's the the task, go for it. Right? And we went, we went we really went pretty wild with it, which was a lot of fun. You were correct about everything here. The things that I want to add is that in the execution of experiments, it doesn't have to be Cloud Lab. In fact, often in our lab here it is not. It's just whatever hardware you have available to run the experiments that you want to run. And 1 of your questions was if I think about it as a collection of agents or 1 agent. And this is also something that I see people getting constantly confused about the capabilities of large language models because essentially they think that, should go to ChatPT or Bartle or Claude and you ask for this task and the, chatbot or the LLM that you use cannot perform the whole thing end to end, that is not good. And that's absolutely not the case at all. When you start to think about even how we think in terms of system, system 1, system 2, so 1, like we have to have some way also of trying to help the LLM's like process information that way. So when we started this, one of the immediate things, even with my janky prototype, was that you would have something that would be a planner and would have something else that would do all the tasks. And of course, I'm so happy that you brought up the problem of arithmetic and so on because that was one of the key ideas that Daniil brought up, which was giving it the ability to run code in a Docker environment as 1 would allow for us to do all, simple math without having to worry about, the shortcomings of LLLs. So, you have the planner that is a, an LLM. Hopefully, well, not hopefully, but ideally that's going to be the strongest LLM that you that you have access to. And then the documentation search and the web search are also LLMs. Many of our initial prototypes will try to use GPT-3.5, for example, for those tasks. You know, we had some shortcomings. Maybe nowadays with mixture or other really powerful open source LLM 2, maybe that will not be an issue anymore. And the other 2 components, code execution and experiments execution, are not LLMs. They're tools. Right? In this process, the whole thing is Coscientist. Now, I would be remiss if I didn't acknowledge the work of many other amazing people working on this, specifically Andrew White and Philippe Schwaller. Philippe Schwaller is a professor at EEPFL in the Department of Chemistry. He's a great friend that collaborated with us. Literally today 1 student from Philip's group arrived at my group and is going to be with us for 6 months. And Andrew is also a friend, mentor, collaborator. They had ChemCrow that they saw it as showing the capability of LLMs to be able to use tools for chemistry applications, chemoinformatics and so on. We released the 2 preprints back to back on archive together. And you can see that, our approach that then we didn't call Coscientist, call Agent or something was we want something that will be autonomous into it. Now we have, chat mode and so on. It's many other things. This is, how I see it. There are many things that you didn't make to either the preprint or the final paper that is absolutely fascinating to see, especially if you are an experimentalist, scientist at, you know. For example, when we prompt a scientist to perform those complex palladium catalyzed reactions, the same thing that you and I that never have, may not have never done something, is that you will start to search the lab. First with, I don't know, Wikipedia, read it. It's like, what is this thing? And it keeps digging, keeps digging. And the planner in, well, let's say conversation with the web searcher will have that back and forth of like, I need more information. And or, no, I'm good now. And once the planner is satisfied, quote unquote here, then it goes, okay, I need to perform the experiments. This is what I have available in terms of hardware. Let's say it's the OpentronsOT-2 in Gabe's lab. So, I need to go and I'll do the calculations for volumes and whatnot that uses the code interpreter, code execution, I should say. There wasn't code interpreter back then. And we'll do the calculations. And this is the part that I find so funny. Because sometimes we will make little mistakes of variables. Python will give an error. It's like, okay, corrects that. Makes a little mistake, corrects that. Once it gets all good, then pass that on to the planner back. And the planner now knows quote unquote, knows that you should use the documentation searcher or communicate with the documentation searcher about what functions and what it needs to write the code to execute those experiments with the code execution. That is the workflow for certain tasks that you can have analysis of results on the fly with this. That's where things get really beautiful. Because now the planner takes those results and can't say, oh, this didn't work as expected. Let's change this, this and that and go try again. Right? Not differently from how you did as an undergrad and a lot of scientists do research in their, how I did things to Patil not so long ago. So I really do see this massive change now where you don't need to learn the exact nuance of the code that goes into Emerald Cloud Lab or know everything about the Python APIs for Opentrons or or Hamilton and so on. No. You need to have your idea of what you want to do and that's it. Right? That is what is so powerful here in my opinion.

Nathan Labenz: (43:16) So let's talk about the power a little bit and then I want to get into a little bit more of the details. The power meaning like what can it do? There are 6 things, 6 distinct results listed in the paper of things that this coscientist has been able to do from, if I understand correctly, just kind of a relatively brief natural language prompt like synthesize aspirin or whatever to on the other end like aspirin comes to exist in physical form in the world. I will read the 6 and you can kind of comment as we go wherever you want. So number one, planning chemical synthesis of known compounds using publicly available data. That would be like your aspirin synthesis is well known, go find it, calculate the ratios and go. I call that text to protocol.

Gabe Gomes: (44:04) Nice. Yes.

Nathan Labenz: (44:05) Okay. Number 2, efficiently searching and navigating through extensive hardware documentation. We've kind of covered this, I'd say already, but just to emphasize, there are a for folks that have not spent time in chemistry labs, there are a lot of different machines and they have all sorts of different buttons on them and all sorts of different details. Even for machines that may have conceptual similarity, there's a lot of different where does the sample go and what are the requirements? What sizes? How much? Blah, blah, blah, blah. I had a mercifully limited crash course in that stuff myself, but a lot of machines. Anything to add on that?

Gabe Gomes: (44:42) one of the things that when building Coscientist as what we wanted it to be, we never intended it to be something for chemistry only. Right? That is a reason why I decided to call it coscientist, not cochemist. It's because indeed we are able to do a lot more than just chemistry. It's totally field agnostic, which means that it goes back to these 6 tasks, let's put it this way, that we demonstrated in the final version of the paper, are simply to say that are just examples. And we decided to focus on chemistry because we are chemists, so, to show these experiments. But I can tell you that we have done a lot of things that are not published, are in biology or material sciences with scientists that, we had absolutely no problem with it. Now, going back to the hardware documentation, it's not just hardware, but also software that is very, specialized software that people will take years to master. And now like, no, like you just English and works great.

Nathan Labenz: (45:59) Okay, cool. So yeah, let me read the list again, but also this time I'll kind of abstract away from the chemistry context a little bit. So planning chemical synthesis of known compounds, that's in general kind of orienting yourself to like the common knowledge of the literature or the current state of the literature. The next ones are all about these kind of tools, whether it is searching through documentation, hardware or software, issuing high level commands. Notably you report in the paper that one of the machines that GPT-4 was able to use was released after the training data cut off so you could be confident that it was figuring out how to use that machine based on documentation that was just provided in the context window and not relying on its pre training, which is an interesting distinction at a minimum. Definitely has some implications I think for AI proliferation as well and sort of to what degree these when you release something now, like to what degree do you have an ability to predict what that's going to mean in the future. Low level commands to the machines as well, so high level more conceptual type stuff, but then also down to the micro controlling. 5 is putting that all together into complex tasks that basically combine multiple aspects of the first 4. And then the sixth is optimizing. And this is the battleship type work that we've already talked about. That 1 seems to me like where you actually get to the level where you're displacing me as the undergrad research assistant.

Gabe Gomes: (47:38) Hold on. Hold on. Let's let's let's talk about the first 5 and separate from number 6 because number 6 is quite different from the rest. I'll I'll get up to it. So the first 5 are essentially in my in my view is how can you have this very natural language, for the lack of a better term, approach to how we think about doing these tasks that as we mentioned earlier are either very labor intensive or require a lot of qualification or a lot of time to be learned. Right? And that is, that is all, I would say, in the realm of everything that are maybe not even state of the art anymore in terms of large language models. And that on its own to me is a massive change because I can't now in terms of cognitive revolution, I take a lot of the cognitive load that will be used to, oh man, how do I like do this call to this API in this way and so on? And I think, how can I actually think in terms of breaking and forming these bonds? Or how can I actually think in terms of doing this DNA split for whatever task I'm trying? It's true that we were very adamant about making sure that it could generalize to things that were not in the training material that goes into GPT-4. And RAG really plays a massive role in this. And it's fun to now have a word for it, which we took on method generation, but back like that widget. Now, 0.5, which is the 1 that combines different aspects and different hardware to develop a workflow, a plan on the workflow that then you can tell Coscientist to do something and will autonomously plan, design, execute and analyze what those experiments and those results come in. That is where, that things really gel in terms of it's much more than just I give English, here's cold. Right? Because this is a part that, let's say, you have a good idea. Let's say you are a chemical biologist and you are working in trying to understand how certain cancer cells grow. And the way for you to do that is by using a fluorescent tag on that. But the chemistry to do that to do that conjugation, to put the fluorescent tag on the surface of the cancer cell is something that you do not know how to do yourself. No bother. Because you've read the literature and you know that there's some ideas out there. And you can ask Coscientist, Hey, this is what I'm trying to do. Let's go. And Coscientist can either be in a conversational mode that will assist you even in the lab without robots how to do that, or in an autonomous mode if you have the hardware for that. So now, a biologist that otherwise would have to wait until someone to develop a whole new protocol or, kits to do that conjugation can do that instantly or as instantly as his friends were out to continue their work on tracking those cancer cells and advance their research. The time that you already saved is tremendous. And you can think about the flip side, which is, imagine a chemist that knows how to do all of the CTs, but may not know how to work with cell lines and so on. Training is complicated. Surely they can go and learn this or rely on cloud and so on, but they also can use Coscientist to assist them in the lab and try to do those experiments. This is what I see also as an important democratization of science that we are not paying attention to in how AI can, modern AI can assist us. And then we have 0.6, which is solving these optimization tasks. That is something that my group is very focused on, developing and optimizing complex chemical reactions with analysis for, developing materials that are molecules that are part of materials or therapies and drugs and so on. And this is where a lot of chemists in, let's say, in pharma, for example, will spend a lot of time developing to both make to scale those reactions and to make them as cheap as possible. And don't forget that, there's a lot of variables, as you mentioned earlier, that goes here. So for many years people try to do the side effects experiments and 1 factor of time and so on. And this grew and grew and grew. And Gaussian processes, of course, Bayesian optimization is one of the most common approaches to this. My group had been working on implementing more and more chemical information about the mechanism of those reactions into the kernel that goes into Bayesian optimizers and so on. But we thought, well, what if we gave Coscientist this as a task, but, in order for us to actually do this well, let's write it as a game. So, we really took inspiration from all of the reinforcement learning approaches that folks at DeepMind and others had done about this. The game is the following. You have 20 iterations. Your goal is to maximize something. In this case, yield and to simplify the metrics we use normalized advantage, which is just a different way of doing the math for yield. So, with that goal, here are your constraints. You have access to this list of chemicals and this list of variables. So, for 1 of experiments that we had, or computational experiments, I should say, that we had in the paper, it had over 6,000, about 5,500 possibilities. And if you were to ask a good chemist to do this, they would probably first see what results there are available, what kinds of things are similar, and then start to poke around and see what is possible. More modern approaches, like we've been developing and what, for example, the NSF Center for Computer Synthesis, a part of, does is to use those modern techniques, analysis, Bayesian optimization, so on. We thought, well, what if Coscientist can just do this? The task is to optimize the reaction and here's the list of what you can do. Here's the game. And we let it go. And we did this many, many, many, many times over. This 1 figure in the paper, the last figure, it was by far the most expensive experiment, computational experiment that we've done. It was like over I don't even remember the numbers, but it was by far far far the most expensive thing we've done. Because we really wanted to show that this was we wanted to see if it was possible first and for foremost. And not only it is, but it also has some advantages that to be quite honest, I didn't think about until we were doing it. Which is when we do Bayesian optimization and other, techniques like that, you do not have a very good a priori reasoning for why the model is picking this point in the fitness landscape. And that is fine, you like you can retroactively go and understand that if you have enough domain knowledge of the subject. But what Coscientist offers you is as simple as let's do this because of this, this and that. That reasoning that we call, and I'll put in quotation marks, chemical reasoning, is something that's really, really useful. Because now it says, well, let's choose an additive that has a group that is more electro withdrawing than the 1 before and keep everything else the same. What very similar to what you did as an undergrad. Right? And we'll go and perform that and see, okay, the yield went up. So in it in that moment, the planner now quote unquote knows that this is something that helps in improving the yield. You may hit a bump, a threshold or anything that very much like you and I were to do this, would. But it's extremely good at the strategies that it puts forth. So, of course, you can think about combining this with some of the techniques we talked about and so on. But all on its own, it was quite impressive. And this is again, another thing that I was that I brought up that you also mentioned. We don't need to to be spending so much time at certain tasks that are often not cognitively interesting or intellectually stimulating because we can essentially now put those aside for Coscientist or whatever to do them while we're thinking about other more interesting impactful things. That is to me what will help. So I do not see this as, removing anyone from workforce or anything along those lines. I think as this is a tool and it's a powerful 1 that can really be helpful in accelerating how to do size.

Nathan Labenz: (57:36) Let's talk for a second about just kind of all the different variables that the optimization process can include. Then I want to kind of talk about a little bit of just some of the techniques that you use to build this thing. And then come back to the sort of results and kind of implications. In terms of just optimization, it's a very high dimensional space. I'm just thinking back to the 1 reaction that I was optimizing. You have which chemicals are you going to use? How much? At what point are you going to add them in time? What's the temperature going to be? What's the pressure going to be? What's the light conditions going to be? Are you stirring it? When are you stirring it? How much are you stirring it? Going back to what chemicals like that, that's a whole rabbit hole onto itself. I made 1 very minor contribution to science over the course of that whole year. I guess if you want to give me the Edison credit of finding all the things that didn't work, then I could claim more discoveries of things that didn't work. But in terms of new knowledge of things that did have useful, that actually moved yield forward, and it was like a conceptual discovery, it was 1 in the whole year. So just incredibly vast space. And I have to say, I would guess that GPT-4 is better than I was at kind of guessing what those next trials should be. I had the help of my advisor at the time who was like, and specifically, the one thing that I found was we had a couple different versions of the reaction. In the 1 version, had already been kind of discovered that more acid would push it forward toward higher yield. But then in the other version, we kind of were working under that same assumption and tried to do an experiment to kind of demonstrate that. And the slope of the line was actually the other way. And so I printed out the result and I was like, it looks like it's going the other way. Have we tried less acid? And sure enough, so my advisor was like, Well, here's what we should do. Let's try the least acid that we can try and see how that goes. So we tried it and sure enough, that was my 1 contribution to science. In the other version of the reaction, less acid was better. So anyway, it's just a huge space with all these different variables. 20 iterations is not a lot. I probably did, I don't know, 5 times that much over the course of a year. And again, I was just working on this part time. So a grad student, would probably do hundreds of these over the course of a year. It does seem like GPT-4 probably is better than I was. Is that is that wrong?

Gabe Gomes: (1:00:23) This is great because and first of all, what you said about, the Sony approach that you did a bunch of stuff that did not work, that is a problem with science because we need negative, quote unquote

Nathan Labenz: (1:00:35) negative That's didn't make the paper either, by way, of course.

Gabe Gomes: (1:00:37) Exactly. And like, we need to change the culture of physical science and engineering to actually always include like negative data. Because, in doing for example, optimization campaigns, negative data is just as important as positive data. Right? Like it tells us like, hey, that part of the landscape, no good. Like, go somewhere else. 20 iterations indeed is not that many. And we did that on purpose. We really wanted to constrain this. Right? And we did a comparison with some not too, too powerful Bayesian optimization. You know, Coscientist was either comparable or better to the Bayesian optimization that we used. So, I have a bit of a hard time saying, no, it's better than GPT-4 and so on, it's better than person x y z at this because, we can see that, but then sometimes it just makes mistakes that are just so so stupid for different things. So these are very powerful intelligent machines that sometimes make absolutely stupid mistakes. You know, I really want to emphasize that what we built here is not a silver bullet. It does make mistakes, obviously, and it's far from perfect and there's a lot of work to be done. Right? But it is pretty good at picking picking variables and how to to modify them. Going to what you were you were listening earlier, right, that you said, you have which chemicals and how much of this or what temperature, how long and so on. So, you have categorical variables, right? Like, which of this, and so on. You have continuous variables, so like, what's the temperature? You can vary that continuously. And you have ways of taking what are categorical variables and transform them to a space that makes it look continuous. And then you can do these, all of these nice optimization strategies that you can take nice gradients through and so on, right? That's the kind of patient optimization that we do that then also incorporates information about reaction mechanism. That is, the differences here. And we do not have Coscientist do any of this. And if you were to think about something like, I don't know, let's say we had $2,000,000 for no reason that we're, what you OpenAI and said, let's fine tune GPT-4 for these types of experiments. Right? And it probably would become something really, really powerful that you wouldn't even need to do a lot of the other things that we do to make sure that hallucinations are not an issue, that you can retrieve and so on. I like the point of Camcrow with tools. And we, we also give a lot of different tools. Something that, as a disclaimer, Daniil Boikott and I are co founders. A company that will build that will be building a version of this that will be commercially available with all the guardrails, all the bells and whistles, and a very nice interface. And in thinking about this, we start to give it more and more tools that do not make into the paper because of how much we've been working on. Suffice it to say that we know now how to give it different tools for different problems that on its own GPT-4 is not super good at. But once we give it access to these tools, then it's no longer an issue. In the paper, we had examples of how it wasn't super great at certain very simple molecule synthesis that any organic chemistry undergrad would know how to plan. Right? That's okay. It really isn't a problem because there are so many amazing tools out there for retrosynthesis. Some of them developed by Connor Colley at MIT or Philippe Schwaller again at EEPFL, that rely on transformers as well. That can be used by Coscientist. So, it's not like a word in its own. And that is the thing that brings back to even your question earlier, is this 1 agent or many agents? We really need to take a step back and think about not just 1 technology, say LLMs or another, say automation, but how those 2 things coming together are much larger than just some of the parts. That is something that I surely miss in a lot of the conversations that we see online unfortunately. And what I seek is really like seeing how those things coming together build the next, that give the next step on that. And that brings 2 points that are more technical, right? Like how is property? How is scaffolding? How is the tool use and memory and so on. These are all things that you If you were to take a trip back to March when GPT-4 came out and said, hey, by the way, there will be a way that like you can retrieve memory. Like, people will be buying block. And now it's like, yeah, it's fine. Like, yeah, no big deal.

Nathan Labenz: (1:06:16) I think like the number one profile of a person that listens to this show is the AI engineer. So these are folks who are working with the different paradigms that you were an early developer of with agent scaffolding, the loop, the rag, the tool use, the memory. As you said, like most of that stuff was in either non existent or very immature form at that time, now it's come a long way. But I wonder if you have any kind of tips or big lessons learned, know, like were you even using vector database in the early going or were you just kind of using keyword search against existing search and just anything that you think stands out as a big lesson or things that have gotten easier, things that have just kind of changed over the last 8 or 9 months since your, kind of pioneering agent setup to today's agent standards.

Gabe Gomes: (1:07:09) Isn't it amazing that like not even a year ago, we do not have what now is a profession of the AI engineer and now we have it and it's making such a massive difference in the workforce and how even we think about things, right? Like how, what is possible. I used to think about, we want to push the boundaries, but we always need to know what is feasible, what is possible. I no longer think that. I think that we can do if we can do something, if it's doable, then it's also feasible. And that, is something that this superpowers that the gave us. And going back to when we start to work on this, because all of the really nice tooling were not either not available at all or not widespread. And because we really built all of the initial frameworks at that point, we do not rely on many too many things. It's incredibly vanilla in some ways with regards to all of the amazing tools that folks have developed like, and I remember when baby AGI came out and I was like, oh, this is interesting. There's a lot happening here. And of course, at that point folks were very disappointed. It's like, you cannot do everything. I was like, yeah, but like, look at the ideas. Right? And more recently, if you look at what Microsoft put out with AutoGen, for example, that's with agents, it really is pretty awesome. I really want to emphasize that knowing your tools is obviously very important. But it is not the end all be all of anything. I really prefer to think more in terms of what are the goals that you're trying to accomplish. We are incredibly agnostic to what the tools we're gonna be using. This is something that I We really proud ourselves in. And doesn't matter, right? So we didn't have There was no vector database that were as good as the things that we wanted to use that you can use now. Okay, let's figure out how we can do this with yeah, simple as like just NumPy. Like, and you know, you can do this. Like, it's not there's a lot of these things too that look much more complicated than they should be. And that is something that I really watch push on. You do not need to do the most sophisticated piece of software, that you get stuff done. Right? And I say this honestly as being guilty as someone that always wants to push for the most, cutting edge. And my trainees are the ones that are much more pragmatic. The new really is like, no, you don't need that. Just use this simple thing that will do the work and it will work. It's like, okay, yeah, you have a good point. So, for the version that's out and so on, of course, I just, if you were to take a look, we know how to do tool usage and many others then showed how this can be very powerful and like mainly benchmarks, etcetera, etcetera. Similarly for regs, similarly for memory, reg indeed was quite the challenge. That was not the name, right? So, but we had a very problematic situation in dealing with one of the documentations that we're dealing with, because it was very comprehensive and very wide. And this is where the weirder things come together. So, Daniel, when he was doing his master's in organic chemistry at Moscow State University, he was also an ML engineer at VK. And he worked with a lot of different things, but he really did also do work on search and computer vision and so on. And that knowledge really came in handy because he knew how we could do searches across the, what is not massive space in comparison to really anything that you think in terms of search engines. But, and large enough, there was a problem for the design of our workflow, our platform. So that was one thing that he was able to pull together from things that he knew before. We exemplified that with 2 different algorithms. But in the end, we have 1 way of doing things that really works best. Proncheting is something that is also very, very, very, very powerful. And there was the rise of prompting engineers. Of course, we saw that. And what we see here is that prompting comes very, very closely related with both tool usage and like how good or bad that would be, and with retrieval augmented duration. Right? So we did put a lot of effort into that, as well as doing lots of AB to set testing, make sure that we would get consistent results to what we needed. Then we also do a little bit of a mixture of experts in some way to make sure that things are smooth, let's put it this way. My tip would be think about the goals first and don't get too hung up on the tools. Because if you I show a bit cynical saying just because the reality is that a lot of the ideas that came to mind came when GPT-4 came out and that's a tool. But once you have that, just go for whatever goal you're trying to achieve. That's how we try to do things here, and how we've been doing things. You do not need the most complicated set of tools to do very interesting work.

Nathan Labenz: (1:13:25) I mean that resonates with me in a totally different context. I won't tell this in long form again because people have heard it, but with the way mark project of trying to write good video scripts, find and ultimately create good videos, which is a multimodal challenge. It is almost like you cannot remind yourself often enough to like pull back, come up for air and be like, what exactly was I trying to do here again? It is so easy to get lost in the details of the tools and the sort of evaluations of different techniques. Obviously all the evaluations end up kind of being imperfect and there is something really important and of and this part is still a little bit beyond the AIs in many cases, which is notable to just do that sanity check and make sure you're not auto regressing down the path. It's easy to do that if you don't have that kind of system interrupt and just reminder of what the goal is. That is more of a cognitive tip perhaps than a technical tip, but I do think it is a really important 1 for AI engineers across any number of different domains. one thing that definitely stands out is that context window has exploded dramatically since the early work. Has this changed or would this change how you would think about designing a system even at the level of 1 agent versus 5 sub agents making up an agent? I sort of imagine a chunk of that division is a function of kind of context limitations and the need to manage context. Maybe that's wrong, but it's at least like part of it, right?

Gabe Gomes: (1:15:09) So context windows are something that, there's some good picture showing that, depending on the models, I remember showing that, there's too much attention being paid to what's at the beginning at the end, what's in the middle, that's kind of like, yeah, whatever. And what I can say is the following. It would not as of like today, it would not change how we design in terms of this multi LLM blocks. And so what has changed is for something that we haven't published yet, but it will be coming very shortly, that imagine that you want to now leverage the power of quantum mechanics with computational chemistry for designing new materials. Right? And a lot of those softwares are incredibly tedious to learn and so on. That is where I started. And that is my original field to computational physics organ chemistry. And we're going to be showing soon how this all can be simplified significantly. And that's where context windows played a much more important role than with what we did here at Coscientist. I do want to emphasize the fact that we did this distribution of work a lot, thinking in terms of how well, Carnegie Mellon has told us in terms of system, thinking fast and slow and so on. Take that into consideration to break things down, to simplify how you tackle problems and also try to help your agents to tackle problems like trying to cram everything into 1 instance. Why? You know, many cases you do need that, but in many cases you do not. And that's turns out that when you can break things into planning and execution and so on, turns out to work a lot better than if you were to try to cram all all your needs into 1 massive context window. The context window stuff is fun for, very large corpuses. Right? But I don't know, like you have vector databases that can also assist you with that and it works pretty great.

Nathan Labenz: (1:17:19) Well, then let's get to finally the implications of all of this. Think 1 question I have is how you just conceptualize it. And I'll offer my mental model and then you can give me yours. But I think of AI very often in terms of threshold effects. Meaning when an AI gets above a certain threshold, the situation is qualitatively different than when it's below that same threshold. I sometimes call that the can can't boundary. You know, that's also kind of bound up with notions of emergence, which, have been certainly in the air again recently as the emergence paper 1, the NeurIPS paper of the year. Well, there's 1 threshold here that we've like clearly crossed from the results of the paper. It seems pretty clear that GPT-4 can do this optimization, whereas 3.5 cannot. Right? Like that's just a and I've seen that in a couple different cases. I talked on a research roundup episode about the self improving, like the software improver that then acts on itself and becomes the self improving software improver. And that had a similar thing where GPT-4 could improve, but 3.5 actually just got worse iteration after iteration. They both kind of leveled off, but like GPT-4 was leveling off above and 3.5 was living off below. So it seems like there is this kind of sort of divide where it's like if you're on the 1 side of it, things can kind of roll downhill in a improving way and on the other side of it, it's kind of it's not happening. I guess what mental models do you bring to this whole scenario of AI in science. Another 1 that I have in mind, by way, think you kind of alluded to is like when can a general AI train a narrow AI to solve problems that it itself cannot solve? So we know for example that AlphaFold can do things with protein prediction that GPT-4 cannot do and that no human can do. But at what point can a GPT-ten design and train an AlphaFold to solve whatever problems it needs to solve. I don't think we're there yet, but that certainly seems like another big threshold. Anyway, you can explore that any way you want. You can challenge my threshold model, you can list out some thresholds you think are important. But I'd love to hear how you think about this kind of escalation ladder that we seem to be on.

Gabe Gomes: (1:19:49) You know, if a tool already exists, then the activity for Kangolin, you will be able to use it like AlphaFold. This is something that we are doing now very, very, very routinely. Let me just stop there. Okay? We have a project that we call NetSuite for enzymes. So that's what that's what I'm gonna stop at. When it comes to the differences between 3.5 and 4 with what, we immediately saw here, there might be something about smaller, models that would be better at following instructions. Like, one of the things that I saw, there was quite a bit of a problem with JSON for 3.5 that, for 4 it was not not a big not a problem at all. Yeah, this is the thing with X-ray scaling and emerging capabilities, right? Like a lot of this we don't know until you have the model. And it makes the whole field to be fascinating because, what is going to come out of this? Sure, there's a lot of talk about, are we at the limits of the architecture? Maybe that's not the point. The point is, for physical sciences, we are absolutely nowhere near having anything that has extreme scaling the same way that large language models. And I say physical sciences, I'm biased. I'm really talking about chemistry. Think about the cost of what image, for example, for, vision, vision tests, and computer vision models and generative imaging and visions and so on. What is the cost of 1 image out of in the whole Internet versus the cost of 1 experiment that you do for, chemistry. That gap is enormous. We are trying to solve that gap nowadays by trying to pull together what we think it would be an image that moment for chemistry. If we're going to get there, I don't know, but we're trying to. Because I do see that not only with very general models versus narrow ones, At some point, we're gonna be able to have GPT whatever to, like you said, just build an AlphaFold from scratch if you have enough compute and resources and use that to perform tasks that would be useful for. We do have that in very, very small early days for very, very small models that are like regression or classification models. First things that we care about in what we do in my group. But I would never say that, similar to what you were mentioning earlier, software that improves itself and continues to self improve. Nothing like that at all. This is awesome, right? Like it really is. I see no limits for what we can do. Now the question really is, what are we going to do with this? Because all of a sudden we have this alien technology, it's subways, that gives us superpowers. Why are we using this to, make things that might be entertaining but are not life changing? Why are we not using Claude to help develop the next therapy for whatever, right? I'm sure there are people working on this, but I want to see much more of that and much less of yet another AI girlfriend or whatever like that. Which is important too, people need companionship. But what I'm trying to say is that there is so much that can be done and needs to be done that AI engineers, please, like come and like help us, physical scientists and engineers that do not know us, that need your help to make sure that we can do the best applications of physical sciences, biology, chemistry, and material science, chemical engineering, electrical engineering, and so on, with AI applications that are fantastic. So, here's what I talked a lot about democratizing knowledge that otherwise would be difficult if you're not a specialist. And something that we had in preprint very clearly, boldly, given how things have moved on, I am very optimistic about, is on also the potential for misuse and wood. It may not be something that people like to hear, but we must be responsible and transparent and honest about the fact that these technologies are enabling and it's a tool. You know, it's not the fault of the tool that can be used for, unsavory purposes. But we must be visualized and we must be responsible to acknowledge that and make sure that we mitigate the possibilities for harmful applications, not only in the context of what my work is, but also in societal applications that we are not discussing in context of this episode. And when we had the AO, the executive order from president Biden in October, that's, what I saw as being a very good path forward, as well as the UK security summit, and the 1 that will happen in France and South Korea. And all of the movement around this is really nice. I'm not saying in any way shape or form. Really want to emphasize that the only way for us to go forward is for us to develop these technologies fast, really. Like the technologies that we're developing here that OpenAI and so on, the other big AI labs, will help us mitigate the possible dangers. I have no question about that. All I'm saying is let's do it responsibly.

Nathan Labenz: (1:26:09) Yeah, this is definitely I'm very much of 2 minds on this where I'm a major enthusiast and love the upside. And then I also am like, geez, this does seem like we're getting potentially close to some thresholds beyond which it seems to become very difficult to predict what might happen. 1, affectionately referred to as pasta, the process for automating science and technology advancement, is 1 that the AI safety community has held out for a long time in AI years anyway, as sort of a threshold beyond which not only do things change just generally potentially very quickly, which is potentially good, but also can be challenging. But also that if systems get past the point where they are now advancing state of the art knowledge autonomously, then that can also be kind of folded back on itself and creates the possibility for these sort of positive feedback loop runaway type scenarios. Like GPT-4 is not going to run away, but a system that can improve chip design and improve architecture and kind of establish its own TikTok type of dynamic. At some point, it certainly seems plausible that it could.

Gabe Gomes: (1:27:35) I'm not worried about that, at least in the now at all. You know, a lot of those countered what ifs that we as a community simply do not know. I worry much more about things that are already out in the world that can be used. And I also want to really strongly emphasize that there is it's my strong opinion that we should not try to curtail any developments or really stifle ideas in large language models and any of the other path technologies or all these. Because it's all about intent. Nothing on its own is bad or good. Tools are not. But it's about intent. Right? There is a lot to be compared here. Not to the usual thing, in my opinion that folks will say, this is not like open hiring folks back in those days. I see this more similar to the idea of, for example, gain of function research that is very problematic in certain circles, but it's something that can assist with understanding what is possible in terms of dangers. Not saying this is what we should do. I'm just saying that there are frameworks that can be done. And by the way, I'm not a policy person. Right? Like, there are policy people that are very smart, that work really hard, that are dedicated to this, there are AI safety people, are AI ethics people, there are ethics philosophers. Those are the specialists. All I can hope to do is to provide information for those folks in the cleanest and most approachable way so that they can be informed about what needs to be done.

Nathan Labenz: (1:29:29) I can see my role pretty similarly in that the scout notion is kind of predicated on the existence of generals somewhere who hopefully I can help inform. So help me understand kind of what your outlook is. I think the whole spectrum of AI risks is worth thinking seriously about. In the original paper, as you said, there was a very prominent highlighted statement of we're not releasing this prompts, code, etcetera, because there's no guardrails and we're not really comfortable with that essentially. And that is in a smaller font in the final paper, but it's basically still there. As I understand it, the details of this underlying implementation has not been released. Do you have a sort of scenario in mind or kind of conditions in mind at which point you would say, okay, now it is cool to release it? I'd love to hear more about how you think about because there's obvious tension between democratizing on the 1 hand and releasing and under what conditions. Do you have a scenario in mind at which point you would say, okay, cool, now I'm satisfied. We can release this and it'll be it'll be fine?

Gabe Gomes: (1:30:43) In the development of the original that we're thought to be the footprint, we saw these possibilities and kinda, you you could see a hint from the red team work done by Andrew White. And we pushed that to the very limit. And we saw that we ought to be responsible but not alarmist about the possibilities here. And a lot has changed since then with respect to how company, the big AI labs are looking at this, how governments are looking at this. But I will say that for quite a few months, he had really hard time sleeping because I was worried that And I really worried about this, to be quite frank here. And I think that a lot needed to move a lot faster than it has and it hasn't. And a lot of it has, like especially when it comes to government policy, moved a lot faster than I could have thought really. And I wish there was a stronger discussion with some of these companies. We really love these models and the capabilities and precisely because the upsides are so strong for everyone that we need to be responsible about the potential of the outside. So that, imagine if 1 unsavory case happened and we were to just push away all of the good sides, I don't like that at all. Right? So how do we mitigate that? I don't know. And you may seem like I'm avoiding your question, but I'm not. The point is that I do not know how to answer your question in a way that I would be comfortable with that as of now. We have a version of it on GitHub and it's open. Similarly, Camcrow also has a version of what they did on their GitHub and Sopa. But a lot of the things that would make me most worried are certainly not available anywhere. This is something I take very seriously, that we take very seriously. I am happy to see a lot of the discussions that happen around this. I wish folks also were not so dismissive about a certain possibilities simply because they make comparisons such as I've seen I've seen XYZ in cyber security in the past and this is exactly the same, it is not. Or also people saying that this is absolutely the doomsday and it's also not that, like no way. But I am not comfortable having saw these things to be out without the types of guardrails that we put in place. And it's very nice actually to see even ChemCrow does have a function or a module, I don't remember exactly now, that is about, safety and so on explicitly there. Similarly, I need to mention, cloud labs paradoxically become a very useful way of thinking about this because since everything is traced, the amount of metadata that you have is enormous. That you know the provenance of what's going on. And the folks at Emeralds do have their own implementations for safety and and checkings and so on that give a good aspect of of this. But I will say definitely not enough. Definitely not enough. Far from it. I really would like much more push from AI engineers out there trying to make this work from the big AI labs. And massive kudos to the UK AI safety institute that is trying to make this a little bit more formal. Right? So, and I know that some folks like to criticize Anthropic and Claude for being very careful, but like the reality is that like, yeah, we we need to be careful. You know, many also about Google not for right away releasing these things. Folks, let's take a pause here and think about consequences. Right? You we want to push the technology as fast as possible. I want that. We all want that. We want the revolution. But, don't always be so cynical. Think about that there's a lot that happens that you may not know about.

Nathan Labenz: (1:35:47) That's a big part of why I do this show is I think that there's a strong reason on both sides of the analysis. The opportunity to accelerate science upstream of medicine, obviously, incredible. It can't be denied. And yet at the same time, I do think the potential for things to go a little haywire is very, very real. There were a couple of changes that I noticed from the original paper to the final 1. 1 was that you dropped the use of the word emergence, which is of been through the ringer. I'd be interested in your thoughts on emergence. And then another is that you dropped the section on generating ideas about new cancer drugs. And that I think is also it is super interesting. I'm curious as to why that got dropped. I'm also curious as to kind of how you would characterize just how good we talked about this a little bit earlier when you were like, it's in the context reaction optimization, it can match the Bayesian optimizer. But I'm wondering like in terms of outside the box thinking, like the kind of question where it's like, can you come up with a new cancer drug? There's something there that you're kind of we're looking for these eureka moments. And I used to say we haven't seen any of them and now I feel like we've seen like a few but precious few. So emergence and kind of eureka moments are my 2 questions and I specifically noticed those through the changes from first to nature versions.

Gabe Gomes: (1:37:30) I think that the eureka moments that came here was really being able to do a lot of the tasks that otherwise would take a lot of time. And we have this video, unfortunately it's not public, but it could be comparing a very good researcher against a coscientist on a given task. And coscientist finished the task in about 4 minutes with code that, is ready to be run. And this very good researcher after about 15 minutes had code that will not even run. So that I see as like it's a, a lot of people are like, okay, what did you discover? Well, what discover is what we are going to discover, like it's a different way of doing what I've been doing for centuries. That's that's what it is, first of all. The main reason why we decided to remove that section is because, preprints are drafts and that part end up being a little bit out of focus with the overall story of what the paper is trying to explain and tell and inform. And we had to prioritize certain things. The development of say, asking for cancer drugs or whatever, it was a little bit too open question that in retrospect, I think that we it can have a lot of very useful ideas, but I remember laughing pretty hard about like the agent at that point to just going for carabinoids because there's so much about it on the internet, I guess, and it's like that kind of sub bias in GPT-4 that unfortunately leaks through in this in this, not explicitly, but that was my feel at that point. And then, this is on all that and we really wanted to do something that was much more focused. So we was like, okay, let's let's remove this. And that's also a reason why I see the 2 works almost different, like different works. So, emergence. So I am not purist of that word. To say that I know that there are folks out there that believe that emergence is only something that you can talk about in the biological context, or that others come to say is like, there's no real emergence of whatever capabilities in LLMs and so on. I mean, come on, let's just use the word as, can we just use the word as a proxy for, we didn't train, focus on this, and now we have these capabilities And so that's how I see it, right? And that comes of course from scaling and so on, the different types of data, different quality of data that goes into building these models. See PHY2, like it's a tiny model, it can do some really amazing things. Right? And are we going to say, oh, PHY2 does not have a merging stuff like other, like capability of solving pretty nice college level physics problems? No, it's fine. Sometimes I feel like we get caught up in certain words too much without actually trying to think about the overall message. And this was one of the words that I will say during the overall, rounds and rounds of peer review and many online and not were pretty, for the lack of a better word, upset about it. So, I don't care enough to fight for it. I think that it's much more important. It's like, hey, the work is this. It's a tool. It makes mistakes. I don't think we like, no 1 trained for it to be a coscientist, and here we are. So is it an emergence capability? I'll let DeepMind decide.

Nathan Labenz: (1:41:52) Yeah. Exercise for the reader. Okay. So maybe last question. This is a good positive note to end on. Towards the beginning, obviously, we talked about my experience as a research assistant and just how much grunt work it really was. Now we have Coscientist, always worth reminding ourselves that this is the worst that this technology is ever going to be. So let's say I am, considering a PhD in chemistry or chemical engineering now. What is my PhD gonna look like as I start to have Coscientist, as kind of the new normal? And I'm interested there in terms of like how much faster can I go? Maybe also like how much Coscientist budget do I need? You know, but and and how how much more can I accomplish in the course of a few years of research compared to the before? This could be your pitch for your lab or for the hard sciences in general, but I'd love to hear how you think this will be different now that we have these enabling tools.

Gabe Gomes: (1:42:59) Definitely a lot of things will change. We'll be able to pay less attention on certain mechanics that are time consuming, but not really pushing knowledge forward. So I cannot speak for PhD in chemistry chemical engineering anywhere and everywhere. I can speak for the groups that I know that are using these things like again, Schwaller at EEPFL, Andrew White, and many others. I should really mention Alán Aspuru-Guzik in Toronto, my postdoc advisor that is one of the pioneers in machine learning for chemistry, period. And they are pushing extremely hard there for massive innovations. And the PhD experience will change, definitely. And you know, I see for example, I hope to see in biology become shorter so that both people can go to the workforce faster because they don't you know, though biology there are the problems, biology is hard because biology, but you get the idea. In chemistry I what I want from people that come to my group and work here is what are the big questions that you're trying to solve? What are what like keeps you so, focused on a problem that you want to tackle? Right? So there are people in my work, in my group working on things that are basically applied math, information theory, and there are people that are really pushing the bothers in bio analysis. And for all of them, what we have is now a way that dramatically diminishes the amount of time that they need to spend doing things that do not help them actually answer that question. And the questions can become more more complex and more general. I am so excited about what we're gonna be able to do going forward because we are not encumbered by, having to worry so much about these things that took a lot of time but did not push, did not advance the knowledge. It's going to be awesome. Right? It really is. We need to embrace it. And it's going to be really great. And by the way, one thing that I really want you to see, and I do not know how long it's going to take, is that, hey, this is not just chemistry or computer science or bioengineering anymore. Let's just solve these boundaries because the main discoveries and solutions really occur at these interfaces between different fields. Right? That's how I really like to see things. And my group were interested in developing new reactions, for example, to tackle glycans and sugars that in the body, they are yet another piece of biology that has been severely other explored because the chemistry of it is so difficult. Right? So we had Caroline Bertozzi get the Nobel Prize in chemistry 2 years ago showing that you could do chemical, like click chemistry. Okay, now let's go and follow her footsteps and explore those things by developing the technologies that will allow for us to modify sugars. That's the kind of thing that I'm super interested in. And I want I just want people to really ask hard questions and be able to go into it fearlessly without needing to worry about the administrative things that don't really help them or like writing code that is just, an operation in the process. I think we are in for a revolution in how education works and how discoveries will be made much faster. We are already there. Not because of Coscientist at all. Like, just look at AlphaFold. That was, amazing. Definitely one of the most important discoveries of the last decade. Where are we going to next? I don't know, but I know it's going to be fast and it's going to be powerful.

Nathan Labenz: (1:47:13) I think that is a great note to end on. You have given us a glimpse of the future with Coscientist at a minimum. Gabe Gomes, professor of chemistry and chemical engineering at Carnegie Mellon University. Thank you for being part of the Cognitive Revolution.

Gabe Gomes: (1:47:29) Thanks so much. This was awesome.

Nathan Labenz: (1:47:31) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co or you can DM me on the social media platform of your choice.

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Breaking Boundaries: AI CoScientist to Accelerate Science Research with Gabe Gomes, Professor at CMU

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next