Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

Watch Episode Here

Listen to Episode Here

Show Notes

Andreas Stuhlmüller and Jungwon Byun return to discuss how Elicit is building trusted reasoning workflows for scientific research as frontier models grow more powerful but less transparent. They explain process supervision, domain-specific reasoning primitives, and world models that make evidence, causality, and counterfactuals more inspectable. The conversation also covers life sciences use cases, evaluating conflicting evidence, automated software engineering at Elicit, token costs, Gemini, and why legible reasoning may still beat neuralese.

Mercury: Command is Mercury’s new conversational interface, giving you natural-language access to your finances and helping you take actions within your existing permissions and approval policies. Visit https://mercury.com to learn more and apply online in minutes.

LINKS:

Sponsor:

Claude:

Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(03:38) Special Sponsor

(05:26) Mission and evolution

(13:56) Customer fit and markets (Part 1)

(18:19) Sponsor: Claude

(20:11) Customer fit and markets (Part 2)

(25:44) Calibrating model confidence

(31:37) Monitoring reasoning traces

(41:33) Verifying fuzzy tasks

(52:40) World model representations

(01:01:37) Planning versus experiments

(01:11:53) Automating Elicit’s work

(01:17:52) Token economics and models

(01:28:57) AI for science

(01:36:45) Improving reasoning quality

(01:41:50) Episode Outro

(01:44:57) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

[00:00] Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to welcome back Andreas Stolmar and Jungwon Byun, co-founders of Elicit, the AI platform for scientific research that's on a mission to radically improve the quality of reasoning that supports high-stakes decisions. Elicit was founded on the belief that process supervision, where models are evaluated and rewarded for the quality of their step-by-step reasoning rather than just their final answer, would improve the consistency, reliability, and legibility of AI workflows. Of course, with the rise of reasoning models, which can do much larger and more challenging tasks, but generally hide their chain of thought from users, Elicit faced a challenge. how to harness the power of Frontier models while still ensuring that famously unwieldy LLMs actually do what they're supposed to do. Their answer, as you'll hear, is an interesting synthesis. By creating a DSL, or a domain-specific language, that defines reasoning primitives, which they can then deliver and optimize as discrete reasoning microservices, they allow Frontier reasoners to dynamically create structured workflows that are then guaranteed to run as defined. Today, they work with seven of the top 20 life sciences companies, supporting everything from the ranking of candidate drug targets to the defense of drug launch and pricing decisions for regulators and payers. And now, the frontier is shifting to external world models, structured representations which can take a variety of forms that make a model's understanding of complicated bodies of evidence as explicit and self-consistent as possible, with the goal of supporting reliable causal and counterfactual analysis. As Andreas puts it, these world models are a form of continual learning that humans or other AIs can inspect and understand. Of course, we cover a lot of important details along the way, from the reasons that they believe that LLMs are still too easy to push around to serve as reliable decision support tools on their own, how Elicit thinks about evaluating the source and quality of new and at times contradictory evidence, The promise of certificates of reasoning that would prove that the appropriate reasoning steps were in fact carried out as intended. How Elicit is automating their own work with a system that they call The Line, which now delivers 30 to 50 code changes per week, and their goal of getting this system running well enough that the company continues to make progress during the human's year-end vacation. Plus, how much they're spending on tokens as a company and individually, where Gemini fits into their stack, and why they are optimistic that legible reasoning will win out over neuralese in the end. Andreas and Jungwon are really exceptional at making time to zoom out and consider the big picture, even as they run their company day-to-day. And their hope and bet is that if we prioritize truth-seeking now, we may be able to create a positive feedback loop in which better reasoning begets better reasoning, and that this could still happen in time to steer the singularity in a positive direction. Very few for-profit teams have been as consistent and disciplined in pursuit of their mission, despite the rapidly changing AI landscape. And so, for many reasons, I hope they are right and successful. With that, I hope you enjoy this conversation about how fluid intelligence can orchestrate trusted reasoning workflows, and why continual learning might be best instantiated outside of the model weights. With Andreas Stuhlmoller and Jungwan Bien, co-founders of Elicit.

[03:38] The cognitive revolution is brought to you by Mercury, the fintech that more than 300,000 ambitious companies and individuals trust to run their finances. I've wired AI into nearly every corner of my life. My e-mail, my messages, my calendar. I even gave Mercury virtual cards to my agents with low limits and category and merchant restrictions for their autonomous use. But still, my AI's access to my financial data has remained limited. With a normal bank, I might export a bunch of statements and have my assistant process them for me. But for real-time, up-to-date information, and certainly for taking any action, trying to get your agent to use the bank via the browser is just too hard, too slow, and too error-prone to be worth it. And that's why Mercury's new conversational interface, command, is such a big deal. It's built directly into Mercury, which means you get natural language access to your finances without exposing anything outside of your bank account. No exports, no spreadsheets, no pasting your transactions into third-party tools. I really think a lot of people are going to prefer it this way. And it can already help you take actions too, with everything bound by the permissions and approval policies that you've already set up in your account. I am genuinely impressed to see this level of AI integration in banking in 2026. And so I invite you to join me in the future. Visit mercury.com to learn more and apply online in minutes. Mercury is a FinTech company, not an FDIC insured bank. Banking services provided through Choice Financial Group and Column NA, members FDIC. Thank you to Mercury for supporting the cognitive revolution and now on with the show.

Main Episode

[05:26] Nathan Labenz: and Andreas Stuhlmoller, co-founders of Elicit. Welcome back to the Cognitive Revolution. Thank you, Nathan. To be here. It has been two years, guys, since our last podcast. Time flies. We've had a couple of opportunities to talk offline, but two years is an eternity in the AI space. And so I'm really excited to catch up on everything you guys have built and what you've learned and the sense that you're making of the rapidly changing AI landscape and how you can put a little positive nudge into history. Maybe for starters, let's just do a quick little review of the motivating mission that you guys started with around elevating reasoning and the quality of decision making. And then with that context, we can go into brief history of the last couple of years, because I think you guys were on this trend even before there were such things as reasoning models. And I think the reasoning models are not exactly what you had in mind, but obviously very relevant to where we are today. So kick it off with a little reminder of the motivation behind ought and elicit.

[06:37] Andreas Stuhlmüller: So our mission is to radically improve the quality of reasoning, especially for high-stakes decisions. And as you said, we've been at it for a while. We started non-profit art. We're working at Elicit, an AI research company now. And wow, what a decade it's been the last two years. So I was just going a few days ago. I was going back over kind of my PhD from a long time ago in machine learning. I was like, what would it look like to redo some of that work now? And so I went to Elicit and I was like, hey, let me just upload one of my workshop papers where I know I didn't do a good job. And I redid like some of the computational experiments in Elicit. I was like, hey, do the data analysis. I was running in the cloud. And like 10 minutes later, it had redone the paper. It was like, it was mind-blowing. And so if you think back two years ago when we talked, I think we had launched paper search and summarization in Elicit. And then over the course of maybe the next year or so had launched systematic lit review. kind of the fixed search, summarize, screen, extract, write, flow. And now we have these much more flexible research agents. So it's obviously been an incredible, incredible trend from very short tasks, short time horizons, low flexibility, to agents that do much more extensive research, long time horizons, quite a lot of flexibility. So excited to review what this all means and where it's all going.

[08:12] Nathan Labenz: Could you give me a double click into what exactly you had in mind when you were talking about how you were going to improve the quality of reasoning, how that compares to the more and more RL on top of a chain of thought paradigm that everybody now is very familiar with, and then how that has maybe enabled what you're doing with Elicit, maybe in some ways competed with it, and how it has shaped your business and product strategy to go in this more principled, a little less bitter pilled direction with what you're trying to build.

[08:48] Andreas Stuhlmüller: I like bitter pilled. Yeah, so last time we talked, we talked a lot about process supervision, right? How do you know that a system is doing good research for you? can either look at the output and be like, that looks good to me, or you can look at the process it went through, you know, what papers did it look at, why did it look at them, how did it choose what to do? And Back in the day, when we were young and naive, you all thought, it would be better to know that things are correct for the right reasons, that the process was good. And so we leaned a lot into that. And in many ways, I think the justification for that has still barred out. So let me talk about a quick anecdotal experiment we ran, I think, just two days ago, where we told a few research agents, hey, look through and analyze about 100 papers on toxicology risk for a particular type of cancer drug. And then, so we tell that to Claude, we tell that to ChatGPT, tell that to Elicit, and then we ask, hey, how many papers did you actually analyze? And then as the models like to do, they're like, you know, that's a fair and important question to ask. Let me be direct. I did not analyze 100 papers. You're right to push back. I didn't do it. And that I think that is like a failure of process in many ways. Because you're like, I told you what to do, you didn't do it. And so I think the reason for that is because the models are not trained on process. The models are trained to produce outputs that where if you didn't check and you just looked at the result that said, hey, here's my analysis, you wouldn't have caught that. I think the fundamental problem still exists. And then the question is like, what do we do about it? And to what extent does the solution look like better checking of the outputs versus checking of the process in various ways. I actually think this question is still open. I can speak for what Elicit does and maybe briefly for what the rest of the ecosystem does. Elicit addresses this by, we have like a little kind of domain-specific language that the research agent can write that orchestrates other calls to agents. Screen all these papers, then extract data from all these papers, and it runs the, you run the process, you know the process does what it said it was going to do. I think actually the model companies are going a little bit in that direction too. I think I haven't been following it extremely closely, but I think Anthropic recently launched like a workflows feature that has similar properties. And so even though I think on some level, yeah, all the models are outcome trained and that's why we see these like quite ridiculous artifacts of models being like, oh, sorry, I didn't do it. I think the problem still exists and like one level up from that, people are trying to patch it. that's where I see ourselves as being at in this game.

[11:30] Jungwon Byun: Yeah, I think it's worth emphasizing that a little bit more because it's a core part of how Elicit is built differently and therefore what you can use Elicit for. A lot of, like you said, it's been 2 years since we chatted. A lot has happened in that time. A big part of what we've been working on for the last year is just rebuilding Elicit on top of this much more agentic infrastructure. We started working on that in about March or April of 2025. In retrospect, it was maybe a little bit early, but at the time it felt quite late because obviously a lot of advances happened over the last two, three quarters. But a big part of what we spent the last year thinking about was designing how do we preserve these benefits of transparency, systematicity at scale without losing the flexibility and raw power of these models? I think that's the core design question we've always wrestled with because when you want to deploy these models at scale for really high stakes decisions, you need to be able, you need them to behave in a certain way, which is often contrary to their kind of fuzzy nature. But you don't want to be overly deterministic because then you run into the bitter lesson issue, right? So like threading the needle is what we're always struggling with. So a lot of our time Last year was spent on this kind of technical design question, and we decided to design our own programming language to solve this problem so that the models could run reasoning computation, be able to call these reasoning primitives at scale in a more trustworthy way. And what we were trying to accomplish for our end users was the ability to say, you can run this process with the model over 10,000 objects, 10,000 documents, 10,000 drugs, 10,000 targets, genes, whatever, and the same process will be applied to #5 as number 9,999. And there's just no other model that can make that guarantee. And unfortunately, there are lots of models that claim that they have done that or can do that and are just completely wrong about it. So when we think about who are our users, how do we build a differentiated product that really meets their needs, what use cases can we enable? We are powering people who want to be able to rigorously synthesize evidence at a very large scale and get every single thing right all the way to the nth degree. That's a very different interaction than just riffing with a model. And so that's, and we support some of these lightweight use cases too, but I think there it's almost more like getting to parity. And I think where Elicit's really differentiated is that trust at scale.

[13:56] Nathan Labenz: Yeah, I should always remind myself and be clear that none of these things are really true binaries in the sense that it would be wrong to characterize Elicit as being not bitter pilled because I remember last time one of your big principles was how can we allow you to spend more money to buy more compute to get better results? And that's almost a restatement of the bitter lesson in some way. And at the same time, the frontier companies, as you said, are doing some of this. And I assume that in their research agents, in particular, they're deep research products. They are presumably doing some sort of at least rubric-based reward on the final reports that the models are outputting, albeit, if I understand correctly, still at least intending to avoid putting any optimization pressure on the chain of thought. So there's always little aspects of gray to this. How would you say the So who are the customers now? There are these deep research things. I use those pretty frequently. Who is the sweet spot that is like, a deep research agent isn't enough for me. I really want to go way bigger, way more systematic, be very sure that I'm performing the same analysis in a way that I can count on. Who are those customers now that you're finding product market fit with?

[15:21] Jungwon Byun: Yeah, there's, I think of it as maybe a funnel. So the largest group of users are academics, individuals, people like yourselves who want value the kind of deep research function. They generally want a fast but robust synthesis of the literature or of the evidence. And where Elicit really differentiates there is it just, it's a bit more technical by default. So it's not really assuming a lay audience as much. And most importantly, everything is well cited. So by default, you can get the models to cite things if you really push them to. And then sometimes they realize they were referencing sources that didn't exist, which is unfortunate. So they still make those mistakes. Alyssa doesn't do that. Alyssa just assumes every single claim needs at least one multiple citations from vetted databases and data sources in the academic literature. So a lot of people still find value in just that kind of core offering. Then there are academics and researchers who, like you said, want to get a lot more systematic. They want to do a systematic literature review where they're really thoughtful what evidence should I be looking at? What information out there is relevant to my research? What should I be excluding? How do I apply the same process to all the papers, extract detailed data from charts and figures, and then synthesize all that and weight the findings by quality or do a landscaping of the market or the research in a similar way? And then increasingly, one of the other things we've really been developing over the last year is our life sciences motion and playbook. So we work now formally with, I think, of the top 20 life sciences companies. And we work pretty much across the entire development life cycle. I would say our biggest concentration is in early stage research. So working with discovery biologists, some of those toxicologists that Andres mentioned, as well as researchers in the kind of late and post clinical stages and commercial and medical teams. The early stage researchers are often, they are the ones that have a little bit more iterative processes. They're like, I have this idea for an experiment. I'm in immunology. I want to figure out a way to repress these kind of rogue T cells. What's the right mechanism to do that? And then getting ideas, finding other experiments that have been run, trying to reproduce those experiments. They also are often systematic. So they're the ones that want to apply a reasoning process over thousands of genes and targets and do a tournament style ranking. And then on the commercial and medical side, at this point, often you've had validated phase two, phase three trials. They're starting to think about what's the launch strategy of a drug. Exactly which populations do we go after? Who pays for this drug? What are the other kind of alternatives out there? How much? more compelling or cost effective is our product. They have to be very evidence-based in every claim that they make and in justifying to regulators or payers exactly why this drug is, it makes a difference in the world. So that's where within life sciences, we're seeing a lot of pull.

[18:19]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

Main Episode

[20:12] Nathan Labenz: Something really top of mind for me that you kind of touched on there is being evidence-based and adding kind of the definition of that, especially both at the early stage literature review, but especially as you get later, when you start to think about markets and that sort of thing, it strikes me that perhaps in many cases, the best information is not necessarily in an academic paper, or at least if I was naively thinking about approaching such a problem, I would think I would want to cast a pretty wide net. I also experienced this a little bit myself. Fortunately, my on his regular listeners note is pretty much cured from his cancer and we're getting back to normal life and all that's great. And I didn't have to go down the path too deeply to really try to get synthesis of what would the best second line treatment be for him. But when I was contemplating that in talking to doctors about it, I found there was a real very narrow box drawn around what kind of information would be considered. Our oncologist, who I probably quit highly of and have a good relationship with, will say there's no evidence for that, for things where I'm like, there's definitely some evidence, right? It's just not this sort of gold standard RCT, peer reviewed, yada, yada, yada. And obviously that stuff is great where you can get it, but all this is a build up to ask and it may be different for different kinds of analyses. How do you think about how wide the net should be as you go out and look for evidence? When is it appropriate to truly limit yourself to published academic work? When is it appropriate to consider blog posts or even insightful tweets that somebody has put out? How do you guys think about that? Does it vary by use case? How do you think about weighting these things and calibrating to the level of credibility that they should have? That seems to me an opportunity for honestly just major improvement even at the clinical level because there's so much blind eye turning it seems to me today to evidence just because it's not right at the very highest level of quality or credibility.

[22:25] Jungwon Byun: Yeah, that's right. I think the core question here is how do you differentiate and discriminate against evidence based on evidence quality? And how do you identify when a certain level of evidence or quality is relevant or the best thing for your decision? And so even within published literature, there's a gradation, right? There are more- Important to remember, yes. Yeah, there are high impact journals, there are well-regarded authors and institutions. And so this is a question that we've wrestled with for a long time because the way humans have solved it is there are these very lossy proxy citation counts, journal impact factor. I know the guy, right? And those are rough approximations, but deeply imperfect. And there are so many examples of groundbreaking research that was not published in the top journal, like one of the foundational CRISPR papers, for example, was actually published in tier two or lower tier journal. And I think the promise of language models is that yes, we can continue to use these heuristics that have been helpful, but also we can think a little bit more from first principles about the quality of work and its appropriateness for the decision at hand. So that's what Alyssa has always really tried to do, and not just blindly rely on citations, but give the researcher the chance to say, okay, for this research project, I'm going to look at case studies, or I'm not, case studies are not good enough in my research domain because there are plenty of RC, or I'm gonna look at studies of the sample size of 10, or I'm just gonna look for a sample size of 1000 or more, because depending on whether you're working in rare disease or oncology, the shape of the research that's available, like you said, is different, right? So I think part of what 1 principle we have is like, how do we look at the actual methodology and the content of the research, not just at metadata, to determine quality. How do we equip the researcher to express their own judgment and expertise based on what they know about the domain and what is specific for their project? Because I do think it is specific to the domain. As we've built out the research agent, we have started to bring in a lot of other data sources, web sources like company filings and things like that, because like you said, there's a lot of information that's not just in published papers. Publication takes a long time, so a lot of recent things are not going to be there. And sometimes, and research is very interdisciplinary and holistic. So when you make a scientific decision, you want to consider the science, but you also want to consider the kind of commercial implications of that as well, or the policy implications. So we have introduced a lot of data sources, and we do, I think, still the agent relatively has a fuzzy understanding of which sources are higher quality. But I think we want to kind of structure that a bit more. And again, the user has a lot of control in how they can express what they believe quality means. But I think we do want to also be a bit more opinionated for researchers that are not experts. And then the last thing we're thinking about is how do we express the confidence level of claims? So not just at a source level, say this is a high quality paper, this is not, this is a high quality blog post, this is not, but when elicit synthesizes all this information for you, how can we say this treatment works, that claim, how well supported is that claim? Is that you don't need to worry about it, like 99% likely to be true, or is it like there's conflicting results in the evidence? So we've been thinking a lot more about breaking insights down into claims and then giving people kind of confidence levels so that they know what to do with that. I think that's another important part is like, It's not about, it's not binary. It's like, how can I use it? That's the part that's really important.

[25:44] Nathan Labenz: How good are models at that today? I always think back to the original GPT-4 model card where they showed the base model calibration being pretty good, where when asked to express confidence in claims, the When it said 20% likely, it was actually pretty close to 20% of the time that it was correct. And honestly, to a pretty remarkable degree, the base model was reasonably well calibrated. I understand that pretty much just came out of pre-training as an emerging property. But then with reinforcement learning, you have mode collapse, which is going to overstate the importance of mode collapse, which I do think was like one wave of AI denialism for a minute there. But there was some form of mode collapse where all of a sudden you're like getting much less well calibrated self-assessments from the model in terms of how confident it was that the claims were accurate. Obviously since then we've had a tremendous amount of additional RL applied on top of models. I haven't seen those kind of calibration studies as recently. I could believe that maybe they've gotten better because maybe that's part of a rubric that's being used. I could imagine it's even gotten worse because just left to its own devices, RL maybe takes you the other direction. What do you guys see in terms of how good they are natively? Can you prompt your way to good performance on these sort of confidence questions? Do you have to fine tune? Do you want, if they're not that great, do you want the frontier models to get better? Would you advise them on how they should think about getting better at that? Tell me everything.

[27:32] Andreas Stuhlmüller: Yeah. So first, I think the token probably is just useless now, so we can pretty much ignore them. And so when we talk about calibration, mostly we have to talk about the verbalized calibration, right? If you ask the model, how confident are you in this thing? Are you confident? And I think that is probably much better now than if we compare the, what was it, TB3, TB4 token props against the verbalized probabilities now. I would rather take the verbalized probabilities, to be honest. I think those will probably do a better job at capturing, kind of for complex situations, what the assigned probabilities should be. I still think the models right now are easy to push around and that is one of the main things we are trying to address with better scaffolding is right now, often you ask the model, hey, how likely is it that this kind of trial will fail, say? and the model is going to say something like 30%. And then you're like, but have you considered that you can almost say anything here. You can say, have you considered that on average clinical trials like fail X percent of the time? And it'll be like, oh no, you're right. Actually, you know, it's actually this probably should have been higher. Or have you considered that in this particular case, actually this molecule has like pretty little research behind it? And it'll be like, oh yeah, you're right. And the models are getting better at it. But I think fundamentally right now, these probabilities are very unstable compared to when you talk to an expert, you can throw these things at them and mostly they'll be like, yeah, I've considered it. That doesn't really change my view here. So a big question is like, how do you get kind of stable probabilities out of the models? Which isn't as they say they should never update, they should obviously, if you present truly novel information to them, they should update. But right now, it does not feel like they have a coherent world model behind a lot of the probabilities that they express to you.

[29:24] Nathan Labenz: How, when you say they're easy to push around, like how easy are they exactly? Are we talking like even seemingly semantically equivalent rephrasings get big differences in outputs or is it a sort of more conceptual difference in framing is required or you throw in red herring fact and it can move the needle. Just how blowing in the wind are they? It's

[29:52] Andreas Stuhlmüller: hard to characterize. I think it doesn't maybe part of the issue is it's hard to predict what will push them around exactly. And I think if I could give a better answer to this question, I think I could would also have an easier time fixing it. I think it like I don't think it's like extremely systematic right now.

[30:12] Jungwon Byun: But maybe that's partially what we've been trying to invest a lot more in is, and I think that's the next phase of our role as humans and what humans have to do is work through these really gnarly evaluation problems. I very much believe in the next few years, generation will more or less be a solved problem. Historically, humans have done all the generation. They've done all the work. You open up a blank document, you put your thoughts in, you start from nothing, you create. And that's just not how work is going to happen anymore. Increasingly over time, AI is going to take the first pass at everything. And then the work left to humans will be around evaluating that. Was it the right thing to do? Was it done well? Can I use it for this use case? Can I trust it? And I think people are worried about kind of job loss, but I do think there's a major job transition opportunity or skill transition opportunity where now what we need to do is think about what does good look like and what are these failure modes that AI systems can run into? How do we start documenting them? How do we start codifying best practices, getting good examples to point models at? How do we start building better verifiers? There's actually a lot that needs to happen because we don't even know, we can't even articulate for ourselves what good looks like or why this is is a bad thing or how often this happens. So I feel like there's a lot of work around that and more certainly we are trying to make that transition as a company. I think there needs to be like more infrastructure built around evaluation and articulating what good looks like so that we can point these models towards them.

[31:38] Nathan Labenz: How that interacts with chain of thought and I was just, this is very top of mind, I keep talking about it, but I not too long ago attended this event in San Francisco called Recursive, which was all about the potentially soon coming phenomenon of recursive self-improvement. A sort of scary, striking takeaway from that event and all the conversations and presentations there was we're really heavily, and I would say like problematically indexed to chain of thought monitoring. Like it's chain of thought monitoring kind of all the way down in terms of the plan for how we're gonna keep a recursive self-improvement process on the rails. And so there's like this very strong sense that we've gotta maintain freedom for the models in the chain of thought so that we can monitor it. Because if we do apply pressure per the obfuscated reward hacking paper, we'll drive the bad behavior underground and then we'll be doubly worse off because we'll still potentially get the bad behavior and we won't see that it's thinking about it. How do we square that? Is it maybe, I guess the naive answer would be maybe you just say you can think whatever you want in the chain of thought, But I still want a systematic account, a systematic reasoning trace in the actual final output. And then I just reward that and maybe that works. But maybe you have a more nuanced view on how companies should think about where and what kind and how much pressure to apply to the reasoning process that builds up to the final answers that models give.

[33:22] Jungwon Byun: I can give a high level take and then I'm sure Andreas will have more a technical response. But one of the projects we worked on in OTT that like many OTT projects was a bit early for its time was we probably built the first language model observability tool. We built this thing called ICE. It stood for Interactive Composition Explorer. And we had this, was, when was this? must have been 2021 or so. And yeah, we anticipated this problem even five years ago that at some point language model traces would get so large, we would not be able to debug them for ourselves. How would we visualize that? How would we maintain oversight? How might we audit that? And so we built this visualization. And one of the things we learned was the best way to troubleshoot is not necessarily to chronologically go through all the steps that the model took. That actually evaluation, so repeating the generation process is one way to check what happened and to build trust in it, but it's inefficient. It doesn't really scale, and it's a very difficult way to check. And so actually what you need often are like a different layer of reasoning checks, almost like logical consistency checks, which are not, let me go through everything you did, steps 1, 2, 3, 4, 5, and see if it was correct. But now let me think about how, for example, like a sensitivity analysis, like how sensitive are my findings to different changes in input parameters, logical consistency checks, things like that. And so that's where I think we need a lot more investment in infrastructures and building kind of independent checks that don't don't rely just on checking chain of thought monitoring. That's my high level take, but yeah, I'm sure you have a more technical version of that answer.

[34:58] Andreas Stuhlmüller: Yeah, let me maybe first restate part of what you said in different language. So I think you can either check the process or you can check the outcome. Those are your two options, right? And then when you're checking the outcome, You still want the outcome to somehow contain a certificate that the right reasoning was done. And like that, what can that certificate be, right? It can be like, here's how my conclusion would change if I had a different input. Here's like the things I looked at on the way. Here are like citations to the literature. I do think this is a very underdeveloped field in my mind. So it's In mathematics, it's very developed, right? You can have your formal proof if you want, and that proof is checkable. I don't think it's very developed in more kind of fuzzy domains, and probably partially because we'll just be too. too much work for humans to produce legible certificates, and humans don't even really have great ability to introspect on their own thoughts. But I think in principle, you could, even if you didn't supervise the process, you could produce certificates of reasoning that then let you check the reasoning, even if you didn't check the process that generated that outcome. So I'd be very excited to see more work in that direction. The other clarification I wanted to make is, I think there's For me, it's worth distinguishing like chain of thought and kind of the reasoning process or the chain of process or whatever you want to call it. Because the chain of thought, people often think about what are the thoughts the model writes down, like the reasoning tokens. And then the other, those, how much can you trust them? I don't know. We don't, I think for open AI, we don't even get them anymore these days unless maybe you apply for a special permit to see them. But you do see the tool calls. And I think the tool calls actually are important reasoning facts in and of themselves. Take, for example, let's say I use a model to ask about, like, I download a new paper from archive, the model hasn't seen it before, and ask it some question, ask the model, hey, you know, what are the key results here? And now if the model, I can see which parts of the paper is the model reading, because it, you know, it has a, it's retool that maybe reads, by default the first few lines and then it can scan other parts of the paper. And so sometimes I know the model didn't even look at the methodology section. And that is part of its reasoning that is checkable. I know if it now tells me something about the conclusions of the paper that it really should have relied on the methodology section. I know it didn't do that. That's obviously like a very simple example. These papers are small and so on, but the same thing applies at much larger scale where like you can actually, the process still remains an important waste checkable because we do see the tool calls and the tool calls are an important input into the model's reasoning.

[37:48] Nathan Labenz: Yeah, I like that. How much of this do you think companies are doing today? It's got to be some, but obviously they're not telling us is your on the spectrum from closing in on the sort of thinking that you're doing and the granularity of process supervision that you would like to see on the one end to the purely RLVR with the did you get the right answer in the final answer box or not, just binary signal. How much of this do you think they are doing based on in model behavior?

[38:22] Andreas Stuhlmüller: These are obviously speculations. I think they do a lot. I think we know they do a lot of kind of rubrics on the final output. Like what's the, even like reasoning adjacent rubrics? Did the model produce something that looked like expert reasoning in many ways? I don't think they do a lot of evaluation of the process. As for, again, I could be wrong, but I think there is a really interesting question, which is, if you had all the details of the process, let's say you are inside a lab and you ask yourself, you know, ex ante, would you expect this process to lead to the right answer? Like not like after the fact if you check it against the answer, was it correct? But did it follow the sort of process that if you're making a like a forecast, for example, and you did don't see, let's say you don't even see what forecast the model ends up, you just look at what things did it consider, what data sources did it consider, what did it write about the different hypotheseses? it considers. I think there's a really interesting project of thinking about like to what extent are the models following a process you should expect to be good. And in my limited knowledge of what's going on, I don't think this is much work is being invested in that.

[39:35] Nathan Labenz: So the, I guess one implication of that for your work would be you might expect the next generations of the model to eat less of your scaffolding than it might eat other types of scaffolding, right? This kind of general pattern of like people build out scaffolding to compensate for the model's weaknesses, then the model companies take that feedback and train on it and the next generation, you know, you could clean out or eliminate a decent amount of that scaffolding. It seems like you think that maybe on the agentic side of go accomplish this project, get over these humps, whatever, you would expect more of that kind of scaffolding to be eaten in the next version than the sort of trust and reasoning scaffolding that you're building.

[40:26] Andreas Stuhlmüller: Yeah, these things are hard to predict. I do think the main one way to divide up the types of questions that people ask the models are like questions where Once you see the output, it's easy to check, like proof of existence type questions. can you write a program that passes some tests, for example, or replicates behavior of an existing program? And then there are questions where you can't do that, where, for example, let's say you want to know, is there any paper in the literature that refutes this claim? And Again, this getting back to the certificates point from earlier, the model can't exhibit to you, hey, like here's a short explanation for like what, you know, why there is no such claim. Like the proof is in the process in a way. The model has to prove to you, hey, like I look, I followed this strategy to find if such that if there was a paper that refuted this claim, I would have found it. And here's the proof that I did that. And I expect those sorts of questions to remain a little bit more tricky for the models for a while, but we'll see. We'll see.

[41:34] Nathan Labenz: Yeah, that reminds me of something that you recently wrote around Elicit's focus being on becoming the best at reducing these big kind of fuzzy, hard to verify tasks to sets or graphs perhaps of easy to verify tasks. And I'd love to hear a little bit more, you've alluded to it there, but I'd love to hear a little bit more about like how you're going about that and also like how complete you think that process can be. I guess everywhere I look these days, I feel like I see this same shape of a really interesting question that I don't know what to make of, which is like basically how do we get high level guarantees or conclusions or insights from low level steps, right? And this is like in biology, I might be able to say I've got all these proteins or these genes are being expressed at this level, but I know what's going to happen next at the cell or the tissue or the organism level. No, right? I don't. And similarly, it seems with these kind of big judgment calls, should I prioritize this drug or that drug, we can break it down and get systematic. But it's not clear to me if there's, if it kind of, But how close it gets to something where I'm like, yes, okay, I can really buy in and trust that versus is there something that kind of sits above all those steps still that is, it emergent or is it just somehow lost or people are doing some sort of meta cognitive work that's hard to capture that is maybe still like very critical to actually being effective at these tasks. So I guess, again, tell me everything, but I'm really struggling a lot with this like, Formal methods, formal verifications is another area where I see this, where we can make all these low-level guarantees about this error, that error can't happen, but is the system itself going to behave how we wanted it to? I still don't know in a lot of cases how we make that leap. So very interested in your take on these sort of, I usually think of it as like laddering up low-level things to high-level conclusions where you're approaching it from the other direction, which is interesting unto itself. Yeah, again, tell me everything.

[43:51] Andreas Stuhlmüller: maybe I guess to quickly restate, what is the, why do we want to reduce hard to verify tasks to easy to verify tasks? It's because AI currently can be trained on easy to verify tasks. We know it's like extremely good at RLVR coding. It's a tough math tasks like this. And it's quite, I would say like quite weak at a lot of fuzzy tasks. So when I, notice it all the time when I try to use The models to help me in plan our company strategy, for example, I think they're surprisingly useless. So even though they have access to all my contacts, they're really quite good at being like, you know, let me pull in the data, let me pull in your email, your Slack. I still find that I can't, they don't get it. Or like in an important way, this is related to what we said earlier about like they're too easy to push around. It doesn't feel like they're building up a coherent model of what's going on. And I think an important The reason for that is, that is a hard to check tasks. Then, okay, what to do? I think it depends a little bit on like what your situation is, whether you're trying to create a reward signal that you can train the models on, or whether you're trying to kind of do verification and checking for the purpose of understanding whether an already trained model can be trusted in a situation or how to refine its behavior. I think if you're trying to create a reward signal, that's pretty rough because The models are going to optimize pretty hard against your signal. And so it's not enough to do spot checks and be like, hey, here are some cases where we can verify that, for example, your company strategy was, I don't know, incompatible with some claim you made earlier. Whereas if the goal is to take an already trained model and get it and understand how good is it exactly, or did it make some fairly obvious mistakes? Can I find places where it can improve? I think then your reward signal doesn't need to be bulletproof or your way of getting some easy to verify aspects of the hard to verify task. You can make more kind of stepwise progress, I would think. And so I think our situation, like we are not currently trying to train like a foundation model from scratch or even like post-train a model on this particular aspect. And so our situation is more like, how do we get to the point where we can check many important properties of tasks? Like are the claims the model makes internally consistent? Is it the case that, you know, if we broke it down in different ways, it will end up at the same conclusions and so on? I think that is like a fairly tractable project, I think. And the project of, how do we fully reduce a high level task like company strategy into individual components that are all like formally verifiable, I think is a much rougher prospect. Not isn't to say it's impossible, but it's like less of a, you have less incremental kind of feedback signal that you're on the right track there, I would say.

[46:51] Nathan Labenz: So Tell me a little bit more about what you're doing in practice. You said you're not trying to post train a model. I know in the past there was a decent amount of fine tuning though, at least for kind of specific tasks. I'm curious if there's still a fine tuning element to it. And then there's a lot of different ways you could think about spending a lot of tokens to try to get at this. You could the decomposition process like multiple times and check for consistency, which I think you're suggesting something like that might be going on. You could do a more iterative thing where you get the AI to give you an output and then have some kind of specialist prompts or perhaps even specialist fine tune models come in and assess in various ways and then give it feedback and then let it reason some more and try to improve on what it just did. And we do see some of that stuff. I just talked to some OpenAI forward deployed engineers who are basically using that process to climb the hill on filing accuracy and that seems to be going quite well for them. So what techniques are you finding to be in practice most effective today?

[47:59] Andreas Stuhlmüller: Yeah, so first we still do a bit of fine tuning. We still, I think it's at this point more of a kind of technique to make things have reasonable efficiency properties at scale than something to get the models to like new behaviors that you couldn't otherwise elicit. That said, so we do a lot of the, otherwise, we do a lot of the things that you pointed out. Let me maybe talk about one of them that we've been investing more in lately, which is what you could call like world models or like knowledge representations that make the models work more checkable. So they are, what is the motivation? So the motivation is actually, there's like maybe similar to the kind of medical case you had, which is like I had a friend who also had a cancer and it was like a case where I then used elicit to get a lot of the raw data for it. So I ran the systematic literature review flow and like for a few versions of the kind of question of like how do you address this particular type of cancer? And I did end up just with, even after filtering down all the information with a ton of papers, it was maybe after filtering out just for the most relevant papers, maybe it was still like 5,000 papers or so. And then the question is like, what do you do with that, right? Like you're now, that's, you could try to like somehow throw it all into like a million context, context window, but I don't think it would actually work that well and I think the model wouldn't be that good at coherently reasoning about it. And so then the question is like what else can you do? I think one thing that people have tried, I don't know if you're familiar with the Karpathy termed it like LLM Biki I think or something like that, you know, which is like you build this knowledge base as like sort of like a obsidian like knowledge base or like a folder of markdown files where you try, you tell your model, hey, we're researching this topic, organize the information in a way that makes sense and do these like iterative updates to it, like maybe it's a GitHub repo and the model gets to add new nodes to it, gets to move information from one file to another, gets to propagate information. And so I think that is a really interesting direction because if you think about how do we get to models that coherently answer complex questions, well, either it happens in the weights of the model or it happens in some explicit representation. And then what is the explicit representation? One, I think, text files are appealing. It's a nice start. But then you ask yourself, it's also very flexible. What properties do you want this representation to have such that it actually helps with the research questions you have? And once you think about that and you're like, well, I want it to let me make predictions about what's going to happen in my case. I want it to help me talk about like interventions. You know, if I did this, If I took this particular drug or pursued this particular type of chemotherapy over different time periods, what would happen if I did instead or following it like some immune therapy, what would happen? If I had done this different thing in the past, what would have happened? So those are like questions about predictions, counterfactuals, interventions. Very soon you're like, wait a minute, that is a thing people have studied in the past that sounds a lot like kind of graphical models or like structured processing models. And so The question that we've been thinking about internally is how do you get the best of both worlds? You want the flexibility of language models that can reason about and transform these representations, but you also want to be able to answer these sorts of prediction, intervention, counterfactual questions that you care about. You want the answers to these questions to be internally consistent, such that it's not the case that one answer and another answer just don't make sense. relative to some underlying coherent representation. So this is all to say we're thinking about how can you build the, like I'm currently calling it world model, but I'm not sure if that's the best name for it. Like how can you build these representations that can let you answer these classes of questions that I just talked about in a way that is like internally coherent and evolve them over time as you add like papers like one by one, like adding information about different symptoms, different treatment strategies, and so on, and get them to be internally coherent. That's one big direction that we're thinking about for how to spend really enormous amounts of compute to deal with large amounts of data and turn them into something that can actually answer people's questions for projects that are much larger than a single query.

[52:40] Nathan Labenz: Yeah, that's cool. I am very familiar with the wiki line of thinking. And I have used it myself for just creating a little wiki that my personal agents use to navigate my life, basically, on top of kind of all the raw data. I first did monthly summaries, right? I exported everything. And then on a month by month basis, it was like a few 100,000 tokens a month. So I would condense that to a monthly summary. Then there was a layer of condensed to an annual summary. And then I came across that and I was like, okay, let's make a wiki. So we've got like articles about all these fun things that I'm starting to do it. It's a party trick, I guess. I just did it last night was I told a friend who I hadn't seen in a minute that, oh, there's an article about you in my personal wiki. I'll have my agent send it to you and you can read it. I haven't read it, actually. I don't know exactly what it says, but you can read and tell me if you see any hallucinations or anything that you would object to in there. For me though, I'm not, it's basically just trying to get down like pretty black and white facts and help the model kind of navigate those. So the nodes are pretty obvious and the edges are also just like simple links, like this person I know from this organization and what have you. I haven't pushed it nearly as far into ideas or a structured decision making aid. So I'm curious about how that looks. I guess the other image that's coming to mind as I'm trying to conceive of this a little bit better is the anthropic work on tracing large language model thoughts, where they have these sort of graphs from tokens to outputs. And I'm always one to remember that the graphs that they present are somewhat clean, but there's a huge residual term all over the place on those graphs, which again recalls my earlier question on how much can you decompose? How much residual is there? But tell me a little bit more about what these world models end up looking like in terms of... What are the edges is maybe one good way to start. Is it X causes Y with X percent probabilities attached to it? Or how do you think about describing those relationships in a way that balances, like capturing as much of the structure as you can, but also recognizing that there is going to be some residual. I'm really interested in how you're navigating that.

[55:01] Andreas Stuhlmüller: Yeah. We are trying to be not very prescriptive about it to the extent that we can. So I think there are cases where notes and arrows are the right or are a helpful representation. So in the cancer example, for example, if you're trying to understand like what is the mechanism, I think office is just very useful to be like, well, you know, the, I don't know. antibody and antigen need to bind and then once they bind like in the cell like a particular type of substance needs to be released in a particular way and like there is just a sequence of events that needs to happen. And I think in those cases we want the model to build these sorts of like graph-like representations. But I don't think that's always the case. So another, if you think about like company planning, I'm thinking about, you know, what products should the elicit ship over the next few quarters and how does it affect our revenue and user numbers and so on. I don't actually think I want it to be represented as notes and arrows. I think that's more like I have a spreadsheet with maybe features and then I have, I don't know, user numbers and margins and so on over time. And so that's a very different type of model. And I think often in the real world, no single model captures what's going on in its entirety. So it's like one lens to look at what else it is doing as this like spreadsheet of user numbers over time, but then maybe another lens to look at it is as, I don't know, the tech tree of elicit the product and how we build it up over time. And those are kind of complementary. And ideally, you know, those can both live in your like knowledge wiki. And you can say, hey, language model, as you're trying to make predictions or help us evaluate plans, look at all of these representations. And then the challenge is, how do you get it to be the case such that the model knows how these different representations relate or when it makes a prediction. It's not the case that sometimes it looks at the spreadsheet and it's like, well, it's going to look like this. And other time it looks at your tech tree and it's going to say, oh, okay, we're going to do this. And they're just like two totally separate things. So there's some like propagation of information that needs to happen between those different representations. So there's a kind of ongoing research project. I'm like, definitely don't want to claim we have solved this problem as an ongoing research project within elicit and hopefully eventually within the world at large, which is how do you make these like more explicit, legible, fairly kind of heterogeneous representations of knowledge that mouse can work on over time and improve? I guess, I guess, as one last thought here, I guess one way to think about it is maybe as how do you make progress on continual learning in a way that is not stuff just lifts in the weights of the language model, but is available to humans as a representation we can inspect and understand.

[57:47] Nathan Labenz: Have you found any particular data structures, particularly if they're open source and something I could also incorporate into my own personal AI infrastructure that work well? I'm just doing very simple markdown wiki as of now. Is there a next level that I should be considering? Graph database or I have no idea what it would be, but.

[58:10] Andreas Stuhlmüller: I think step one is just to use the representations people already find useful. And SQL databases are a pretty useful thing. So when I'm currently like building my model off the list of the company and step one is like ingest a lot of the information from Mixpanel and Addio and like various other types of systems and put it into a representation that the model can then operate on. And a lot of those representations are just SQL tables.

[58:37] Nathan Labenz: Cool, interesting. Everybody's doing their own experiments in recursive self-improvement these days. That's what I'm noticing across the board is there's always this, not always, but it's been striking that in the last, I don't know, 6 to 8 weeks, it seems like everybody's tipping into this moment of, maybe we too can be an experiment in recursive self-improvement. So you're now. elicit as a test case for can we get elicit to build effective world models and then be able to from its own, use these detailed representations that it itself has constructed to inform its own analysis of what it itself should become in the future. The hall of mirrors there is deep and fascinating. This obviously relates to a blog post that you put out not too long ago, which is called Planning is Unsolved. And I think, obviously that's totally true. Or we could just ask the AIs to handle all this and retire to the beach as I think it was an Anthropic person once famously put it. I do wonder a little bit, and this is probably a cultural question as much as it is a technology question, but when I think about like my own company that I'm now just the AI advisor to and not running, And I think about our planning process that we developed before AI. Sometimes I'm like, maybe we should just scrap the whole thing. Maybe all these times that we like come together and sit around the table and talk about this or that and try to convince each other that if we do this, it'll be more successful than if we do that. I'm often like, man, what we should do is build it all, launch all these things and see what happens. We can like coding has gotten cheap. So maybe the future of planning is less about guessing and more about like fast iteration and actually like making more contact with reality. Obviously, again, there's not like a true strict binary there. But how are you guys thinking about that question. And how are your clients at, especially like pharma companies thinking about that question? There's also this, I think it's more of an aspiration than a trend at this point, but there's the notion of clinical trial abundance, which sounds great. I don't think we're, again, I don't think we're there, but you can imagine a very different vision for a pharmaceutical company where one is like, we're gonna use these world models, we're gonna make much better decisions, and we're gonna we're going to deploy these like scarce resources and these like precious few at bats we have at clinical trials in the best way possible. And then there's this other vision that's we'll do 10 times as many clinical trials and that'll be, or 100 times, who knows, and that'll be bigger on lock because we'll actually get the real answers in far more cases. Where do you want to be on that spectrum? Where do you think pharmaceutical companies should be on that spectrum? Maybe it's different. Maybe Waymark should be one place and pharma should be somewhere else.

[1:01:37] Jungwon Byun: Yeah, I think it's great that the cost of software engineering has come down, but it feels like that was just one of the bottlenecks and the others haven't moved. I don't think user attention or feedback is infinite. So I still feel cautious about just throwing things out there and giving people bad experiences or leaving people with bad impressions. And I also think that the cost of It's definitely possible to get stuck on a local max. I would worry about that a bit. Let's say you just ship a feature, you're like, oh, cool, this works. Let's just keep going. And you might be able to keep going for some time, but then still end up maintaining, improving, and investing in something that wasn't the best possible thing you could have done. So I think there's still a lot of room for judgment and purpose. And there are places where I'm not sure the shape of the problem has changed that much. I feel like there's always an explore-exploit trade-off, and you want to navigate those two thoughtfully, and sometimes you want to do one, sometimes you want to do the other. And maybe the cost of exploring has gone down a little bit, but I think there are still some things where you know you have to get it right the first time, or it's still, it's not all software engineering has literally gone to zero. There are still large software engineering projects. So I'm not sure it's changed. that it's only, I feel like for now, it's mostly changed things at the margins, like fixing bugs, small kind of admin features, things that we don't see as our core capability that we want to fully automate and are happy to take liberal experiments with. And then things that we see as being like core to our mission and our purpose and differentiator as a product still involve a lot of careful thinking and intent. And then with our customers, I find that many of them are just like, yes, there's a lot of excitement to build. I guess maybe like unsurprised, my unsurprising take is that, okay, my take on the build versus buy problem is you should build internally the things that are your core comparative advantage and basically nothing else. And if you're building something, for example, there are certain workflows that are just regulated across the industry. Every company has to do them pretty much the exact same way. And they are very particular. And as a result, they're very interface heavy. And I just don't think it's the core competency of pharma companies to design nice software. And that's not your comparative advantage as a company is not going to come from solving this regulatory compliance problem. So I don't think, for example, systematic review, which solves is one of those where I just don't think it makes sense for pharma companies to try and build this thing that every company has to do the same way and is actually very involved. Other types of certain, especially in the early stage research or even in development, certain kind of predictive models make sense that a pharma company would want to build in-house. And then on the clinical trial point, I think why not have both? And I think we'll try, there's a lot of interest in kind of digital twins and simulating trial effects digitally as much as possible to draw, design the trial well. And I'm sure there are certain trials where with the right regulatory framework and kind of operational improvements, we might be able to take a lot more bets. I think especially like in rare diseases where you have a kind of, where a trial is actually almost like a treatment option, being much more flexible there. And then I think in other domains, depending on the type of drug and what we already know about, it's toxicity profile would probably want to hold a higher bar. So again, my hope is that it's great to have multiple tools and options and choices. And so maybe we can just build more tools, make the tools better, and then build a good framework for like when you reach for what tool?

[1:04:59] Andreas Stuhlmüller: I think there are many cases where you just can't do everything at once. So like if you think about clinic trials from the participant perspective, right? Usually you can participate in like one or at most two. And often there are many that you could participate in, but you still need to choose which of those do you go with. And so I think that's a tough planning problem. And likewise, I think as a company, you have a certain amount of resources and they can be deployed one way or the other. I still find, especially as a company that is trying to be extremely mission focused and how do we actually, in the short time that we have, make an impact on the quality of reasoning and the impact of AI on reasoning quality, I think by default, you're probably just not going to accomplish that. And even if you hill climb on user metrics, then you're probably just not going to accomplish that. So you need to think pretty carefully about how to get that to work.

[1:06:01] Nathan Labenz: Your call out of the patient perspective is a really useful reframing there because it's one thing to say, yeah, as a pharmaceutical company, maybe we can have both, we can run all the trials or certainly as a SaaS company, I can potentially launch all the features whether or not I should, I maybe could. But if you've got cancer, you've only got one body and you've only, you can't take all the drugs, right? And that would obviously be ill advised. So that I do think is a really, at least for now, where until the transhuman uploading future, whatever appears, we are going to continue to face these very stark choices about what should I do with my own individual body as a human, knowing that there's not like a second copy of it and there's not really the ability to kind of diversify across paths in many cases. Are there any examples of that you could share at the company level, especially if there's something where this sort of company model and recursive self-improvement paradigm way of working has led, at least in your counterfactual analysis, to a different approach than you might have taken were you just gathering around the table and sharing your intuitions like people used to do.

[1:07:20] Jungwon Byun: Examples of where automation and learning through doing and cheaper experimentation has led us to a different outcome.

[1:07:27] Nathan Labenz: Or Maybe it's the same thing, but I'm thinking especially around this like planning with a world model. Is there something where because you had structured these various lenses and you were able to be more systematic and structured and really agree that this is the framework that we are using, maybe avoiding talking past each other or having some new synthesis, some new insight that you just don't think would have happened in the absence of that? structured approach.

[1:08:00] Jungwon Byun: This is a partial example. I'm sure Andreas has a better one, but it's actually very timely. I just did this for an important hiring decision where ahead of time we had a rubric designed for how we wanted to, what role we wanted to fill and the role had been through a few different evolutions. We had looked at different personas. There was maybe a disagreement on exactly what type, it's an executive level hire. So some disagreement on exactly what we needed and maybe that changed over time as a company grew during the course of the search. And then at some point with it, over about a month or so ago, I wrote down a framework. And then we did extensive interviews, so many references, lots of back channels. There was just so much information I was getting. And I was starting to develop a take on what we should make or what we should do with this candidate. But I really want to avoid recency bias. And I had a structured project. In this case, I use Claude, not Elicit. Elicit doesn't support hiring decisions exactly yet, or I don't use it for that case. And I created a project where I had all of the different kind of meeting notes across all the different interviews and all the different email threads with this candidate and all the feedback submitted and systematically asked Claude to fill out examples of evidence for every single dimension and it ended up. probably being about 20 different fields, like evidence that this person has consistently hit their goals, evidence that this person can hire a great team, evidence that this person is authentic or is culturally aligned. And I started with evidence and I started piece by piece because I don't trust Claude to fully execute in one go. And there was like a bit of calibration. And then once I had the evidence, which was like hit quota so many years, blah, blah, blah, then was like, okay, what is evidence for against? What decision do I make? How do I rate it on a scale of five? And then all together have this like synthesized point of view. And then I was able to say it to the candidate. I think they really appreciated it as well because they said it was like the greatest kind of comprehensive synthesis of professional validation they had ever received. So that was one case where a compositional structured reasoning with an intentional process and then applied at scale with AI was able to both check my decision-making process and also develop, give someone the gift of something that was like very human and very detailed about them and everything that they had accomplished.

[1:10:03] Andreas Stuhlmüller: My example was going to be much more mundane. I think for me, I've been trying to do this more for just planning my week, where I think about, I have goals and this is what I want to accomplish in the long run this year, this month. And then the question like, I have all these calendar blocks, like this podcast blog, and I need to figure out like there are many different things I could do. What should I do? And which things depend on which other things? It's actually, I think it's a, is a pretty tricky problem to know when you could be spending your time in many different ways, what is worth doing. I've been trying to get to the point where I can use automation as part of my weekly planning and think like more. in a more structured way about for if I want to accomplish my monthly goal, this is where I need to be this week. How much time is that going to take? Is it maybe going to take like 5 hours to write A blog post? When can those five hours happen? And so I think this sort of like backwards chaining. People do it like informally, but I think there is a lot of kind of constraint satisfaction and like propagation of constraints that it's pretty tricky for humans and that I think the models will help us with that.

[1:11:13] Nathan Labenz: Yeah, that's cool. I think of that as building your own harness in a way, which is something I'm thinking about for myself too. Like how can I build up structures around me to keep steering me in the right direction, feeding me the information I need, and hopefully helping me become my best self, use my time as well as I possibly can by setting me up for success as much as AIs can do that. Do you find that you are following it? Do you find your, like, how good is it? And are you actually living by it yet? Or is it still, maybe I'll, maybe next week when it gets a little better, I'll actually do the plan that it gives me.

[1:11:54] Andreas Stuhlmüller: It's still, so it's still a very human in the loop process. I actually have two versions of it. I have one automated version, which I hate, and I have one like interactive version where like it walks me through the planning. And I still do the fully automated one just to see, how good is it. And I want to know, maybe at some point I'll be like, well, yeah, I'm not needed here anymore. But generally I'm like, you just didn't fully understand what I'm trying to do. You made it too complicated and so on. So it's still, it's not a, I'm still in the outer loop, but I think it's kind of interesting to think about it, right? Like, right now humans are the outer loop and they're like, they use cost LLMs, maybe eventually LLM calls you and LLM is the outer loop and you're just the inner loop. Not sure that's a positive future, but it seems like it was part of the trend here. Yeah, so the line is our automated software engineering project. I think like as maybe in For many companies, I think software engineering is the place where we have had the greatest success in doing quite extensive automation. And so there are maybe, briefly, our overall company goal is at the end of the year, when we go on vacation, we want the company to keep running and keep doing work in all of its functions. The first half of the year, we mostly focused on trying to make that happen for software engineering. And there, so we have a The system which it's called the line because it is like a factory line. So you have someone mentions a feature they would like to have on Slack or a user mentions a feature and we like Slack emoji react to it with a little line emoji or there's an integration with our customer support system. And then it kicks off this like iterative process where like first the feature needs to get specked out. then the, you need to iterate on the spec, it needs to be implemented, a video needs to be recorded of like the feature being tested, then like a code review needs to happen, then needs to get merged into dev and then into prod. And those, we do have like a fully automated version of this now. So for simple features, basically you just like emoji react to, oh, I would like it if Elicit kind of talked about its citations in a slightly different way. And it will just go through this entire process automatically. And at the end of the, there are various judgment calls it makes about where human intervention is needed. Like maybe the spec was too incomplete. And so it's like, okay, we need to pull in a human here. Or maybe the feature is like too complex for the system to automatically review and need to pull in a human here. But for many simple features that it can actually like flow fully automated through the line. And I think that's it has already been a significant unlock for a lot of simple bug fixes and features. But I think it's also setting us up for the future where as each of these individual parts of the line improves, more and more of software engineering will be automated. And yeah, that's been pretty cool to see how that over the course, I think last, now we're merging maybe, I don't know, 30 to 50 issues per week fully automatically.

[1:14:59] Nathan Labenz: Cool. That's really interesting. What do you think is going to drive the Next, if indeed we are successful in having the company continue to function through the holidays without you guys for two weeks, then one wonders like, how long could it go? And do you ever have to come back for one thing? But what will take us there? Is it just the next generation of models? Mythos is supposedly coming soon to a public API near you with supposedly much better long horizon performance. Is that going to be the biggest unlock? You've got the structure, now you just need to drop in a better model, or what else do you think is going to be needed over the next six months to actually realize that?

[1:15:39] Andreas Stuhlmüller: As a first, I don't actually expect the company will like run fully automatically by the end of the year. I expect, so I expect our lower bar is within each function, they're like pretty autonomous workflows that run and that like connect to some workflows in other functions. But I think a lot of the high level steering will still be very much needed. Yeah, what is needed for scale up? I think one big obstacle right now is the models are like not fully calibrated about when human intervention is needed. You have to be like pretty risk averse in how you use them. I think with software engineering, if it's the case that 80% of the time when the model says this is like an automatically reviewable feature than it is actually is, then that's not good enough because we don't want to break production 20% of the time. That's pretty rough. And so we have to earn much more on the side of, if in doubt, it's not an automatically reviewable feature. And so I think that's the case in software engineering. I expect it's the case in other situations too. Like, you know, if you were to let the models drive through some customer interaction, for example, like you probably want to be at least as sure as in the engineering case. And so it could, it's actually not clear to me how, I think there's like the, if you're following the meter graph, right, there's like the 50% success rate kind of curve that goes up over time as the miles get better, and then there's the 80% success rate curve. And the 50% success rate is like much higher, obviously, than the 80%. And the 80% hasn't been going up quite as fast, I would say, as we would like, and often we want more than 80%. So depending on how the kind of average case performance compares to the kind of, I don't know, 95th percentile performance, just dropping in the next models might be good enough or might not, but I wouldn't automatically rely on it. And it could be, would be cool if there were similar to fast mode, if there were like a ultra reliable mode or something, which isn't just think more, but it's like half guarantees on certain classes of errors that you're never going to make.

[1:17:53] Nathan Labenz: Yeah, that's cool. A lot of really interesting thinking there. One big question that is generating quite different takes at the moment is, are people going to be able and willing to pay the exponentially rising token bills that the industry as a whole is currently seeing. You could analyze this from any number of ways. One would be like your own internal work, right? Like where is your token budget as compared to your headcount budget today in engineering? And do you expect that With an introduction of mythos, if it really is, let's say, a lot better for some sort of definition of a lot better, do you think you will shift that budget and just spend a lot more on tokens relative to humans compared to what you do today? And then do you think your customers will do that as well? And maybe it will break down by use case. It sure seems like for as much as we do hear a lot of complaining about token costs and anecdotally, oh, this company pulled back or that company hit budget, It still feels to me like there's a lot of value in the marginal intelligence and just getting better results. Tokens are still pretty cheap. In most companies, it's still a small, I hear things like 5, 10% of what we're spending on headcount and that's not that much. Maybe you didn't budget for it and that creates some discomfort in your organization. But on the fundamental economics, it feels to me like if you can just get lots better work, for somewhat even maybe a multiple token cost, still seems like pretty rational to pay it. I guess that's my starting position. What do you guys think you will do? What do you think your customers will do?

[1:19:40] Jungwon Byun: I kind of, I think often for our customers, at least the offering we're providing is displacing services spend. So the barrier is more about can it fully displace the services spend and then also maybe getting over the mental hurdle of price anchoring for software. But certainly in terms of the dollars allocated to solving this problem, there's a lot more dollars and compute costs at the moment. So I don't feel, yeah, we'll see. There are obviously human issues to overcome there, but I think from a dollars and cents perspective, there's still a lot of room. Yeah, I've heard, I've heard, I've seen like mixed things in the news. And I think the industry, at least the pharmaceutical industry, is fairly disciplined about costs and ROI. So even if there's an initial period of heavy exploration, I think there's a lot of accountability around what that's delivering for the business. I think we'll continue to see that.

[1:20:32] Andreas Stuhlmüller: I think even internally at Elicit, I'm not sure how many more multiples of token costs we can easily spend. So maybe taking myself as an example, I spend maybe $2,000 or so per week on tokens. And it could, maybe I could like double it or triple it, I don't know. But not much more than that for sure. So I do expect, and that is already influencing my behavior to some extent. So I don't actually currently use fast mode for these models for that reason because I don't feel the modular returns are high enough for most tasks. So I expect It's unlikely that I'll be like, oh, I need to switch over everything I do to and it's probably not going to happen. And both for my own usage and also this is already actually the case in the list of the app, I think more of a look like if like one smart orchestrator agent that then you know spins off like many other agents that have to do simpler tasks that just don't need to be the largest model. And I expect that will just become increasingly important, this sort of dispatching to a model of the right size so that you get the You get the intelligence when you need it, but you're not like just like multiplying your whole spend by some number that was an inefficient use of compute to begin with.

[1:21:42] Nathan Labenz: So $2,000 is not a small amount. Is that a outlier with, does that make you an outlier within the company or is everybody doing that? If so, that would put your token cost like not presumably at the level of payroll because I assume you're paying your engineers more than that. But it would be like at least a not insignificant share. And certainly if you were to triple it from there, you'd be getting into something on the order of magnitude of parity with human headcount. What are you doing with it all too? Because I use my $200 Claude Max and my Codex Pro and I honestly don't even hit my limits that often. Now this may be API, which might be 10 times more. And so that could be a big part of it, but I sometimes feel a little ashamed that I'm like not redlining the account more than I am. What do you, how would you advise me maybe to, or what sort of personal bitter lessons have you learned that you're really like finding that token maxing is worth it?

[1:22:42] Andreas Stuhlmüller: Yeah, I think I'm not sure I'm the top user of tokens at Elicit, but I'm probably at least in the top five. So I'm probably a little bit of an outlier. Second, yeah, I'm using the API. I could probably like save more money by being more clever about how to use various like pro accounts and stuff. I do have a fairly elaborate like system built on like Pi that like orchestrates between the different agents and uses like ChatGPT to double check Claude to then sometimes call Gemini to get like another take. And that is a little bit easier to do if you're on the API than if you're on the normal end user plans. That might just be part of the explanation here.

[1:23:20] Nathan Labenz: Any particular use cases that you think you with a particular value in that you think other people are maybe sleeping on?

[1:23:28] Andreas Stuhlmüller: I don't know what other people are doing. I have found, as mentioned earlier, I think I do have like a lot of use in planning, keeping my calendar in sync with my personal journaling system, in sync with my to-dos, making sure everything is coherent with my longer range planning doc. You know, when a new day starts, going over like the last day, checking, are there any leftover tasks, like moving them into the right place. So there's a lot of kind of automation that is happening behind, like without me prompting it that probably contributes to those costs being higher. Similarly, for email, I have like pre-elaborates back on like which emails should be auto archived and maybe every hour or so my models check that and go like, okay, you know, let's just archive the emails that Andreas definitely doesn't need to read. And then like on the more kind of user driven side, I do a lot of kind of cross-checking where I run, what are some fun things we could talk about with Nathan and then, okay, but called GPT and Gemini to double-check those things. And I do find the models getting cross-checked by other models often improves the results quite a bit. So that maybe for, I don't know, 1/4 of my use cases that already doubles or triples the cost. So that's another source of additional token spend.

[1:24:49] Nathan Labenz: Got it. Okay, cool. Other questions I'd love to get your take on is, are we seeing convergence or are we seeing divergence in models? And because one notable feature of elicit today is no model picker, at least from what I've explored recently. So you're making choices and it seems like you clearly think you know best and it would be like, not a good idea, even if people have a favorite model. It would be not a good idea given all the validation and scaffolding that you have to just go in and swap bottle in and out. How do you see this kind of dynamic shaping up? There's again, just such different takes between the model's commodity, scaffolding's all that matters. No, the model's everything. Scaffolding's A complement. They're converging, they're diverging. What is your take on all of that?

[1:25:42] Andreas Stuhlmüller: Yeah, I keep being shocked by how much the models are converging. I think it's really, I guess I should stop being shocked at this point because I'm just not updating, but it is a really interesting and surprising fact about the world that the models are so similar. I do, that's not the reason why we don't offer a model picker. The reason is I think a lot of tasks still involve kind of multiple models orchestrated in a way that we think makes the most sense, having like particular models that are good at screening papers or extracting data. And I think the differences, even though the models are so similar, I think the differences are important in subtle ways. So for example, I think like people hate to hate on Gemini and so do I, but when we evaluated like Claude, Opus, I think this was like 4.5 against like Gemini 3 Pro at the time. I think Opus like did like better on extraction accuracy, but if you check that, you know what fraction of claims are directly supported by the evidence. I think Gemini actually beat it by like at least 5% or so. there's, I think it's still, yeah, I think the models are still like, I don't know, like micro jagged enough that you can't say, oh yeah, this model is clearly the best. You should just use that. And so as a user, I don't want to put that on our users for the most part. I think mostly what our users pay for is for us to do the work of figuring out what models are good at what kind of thing and making sure those models are actually getting used in those places.

[1:27:10] Nathan Labenz: Yeah, interesting. So there is a place for Gemini in Elicit today.

[1:27:16] Andreas Stuhlmüller: Yeah, I'd actually, I mean, there was a place like 2 months ago. I actually don't know. I think even though I'm fairly on top of what's going on, our EVOS team is even more on top of it. So I don't actually know if it's still live, but I wouldn't, there definitely was a place two months ago and maybe next week there will be a place for it again.

[1:27:34] Nathan Labenz: Yeah, cool. That in and of itself is an interesting reflection of how frequently you're swapping things out and how dynamic and competitive the environment is. Maybe three more questions, if you will. One, do you have plans to expose elicit as a tool for random people's cloud codes to use? That could be by allowing them to do it via API if they have an account, or even more broadly, it could be like through a sort of 402 code type thing, a 0.xyz I've recently been exploring as a way to get just pay per use access to a bunch of different tools. Yes, no, why not?

[1:28:20] Andreas Stuhlmüller: No, I mean, it's already, so we haven't been advertising it that much, but we already have an MCP and API. A lot of people use the API. People can check it out at docs.elicit.com. And a lot of the work I do with Elicit is through the API. When I run systematic literature views, often I kind of use the systematic literature view API and I iterate using like various other models on the protocol for a while, and then I run it in the background and retrieve it. So I think that is an important use case that we'd really like to support for not everything has to happen through the interface and I think more and more will happen through APIs.

[1:28:58] Nathan Labenz: Yeah, okay, cool. I'm sorry I missed that in my prep, but I'll again point my at the documentation. How about, so I'm a little bit of an AI for science mini arc right now. I'd be interested in your take on other big picture approaches to AI for science. You guys are obviously coming at it with a very systematic reasoning angle. There is the sort of close the loop angle where we'll empower these models to actually run experiments through a cloud lab or whatever, and then they'll be getting feedback from reality. That seems like it could go somewhere quite interesting. Then there's of course training models on other modalities of data. We've of course seen how proteins fold and There's what's the, what if I do this perturbation to a cell, like what's its next state going to be? Or you can go on and on in that domain. Interesting takes that you think might be non-consensus that you'd like to share.

[1:29:57] Andreas Stuhlmüller: I think all of this stuff is like super exciting. I think sometimes people like come to us and are like, well, who's going to win in AI for science? And I think that's just an absurd thing to say because science is such a big space. And as you just said, there's like many layers of abstraction from, I don't know, understanding single cell dynamics through automated experiments, protein models to like making, like when we talk to the pharma companies, they're like, we are trying to make like a multi-year plan that accounts for the changing technological environment, but also accounts for the fact that clinical trials have certain intrinsic time scales. And there's just like such a different reasoning problem from modeling the single cell dynamics. I don't know, it's a big space, it's a big pie. I think I'm excited that people are excited about it. Yeah. Do you have any controversial takes here? I think how does it all come together is maybe a really interesting, like open question. When people think about what is the automated company of the future? Does it look like you take the existing, like top 20 pharma companies and they over time will morph into like a different functional form? Or is it going to be the case that a small biotech comes along and they're just like much more end-to-end integrated? And in 10 years, the top 20 companies will all be replaced by companies that we don't even know the names of today. I actually don't know how it's going to shake out. I think, yeah, I really think it could go either way, depending on how quickly people at the existing companies wake up and understand how much everything is going to transform. So yeah, maybe that's not a very interesting controversial take, given that I don't have a take between those two futures.

[1:31:36] Nathan Labenz: Do you think that there's One big question I think about a lot is how integrated will the models themselves be? Obviously, we have the tool calling paradigm coming along very nicely. And this could be extended to tool call to run an experiment in a cloud lab and get a result. And that result could come back as a data printout of the same sort that a human would read. And then there's this other paradigm of integration that we see, I think a leading indicator of with image and now also video with Google's latest Omni model. There's this sort of deep integration of language and pixel space where in the early JetGPT image generation experience, you would talk to the model or even if you gave it a photo, it would try to caption that photo, describe it, and then use language to go over to the separate cool call image generation model and ask for something. And of course, the people never quite looked like the ones that you put in, right? Because you just can't describe a face in language with that level of fidelity. But now you have this deeper weights level integration where I can give you an image and say, make this a line drawing or whatever, and It has both, right? It understands conceptually what I want, but it also sees, it sees in some sense the structure of the face and can preserve that through the transformation. So I really wonder if that's coming to all the modalities of science as well. And if it's a good idea, I guess is maybe. I think it probably is coming, but I wonder if you think it's a good idea, because it certainly would in some ways make or at least it naively seems to me like it would make process supervision more difficult if I can trace like, okay, you called the protein folding model. This is what you got back. Okay, there's where you went wrong, right? I can see us digging in and interrogating those traces a lot better versus it's just, I asked you for this, you like spit out a new protein sequence because in your weights, you were, oh, I intuitively know what sequence will do that function that you just asked for. But obviously that could be really powerful in image, it's like a major unlock and I don't see any reason it wouldn't be a major unlock in designing new proteins or what have you as well. So do you think that's coming and do you think it would be a good or bad idea if it does in fact?

[1:34:03] Andreas Stuhlmüller: Yeah, I think it's a really interesting question. I think it's related. I think there's like the bigger question behind this, which is where do continuous representations win? And you might I mean, the prior should be, end-to-end optimization is strong, integrate everything. But at the same time, I think you might have expected new release, like language models, just not thinking in tokens, but just thinking in weight space to be more successful than it has been. I think a lot of people have experimented with it. And I mean, I hope no one succeeds, but like on priors, I would have expected maybe people to succeed at it. And so often, I think people forget that there are benefits even to the models from the discretization that comes along with that. And even for, I mean, I guess you can ask yourself, why is human language the way it is? Why does it have discrete words in the 1st place? Why do we think in words and so on? And so I think it's possible that in some situations, even the models will benefit from the discretization. why don't the models just write programs in wait space and we don't even have programming languages and just everything happens in continuous space and maybe that's eventually the future, I don't know, but I wouldn't necessarily bet on it. And so I think I think the straightforward, everything will be end-to-end optimized in weight space take is probably a little bit too lossy to be a good predictor of what will actually happen. And maybe, I think the protein dynamics case, I could actually see continuous representations being pretty good, but I'm not sure that will be the case everywhere. I think discretization has benefits that people sometimes overlook.

[1:35:38] Nathan Labenz: Give me one more beat on what you think those benefits are, because I totally agree with you that I don't want to see neuralese take over or win out. And yet at the same time, when you say like, why do we think in language? My immediate answer is like, because it's all we have, right? Like what I don't think when somebody throws me a ball, I don't think in language about it because I'm like a physically embodied person who has intuitive physics and I just catch the ball. And I feel like if I had those sort of of senses for how proteins fold, I'd probably use them, but I probably wouldn't think about them.

[1:36:11] Andreas Stuhlmüller: I mean, I think the fundamental property you get from discretization is error correction, right? It's like, if you like, if I say a word like a little wrong, you can still round it off to, okay, that's the word. And so at every step, you get like this little bit of error correction. And so if you're trying to chain together many words, you by like, you don't get the compounding errors. I think I'm not an expert on this, but I think this is like roughly why we don't have analog computers these days, why we have discrete computers is because you get like these nice error correction properties. And so that could be one thing you might lose if you're trying to push everything into weight space.

[1:36:46] Nathan Labenz: Yeah, interesting. Okay. Last one, zooming out and just going back to your original mission of radically improving the quality of reasoning in science and in society. How do you think we're doing? We've got inference everywhere. Is it serving us well? And how would you handicap the trajectory that we're on as we think about recursive self-improvement possibly soon, transformation possibly soon? Is the quality of reasoning on track to rise to the level that we need it to? What's the state of the species, so to speak?

[1:37:23] Andreas Stuhlmüller: What is the state of the species? I don't know. I mean, there's this meme where like person jumps from the roof and is like so far so good as they're halfway down. So I think in many ways probably the models have improved our reasoning. I think I probably get better answers for many questions that I care about than I did in the past. But also I think that is not necessarily indicative of what will matter most to the species. So when I think about what is going to matter most, it's like what are the decisions that like governments and big AI projects are going to make or maybe other large organizations as AI transforms everything. And there I think the game is still open. I think In many ways, we're still like extremely early. I think AI has transformed basically nothing. I think people always are like, well, coding is like getting automated, but I don't know. Coding is most intellectual work in the economy is not coding. It also is employing many people. And so I think we've seen nothing yet. And so we're still early. And also I think all the big decisions are still coming up. I think how AI will impact like epistemics and good reasoning I think it really could go either way because the models are optimized to look good, be persuasive. I think they could, I could see a worsening of epistemics happening if we are, if you don't make this an explicit priority. At the same time, I don't know. I mean, I think the models can be optimized for truth seeking too. And if you prioritize those interventions, like, you know, there's maybe like a basin of attraction or like once you become like more truth seeking, you realize, okay, what are the most important interventions? Oh, I guess we should prioritize being better at forecasting the results of what we do, and then you become better at protesting more epistemics-related interventions. So I do feel like, for better or worse, we're at the still before the point of no return in either direction, actually, for better. And I would, I'm excited for people to build tools in this space, trying to build tools in this space. I think people can help advocate for adoption of better epistemic tools. It does feel like very small area relative to how important it seems to me to be for the future of our species, as you said it.

[1:39:36] Nathan Labenz: Was the smartest of times. It was the stupidest of times. And here's hoping that our better angels win out when it comes to better reasoning and better decision making. This has been an excellent conversation. I really have enjoyed the update on Elicit and I appreciate how consistent and disciplined you guys are about It's not easy, obviously, running a business and trying to make sure you're staying true to that North Star mission. I think you guys do a really admirable job of holding yourselves accountable to trying to find the right balance between those. And there are many echoes of the hard work that you've put into that in this conversation. So I really appreciate the time and encourage you to keep up the great work and the disciplined reasoning. Anything else you want to leave people with before we break?

[1:40:28] Andreas Stuhlmüller: I think knowing when the models are making you better or worse at decision making is actually pretty subtle. And I think not that many people are paying close attention to it. I am reminded of the, now from a while back, the meter study where engineers thought they were being made more productive, but actually at a slight discount relative to unassisted work. And there was an engineering, probably no longer true in engineering, but I could imagine that for more complex decisions if you're not paying attention. Sometimes using the models like regresses you to the mean or like cuts off avenues of investigations and at other times it opens up the space and makes you think more clearly. And so I find it helpful to just introspect, am I actually getting benefits here or how is this changing my behavior? And so thinking about that and sharing it with the world does seem to me like just a clearly not good thing.

[1:41:23] Nathan Labenz: Yeah. For the time being, we still have some agency over this process, and I think that's a great reminder to maintain an ownership mindset and hold ourselves accountable to doing our very best work and not getting lazy and letting the AIs lead us around. Excellent. Zheng Wang Yun and Andreas Schulwaler, founders of Elicit, thank you again for being part of the cognitive revolution.

[1:41:49] Andreas Stuhlmüller: Thanks, Nathan.

Outro

[1:44:57] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Nathan Goes to China – Part 1: Tech & Agent Setup, Chinese AI UX, WAIC, and Attitudes on AI

Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Introduction

Main Episode

Main Episode

Outro

Read next

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Nathan Goes to China – Part 1: Tech & Agent Setup, Chinese AI UX, WAIC, and Attitudes on AI

Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Introduction

Sponsor

Main Episode

Sponsor

Main Episode

Outro

Read next

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Nathan Goes to China – Part 1: Tech & Agent Setup, Chinese AI UX, WAIC, and Attitudes on AI

Alignment with Awakening: Davidad on Moral Realism, AI Wisdom, & why His p(Doom) is Down to 5%

AI:AM Highlights: Exploring the J-Space, AI Superforecasters, SambaNova's Chips, & LTX Video Gen