The AI Reasoning Revolution with Ought's Jungwon Byun and Andreas Stuhlmüller

The AI Reasoning Revolution with Ought's Jungwon Byun and Andreas Stuhlmüller

Watch Episode Here


Read Episode Description

We've looked forward to today's episode since we launched the show! Andreas Stuhlmuller and Jungwon Byun are the co-founders of Ought, a product-driven research lab that develops mechanisms for delegating open-ended thinking to advanced machine learning systems. Their flagship product, Elicit (elicit.org) is an AI research assistant that helps researchers accelerate time-consuming workflows, starting with literature review.

Also, check out the debut of Erik's new long-form interview podcast @UpstreamwithErikTorenberg whose guests in the first two episodes were Balaji Srinivasan and Marc Andreessen. This coming season will feature interviews with Ezra Klein, David Sacks, Katherine Boyle, and more. Subscribe here: https://www.youtube.com/@UpstreamwithErikTorenberg

LINKS REFERENCED IN EPISODE:
Ought: https://ought.org/
Elicit: https://elicit.org/


TIMESTAMPS:
(0:00) Preview
(2:00) Nathan introduces the founders of Ought
(4:20) Why doesn't AI serve better reasoning?
(6:55) What limits the current paradigm
(8:40) Reflections on the last six years of Ought’s research experiments of "composable thinking"
(13:10) Error elimination mechanism is a shared challenge between human and AI systems
(16:45) Sponsor: Omneky
(18:00) Interpretability by construction product philosophy
(25:00) Nathan’s personal experience using Elicit as a research assistant
(30:00) Explicit concerns about model reasonings, and the importance of going a step further
(36:00) What customers of OpenAI Foundry should consider
(43:15) Evaluation challenges
(48:15) Embeddings challenges
(51:00) Vision for a knowledge work assembly line corporate paradigm
(56:00) Ought's short-term approach to building: Understanding human ways of teaching the model to be more helpful
(59:00) Wishful thinking versus real helpfulness
(1:03:00) What's next for Elicit: expansion and new workflows
(1:17:00) Zapier for reasoning
(1:23:00) What are the most fundamental "magic questions" for all domains?
(1:31:43) Significant impact of GPT4
(1:36:00) How people are using Elicit
(1:44:00) AI Uncertainty and reason for hope
(1:48:00) 3 lightning-round questions

TWITTER:
@CogRev_Podcast
@jungofthewon
@stuhlmueller
@labenz (Nathan)
@eriktorenberg (Erik)

Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.


More show notes and reading material released in our Substack: https://cognitiverevolution.substack.com/


Full Transcript

Transcript

Transcript

Jungwon Byun: (0:00) There's a sense of wonder that has been lost, and I think it's easy to lose sight of how much about ourselves and our world still remains to be discovered. So I really want tools like Elicit and AI more broadly to not just amplify existing research efforts, but help us discover entirely new ways of doing research, entirely new research methodologies, entirely new research domains that weren't possible before. Now using this human artifact of text as a data source, we can analyze ourselves more intelligently so that we can better understand who we are and how we relate to each other and what kinds of systems we can build to help each person and every group flourish more. And for AI systems to be able to generate very decision-relevant but simultaneously rigorous one-pagers, policy memos, and answers to people in very high stakes decisions so that we can just become way more intelligent about how we can govern this world.

Nathan Labenz: (0:58) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Before we dive into the Cognitive Revolution, I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sacks, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Marc Andreessen. The link is in the description. Hi, everyone. I'm recording this in the hospital maternity ward where my wife Amy and I just welcomed our third child this week. Thankfully, both are doing well, and we should be home by the time you're listening to this. Today's episode is one I have looked forward to since we launched the show. Andreas Stuhlmuller and Jungwon Byun are the founders of Ought, a product-driven research lab that develops mechanisms for delegating open-ended thinking to advanced machine learning systems. Their flagship product, Elicit, online at elicit.org, is an AI research assistant that helps researchers accelerate time-consuming workflows, starting with literature review. I was particularly excited to have Andreas and Jungwon on the show for two main reasons. First, in my experience, Elicit works well, notably better than any comparable product I tried in early 2023, and so much so that it adds real value to research workflows. No small accomplishment given the inherent difficulty of research tasks. Second, and likely more importantly in the long run, Ought has helped lead the AI industry toward a process-focused approach, which emphasizes task decomposition and the training of high-quality reasoning processes rather than focusing solely on correct or incorrect outcomes. We discuss how this process-centric approach works in practice today, why it's likely critical to building AI systems that can help with novel and important questions where little or no training data exists, how it might help us avoid catastrophic AI outcomes, and more. I hope you enjoy this conversation with Andreas Stuhlmuller and Jungwon Byun. Jungwon Byun, Andreas Stuhlmuller, founders of Ought, makers of Elicit, welcome to the Cognitive Revolution.

Jungwon Byun: (3:47) Thank you.

Andreas Stuhlmüller: (3:48) Hey, it's good to be here.

Nathan Labenz: (3:50) Yeah, I'm really excited for this conversation. I have been a fan of what you guys are building at Elicit for some time and had some really great product experiences using it in a couple of research questions that have come up in my own everyday life. And so I'm really excited to learn more about the product and also the philosophy that is behind it. I wanted to start with just a question drawn from your website. I went and read the mission statement, and there's a really interesting claim, which I think is true, but maybe underappreciated by many. And you say, it's especially unclear that ML will help in matters that require substantial thought. I think that's really apt in any sort of the writing or work that I'm doing. I find that the best work, the AI is currently the least helpful. So maybe you guys could just start by unpacking that. Why do you see that as the case? And then obviously, we'll transition into what you're doing about it.

Andreas Stuhlmüller: (4:51) Yeah. It's actually interesting. We wrote that six years ago, and I think it's still true. Roughly speaking, I think there are two things that machine learning can do. One is it can imitate some behavior. For example, it can write text like a human can write text. Or you can use reinforcement learning and can optimize some reward, like win games of Go. I think essentially all cases where AI has substantially exceeded human capability are cases of reinforcement learning because that's what you need to get to unforeseen creative moves. And so then the question is, well, when does reinforcement learning work? And I think it works best in domains where you have a pretty well-defined task, like play a game of Go, and you have a pretty clear objective, like win the game. But for helping us think more clearly or arrive at truer beliefs or better decisions, there isn't an obvious such objective. So there's a risk. And the risk is that we'll use a proxy objective or something that is similar to the thing we want, but not exactly the thing we want. The most natural such proxy objective is, does it look good? Is it producing stuff that looks persuasive or impressive or helpful seeming? But that's not the same as actually being well reasoned, robustly helpful, arriving at really good decisions. I think actually right now that's not a big problem. I think in terms of behavior, these models are still mostly in the imitation regime. They can help us write stuff or brainstorm stuff or write code, and I think that's all great and actually is helpful. But if you ultimately want them to be as helpful for better thinking as they are at playing Go, which is better than the best human, then I think there's some work to be done to make that work.

Nathan Labenz: (6:33) So there are at least two issues there, right? There's the sparse reward issue, and there's also the human unreliability in rating or valuation. Are those the core two issues that you see limiting the current paradigm, or are there others that you would want to highlight?

Andreas Stuhlmüller: (6:54) Those are real things, but I think it's more that there's not really a clear task definition. So I think the most natural task is I ask the model, hey, what decision should I make in this setting? And then the model is like, here are some considerations for these considerations you make that decision. And then I can be like, well, does that sound good? Does it not sound good? I think I have some resolution in judging that, but I think if I want the model to eventually exceed how good I am at making these decisions, then something else needs to happen. I think there are various ideas for how to address that, but I think the most natural paradigm doesn't solve the problem of exceeding human capability at reasoning and decision making in a really helpful way. So that feeds into the next part of the mission statement, which is really your take on solving it.

Nathan Labenz: (7:45) So you say, at Ought, our mission is to automate and scale open-ended reasoning so that future improvements in ML help as much with thinking and reflection as they do with tasks that have clear short-term outcomes. And it's interesting that you say you wrote that six years ago. I hadn't realized that the mission had been established that early. I'd love to hear a little bit of what you thought you were going to be doing when you started off and how that has compared to reality. Because I'm sure if you're anything like me, you've had a few surprises and updates over the course of the last six years. How was that expected to cash out versus how it has actually cashed out over time?

Andreas Stuhlmüller: (8:28) I think the surprising thing is actually how in line with expectations everything is. I think maybe timelines are somewhat shorter than we expected, but roughly speaking, when we started out, we were like, well, machine learning can't actually help with even the simplest thinking tasks yet. This was before GPT-2. There were no language models. I guess there was a thing called language models, but it's very different from what you would call a language model today. And so we were like, well, how can we still get ahead of the situation and study how to use future AI systems to help with that? Well, I guess humans are our closest standard for AI systems. We mostly ran experiments with humans thinking, well, how can we decompose complex tasks into tasks that humans who don't know the big picture can help with, break it down, and all that. I think it is crazy how much AI systems now look like these context-free humans that we ran simple toy experiments with back in the day. I think I've been surprised by how much things played out in line with expectations, actually.

Jungwon Byun: (9:35) Yeah, it's crazy. When we started, the aggressive AGI timelines were 2100. That's what the crazy people believed. They're like, oh, maybe we'll have AGI in 2100. And then over the span of a few years, that's gotten condensed by so much. So yeah, when we started, we were doing more research experiments. But even then, I think this is you, Andreas, we were trying to break down tasks to be very context-free. So basically giving our human participants a context window where they could only read so much text to do a specific task. And even details of our experiments like that have actually generalized quite a bit to what language models have actually ended up evolving into. But when we started, because we had no line of sight to when ML would be actually good for helping with reasoning, we ran experiments. It was way more researchy. And then as language model progress happened, we realized building a product around that and directly trying to help people with reasoning would be way more impactful.

Nathan Labenz: (10:34) Yeah, that's fascinating. So can you describe those experiments in a little bit more detail? The idea of giving a human a context window, I think, is pretty fascinating. Procedurally, what did that look like in the early days?

Andreas Stuhlmüller: (10:47) Yeah, there were a few different versions of those experiments. I'll describe one simple one, what we call the relay programming, where basically there's a programming task, like a programming puzzle you might get at one of these programming competitions. And every participant only has, let's say, a minute to make progress on the task and then pass on their notes to the next participant. Where the thought was, if you could make that work such that a person who doesn't know the big picture can make small contributions that move the state of thinking on that problem forward such that you eventually solve it, then you could get to mechanisms where you can solve much more complex problems. There's a fundamental question of how composable is thinking, how much do you need to know the big picture, need to know some mental model where you load everything into your head, and to what extent can you get away with not doing that? And so one of the settings where we studied that was in the setting of programming.

Jungwon Byun: (11:49) I think the high-level research question was, in a world where we have superintelligent AI systems who are doing work that are smarter than us and so can do work that we can't evaluate directly, can we break down complex tasks into smaller tasks that are easier for people to evaluate? So part of the reason why in some of our experiments we gave people context windows was because if the AI system can read an entire book, but you can't, then that's really hard to evaluate. And so can we then break down work into smaller pieces such that if a human wanted to spot check parts of it, the amount of work that they would have to spot check is manageable for them to do? That's what we were studying.

Nathan Labenz: (12:29) So how productive were people able to be in this one-minute context window environment? Were teams or groups able to solve substantial problems? I would expect that honestly to be very hard. And I think this is a really interesting comparison now as well, because we do have obviously a lot of debates around what do the latest systems mean? Just how capable are they? How should we understand them? But I've not previously considered this notion of testing the human on the AI's turf. So I'm guessing we didn't do nearly as well as GPT-4 would currently do, certainly if you give it a minute as your hard limit to make progress.

Andreas Stuhlmüller: (13:13) I think it is very challenging, and the challenge is actually somewhat similar between humans and AI systems. One challenge is you need to get rid of errors more quickly than you introduce them. When you do a minute-long task, there's some chance you get it wrong. The more such tasks you compose together, the more likely it is that at least one of them goes wrong. Even if you only have a 10% probability of getting a subtask wrong, if you do 20 tasks, it's almost certainly going to happen. So you need some error correction mechanism, and that error correction mechanism itself needs to be composed out of these one-minute pieces. If you are not getting rid of errors more quickly than you introduce them, then you're doomed. I think that is a thing that we're currently facing with language models as well. If you compose together many tasks, the models are still somewhat flaky, and that limits scalability of both the humans we work with and current models to very complex compositions of tasks.

Jungwon Byun: (14:20) In addition to the programming task, we also did text-based tasks. This was before a lot of the language models. We would have tasks where one person would generate a summary of a movie review, and then other people would have to decide was this summary accurate or inaccurate. Everyone could only see little parts. No one except maybe the original generator could read the whole summary. We ran into a lot of similar issues there as we are with language models. One specific flavor of this error propagation issue is that the interaction can converge in this suboptimal place where the people might get derailed by a particular line of reasoning and then get stuck there and miss the overall point of the summary or not find the most efficient way to decide if the summary was accurate or not. The other thing that we struggled with there is that the context windows were really too limited, and there needed to be some better mechanism for passing along state across different participants. We still haven't figured that problem out with language models either. What is an intelligent way to pass state so that you're not starting from scratch with every call to the language model?

Andreas Stuhlmüller: (15:38) Maybe here it's worth zooming out and briefly asking again, why does this sort of context-freeness matter? I think Jungwon mentioned this earlier a little bit, but the key thing is really how can you make it so that humans can evaluate what's going on? One world is the world where in the future, these context windows are obviously going to grow or going to go away eventually. In that world, AI systems are building complex internal mental models. We can see those models. They make decisions. We're left asking, why did it make that decision? It's really hard to tell. Contrast that with the world where all the state that is being passed along is explicit. Then you can say, okay, this is what the model knows at this point. This is the local decision it made. It said, here's the next thing we need to check out based on the things we know so far. That's a thing you can evaluate. We care a lot about getting AI systems to a setting where what they do is transparent and where humans can evaluate what's going on. That has driven those experiments with humans in the past and also drives a lot of our work with language models now.

Sponsor: (16:45) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Nathan Labenz: (17:03) It's funny. I would say that's pretty prescient, really, because as you're describing the challenges in performing tasks that way, and then you said to zoom out, why do we care about this? My immediate thought was, having used language models a lot, it's pretty obvious why I would care about it now. It may not have been nearly so obvious five or six years ago. That is a really interesting history as to how you were exploring cognitive space. That's also a pretty natural way to bridge toward how you're building today. It seems like this product philosophy of interpretability by construction reads almost like a mantra for what you guys are doing now. So fascinating that you started with that even before the technology shaped up the way that it has. Tell us about this product philosophy of interpretability by construction, how it feeds into Elicit and what Elicit is today. Then I also want to hear what your vision is for Elicit 2 and maybe end of the decade type timeframes.

Andreas Stuhlmüller: (18:10) We mostly think about it as how can we make language models follow processes that humans can understand, as opposed to these latent, unobserved things that are only in the language model's head that have been optimized end-to-end to accomplish some goal, but we don't actually know what's going on. The most common way that we and also others refer to this is this distinction between processes and outcomes. I think the most natural framework for most machine learning is to think about it as outcome-based. You have some metric—predict the next token or try to optimize reward in a gameplay setting. You're trying to literally set all the floating point numbers in your matrices to make that final number go up as much as possible, and you don't really care about the meaning of the internal components. So they might get inscrutable. Contrast that with more programming or just natural programming. Usually, if you're programming, you need to write functions. These functions need to have independent meanings that you can evaluate even outside of a particular context. If that's not the case, then it's terrible spaghetti code that no one will understand and it's hard to debug, hard to improve. If we're on the spectrum from more outcome-based systems to more process-based systems, our philosophy is always try to push as much as possible for being in the process-based setting where you build on human-understandable task decompositions. You directly supervise the individual reasoning steps and try to get to this transparent setting. The obvious objection is, how is that going to scale? That seems like it involves a lot of manual labor. Isn't the whole bitter lesson of machine learning that you just want to scale up your matrices and optimize them end-to-end? I think there is a lot of truth to that. But the idea that pushes against that is actually we can use language models and AI more generally to help with the task decomposition itself. So it doesn't actually have to be a manual thing. The process of decomposing tasks into understandable pieces is itself a thing that AI can help with. Even in the world where end-to-end optimization is, in principle, more powerful, we think that the benefits from transparency and understandability are enough that it's worth pushing really hard in that direction.

Jungwon Byun: (20:48) The way I think about the interpretability by construction point is, the current world we're in is, as Andreas alluded to, do whatever it takes to score well on this feedback metric. We often don't understand why a particular method works really well, or it's not human-legible what "whatever it takes" is. Scaling this up to superhuman levels and scaling this up with a lot of compute is very concerning. We don't really want to do whatever it takes to score well on this metric. Then there's an interpretability approach to interpretability which is, okay, I get this result from this answer. Now let me figure out how it arrived at that. But I think that is also still concerning at superhuman levels. Even if you ask the model to justify why it did what it did, it can rationalize. It can make up reasons that were not its actual reasons. So interpretability by construction is, it was trained on human-legible processes up front. You have this guarantee that this is what it did because it was programmed to do that. So you can always go back and just see what were the steps you followed to get this answer. That's the distinction—interpretability by construction versus just trying to understand why the model did what it did after the fact and using the model to generate explanations that way.

An example of this that we have in Elicit is this feature that runs through a trustworthiness checklist for papers. Currently, Elicit broadly is a research assistant. We've been focused on making it useful for literature reviews. Right now, when you run Elicit, you get back a table of papers. We automate the process of finding relevant papers and extracting details and summarizing the paper so that you can very efficiently figure out if a paper is relevant to your research question or not. But good information is not just relevant. It's also trustworthy and something you should actually update in light of. Once we set up infrastructure for finding relevant information better with language models, we then wanted to tackle, how can we evaluate, of all of these relevant papers, which ones should most update the user's question? We found in talking to a lot of researchers that there are very good standardized processes that exist for this already in academia. We learned about things like strength of evidence checklists, risk of bias checklists. There are just these checklists that researchers already make to evaluate lots of different papers. So we started to just automatically run that checklist over every single paper.

I think a tempting thing here would have been to try and come up with some score for trustworthiness, and probably other tools or other attempts have been made. How do we think about citation count and what journal it was published in and then give it a rating? But ratings can be easily hacked. So what we're doing is still giving the researcher all the pieces they would need to evaluate trustworthiness without distilling a lot of that richness and losing a lot of context by putting it into a score. That checklist currently has things like, what was the sample size? Are there any conflicts of interest with the funding sources? Did they control for a bunch of different things? What is the actual number that they reported on and who is it based off of? In the future, we could take this and condense it into a score if we wanted to. But because we did this in this compositional way, if we decide to summarize that checklist into a score, it'll be very easy for someone to then expand the score and see the checklist again. That's interpretability by construction. Do it in pieces first, then you can collapse it. Once you collapse it, it's also easy to expand it again and see what went into the score. Whereas if we had just collected a large dataset and trained on what people rate papers as, no user could then go and really get a detailed understanding of why this score in this specific instance.

Nathan Labenz: (24:57) It really does work amazingly well, at least for many of the use cases that I've tried. So I would definitely recommend that our audience go check it out. I've used it for random medical queries that have come up in my personal life. My standard disclaimer is I wouldn't recommend making it your doctor at this point, but I would recommend getting a second opinion. I say the same thing about GPT-4, and I would definitely say here, it can help you surface valuable information. So that's been a really good experience. I've also had some really nice experiences, honestly, just even at the highest level. It's often hard to search for a particular paper that I kind of recall what the main conceptual takeaway was, but maybe I can't remember the name or I'm terrible with the authors' names in many cases. So my options at this point are often, well, I usually found it on Twitter, I can go search for things that I liked on Twitter and put one word in at a time and hope that I actually did like it as I should have and that it pops up. Or I can go to Elicit and type sentence length. It could still be a sentence fragment in many cases, but it's what is that conceptual thing that I'm recalling that is drawing me back to this in the first place? I've had a couple of good examples where I could not find the damn paper. And then with that 10 or 12 word conceptual memory that I was able to finally spit out, boom, I was able to get exactly to the paper that I wanted. So that's obviously not the most advanced use case, but honestly, I would say that does put Elicit ahead of almost all the competition at this point. I've tried some other research-focused tools and, frankly, in many cases, gotten nothing of real value out of it. And even with Bing, it's very hit or miss exactly what you're going to get back and how good it is. So this is not a paid endorsement, although it's starting to sound like one. It actually just has been really useful to me. So go try it out. We're big on telling people to go try out the actual tools. Okay, back to questions. Supervising reasoning processes, not just outcomes. I'm feeling a little bit like the line there is blurring in some ways. I'm feeling a little bit confused because on the one hand, okay, it makes a ton of sense to try to create this, what was the thing thinking at every step of the way? And certainly if I rewind the clock a little bit and you're looking at very finite context window language models, that's all you have available, then yeah, you can only do so much, so you kind of have to chop it up into these micro tasks. It's the only way. But how do you see things? For example, with a 32,000 token context window now coming online and with techniques that are, whatever, least to most or step by step, is there a way in which we're headed toward some sort of equivalence where you could spell out all the steps that you want a GPT-4 32k to take and then watch it take all those steps one by one and then sort of say, "Here's your score on that," with all of that out loud. Are these paradigms converging in some sense with those language model best practices? Or do you still feel like there are fundamental differences that I'm not grokking?

Andreas Stuhlmüller: (28:53) I think there's partial convergence. So I think the step by step is definitely a move in the right direction. I think it's good to see the reasoning spelled out. I think it's good for many reasons. I think it's still the case that there are some results where you replace the steps just with dots. Literally, you make the model type out dot dot dot dot as opposed to writing out steps, and it still leads to its answer improving because there's some latent computation that it can do during those steps that it wouldn't otherwise do. And I think that is maybe a little bit concerning. I think when you write out the reasoning, you definitely want it to be the case that the reasoning is causally responsible for the answer you get in the end. Because if you're like, "Well, yeah, that reasoning all makes sense," and then you get an answer and you're like, "Well, does that really depend on the reasoning?" I think you'd prefer not to be in that situation. Maybe it's not really a big deal for basically all of the use cases we have of language models right now because they're so low stakes. But I think if you think a little bit ahead and in the future, suppose you're running a company, you're like, "Well, should we invest in this thing or hire this person or whatever," decisions people will try to make in the future, I think there it will matter more that you can actually trust that the answer, in fact, faithfully depends on the reasoning. And so I think even though writing out the reasoning steps in the prompt is a good step, I'm very excited for techniques like that. I think you ideally would go a step further and have guarantees that the answer causally depends on the steps, which you don't have for most of those approaches.

Jungwon Byun: (30:34) Yeah, so one example that I give to contrast these two methods is maybe you ask a language model a research question. You're like, "I just want a one-pager on this topic. What should the government do about long COVID or something?" And the language model produces an answer. And if you're actually going to make a high stakes decision about that question, you want to know, why should I trust this answer? And at that point, it's more trustworthy if the language model is able to say to you, "Well, I went out, I searched over these databases, I read these papers, I ranked these papers on these dimensions, I checked to see if they were all trustworthy, I summarized them, and then I compared it against this data." If you can list out all of those processes to you and actually follow that process, you can have way more trust in that answer. And you can then go and you can look at the process itself and say, "Do I trust this process or not?" And you can also inspect each step and did it actually follow that step. But I think with just chain of thought, I don't know, the alternative would be it just kind of rambles about a bunch of things. And as Andreas said, it may not be actually what it did, even though having some more insight into the model state is helpful. I think that interpretability by construction piece comes up again where you want some guarantee that this is what the actual model did.

Nathan Labenz: (31:57) Seems like a big part of the advantage there is also the grounding, right? I mean, in a pure chain of thought, you're just relying on whatever the language model thinks it knows, right? And it's one thing to sort of say, "Okay, we're going to go to sources and evaluate those sources, try to find takeaways and then synthesize that." So if I was going to try to continue to close the gap between the two paradigms where you guys have taken things and made the complex discrete and comparatively simple, ultimately that involves a lot of calls and each one is a small input output, and then you can view that whole history. It seems like maybe much closer to that if I was, let's say I'm an OpenAI Foundry customer and I just dropped $105,000 on my 32,000 token robustly fine-tunable thing. The main thing that I would really be wise to do, it sounds like from your counsel, would be ground in sources. That's at some point, bring in information. Maybe it could also be in one context window, but you're going to want to interrupt at certain points and bring in information. And then maybe at the end, you could sort of have a summary that goes back through all of that history and is like, "Okay, here's what I did. And then I hit some stop token and that executed a search. And then I hit some other stop token and that fetched the abstracts for all those papers or what have you." And you could, in theory, layer that on, but you'd be having layered calls as opposed to compositional structure, I guess, to the overall process. Linear with these sort of outbound calls might be at least much more comparable to the composable structure that you guys have built. Does that sound right? I'm really just trying to understand it, so feel free to tell me if I'm off base here.

Andreas Stuhlmüller: (34:11) I think the thing you described where you make calls to sources, I think that is a pretty common thing that maybe the simplest instantiation is the thing that I think Bing does, which is just do a search. Don't do it even in multiple steps. Just do a search, take the results, then do generation based on that. That's maybe version zero of that. Then I think what you described is one step more complex, which is you can do some generation, halt, do a search, pull in the results, continue. And then I think there's versions after that which are, well, one of the things you could be doing is you generate, halt, decide to farm out some subcomputation to another model that is, roughly speaking, the same situation the current model is in. It does its thing. It comes back. Then you continue generating. So I think there is an incremental path from just do everything in one prompt to more compositional settings. And I think the world where everyone pursues those more compositional paths, I think, is a pretty good world.

Nathan Labenz: (35:19) I guess another big opportunity for difference, and this is probably something that does exist in Elicit. I know it does from, I don't know exactly what degree, but I know from your Twitter presence that it does exist to some degree, is you can have, if you do truly break down into a compositional approach, then you can have specialized models for all the different subtasks, potentially, if you really want to, right? I guess you could also, I'm really contrasting this because I just get the sense that if I had to guess, I would say half of the Fortune 1000 is probably going to try to buy OpenAI's new thing, and they're really going to be striving for reliability in their language model output. They're not going to be generally wanting it for brainstorming. They're going to want to be completing tasks and they're going to want to be completing them reliably. We don't know what fine-tuning they're going to offer, but you could fine-tune a bunch of different models for discrete subtasks. You could maybe try to do 10 tasks in one giant fine-tuning of a giant model as well. And then you're kind of back to maybe a question of, does the super generalization help you? It might in some ways on performance. Certainly, there seems to be this bigger is better. Maybe you could help me unpack a little bit more the worry of when I'm looking at chain of thought reasoning, throw on those halts or whatever, and I get to a final answer, you said you'd really like to know that the answer was actually the result of the reasoning. And on some level it is, right, because it is incremental token prediction. So all the, one by one prediction, right? So all those previously generated tokens in some sense are causal for the tokens that follow through, obviously, a still fairly alien mechanism. So I definitely appreciate the danger of, we have really no idea what's going on in very profound ways. But how is that different? If you then pull things apart into a bunch of different language models and you're like, "Okay, this one did this task and then that task, that output feeds into the next one." How do you see the, where is that extra causality coming in versus just the one token at a time prediction?

Andreas Stuhlmüller: (37:46) Yeah. So I guess it's important to clarify that it doesn't matter whether it's the same model or different models. You definitely don't need to fine-tune a different model. It's fine if all the subtasks are done by the same model. And so the causality mostly comes through removing information. So in the case where someone is doing a task and the only thing they know is they are supposed to figure out, does this website answer this question? They don't know you're trying to make a decision about whether to take some medicine or not. They're literally just trying to answer, "Does this website answer this question or not?" You know they're not influenced in some way by making it seem more or less relevant or otherwise taking into account context that it shouldn't take into account. Whereas if it's all happening in a single prompt, then you're like, "Well, there could be some causal influence that you're not tracking."

Nathan Labenz: (38:45) Right. Everything's just a lot more tangled.

Andreas Stuhlmüller: (38:46) Yeah, everything is more tangled, which for better or worse, sometimes that's really useful because sometimes to do the subtasks, you want to know, "Oh, which are the things that actually matter for the bigger question?" But also, it has a drawback that it gets harder for you to analyze what actually happened and the pieces lose their individual meanings.

Nathan Labenz: (39:07) Yeah, okay. That's really helpful because especially with these really long, super high coherence now with some of the latest models, the ability to attend effectively to these super long windows. It does come, I've thought primarily honestly about the pros, but you're highlighting a pretty good con, which is the fact that now everything is feeding into this one decision. And that just makes for every incremental decision you have or every incremental task, you have a fundamentally noisier situation where you're wondering, well, would it have done the same? You now have the question of, okay, for that paper where we wanted to get this one bit, did that result depend on the original query? Or did it depend on the two papers before it that got analyzed in this linear stream? And I do see how that would make things a lot harder to tease apart. So, yeah, that's really interesting. Thank you for bearing with me as I try to sort out my own confusion. I took a note also from your, I think it was from your big Twitter thread, which I've recommended previously about how you break down the task and approach the composability. So your guideline is once language models do well at small tasks, roll up to bigger tasks. That reminds me of the Drexler comprehensive AI services a little bit. He's very much, let's deploy narrow superhuman AI everywhere before we even attempt to do very general superhuman AI. Have you guys taken inspiration from that, or how would you contrast what you're building to the vision that he's outlined?

Andreas Stuhlmüller: (41:08) Yeah. It's definitely related. Maybe one dimension along which to contrast it is what is the task size. I think Drexler mostly thinks about fairly large tasks. Let's say there's a service for evaluating whether an investment is good or something. And then within that service, as I understand it, it's still fine if everything is more or less end-to-end optimized. And I think we'd probably just try to push harder for even smaller, even more understandable components. And I think Eric also has more recent work. I don't know if you've seen the open agency model, which I think is also a bit more in that direction, trying to get language models to do even smaller tasks, like suggesting different proposals for policies and then different language models evaluate those proposals and such. So I think there's definitely some convergence here.

Nathan Labenz: (42:01) Yeah, that's interesting. How about the question of task size? I've noticed this in my own work. At Waymark, we use an ensemble of different AI systems, some of which, the core ones being language models, but also some visual type stuff as well because we're ultimately trying to create visual content. And I've definitely noticed over the last year and a half that it is getting easier and easier to just throw more into a single call. We used to have a fine-tuned model that would do this, and then we had a different fine-tuned model to write the voiceover script. And now we're kind of, okay, we could just write the main script and the voiceover script in the same call. That got easy enough that we can just go with it, right? Similarly with what image content would complement this text. We used to fine-tune on that separately. Now we can kind of bundle it into one thing. That's all still a pretty narrow task, certainly in the grand scheme of human endeavor. But how are you guys shifting the size of task, if at all? Are you saying, okay, these guys are getting more capable. We can trust them with somewhat bigger bites out of the overall task. Is there kind of a pull in that direction?

Andreas Stuhlmüller: (43:15) I can give the general answer, then Jungwon can maybe address it in the context of the product, if that seems helpful. But in general, I think people think about what are tasks that humans can fairly robustly evaluate. And you want tasks to be as large as possible under that constraint. It's great if models do more work at once for all the usual reasons why it's useful for models to have more context. But it's bad if you get to a stage where a human can't look at it and be like, okay, this is good or bad. I think there's some caveats to this where you're, well, maybe humans can use AI assistance to do that evaluation better, but then you still have that constraint of can a human with AI assistance robustly evaluate whether a subtask was done well or not? And you never want to push beyond the threshold where that's true.

Nathan Labenz: (44:09) Where is that kind of settling? What do you find that people can in fact evaluate effectively? I mean, for context too, for the audience, boy, is evaluation hard. I think most people probably know that. I was really amazed to see from the GPT-4 technical report that the GPT-4 versus 3.5 win rate is only a 2-to-1 ratio. It's a little bit higher than that. That's maybe 70-30. But it's not like people are preferring this 99 times out of 100. It's nowhere close to that. It's like 2 out of 3. So that in and of itself is hard. There's just a ton of noise. Again, at Waymark, we've done a lot of experimentation with which are the best looking images, because our users will have these image libraries. And one of the biggest things we want to do for them is just for convenience, pull the most beautiful images forward. And that is so easy for people. It's like a quarter of a second that they can differentiate between which one they like. But the noise and the inter-rater reliability just gives us insane headaches. The R-squared on two humans sitting next to each other looking at the same images is really tough. And so the models struggle as well, right? Because they're trained on that signal and the signal is just insanely noisy. So in your context, where are you finding that it settles in terms of what people can reliably evaluate?

Jungwon Byun: (45:41) Yeah, I feel like this is another place where breaking tasks into smaller tasks really helps with the evaluation. So we have a bunch of, in Elicit, there are a bunch of abilities to ask questions of the paper. And the paper has lots of text. It is this long context situation that we were talking about earlier. And Elicit using language models can say, oh, this paper found this result. This paper studied these types of people. This paper had these limitations. And if that's all you got and you wanted to evaluate that, you would have to read the entire paper to figure out if that was true or not. Maybe you could control-F it. But in a lot of cases, you'd have to read the entire paper, which is very expensive and time consuming to do. But because the way we build out these tasks in Elicit is compositional, what we do is we first find relevant parts of the paper, then using that part, generate the answer to that question. So when we show you the answer, we can directly link it back to where in the paper we got that. And then all you have to evaluate is the answer and a small chunk of text, a few sentences, which is way more manageable. And it becomes very obvious. Sometimes the models still make mistakes, right? But those mistakes become very obvious because you can check it. And then you can always also scroll around the paper, that area to see, was that section taken out of context? So, yeah, I think that's just another really important feature of doing things compositionally and making that easier to evaluate really quickly. And that size is definitely very manageable. But having to read an entire paper to come up with the answer is way trickier.

Nathan Labenz: (47:21) For that type of source linking, I can imagine a couple of ways of doing it. One would be, you have a question like, is this part of the paper suited to answer this question? And you could kind of page through to the degree that your context window allows to try to extract the part of interest. Obviously, you could also pre-chunk it and use embeddings and then kind of do an embedding comparison at runtime. How are you guys approaching that problem of grounding? Thinking with the embeddings too, it could be kind of hard to determine, the retrieval might have failed. You don't have the answer, but the answer was there. But you pulled the wrong section. The embeddings feel not super easy to evaluate. So how are you thinking about that part of the challenge?

Andreas Stuhlmüller: (48:15) Basically, you can think about embeddings as, roughly speaking, cached language model computation. So I think the ideal thing is probably more like the former thing you described, which is you go through the paper maybe paragraph by paragraph, and for each paragraph, you ask the most expensive model, think step by step and make sure you're correct and whatnot, and think really hard, is this relevant? So that's maybe the most expensive solution, and then ideally you would just do that over all papers ever written by anyone and all their paragraphs. And then you rank by probability that says it's relevant or something. So that maybe defines one kind of ideal, and then you're, well, we can't do that. Then I think you're trying to approximate it, and usually the way we go approximating it is in a multistage process where you're, well, first we want to find the thousand papers or paragraphs that might potentially be relevant. For that, we need to use something very cheap, like some embedding-based search. Then you're, okay, now we have a thousand papers. We can use something slightly more expensive, maybe some cross-encoder or something. And then you have maybe one or a couple of papers, then you're, okay, at this point, we can use the expensive model and actually pick out the things. But I think this is a setting where a lot of it is about, well, what are tractable approximations to the expensive ideal? And hopefully over time, people will find better ways to both improve what the ideal is and also better ways to distill the ideal into faster cached computations. I don't think there's something super clean. I think this is something that needs a lot of empirical evaluation and trade-offs related to how much you're willing to spend on any one query.

Jungwon Byun: (50:04) From a product perspective, we've thought about showing multiple segments. This at least kind of helps bridge some of the gap. Yeah, if you can show maybe, right now, we're just showing the most relevant segment and the answer from that. But if you can let the user tab through a couple of the highly ranked segments, even if you didn't get it right in the top slot, you can compensate for where the model might have missed some things.

Nathan Labenz: (50:27) Cool. That's really fascinating. Again, I think you guys are on, the more I learn about what you're doing, the more I think it honestly really resembles what a lot of corporations are going to be trying to tackle over the next year or two, because they're just now getting to the point where, first of all, they're just aware of all this stuff. And second, the frontline demo is good enough that they're, well, if they could do that, they could probably figure out my stuff. But now they've got these questions of, okay, I've got whatever internal knowledge base that's in whatever form, and then I've got whatever tasks and processes and how do I... It's kind of like the assembly lining of knowledge work is what I sort of expect a lot of corporations are going to try to figure out how to do, for at least a lot of the processes that they have. Anything that's not driving alpha is a good candidate for can you pull it apart and turn it into some sort of knowledge work assembly line. I suspect that you guys will be in high demand as it becomes clear that this paradigm is really what the corporate customers are going to need. They may want some bigger tasks. I think they probably will want bigger tasks than what you guys are doing. But still, taking a five-minute customer service phone call and resolving an issue is one or two levels up from where you are, but it's still pretty small in the grand scheme of things. And they're going to be working really hard to figure out how can we structure all this activity that goes on in our business in ways that we can feed it into an AI assembly line and reliably get the desired behavior out the other end.

Jungwon Byun: (52:22) As we start to see more high-impact, high-precision, high-reliability applications, we'll really need that infrastructure. So for kind of approximate use cases, I think something like ChatGPT will work just fine. Like you said, maybe just a quick second opinion or a quick, what do I need to know? What should I then go and ask more questions about? It'll be very useful. But we're seeing right now, obviously, people know that these models have issues with hallucination. I'm sure a lot of those will get trained out or they'll be better at citing their sources. But I think even in that world, it still won't be clear that what the model is telling you is the best possible answer. So not just not incorrect, but the best possible answer. And I think with these more serious use cases, we'll need systems that can give us the guarantee that this was the best possible answer. So we'll want to apply these models much more systematically and not just take the first thing that they tell us, but have them run through every single option and do really detailed checks and double checks and think about everything that could go wrong. And then, having considered all of that, this is the best possible answer. So I do think that with more serious use cases, the way we relate to these models, the way they're deployed will change.

Andreas Stuhlmüller: (53:37) Yeah, I also agree, Nathan, with your analogy of assembly lining of cognitive tasks. We've been thinking about that a lot as we build the new version of Elicit. So I think Elicit right now still looks a lot like quick searches, but we've been thinking more about what we've been calling large scale industrial cognition, where you're not just trying to get some quick answer to a thing. You're trying to actually define some process where you go through maybe all papers that you have internally at your organization. And then for each paper, you extract the relevant insights and then you want to do some systematic ranking at the end of the insights, maybe connect it to other databases. There's some fairly large scale, probably pretty expensive process that you want to run, but at the end you want to have some guarantees about how good the result is. So I do think there will be a lot more of that in the future.

Nathan Labenz: (54:33) That also gets a little bit to this question. I think it's ultimately both, right? So I'm not looking for an either-or answer because I think that would be silly. But there's this debate right now between the copilot model on the one hand and the assistant model on the other hand. The copilot would be characterized by you doing your work and the thing is helping you out along the way, answering your questions, autocompleting for you. But I think that the vision that people have in their minds there is largely you are doing work in the same way that you do it today and you're getting help, as if you had somebody sitting next to you helping you. Then the assistant model is, I want to tell the system what to do and its job is to do it, come back when it's done, and hopefully have done a good job. And it sounds like you guys are maybe more trying to build an assistant model type of thing where you want the user to be able to delegate real tasks and not just be getting the weaving in and out of AI commentary on the research process, but actually get a quality literature review out and be able to advance to that square on the board in a discrete way.

Jungwon Byun: (56:01) I think there are maybe two dimensions to this. One is what is the most helpful way to build this kind of copilot, helpful voice next to you? And I think our bet is to figure out what the human is actually doing and then to break it down and then train it on those tasks. So instead of just letting the language model talk and hoping it's helpful, being more opinionated about what is helpful in this workflow, teaching the model to do that, and then we know that the thing the model does will be helpful. And then I think that's about how we approach building in the short term. In the long term, I think it remains to be seen. How exactly do we want to integrate these models into more workflows? So in the beginning, maybe they'll do, just how much of that process can be fully automated? And I think our approach will be, do the small tasks, work up to bigger and bigger tasks. It's very unclear if we'll ever be able to automate the whole thing. We'll just keep automating more and more as it makes sense along the way. Some people talk about how maybe there's just a fuzzy component of reasoning that can't be made explicit and easy to automate. So who knows? Maybe we'll encounter that at some point.

Andreas Stuhlmüller: (57:15) There is a fundamental constraint with the copilot model, which is everything needs to be really fast. You want the model to respond immediately to you. You want your code completions to be there in the moment. And I think that's just a fundamental limit on how much cognition it can do. And so if there are tasks where you're, well, I really would like the model to think harder about this thing, then either it's going to be an extremely slow copilot, which I think maybe that just starts being more equivalent to the assistant. But I think I would be very surprised if all use cases that we are interested in just look like the copilot model because I think that would mean accepting that we never do more thinking than some very small amount for any given task.

Jungwon Byun: (57:58) Yeah, I think plausibly there will just be some type of work that machines are better suited for than people, and it makes sense for machines to divide and conquer here. Language models will just be able to read way more text and extract information from them way faster and be way more comprehensive, and they should do that type of work. So it might not just, we might not exactly be on a spectrum of minimal automation of what the person does to maximal automation of what the person does, but it's more about specific tasks and how to split them up.

Nathan Labenz: (58:30) Yeah, that was a big point in the economic analysis report, the GPTs are GPTs paper that accompanied or fast followed on the GPT-4 release. An interesting study there of, okay, if we break down, it's been commonly said, and I think it is true, that jobs as they exist today are not really amenable to AIs showing up and doing the jobs. There's too much context needed. There's physical demands. There's whatever. It's pretty easy in any given job, the report says this. You look at any given job, it's very rare to find one where you could feel like the AI could do the whole thing. But then they break it down into tasks and say, okay, well now what are the tasks that make up this job? A composability paradigm for what is a job. And then they find, well, a lot of those tasks for a lot of jobs are in fact tasks that language models are now capable of doing. So yeah, I think we will certainly see both. But it feels to me right now like there's honestly, I think, a lot of wishful thinking feeding into the copilot model. It seems like a bias, a wishful thinking bias toward the copilot model. And one thing that's been really interesting in doing these interviews for this show has been I always ask, I'll ask you guys too, but I always ask, what AI products are you using regularly? And strikingly, the answers are very few. There's not that many products that are cracking into folks' daily lives at this point. And I don't think it's really because in most cases, there aren't tasks that they could delegate, but rather it's often just not quite helpful enough to go do it totally on the fly in these, oh, let me pop over to here and get my social media content written or whatever. I think what people really want is to be able to delegate whole tasks and have them done. But then there's some people who are either afraid of that or it just feels uncomfortable to think of building technology. Or it may just be bad marketing, right? I mean, that's the other thing. They may know on some level that things are headed one way, but they may still be telling a copilot story because, what else is GitHub going to do? Right? I mean, it's going to be tough for them to reposition around the AI developer for lots of reasons.

Jungwon Byun: (1:01:11) For me, the barrier is when I'm using some of these language models to actually do a job that I care about, I have very low tolerance and patience. I don't want to read this paragraph. I don't want to hear about how you're this helpful assistant. I don't want to give you feedback on this answer that you didn't quite get right. I just want the best answer right away. And so I wonder how many other people have this. Some people, I think, are enjoying this flow with the language model where it suggests something and they give it feedback. But I'm sure there are more people like me who just want to get this task done. And there's just a little bit too much back and forth right now around interacting with this assistant. And I think that's, yeah, I think we'll probably start to see more products that apply these language models in these higher reliability, well defined, discrete tasks. Because when you're trying to get something done, I don't know that you want to have a long conversation about it. You just want the thing done.

Andreas Stuhlmüller: (1:02:07) I had one other thought on the marketing point you made, which is I think even if you ultimately want to automate whole tasks, maybe it's pretty useful to get feedback from people. The models are still relatively weak in many ways, and so being able to have millions of users use your product and give feedback on the places where they break down, I think, is pretty useful, even if your long game is to cut the people out of the loop. Maybe that's a cynical view, but I think that's also one view that's compatible with what's going on.

Nathan Labenz: (1:02:41) Yeah. There definitely is an element of that. We had Ohn-Bar Nimrod from Google Robotics Research on, and she was talking about a paradigm where as they envision deploying robots in the future, there's obviously going to be so many edge cases and long tail scenarios. As I understand it, their plan is to train the model to know when to stop and call for help and then have a remote operator in a call center somewhere that will literally use joysticks to do the task for the robot. And then there's your training data patching. So, yeah, there's definitely going to be a lot of that. But, yeah, I do wonder. I mean, copilot is very good, right? Because if you accept, they have a very natural loop there. If you accept, it must have been at least decent. They're getting a strong signal from a lot more suggestions probably than almost any other product. Anyway, enough copilot. Let's talk about Elicit and the evolution. Right now, I go to the site, it invites me to ask a research question. And what I understand is you're expanding the workflows that it can do. So tell us how you are expanding, what new workflows, and what we can expect next.

Jungwon Byun: (1:04:05) Yeah, so I think the way to think about this is this initial phase of Elicit was about taking unstructured data and information that's trapped in PDFs and structuring it into a much more organized table where you can efficiently skim across a lot of information at once and automate the process of going into each PDF and figuring out what is it that they did and what should I know. And then the next phase of Elicit will be, okay, cool. Now we have this structured data model, this table. How can we run interesting and decision relevant queries on top of that? How do we take this information and then make it even more useful and easy to get good answers out of so that researchers don't have to then just walk out with this long list of PDFs that they still have to read, but they understand better what they are about better than they did before. So specifically, the new workflow that we're launching, we're calling list of entities. And the idea is, can we extract concepts that are discussed across multiple papers and then show you the concepts with links to the papers and the parts of the papers where they're discussed? So this could be a list of datasets that were used in lots of papers, a list of machine learning benchmarks that a bunch of models were evaluated on, a list of effects of a certain medication, a list of interventions tried. I'm trying to get it to work for a list of outstanding research directions, for example. So this list form is really quite general. I think the core thing here is querying this table in a much more rich way and organizing the information in a way that's way more decision relevant. So, yeah, that's the big next workflow. And then more generally, to think about moving from one workflow to supporting many workflows. So some of the other ones we have on the docket are running the current version of Elicit, but over a much larger set of papers. So we have a ton of organizations who have their own databases of thousands or tens of thousands of papers and want to basically get an Elicit table for their own papers. So thinking about supporting use cases like that. Yeah. Those are some of the things down the roadmap.

Nathan Labenz: (1:06:21) You guys are a research nonprofit, but you describe yourselves as a product driven, research or product led research nonprofit. What, how do you envision that evolving in the future? It sounds like you're thinking of offering this on a commercial basis to large companies. Is that right?

Jungwon Byun: (1:06:40) In terms of our long-term roadmap, we see this as having three phases. Phase one is discover what is known. Phase two is discover what is unknown. And phase three is decide in the face of the unknowable. We're currently in phase one, discover what is known. The core problem we're trying to solve here is the fact that there's already a tremendous amount of information that exists in the world. It's just trapped in unusable PDFs and books. There's just too much information. It's not structured to be very decision-relevant. It's very hard to query. It's really hard to read an entire book to get a very specific answer to a question you're looking for. So a lot of the capabilities we want to build out in this phase are around information retrieval and synthesis. It'll probably mostly focus on text, and we'll be trying to take this unstructured text that exists in the world in papers and start structuring it into a tidier data format. That's what you see in Elicit today. We're taking a bunch of these papers, we're converting them into nice tables. And then this list of entities workflow is a way to query that table in a richer way, to reorganize it at a higher level of abstraction where you're looking more at concepts discussed across papers rather than the papers themselves. So I think phase one will look a lot like converting text into richer structured data formats and then flexibly querying that underlying data into more decision-relevant insights. Phase two is discover what is unknown. In this phase, we want to help researchers start to contribute to that body of knowledge. Having parsed through all the texts that already exist in the world in phase one, we should be able to start finding gaps in that literature and then helping researchers resolve those gaps or identify cruxes that they can resolve. I expect that we'll relate to language models totally differently in phase two than we did in phase one. Here, I think the core technical capabilities we need to build are around causal modeling. So using these language models to develop good models of the world and how it works. You can think of all the work in phase one as collecting a bunch of observations, observations that people have gathered from running experiments. Then we want to collect those observations to build towards a richer causal model that we can generalize into new contexts. Then researchers can take that causal model and come up with new hypotheses that they can design experiments around. So it'll be very different. Where in phase one we might want to apply language models at scale in batch and go broad, in phase two I think we want to go much deeper. And then in phase three, decide in the face of the unknowable, we'll probably switch user groups a little bit and move from research producers to research consumers. People who want to incorporate research into their workflows are making very high-stakes decisions based off of research, but are not researchers themselves and are doing much more than just understanding the state of science.

Jungwon Byun: (1:09:55) And again here, I suspect that we will relate to language models very differently. I think here the problem we want to address is that for some really high-stakes decisions, or some parts of those decisions, there is no right answer. It really comes down to your values, or your preferences, or just certain choices you want to make. And so at this stage, I think we want to use language models totally differently and use them to elicit the values and maybe unresolvable assumptions or beliefs about the world that the individual has or the decision-maker has. For example, if you want to make a decision about your career, part of that might be informed by research, but a lot of the rest of that might just be around your own values or other hard-to-resolve beliefs about the world that you have. So those are our three phases. I think the important thing to emphasize is that literature reviews are very much a small part of our long-term vision. We don't see ourselves being a lit review tool forever. We don't see ourselves just being a search engine forever, but understanding the state of literature and being able to work with that is just a really important building block to everything else we want to do. So that's why we're starting here. And if that sounds exciting to you, we're hiring for a product manager. This is one of my biggest priorities right now. So if you are interested in really helping us convert this grand master three-phase plan into a concrete roadmap and are excited about all the different things Elicit will become over time, please reach out or recommend people you know that seem like good fits.

Nathan Labenz: (1:11:35) How science-y is the target market for that versus just general use corporate? Because I was thinking, who would be your number one most natural customer? And I don't have a great answer, but the first thing that came to mind was the Broad Institute. They probably have a ton of stuff that they might like to be able to search over. I don't know how much of it would be public versus private, whatever. But that's one. Certainly they have the science pedigree. But then I think, okay, Coca-Cola, would that be a target customer? They're doing stuff with AI, I guess. Is that in scope or is that just too commercial, too consumer-y, not science-y enough to be within what you guys would try to do?

Jungwon Byun: (1:12:21) I think broadly, we want to prioritize researchers in a broad sense. And that will get broader over time. So up to now, most of our users are academic researchers, PhD students, or professors. But there are researchers like that in so many different places, at many different companies. There are researchers who are trying to study the effects of sugar in Coca-Cola or what happens if you reduce the sugar content, probably. Lots of researchers in industry, and I think the activities that they are doing are very similar, even if the domains are slightly different. So we do want to support all of these people who are trying to have good models of how things work in the world and have good science-backed beliefs and are thinking a lot about this information. And that's really a big part of why we started with literature review. When we're thinking about scaling up reasoning and research, this is a workflow that's general to many different domains and generalizes even beyond research. So in the short term, we'll keep focusing on more academic researchers, but over time, we want to scale to everyone who's thinking really hard about evidence and trying to make good decisions in light of that.

Andreas Stuhlmüller: (1:13:34) I think it depends less on what is the flavor of the company, do they sell sodas or not? It's more like, what is the task they're doing? Are they trying to do serious reasoning? I think we probably do not want to just build an internal search engine for someone's documents. I think there are going to be plenty of others who are going to be doing that. But we do want to help if there are people who are like, well, we're trying to figure out how to use AI to make better decisions or figure out what's true in some setting. I think then it becomes more interesting to us.

Nathan Labenz: (1:14:08) Okay, that's really interesting. When you think about going to customers, like any sort of big company, would you imagine allowing them to create their own composable workflows that are riffs on yours? And then I guess the follow-up question would be, as soon as I imagine doing something like that, then I imagine, well, these people are probably going to struggle with this. And then I go to, well, maybe I could get a language model to give them some good suggestions, which might also be in your plans separately. So at some point, you guys can't do all the decomposing, right? What's the future of the decomposing process?

Jungwon Byun: (1:14:52) Yes. This is the dream. We tried, actually, earlier this year to use language models to decompose tasks, and they just weren't quite ready for that, somewhat surprisingly. And so that's why with this list of entities workflow, we're still hard-coding the workflow. We're saying, first, you should search over these papers and you should extract these statements, and then you should cluster them in this way, et cetera. I think the evolution here will be, and I do think for a lot of users, they don't want to look at a blank screen and think about how do I need to make this tool do what I want it to do. So I think for most people, they'll want a predefined workflow to start with. And then from our perspective, the evolution will be, we'll let users edit parameters of the workflow to start. So now I have this list of entities workflow. Maybe I can change the search query in the first step. Maybe I can get a few more papers. Maybe I can delete some of the papers. Maybe I can change the way the clustering is done. That'll be the next phase. The phase on top of that is I can make more meaningful changes to the steps of the process. Maybe I'll search over a different corpus. Maybe I'll add another search step at the end to check if this entity came up in Google somewhere or whatever. And then I think once you're there, then you can get to a place where people are just creating their own workflows. Editing each step could lead you to edit the workflow to an unrecognizable state. And then from there, you can just start from scratch and create the workflow from that thing. And then once we have many more examples of that, presumably we could have language models also automating those steps or co-piloting, suggesting the steps that someone might want to take. And interestingly enough, we already have some researchers who are basically doing this. So the main workflow in Elicit is lit review, but there's this secret page called Tasks, which is our graveyard of all the things we experimented with before we landed on lit review. And it's a ton of things like brainstorm research questions, suggest search terms, generate counterarguments, all these very well-defined little tasks. And we have some researchers who have basically manually stitched together the results of those tasks and created their own workflow in this janky way because we don't currently support it. Where they have a question, they generated a bunch of research questions, asked each of those questions into lit review, got a bunch of papers, picked the abstracts of those papers, ran the abstract summarization task, and then generated a bunch of other ideas from that. So they are currently manually stitching together these tasks to do more complex workflows, and that's something we want to automate and make way easier to do natively in the platform going forward.

Nathan Labenz: (1:17:25) It's like Zapier for reasoning.

Jungwon Byun: (1:17:28) Yeah, exactly.

Nathan Labenz: (1:17:29) Does the codebase get hard to corral? I mean, when you describe all those different tasks and especially if they're all explicitly defined, it sounds like a lot of code under the hood. Does that become a challenge?

Andreas Stuhlmüller: (1:17:45) I mean, as with any fast-moving field, it's an ongoing thing where we're trying to figure out the right abstractions, but I think ultimately there aren't too many of them. So there are a few core reusable components like search, answer given results, summarize, rank, cluster, filter. I often think about it as functional programming with natural language. In functional programming, you have some core higher-order abstractions. You have map, fold, filter. And there are only five of them. And then you can apply those to more atomic tasks, which are either deterministic computation or language model calls. But it's not like there is a huge number of inbuilt components that you need to add. In some sense, that's the beauty of language models, is that you don't have that many components. It's about a bunch of pieces and you need to be reasonably smart about how you compose them, and I think that gives you a lot of power. I don't think we'll be in the business of building out a zillion components. I think we are more in the business of, I guess, I don't know how this relates to the Zapier analogy, but I think we are more in the infrastructure for composing stuff business than in the make lots of small bits business.

Nathan Labenz: (1:19:11) What about when you get to low-level bits where, if it's medicine, then as you're validating the trustworthiness of the result, you might ask, is it a double-blind study or not? But then if I'm doing some different question in AI, then generally, that's not really relevant. So it seems like there's got to be some, to me, it seems like there'd be this fractal complexity at the low level and maybe those are different components, but it seems like there's still, I'm struggling to imagine how you could handle the breadth that you handle without some sort of dynamic decision-making to figure out where you want to be on those low-level tasks.

Andreas Stuhlmüller: (1:19:57) Yeah, there are different sources for that information. One source is language models themselves, which probably know some of those things. So you could have dynamic task decomposition where you ask the language model what criteria computer science papers are generally evaluated on when reviewers make decisions. That would give you a bunch of relevant criteria. And then the source we talked about just moments ago is people using the platform who have views on what they consider important or less important. In some settings, we might also be the users of our own platform and might also have views that help people come up with things like what's illustrated right now. The risk of bias analysis is a thing that we put in. But I think the trajectory over time is probably more of that stuff comes from some combination of users or language models.

Jungwon Byun: (1:20:50) Yeah, I actually think when you look at it, there's a lot more generalization than the example you gave might suggest, Nathan. In both that computer science question and biomedical question, the activity that the model has to do is the same. It's usually given this segment from a text, what is the answer? So it's question answering over a piece of text, basically. According to this part of the paper, what is the answer to that question? The source of that text can be a computer science paper or it can be a biomedical paper, but the activity of using this text to answer this question actually generalizes to both of them. So far in our experience, those two don't need to be treated differently. And then maybe in some of these higher level workflows, they might be a little bit different. As Andreas was saying, maybe this risk of bias checklist where we ask about sample size and other factors, that checklist looks a little bit different for medicine than it does for machine learning. But we might just want checklists in both cases. And then the way we create those checklists might still be pretty similar, even if the content of them is different.

Nathan Labenz: (1:21:58) So today, it sounds like there's a challenge that you guys have probably made a real art out of. We had something kind of like this with Waymark at one point in time where, especially before language models, we were trying to generate content for people in this automagic way. And so our customer base was all small businesses, so the diversity was insane. We got pretty good, definitely not as good as the language model experience is now. We got pretty good for a while at classifying businesses into one of 20 categories and then having uncanny valley copy for a lot of those categories where we'd be like, all right, well, if this is a law firm, then we're going to classify it as professional services. And then the kind of marketing messages that they would want to talk about would be that they're trustworthy and trusted, versus if you're a restaurant, you're not going to go with messages like that, but you'll go with great taste and what a great experience and how much fun it was. So I imagine you have to work pretty hard to think about what are the right types of questions. Because even in that scenario, it's just question answering over a text. But the question of what question is it answering? I imagine you've thought pretty long and hard about what are the questions that are most generally applicable. I wonder if there are any questions that you could share that are like, yeah, we don't ask is it double blind, because that doesn't generalize super well. But instead we find this is the magic question to ask for evaluating a random piece of text from a random paper in a random field.

Jungwon Byun: (1:23:47) Yeah, generally the things people want to know across all domains, this is maybe for empirical domains, and I do think for theoretical domains it might be a bit different. But within empirical domains, the fundamental questions are what did they do? What did they find? What did they not find? At a very high level. Last year we were focused on helping biomedical researchers, and then this year we started trying to help more people who were working on AI safety or empirical machine learning research. And I think a lot of the structure generalized, even if the specific questions were a little bit different. So when we ask what did they do in medicine, we would ask what was the population, where were they located, what was the intervention. And now we ask what was the data set, what was the model used, what were the techniques used. But yeah, we haven't really run into significant problems with generalization there.

Andreas Stuhlmüller: (1:24:52) Yeah, I think what did they not do is maybe the one magical question to ask about everything. I always find it most useful to ask, what limitations do the authors say their work has? That's extremely true for machine learning papers because if you just look at the abstract often, you're like, wow, that's amazing, this solves everything. It's also true of other work because knowing what they didn't do really helps you understand the boundaries around the thing they did, often better than the literal explanation of what they did. If I had to pick one question, maybe that would be it.

Nathan Labenz: (1:25:25) I know you guys are connoisseurs of language models, and I've seen from some of your published stuff on Twitter that you're using a mix. So I'd love to get a sense for what the ensemble of language models are that you're using. I know you've done some in-house training and also are using some commercially available stuff. To the degree that you're comfortable doing so, I think people would learn a lot from that.

Andreas Stuhlmüller: (1:25:55) So I think we've tried basically everything that exists, including all APIs like Anthropic, OpenAI, Cohere. I guess we haven't tried the Google API. Anyone from Google listening, please give us access. And then we've tried most of the open source models too, I guess Galactica and GPT-J, Flan-T5. The one that we've found most useful for deployment in practice right now is the Flan-T5 XXL model, the 11 billion parameter one. It's a good trade-off between reasonably small, not too hard to deploy, and reasonably powerful. There's been a bit of a switch from mostly using the OpenAI API before, and the reason for the switch is mostly that it does get very expensive. We have something like 250,000 users now, so it gets very expensive and it's just a lot cheaper to deploy your own thing. This space is changing very quickly. I wouldn't be surprised if we talked to you again in three months and we were like, actually, we switched everything to some other model, so I wouldn't make too much of it.

Jungwon Byun: (1:27:07) This is generally something that we think we have to be really good at, and probably a lot of language model applications will want to be really good at, is the ability to very efficiently test and deploy and swap in different models. Partially because there will continue to be so many of them from commercial providers as well as open source providers, but also because the rate at which these model capabilities are changing is moving so quickly that I really want us as an organization to be very good at taking a new model and then figuring out, okay, what new workflows are enabled by this and what is not, what is still missing. This is something that I really hope we get really good at as an organization. I think it's somewhat research-y in nature as well.

Andreas Stuhlmüller: (1:27:47) One of the things we haven't been doing too much of but that we're very excited about is combining, ensembling multiple models. I think especially if you have models that are trained on different data, you get them on a task and they all agree on the answer. I think you can be a lot more confident that the answer is correct than if you just use one model. Often people are like, well, aren't the big labs going to do everything? I think that's maybe one situation where that's not the case. There is room for other organizations to be like, okay, we're trying to combine the results of different models in ways that lead to overall better performance than you could have gotten from any one model.

Nathan Labenz: (1:28:27) Yeah, okay. That's really interesting as well. That's another good nugget. So I mean, one thing that I would just highlight here is you have an 11 billion parameter model working roughly as well, I guess, as top of the line models once fine-tuned for your particular tasks? How narrowly are you fine-tuning those models? Do you have a whole set of different Flan-T5 XXL 11Bs or whatever exactly it was?

Andreas Stuhlmüller: (1:28:58) No. It's fairly broad, although the task still has relatively short answers generally. The tasks that we've used this for are, one, Elicit has abstract-dependent summaries. So if you ask a question in Elicit, it tells you not just here are the papers and here's the abstract of each paper. And it also doesn't just say here's a TL;DR for each abstract. It says here's a TL;DR that was computed for your question specifically. So that's one task that we fine-tuned the model for. And then another task is just question answering from papers, like the task we talked about a bunch earlier, where given these paragraphs from the paper and this question, tell me concisely what the answer is. I think in both cases, first of all, those are both not that specific. I think they apply to lots of papers. But they do share that you want to use a common rubric where the rubric includes things like is your answer concise and truthful and useful to the reader? And there are a few other entries in the rubric. So that's the rough shape of the training we've been doing.

Nathan Labenz: (1:30:11) What is the role of, I mean, it started to sound like a constitutional AI approach there for a second. How are you thinking about incorporating AI into the evaluation and feedback cycle? That definitely seems to be one of the biggest trends in the space right now.

Andreas Stuhlmüller: (1:30:30) Yeah, yeah. So that's exactly right. That's the constitutional AI approach where, for context, usually in RLHF, reinforcement learning from human feedback, you have human annotators. They look at two options and ask which of these options is better. And so in constitutional AI, basically what you do is you replace the human annotator with a language model, ideally as large a language model as you can find because the evaluation task is pretty tricky. And then you tell the language model, here's the rubric, here are the options, judge the options according to the rubric and select the one which is better. This gives you pairwise judgments. Then you can train a reward model on those judgments. And then once you have the reward model, you can then train a generation model, in our case, as mentioned, Flan-T5, on getting high rewards from that reward model. And that's a process that our engineer Charlie has been leading and that has been working very well for us.

Nathan Labenz: (1:31:27) You also mentioned you've used the biggest language model you can find in the evaluation. Obviously, new biggest available language model just dropped in the last week and a half. What are you seeing right now in terms of the absolute frontier of language models? In other words, is GPT-4 changing the game for you at all? What are the tasks that are kind of the ones you wish it could do but it can't yet do?

Andreas Stuhlmüller: (1:31:57) My very brief answer is, yeah, it is a significant step up. I think it still is not great at avoiding hallucination. I keep saying this every time we talk to someone from OpenAI or any of these AI research companies, it's like, stop making bigger models, just make them hallucinate less. Tying it back to our earlier discussion about error rates and such, I think if you could just get high robustness, that would often allow you to do more than if you had a bigger model. If you ask the model, for example, who is Andreas Stuhlmuller? It would probably give you a description that kind of sounds like me, but not quite, and would have some made up facts about what universities I went to. That's the thing we wish for that we don't have yet. GPT-4 nominally also supports image inputs, that sort of thing, but that isn't really deployed yet. But I think that will actually be quite transformative because a lot of papers communicate substantial information in their figures and their tables. And I think being able to access those will be very useful. So I'm excited to see how that goes.

Jungwon Byun: (1:33:08) There are just so many workflows left that we would like to be able to use the models for that we can't. I think a lot of it comes down to having good causal reasoning and logical reasoning abilities. One thing that I would really love to be able to use the models for is efficient clustering and cutting up of a space. So if you have a bunch of research ideas, how do you organize them? Or a bunch of ways you can set your product roadmap, how do you cluster the space? What are actually the cruxes and the constraints that segment the space up most efficiently? And another one is—I think this came up earlier in this conversation—the ability to know when to stop and realize that it's hit a dead end. Part of why our earlier experiments this year with trying to use the models to choose the reasons themselves didn't work is because the model was just really bad at being like, wait, I'm confused. This answer is wrong. I need to go somewhere else. And once stuck on a path, it just keeps going down the path. I think that makes sense. It's a next token predictor. It's just going to keep predicting tokens and not be like, stop and go do something else. So that is another capability that I really hope can help unlock a lot of good reasoning.

Nathan Labenz: (1:34:27) Yeah. I've observed that too. I was also thinking about that recent jailbreak, I guess they call it token smuggling, where you can get GPT-4 to generate some flagrantly toxic or whatever kind of content that it ordinarily wouldn't if you sort of embed that generation in a function and it sort of reads the whole thing as code and attempts to execute the function. And it only needs a few tokens. I almost imagine it as pulling on a string, pulling a thread out of a sweater or something. And then once you've just pulled a little bit of toxic content out of it, next thing you know, the rest of it just kind of unspools. There's almost a mechanistic insight there somewhere, I think.

Jungwon Byun: (1:35:20) Yeah, the other thing that I think people have already started to notice is that a lot of the reinforcement learning is causing the models to be incredibly uncalibrated and overconfident in their answers. And so that is preventing us from being able to use these models in particular ways. We would really love to be able to use GPT-4 for ranking, but it's just really not helpful for things like that. So I don't know how we're going to resolve this tradeoff.

Nathan Labenz: (1:35:49) Yeah. They're also not giving the log probs anymore either. So you've got kind of a couple barriers on some of those things. You mentioned earlier a headline number of users, but one thing I'd love to hear if you have any kind of interesting anecdotes or observations on is how people are using your product. You said a little bit about who they are. Is this something that they come to from time to time? Do they sort of sit there all day rifling through lit reviews? Anything that's interesting or surprising about your users' behavior, I'd be really interested to hear about.

Jungwon Byun: (1:36:25) Yeah. I think there's a pretty wide spectrum of use cases. So, maybe something like 60% of our users are academic researchers, but the rest of them work in think tanks and government and finance and consulting. Some of them are just small business owners. A lot of people actually from medicine, a lot of clinicians and medical people. There's definitely a core group of people who are using it very regularly. At least 10,000 people use it more than once a week. When we built Elicit, we took a lot of inspiration from the existing systematic review process. It's a very common process mostly in biomedicine. And it's a pretty structured process for synthesizing a lot of research and trying to understand across many different papers what is the answer and how do we triangulate the results across different papers. So different teams who are working on that are actually using Elicit as part of their process to find papers and extract information that they need to work on those types of projects. And then, yeah, I think the other cool thing is finding researchers like the one that I mentioned earlier who are stitching together many different tasks and hacking together their own workflow automation and all of that.

Nathan Labenz: (1:37:39) When people are hacking your product to get access to things that it can do that you haven't bundled up for them, that's usually a pretty good sign of something. So we mentioned a little bit earlier, just briefly, that you guys are a nonprofit research company, but you do have some business plans. It seems like nonprofit AI orgs developing business plans is also kind of a trend. How are you guys thinking about your future as a revenue generating, self-sustaining organization perhaps?

Jungwon Byun: (1:38:13) We really want to be able to scale Elicit's impact up quite a lot and be able to have it guide and really help with a lot of very high stakes decisions. It's grown a ton over the last year, so we are hitting the limits of what is feasible with philanthropic funding. And also want to just be self-sustaining as an organization. So, yeah, that's a question that's actively on our minds, and we're thinking about what is the right vehicle to house Elicit in going forward.

Nathan Labenz: (1:38:46) I wonder if there's—of all the language model first companies I've talked to, you guys remind me most of Wolfram. There's definitely some shared DNA there of highly explicit reasoning.

Andreas Stuhlmüller: (1:39:02) Yeah, there's definitely some shared aesthetic in terms of having very explicit processes that are robust and trustworthy. I think there's also important differences in aesthetic. I think Wolfram is very much open to just building a gazillion widgets and then putting them together, and I think we're probably a bit less excited about that. Yeah, I don't know. I like Wolfram. It's a cool company.

Nathan Labenz: (1:39:29) Obviously, you guys are motivated to a significant degree by AI safety considerations. And you made one kind of offhand comment earlier about when you talk to the biggest leading labs in the space right now, you say, you don't need to make bigger models, just make them more reliable. How would you characterize the overall state of the AI safety issue landscape and kind of our prospects? Right now it seems like things are changing really quick. We're getting signals from OpenAI that they're going to try to do something to kind of keep things under control, hopefully, which will be very interesting to see what that turns out to be. But what's your kind of outlook on the whole scene?

Jungwon Byun: (1:40:16) I think my honest answer is that I just have a tremendously wide range of what could happen that includes all of the possibilities. And I just have tremendous uncertainty. I feel like it's just an impossible thing to predict. The things that I do feel more confident about is that I really would prefer for this not to just be a technical solution. At least in the community that we grew up in, I think there's a lot of focus on the technical challenges. There are real technical challenges. I think they would be very high impact. And so it's really important. But I also want to think a little bit about how—especially with things like process supervision—one of the things that I think we want to see happen is consumers demanding this type of process supervision and transparency from their AI products. And we don't want to see safety just happening at the large AI labs, but also at these end user applications. I think they will care internally as they're building their systems about being able to supervise the process and control and debug, and also for their users enabling a lot of those features. So I think there can be a lot of helpful pressure coming from that side. At Upstart, my last company, we were trying to use machine learning for credit within finance, which is an incredibly heavily regulated industry. And I think there I got to see both how regulation could really protect consumers and be very good for the world. I was on the marketing side, and I was just not allowed to make any false statements or any statements that were not very heavily supported by evidence and data. And so that seemed really good for consumers, and people lying about their products and their financial products seems really bad. But I also saw how much regulation lags relative to the pace of innovation. So I really hope that's another mobilizing force on the safety side. I'm not the expert here, so I don't know how exactly we bridge that gap, but I think it's so important that the regulatory side moves at pace with the technology and also that people have examples of where regulation can be really good for the world, especially the tech industry, and we'd be trying to be setting a lot of those things up. So I think that the technical challenges are definitely pretty tough, but I think there are a lot of other ways we can be coordinating in different parts of society to make this go as well as possible.

Andreas Stuhlmüller: (1:42:39) Yeah, I think I roughly agree. I think also things are very uncertain and up in the air, I think, partially because in the big scheme of things, I think still not that many people are actually working on these issues, both on alignment and also on making AI go well more broadly construed. I think in our space specifically, which is using AI to increase wisdom, improve decision making, etc., I think actually basically no one is working on this, which is maybe kind of surprising. I think one of our best hopes to make things go well is probably to figure out better ways to use AI to support us as it becomes more capable, help us coordinate better, and figure out what kind of good plans look like. I think things are up in the air because it's really a high leverage situation. I think individuals can probably still make a surprising amount of difference by pushing in the right directions. I think there are many signs for hope, I think. Big AI labs all have expressed that they do consider process supervision among their core alignment bets, which I think is a great sign. I think we haven't really seen that cash out in real investments yet. I think, for example, I don't think we've seen any kind of scaling up of investment in process supervision that looks at all comparable to how much has been invested in scaling up models. I think if you did see that, better tools for writing compositional language model programs, better debuggers, visualizers, compilers—I think there's a lot you could do in principle, and I think people having expressed that they're excited about this should make us somewhat more optimistic. But I think not having seen the investments on the ground yet should still make us feel like these things could go either way. We don't quite know yet how things will go. But on more optimistic days like today, I feel like, yeah, probably AI will help us better see what the risks are, will help us better coordinate. Probably eventually we'll figure out that we'll have to limit large scale training runs to some extent. And AI will help us see more clearly whether—to the extent that that is the right decision, it will help us see that that's the right decision. And to the extent it's not, it will also help us see that. So I think there is reason for hope.

Nathan Labenz: (1:45:00) What Jungwon said at the beginning really resonates with me around just very extreme uncertainty. Your comment too about some sort of high end cap—I mean, that seems to be obviously a big idea that's circulating in the space. Could we put some sort of—and this seems like it could be a really good idea—some sort of high end cap on super large models? Beneath that, there doesn't really seem to be any reason to worry too much. Now that could change, obviously, with a new architecture or some sort of conceptual breakthrough. But at least for now, it seems like that limit could be pretty high and not really the kind of thing that inconveniences PhD students or startups or whatever, but still would create some decent visibility into what is going on that might actually have some tail risk associated with it. I'm thinking a lot about that right now as well. Obviously, I'm not any sort of ultimate decision maker, but to the degree that I can shape the discourse or whatever, that does seem to be a pretty attractive scenario right now that, again, doesn't hurt people, doesn't constrain most people too badly, which is nice. You guys can continue to fine tune your FLAN-T5-XXLs all you want, but if somebody's going to go 100x past GPT-4, it seems like society has an interest in getting a tip on that before the whole thing kicks off.

Jungwon Byun: (1:46:32) Yeah. I am encouraged by the fact that I think probably compared to other technological revolutions of the past, a lot of the decision makers in this context have cared about safety from the very beginning. So obviously there's some disagreement about what safety should look like and how much—individual opinions vary. But I think everyone is trying really hard to be as safe as they can and think a lot about the consequences of their actions and trying to be altruistic in that way. So that's very good. And if we think about tobacco or something like that, I just don't think that was the case from the beginning.

Nathan Labenz: (1:47:12) Yeah, or even just if you go back to the Industrial Revolution, I've often said this: I don't think James Watt had any idea what was going to happen. There does seem to be just a lot more emphasis on trying to get ahead of it than there were in previous technological revolutions too, right? Or like Gutenberg, I don't think he had any great sense of what he was unleashing in that moment. And we don't have a great sense of necessarily what we're unleashing either, but at least we have a sense that we're unleashing something. I guess that's one step ahead of Gutenberg. So yeah, that's a reason for some optimism anyway. Guys, this has been fantastic. I really appreciate your time. Thank you for going so long with me and educating me so much along the way. I've got three kind of fun questions that I always end with. You can go on as long as you want, or you can just give me quick hitters. But number one is, what are the AI products, services, tools, whatever, that you guys are using, obviously aside from your own, that you would recommend to others that they should check out?

Andreas Stuhlmüller: (1:48:23) This might only apply to a small subset of your audience, but I use Emacs a lot, and I wrote my own GPT Emacs extension, which is open source. You can search for it: GPT.EL. I actually use it a lot, I think, both for coding and for other tasks. I think it's just really nice to have language models accessible in the environment where you do basically all your cognitive work. Yeah, maybe Jungwon has some more scalable suggestions, but that's mine.

Jungwon Byun: (1:48:51) I've honestly been really amazed and delighted by the ChatGPT Deep Mode combination because I don't have much of a programming background, but I've just been able to do way more cool things with Python with that combination. So yeah, I don't know that it's not an AI... it's just ChatGPT. But I think that's the thing that I use most regularly.

Nathan Labenz: (1:49:13) Those are both good answers. One of the things I'm going to read about later today is the new plugins architecture that OpenAI just launched today, and there may be some really cool new stuff there as well. I'm sure there is. But it is funny, actually. Most people don't have a lot of answers right now. I do keep coming back to that. I think the application layer is maybe kind of a mirage right now. There have been literally thousands of things launched and sent out in newsletters or whatever, but the number of things that I actually hear back... you guys gave more original, novel, interesting answers than most. Honestly, it's been just a lot of ChatGPT. Okay, number two. So let's imagine a future scenario where a million people have a Neuralink implant. If you get one, you are enabled with thought-to-text. In other words, you can control your computer purely with thought. Would you be interested in getting one yourselves?

Jungwon Byun: (1:50:21) 100%. Absolutely. I've already thought about this. I just have this experience all the time where I'm like, why do I have to spend so much time converting my thoughts into typing them into my note-taking system, and then transferring them to messages to other people. So yeah, I would definitely want this.

Andreas Stuhlmüller: (1:50:44) I would have a few questions about the details of the implant. But I don't know. It seems like if you're paying for it, Nathan, sure, why not?

Nathan Labenz: (1:51:00) Yeah, maybe that'll be part of our universal basic intelligence stipend at some point. So, okay, last one, just zooming out as far as you can, and you've touched on this throughout the conversation, but zooming out big picture, rest of the decade, what are your biggest hopes for or positive vision for an AI-enabled future? And then what are your biggest fears of what might come as AI permeates society?

Jungwon Byun: (1:51:32) Culturally, I don't know how much this is just me being in my bubble, but I have the sense that a lot of people have a lot of pessimism or there's this kind of sense of wonder that has been lost. And I think it's easy to lose sight of how much about ourselves and our world still remains to be discovered. So I really want tools like Elicit and just AI more broadly to be able to help us not just amplify existing research efforts, but help us to discover entirely new ways of doing research, discover entirely new research methodologies, entirely new research domains that weren't possible before. Now using this human artifact of text as a data source, we can analyze ourselves more intelligently so that we can better understand who we are and how we relate to each other and what kind of systems we can build to help each person and every group of people flourish more. I'm really excited. I've had friends who worked in different layers of government throughout COVID, and they tell us about all of the very impactful decisions that were made on complicated email threads. I would really like that to stop, and for AI systems to be able to generate very decision-relevant but really, really, really simultaneously rigorous kind of one-pagers, policy memos, answers to people in very high-stakes decisions so that we can just become way more intelligent about how we govern this world. And then I think on a somewhat not Elicit-related dimension, I feel like there's so much that AI could do for us. Things like just being able to translate so much more easily—not literally in terms of language, but in terms of style or tone, or helping us communicate with different neuroatypical people or styles or something like that. I feel like there's just a lot it might be able to do that makes us more human. Even though it's a machine, I think there are a lot of ways that it can help us be better humans as well.

Andreas Stuhlmüller: (1:53:37) My perhaps idealistic view is that most disagreements are about beliefs, not values. I think people disagree about how big a deal AI is. Is it an existential risk? If so, is it 90% doom or 10% doom? But I think that's true much more broadly. Is universal basic income good? Is it bad? Probably a lot of these disagreements in terms of policy and what should be done are more about people having seen different kinds of evidence and less about people having fundamentally incompatible values. I think on the optimistic view, AI helps us better see the totality of the evidence and so be able to better coordinate and be less in the situation where it's my faction against your faction, "I think this is the right thing because I have those values." More like, "Well, actually," and this is relating to the translation comment that you made also, maybe actually AI can help me understand why do you think the thing that you think, what evidence have you seen that I haven't seen, and vice versa. And so I think there is a very optimistic vision that could happen this decade or could happen later, where it just becomes much more clear what the right things to do are for a lot of big-picture decisions, and that could lead to things going quite well.

Jungwon Byun: (1:54:57) On the negative side, I think, yeah, I'm quite worried about chaos. I think things can get chaotic for a lot of reasons, perhaps even before an unaligned AI takes off, just the fears around this imagined problem can cause a lot of chaos. I think the classic fear of generative models creating disinformation or just creating a lot of stuff... I feel like we already have so much noise that we need to sort through already in our current society. And I worry that the noise will increase without the signal or the ability to synthesize that information—those capabilities—increasing. So I really think we need to keep up the ability to make sense of the world as it gets more and more complicated.

Andreas Stuhlmüller: (1:55:51) Thanks, Nathan.

Jungwon Byun: (1:55:52) Great to chat, Nathan.

Sponsor: (1:55:54) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.