Unbounded AI-Assisted Research with Elicit Founders Andreas Stuhlmüller and Jungwon Byun
In this episode, Nathan sits down with Elicit co-founders Andreas Stuhlmüller and Jungwon Byun to discuss their mission to make AI-assisted research more accessible and reliable.
Watch Episode Here
Read Episode Description
In this episode, Nathan sits down with Elicit co-founders Andreas Stuhlmüller and Jungwon Byun to discuss their mission to make AI-assisted research more accessible and reliable. Learn about their unique approach to task decomposition, which allows language models to accurately tackle complex research questions. We delve into the company's tech stack, their transition from nonprofit to startup, and their dedication to creating trustworthy AI tools for high-stakes applications. Join us for an exploration of the future of AI in research.
The Cognitive Revolution is part of the Turpentine podcast network. Learn more: www.turpentine.co
HELPFUL LINKS:
Elicit : https://elicit.com/
Andreas Stuhlmüller : https://twitter.com/stuhlmueller
Jungwon Byun : https://twitter.com/jungofthewon
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, instead of...does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
ODF is where top founders get their start. Apply to join the next cohort and go from idea to conviction-fast. ODF has helped over 1000 companies like Traba, Levels and Finch get their start. Is it your turn? Go to http://beondeck.com/revolution to learn more.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Plumb is a no-code AI app builder designed for product teams who care about quality and speed. What is taking you weeks to hand-code today can be done confidently in hours. Check out https://bit.ly/PlumbTCR for early access.
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to choosesquad.com and mention “Turpentine” to skip the waitlist.
--
TIMESTAMPS:
(00:00:00) Intro
(00:04:35) What is Elicit?
(00:05:33) Vision for Elicit
(00:09:40) Making research transparent
(00:11:28) How to use it?
(00:14:51) Multi-dimensional exploration of the research space
(00:15:57) Sponsor: Oracle Cloud Infrastructure | ODF | Omneky
(00:17:41) Task Decomposition
(00:21:29) Evaluating AI
(00:23:18) Defining the task
(00:25:59) Eliciting fine-grained evaluations
(00:27:36) Hallucination rates
(00:29:52) Models in play
(00:30:59 )Sponsor: Brave | Plumb | Squad
(00:33:56) Shipping a new feature every week
(00:35:39) What was not possible a year ago?
(00:37:46) Chain of thought
(00:40:03) Intuition for chain of thought
(00:43:17) Tactically, how to structure the chain of thought
(00:44:51) Data sets and fine-tuning
(00:50:33) Scaffolding
(00:52:50) The future of infrastructure
(00:55:16) How Elicit works today
(00:58:20) Product philosophy
(01:05:46) Trends for Future
(01:07:18) Systematic reviews and meta-analyses
(01:09:23) Depth of processing
(01:12:42) Habit formation
(01:14:12) AI Bundle
(01:17:25) Nonprofit to for-profit
(01:19:47) Hiring needs
(01:22:27) Wrap
Full Transcript
Transcript
Jungwon Byun (0:00) We have big visions of transforming research and there might not be a lot of time left before we really need to make it useful for very high stakes decisions in times of chaos.
Andreas Stuhlmuller (0:08) People probably still are making the obvious mistake of being like, model. Say yes or no and then justify your answer. And that's obviously terrible because then it has to answer without any reasoning and will just lock itself into a potentially wrong avenue.
Jungwon Byun (0:21) This is where the task decomposition comes into play because it's much easier to iterate on and ship a new predefined column for statistical technique use than it is to ship a new model that's been fine tuned on all those scientific papers and evaluate that for quality. So again, breaking down things into small tasks helps with launches as well.
Andreas Stuhlmuller (0:40) The key question is how do we as a society want to turn compute into more work? 1 answer is we're going to train larger and larger models and we'll hope for the best. Maybe we'll augment it a little bit with better interpretability methods. We are working towards a different answer, which is we think like more compute will in fact lead to great things and more correct answers, etcetera, but we can get there through more transparent architectures if we can build the right infrastructure.
Nathan Labenz (1:05) Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg.
Nathan Labenz (1:28) Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to present a conversation with Andreas Stuhlmuller and Jungwon Byun, cofounders of Elicit, a company that helps social and natural scientists analyze research papers at superhuman speed, and in which I am a small dollar but very proud investor. Andreas and Jungwon bring deep expertise and a highly principled approach to the challenge of getting large language models to reliably answer complex research questions. Their product allows users to systematically search large bodies of literature, extract key information into well structured tables, and iteratively refine queries to 0 in on the most relevant sources, all while maintaining a transparent log of each step for easy auditing, reproducibility, extensibility, and team collaboration. As the AI research assistant space has become increasingly crowded, Elicit stands out for its meticulous approach to task decomposition. That is the art, which they are very much developing into a science, of breaking big questions down into smaller, more manageable subtasks that language models can reliably execute. This allows the product to provide real value to serious researchers who really need to be able to trust the results. We cover a lot of ground in this conversation, starting with the company's founding vision of using AI to enable knowledge sharing at an unprecedented scale. Andreas and Jungwon explain how they've approached key challenges like minimizing hallucinations, maximizing accuracy, and expanding the scope of analysis. And we dig into the tech stack that makes this possible, including their approaches to retrieval, extraction, summarization, synthesis, and most importantly of all, evaluation. Finally, we touch on the company's evolution from nonprofit research lab to mission driven commercial startup, their plans to enable more scalable compute intensive workflows in the future, and what sorts of talent they are looking to hire today. By focusing on researchers who are working on hard, sometimes literally life and death problems, Elicit has no choice but to treat reliability as a top priority. And their approach honed over several years of r and d is a model for anyone who's looking to build high stakes applications with large language models today. I was obviously a big fan of the company coming in, but I came away from this conversation even more convinced that Andreas Stuhlm and the team are up to this important challenge. As always, if you find value in this work, please share the episode with your friends. This 1 would be great for anyone who's struggling to keep up with an exploding body of research literature, whether that's in machine learning as it is for me, biology, or something else. And please also take a moment to share any feedback or topic suggestions that you have via our website, cognitiverevolution.ai, or by messaging me on the social media network of your choice. Now please enjoy this deep dive into Elicit's program for scalable, trustworthy AI assisted research with cofounders Andreas Stuhlmuller and Jungwon Byun.
Nathan Labenz (4:28) Jungwon Byun and Andreas Stuhlmuller of Elicit. Welcome back to the Cognitive Revolution.
Andreas Stuhlmuller (4:34) It's great to be back.
Nathan Labenz (4:36) Guys, it has been an incredible year in the AI space overall, obviously, with an unbelievable amount happening. I can't believe that, it's been about that long since our first conversation, And a lot has happened for you guys as well with a bunch of product launches that you've brought forward in rapid succession, a fundraise. And right now, the latest hotness from Elicit is notebooks. Folks will know, by the way, I've mentioned Elicit in a couple episodes, I think, recently as 1 of my half a dozen go to information surfacing tools. And specifically for more structured literature reviews, This is my go to tool. I should also disclaim a little bit because this is not something I have to do often, but I am a super small time investor in Elicit, which is something I say with, great pride. So let's start off by just telling people what's new, and you guys can run through the the new release of notebooks.
Jungwon Byun (5:33) Yes. I was just gonna say for people who don't know, Elicit is an AI research assistant. Our vision is to apply AI to helping the world navigate complex reasoning and to really scale high quality reasoning. We want to help researchers unlock breakthroughs in every domain, everything from climate change to chronic fatigue and institutional decision making. And as you've noted in the past year since we've last been on, the product has just grown tremendously. We've gotten to 1000000 in ARR in just 4 months after launching subscriptions. We spun Elicit out into its own venture, raised a $9,000,000 seed round, and are now have really been spending the last few months thinking about how we continue to support people who are doing really complex large scale reasoning, people who are really pushing the frontiers of their domains forward. And notebooks, which should be out by the time that we release this podcast is something that we've been working on actually for years. This is actually our fourth time trying to make this vision work. It's very core to how what we think about with Elicit and the kind of product we want to be. So I think it's maybe helpful to start with what is the vision for Elicit really? What are the problems we're trying to solve and how does this latest feature it's really it's much more than a feature launch. It's a reimagining of the product and how does this latest release manifest that? And I would say for us really, I think there are 2 main challenges that we see in research today. 1 is that there's just an overwhelming amount of information. There's just a superhuman amount of information. There's no researcher that can possibly stay on top of it all. Publications and most domains are growing exponentially. User a researcher can barely stay on top of what's being published this year, not to mention all of the historical context that's obviously accumulating over time. I think researchers feel like they're just drowning in papers and they can never be as rigorous as they want to be. So that's definitely 1 challenge with research where I think it seems pretty obvious how language models can help with that problem. The other challenge we've seen in research is just that a lot of the processes are ad hoc. So it's like very common to I'm sure you've had this experience too. It's like very common to look at someone and they just have an explosion of tabs. They they start with Google or Google Scholar. They find this paper and that paper and this paper and they're down this rabbit hole like 8 hours later, they're like, how did I end up here? What was I even researching? And there are certainly very systematic processes are in research, but I think a lot of times the journey, there's people talk a lot about serendipity when we talk to researchers at like multinational industry research labs that have offices across continents and thousands of people working. When they try to figure out what kind of relevant work has been done in their own company, a lot of it is I talk to someone and fortunately that someone happened to know someone and that happened that person happened to be here like 10 years ago when we worked on this thing. And I think that's just like really not a scalable or optimal process. And so with Elicit and I think with AI, we can really start to reimagine what research can look like with these tools, what processes we want to bring over and what we really want to transform. And we really want to make research we talk about making research systematic, transparent and unbounded. Those are the 3 things that we want to manifest in our product. And so notebooks is the way by which we make that possible. So systematic is like taking these processes that are ad hoc, these like maybe 1 tab at a time research journeys, and instead helping researchers have much more metacognition, put them more in the seat of thinking about what makes for a good paper, what makes for a relevant result, what makes for high quality research, What does success look like for me in this moment and for this research project? Let me be explicit about those criteria and then apply them over as many inputs, as many papers as as can as exist. Not gated on my own time or reading potential, but really on all of the kind of relevant data out there. With AI, people are increasingly going to be put in the position of having to evaluate and be really explicit and mindful of what success looks like before then using AI tools to manifest that. So that's like systematic is what we're going for. We care a lot about making research transparent. This has been really important to us since day 1. It's really important with language models showing we've published research in the past on supervising kind of process, not just outcomes and notebooks really accomplishes that as well. So it kind of, it allows you to basically take additional steps and it logs all of your steps as you go on this much more extensive journey. So you can see what papers did I start with? What factors did I consider? What papers did I screen out? What other queries did I run? What papers did those result in? And then over time you can see how did I end up at this final output? There's this kind of by default log that you can audit, that you can share with somebody else. So it's gonna help with that transparency process of component. And then I think the biggest part that notebooks will push on that hasn't, that we haven't had before is this idea of unboundedness. So I think another challenge with research today is people do a ton of work. And then the final kind of artifact is this PDF, which is obviously very static and basically like immediately outdated as soon as it's published and you can't reproduce it, you can't extend it, you can't like slightly tweak someone's analysis. And with notebooks, we really want to try and make research much more like unbounded in that sense. So you can keep taking additional steps and keep doing data analysis over all the information and papers. Basically, you can follow-up with additional queries, summarize more papers, extract more data. And often research, there's a way we can systematize research processes, we really want to do. But there's definitely a component of research which is fundamentally iterative and uncertain. Often if you're in a more exploratory state, you don't always know where you're going to end up. And so notebooks accommodates both of those workflows.
Nathan Labenz (11:28) Cool. Let me just give my kind of experiential overview of what it's like to use a list and you can expand on that. And of course, you guys have plenty of demos and materials as well. But the way I think about it is first of all, it's is, as you've said, like, very structured. Right? So it's like, first, to ask a question. That's the initial prompt is ask a research question. Then there are I guess I think of it as 2 main functions. 1 being, like, identifying the papers that it should show me in the first place, and that kind of creates, like, my little sandbox that I can now play in. And then there's the, like, highly structured addition of columns to the tabular form in which the papers are presented where I can bring all these sort of incremental auxiliary analyses to bear on all the papers. So there's, like, a ton of these that are preset where you can say, what's the population size or was there was the null hypothesis rejected or not? All these kind of tip very standard things that you've already cooked up. So I can just go click a button, add a column, and then boom. The in the background, the language model is you can tell us a little bit more about exactly how it's working under the hood in a minute. But digging into each of these papers, asking these very specific questions in these kind of bite sized ways, filling out this table. I personally love the ability to also set up my own and just ask 1 I did the other day was in researching curriculum learning. I just said, what is the largest model used in this paper? And, you know, just boom boom boom. Okay. Pythia 6 b, whatever, whatever, whatever. So now I can say and this is where I think most of that functionality has been there. And now with notebooks, I can start to say, okay. I am really only interested in ones that have, like, pretty big models because of the belief that, like, they're gonna behave qualitatively differently. So I'll look at that column, check the ones that have sufficiently large model size, now pull that down into another table, and now I can start to do more deep dive analysis on just that select group. Then I can also ask additional questions, bring up more papers, consolidate those into lists. And I hadn't really thought too much about because I'm relatively new to the nope. I've got early preview access, but I'm still relatively new to it. I haven't really thought as much about the sort of implicit log that creates and the the journey that is embodied in that history. But I actually, as you described that also thought, yeah, that's probably quite valuable to be able to go back and look at my train of thought because lord knows I can't remember what I was thinking half the time and also just to audit myself to make sure that I that I was approaching things in a sensible way. So, yeah, that's cool. What would you add to that?
Jungwon Byun (14:08) Yeah. Now let's say at the end of all this work, you wanna publish a new blog post or, like, someone wants to update this kind of and compute essay that was published a long time ago. Ideally, this notebook can serve as this log of how did how did you figure out which models were relevant, what dimensions did you look at, what papers, what criteria. And then if someone wants to build on that, they're like, actually, I also want to look at smaller models, or I want to look at models trained on this specific architecture. It's much easier for them to pick up where you left off. They don't have to start all the way from scratch finding the relevant papers. Again, they can just make small tweaks. So the ability to take this expertise and manifest it ex externalize it outside of like a person's mind to make it much more collaborative is definitely something we're excited about. And then this thing that you talked about with columns, think is is another manifestation of this unboundedness in notebooks. There are just so many places where you can go in multiple directions. Like from the very beginning, you can ask like 5 different search queries and get papers for all of those search queries. You can go really broad. Then for any 1 search query, you get a set of papers, you can add all these columns and start to go deeper all the way down to I'm gonna go really deep into this 1 paper. I So think this kind of multidimensional exploration of the research space is really cool and also 1 of the challenging product challenges that we have to work through.
Nathan Labenz (15:24) Cool. Yeah. I love it. I'm excited to just spend even more time with it now. I I love the idea, and I haven't had it long enough to to have had this happen for me yet. But I do think the notion of going back to an earlier analysis, running an updated search, and then filling in all those cells that I previously went through and just, like, extending that previous analysis and kind of bringing it up to date is, yeah, something that I will definitely keep in mind because that is gonna come up, no doubt.
Erik Torenberg (15:53) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz (15:57) Okay. In terms of how this works, I think you guys have been leaders. And 1 of the things that was, you know, really exciting to me about the company was the focus on reliability, taking a principled approach, not being overly reliant on language models, which I think is becoming a bigger problem broadly and will only continue to become a bigger and bigger problem. But I I wanna get into a little bit of some of the weeds on that, if you will, because I think people broadly don't have a great sense of the the best practices on this. Also, by the way, supervising process, and that's become more of a trend. You guys were definitely very early on that trend. So 1 big aspect of this is task decomposition. I would say and you can extend my understanding literature here for sure, but it seems pretty clear that task decomposition is good for performance, good for reliability, good for getting accurate answers. I think we've seen pretty consistent results in the literature to support that notion. As a practitioner, though, you sometimes have challenges where you're like, okay. How just how much do I really wanna break these tasks down? I know I don't wanna just throw everything into 1 mega prompt and hope for the best. But on the other hand, if I break things down super, super fine grained, then my token count explodes or I, like, have to really pare back on context, and that maybe can hurt performance in other ways, and things can get slow. So it seems like there's a Pareto frontier here as is often the case with how much to decompose tasks. Wonder if you could share, like, how Elicit works in that regard right now and just best practices for folks in general? Because I think a lot of folks who listen to this show are builders, and they are probably wrestling in their various contexts with how much task decomposition they should do.
Andreas Stuhlmuller (17:48) Yeah. I I think that's a really interesting question. I think in the past, often, the question was, what's the largest chunk that a model can reliably do? I think we started doing task decomposition with GPT-two and the chunks were really very small. Does this claim imply this other claim was maybe already something that was difficult to do? I think now the question is starting to be what is easy to supervise? How big or small do you need to make the chunks so that you can easily check whether the work was correct or not? As you mentioned, with these, I don't know, token context models, that can be quite tricky. If you get an answer, it's easy to verify that it found something good, but you won't easily know if it missed something because the way you would figure that out is by looking through the million token context. I think this sort of what is easy to supervise question will really become quite important in the future, not just for humans but also for models. Even if you use models for the supervision, I think you still want to break things down that make the supervision problem easy. 1 case where we used this, for example, was we wrote this paper on what we called factored verification for kind of the illicit summaries where you're trying to make summaries that have really no hallucinations, and there, instead of breaking down the task of generating the summary, we instead broke down the task of verifying the summary. We broke the summary into independent claims and then for each claim we asked, Is this claim very clearly substantiated by the context? And we've detected quite a few very subtle hallucinations. Maybe more generally, 1 way I think about it is as similar to software engineering where you want to write modular code and you want to write that code because it will help your future self understand it better. It will help others check and understand it better. And so similar to how humans use clear interfaces for their work, think that the most important decomposition principle is use clear interfaces that make it easy to check the work.
Nathan Labenz (19:46) Yeah. I guess that leads to another interesting question around just the nature of the evaluations that you're doing. My my general sense is that across the sort of AI application industry, certainly, has been true at Waymark, where the stakes are quite a bit lower, and the results are a lot generally easier to at least like it or they don't. We're making creative stuff so they can watch it. And if it resonates with them, then it's a win. And if not, then it's not. But still, we are iterating faster on models. We're using a fine tuned 3.5, and it's become easier to execute that fine tuning. And it's become easier to patch things and throw in some different data, and we can use GPT-four to generate data. And so everything is speeding up in terms of the iteration cycle of the model. But that has now created this situation where we used to have a pretty intimate feel for how our model behaved in a lot of different contexts, And now we've started to lose that, and so it's okay. We really do need to bring some structured evaluations to this. We are largely so far using model powered evaluation for, like, very clear, like, I call it, like, the 10 commandments, like, the thou shalt nots, the things we definitely don't wanna see. And I'm reluctant still to trust a language model on an overall quality score. I'm like, yeah. We could compare our current script writer to the next 1 with GPT-four or whatever, but do I really trust, you know, GPT-four is, like, scoring that much? Probably not. So I still feel like we need to have, in our case, like, humans watch the videos and see if they're if they like them. What is the mix of things that you're doing? Obviously, it's a higher stakes and and more subtle challenge with just a lot more different kinds of inputs as well. So how are you finding balance between, like, when to use models for evaluation, when to use humans? This seems very tricky.
Andreas Stuhlmuller (21:36) I think actually our experience mirrors yours in some ways in that defining the task properly is really most of the battle. If you can be really clear about what good looks like, then you're most of the way there due towards a good evaluation. For us, that means thinking a lot about who's our core user group. Our core user group is serious researchers. What are they trying to do? A lot of people use Elicit to write systematic reviews. For example, the last few days we've been working on improving our search and so there we're looking at what's the ideal output here. If you look at a systematic review that has been written, what papers did they actually cite and what elicit have helped them find exactly those papers and not some nearby papers that maybe on surface inspection seem relevant, but actually once you look into it are not as relevant. In terms of metrics, that's the core approach. There are a few more details on then how do you actually measure things. Those are often more standard metrics, I guess for search, NDCG discounted cumulative gain, but essentially the most interesting part is really deeply understanding the user's problem. I think that was also the thing for the summarization example I gave earlier. Users really care a lot about correctness and don't want to see like even subtle hallucinations. That's why we're like, we can use models to help a lot there and found things like, I don't know, sometimes models like to just slightly exaggerate the findings of a paper or need to imply that 2 independent findings are related because the sentences flow a bit better. For those kinds of things, if you know that people don't like those, we can encode that in models. But then there are other cases like extraction where we use automated evals, but we use automated evals against human gold standards where human experts have looked into it, have thought about it. These are the things they want to see. I think in the future we will probably look more into using things like AI debate to come up with the gold standards as well, but for the time being, there are some cases where we can't fully encode yet what good looks like, and in those cases we also have to use human judgment still.
Jungwon Byun (23:41) Yeah, so I think trying to use naturally existing gold standards and high quality data sets, and maybe not every use case has this, but we prioritize that. Where did someone whose work we'd really want to help just try really hard to get to the best possible answer, not try to create a dataset for AI evaluation or training? Then how can we use that, like, naturally existing human state of the art to evaluate our work is definitely something we try to lean on a lot and our users have been really great in helping with that. And then I think the the kind of factored verification approach that Andreas was mentioning earlier is also an important part of it. So Nathan, you were saying you're wary of just throwing a bunch of things into the prompt and saying GPT-four, is this good or bad? But if you break down the task into, is this supported by the text? Is this accurate? Is this hallucinating? Does it answer the original question into each component? It's generally much more constrained and I think produces better results that way. But yeah, I think part of it for us is also like how do we structure the interaction in Elicit so that everyone is doing more fine grained evaluations. Like, we're doing more fine grained evaluations and our users are doing more fine grained evaluations. So when you gave the example earlier of you're looking for these papers on mixtures of experts and you're, like, specifically focused on large models, I think in in in other products or in another context, you might have just been like, I have this question. Please give me relevant papers. And we're like, we don't know what relevant papers are. But if we make it possible for you to be like, actually what's relevant to me is models above a certain size, then you can explicitly you can break that you implicitly break that task down in a separate task of what were the models size of the models used in this paper. That is a much easier task for us to evaluate as opposed to just what are what are generally relevant papers for Nathan in this context.
Nathan Labenz (25:25) This portion of the conversation started with evaluations, which is typically an offline process. Right? But you could also imagine bringing at least some of that process into the runtime. I guess it would depend on multiple variables, including, like, how much of a problem it is after all your effort has gone in and also, like, how much it's gonna cost and how how much it's gonna slow down and, like, how much that kind of incremental effort would help at runtime. Is there, like, a easy to summarize state of, like, how many hallucinations you've been able to drive the product down to? And are there runtime, like, secondary models checking models, runtime checks happening in the course of a user session?
Andreas Stuhlmuller (26:10) Don't remember the exact numbers, but I think it was something like from 1.5 hallucinations on average using the largest models to maybe 0.5 hallucinations where we were like quite strict about what we count as a hallucination in that context, like even slight exaggerations or slight inaccuracies we counted as such. We don't currently use additional checking at runtime, so the primary use of that was to generate training data that we then used to distill a model that had lower hallucination rates. But I think additional runtime checking is really interesting and we're always looking for ways we can help users turn additional budget into more accurate answers as a key criterion at Elicit, and I think additional runtime checking be 1 of those properties.
Jungwon Byun (27:01) Yeah. And then in app, we've done some testing. It really varies depending on the task. So we've done mostly evaluations on some of the predefined extraction columns and they're pretty cons in the kind of direct side by side comparisons we've done with manual extractions, we've pretty consistently outperformed trained research staff who are doing those extractions. So that's been really cool. When you do it, use it yourself. Again, it depends on how you're formulating the question, what you're asking for is the information even in there. But certainly in my experience, like, I'm always amazed by how quickly and how virtually the data is able to be extracted and I can yeah. I think it's definitely at superhuman performance today.
Nathan Labenz (27:37) That's cool. And it's it's also very important always to keep that kind of alternative in mind. I feel like 1 of the lessons I've learned over and over again over the last year of talking to people, building things is just, like, how often there was no human there was no measurement of what human performance was when they started off in a given domain. And people, in many cases, imagine that it's super awesome, and it's often not quite as awesome as they imagine.
Jungwon Byun (28:02) Yeah. Sometimes people give us a gold standard, and then we're like, oh, we think there are some errors in your gold
Andreas Stuhlmuller (28:06) standard based on what we've done.
Nathan Labenz (28:08) You touched there for a second on models and, obviously, creating datasets is a huge part of this, and you mentioned distilling. I'd
Jungwon Byun (28:17) love to
Nathan Labenz (28:17) hear what are the models that are in play today.
Andreas Stuhlmuller (28:21) To take a step back, our principle is we want to always let people use the best models at any 1 time, so we are pretty pragmatic about what models to use. Therefore, it's also quite important to us to not be tied to anyone like compute provider because, particular sometimes like combinations of models from different providers might be better than the best single model. That means, I think right now we use a collection of models, includes fine tuned models, including client T5. You could use fine tuned GPT-3.5. Use GPT-four, Vision GPT for understanding tables. We're currently working on using Cloud 3 for some functionality, but next time we talk, probably the answer will already be different. It's really at any 1 point you want to be at the frontier, And so getting good at quickly swapping out models is the most important thing.
Nathan Labenz (29:08) Yeah. The speed of shipping is also definitely a key priority for AI apps in general.
Erik Torenberg (29:16) Hey. We'll continue our interview in a moment after a word from our sponsors.
Erik Torenberg (29:24) Hey, all. I'm hearing more and more that founders want to get profitable and do more with less, especially with engineering. Listen. I love your 30 year old fang senior software engineer as much as the next guy, but honestly, I can't afford them anymore. Founders everywhere are trying to turn to global talent, but boy, is it a hassle to do at scale from sourcing to interviewing to on the ground operations and management. That's why I teamed up with Sean Lenehan, who's been building engineering teams in Vietnam at a very high level for over 5 years to help you access global engineering without the headache. Squad, Sean's new company, takes care of sourcing, legal compliance, and local HR for global talent so you don't have to. With teams across Asia and South America, we can cover you no matter which time zone you operate in. Their engineers follow your process and use your tools. They work with React, Next. Js, or your favorite front end frameworks. And on the back end, they're experts at Node, Python, Java, and anything under the sun. Full disclosure, it's going to cost more than the random person you found on Upwork that's doing 2 hours of work per week but billing you for 40. But you'll get premium quality at a fraction of the typical cost. Our engineers are vetted top 1% talent and actually working hard for you every day. Increase your velocity without amping up burn. Head to choosesquad.com and mention turpentine to skip the wait list.
Nathan Labenz (30:42) You recently put out a blog post about how you have, I believe it was, shipped a new feature every week. Do I have that right?
Jungwon Byun (30:49) Yeah. I think in on average, it ended up being 1.4 weeks. So there are there are a couple couple of weeks that we missed. We were doing bigger launches.
Nathan Labenz (30:55) That's in a video creation app where it's, like, a relatively lower stakes. Like, that's would still be pretty good, I would say, to launch a new thing every week or or just just less than that. But given the standard that you have around reliability and minimizing hallucinations being, like, so mission critical, how are you balancing that? Is it just like a matter of having a battery of tests that you can run and trusting them, or is there more to the discipline of the iteration cycle that you that you can share?
Jungwon Byun (31:28) Yeah. I think we're staggering it as much as possible. So some of the accuracy dependent features, they take a lot of time to iterate on and research and that, you know, they're in progress for it's not like they're in progress for a week. They're in progress for a while, but then we can stagger the launches across multiple weeks, and then we can offset them with other more like infrastructural or just pure product based launches.
Andreas Stuhlmuller (31:49) Yeah. I guess the other thing to mention is, I think, if you want to be at the frontier of accuracy, you need to ship quickly because every week, new model is coming out. So if you're to ship, then your product will probably not be the most accurate it can be.
Jungwon Byun (32:03) Yeah. I guess this is also where the task decomposition comes into play because it's much easier to iterate on and ship a new predefined column for, you know, statistical technique use than it is to ship a new model that's been fine tuned on all the, you know, scientific papers and evaluate that for quality. So I think, again, breaking down things into small tasks helps with launches as well.
Nathan Labenz (32:25) Yeah. That's interesting. So what have been as the models obviously have gotten better over the course of the last 10 years, but certainly we've we've experienced a lot of kind of would say I would imagine a lot of thresholds have been passed over the last year or so. I wonder if you could tell a story of what was not possible a year ago that just just the language models couldn't do it or they couldn't do it reliably enough that now are, like, preset column options because we've tipped over from not there to there in terms of the frontier model's ability to do it.
Andreas Stuhlmuller (33:00) The most obvious case is answering using tables. So this isn't a specific column. It's across all columns. We know if you're using high quality mode, also look into the tables of the papers. For many question, if you I guess I think you mentioned earlier, is the p value significant or things like that? Is the hypothesis refuted? Those are often only found in tables. Yeah. That was just a really hard problem that before multimodal models was basically not possible. Even at this point, I think there's like a lot of engineering involved in getting things to be fast and parsing tables, etcetera, but now it's possible. Otherwise, I think there is a kind of increase in what information you can leverage across the board. For example, I've been working a fair amount with an organization called epoch. You might be familiar with them. They track progress in the field of AI across compute parameters used, dataset sizes, just to see how are things progressing. The way they do this is by analyzing the literature, what kinds of models are coming out, actually pretty similar to the example you gave earlier. If they ask you a question like how much compute was used in a paper? Often papers don't say we used x FLOPS. They say we trained this architecture on this type of GPU for 2 days. And if you want to actually answer that question, you need to do a bit of reasoning. You need to be like, FLOPS is like this GPU and what's the answer overall? So as models get smarter, they can answer questions that are, like, 1 or 2 hops removed from what's literally in the text.
Nathan Labenz (34:29) Okay. That's really interesting, and I'm very curious to know how that works and maybe it varies depending on the context. But I could imagine at least 2 different approaches or even a combination. Like, 1 would be chain of thought, and I wanted to ask you about chain of thought in general. Another would be, like, cogeneration and execution. And then, of course, you have, like, chain of thought followed by cogeneration and execution. Let's start with how that kind of reasoning is being implemented. And then I do wanna dig a little bit deeper into chain of thought as well because I feel like that's another black box voodoo area where people know that it works, but there's probably a lot of caveats and gotchas there that people should be more aware of.
Andreas Stuhlmuller (35:10) I think chain of thought is quite important to us. So we use chain of thought for basically all questions you ask in Elicit, and if you use the CSV export functionality, it will actually include like extra columns that say, here's the reasoning for this column, which I recommend people check out. I think often it's like quite instructive. The reason we do this is because transformers obviously have a fixed amount of compute per token, and so there's some questions that they just can't answer. You can fine tune as much as you want, but your transformer is not gonna be able to do arbitrary arithmetic. That's just not possible. Think there are different architectures that people are interested in or like modifications like POS tokens or something, but at least given the current architectures, I think chain of thought is often critical. I think the epoch example I gave is an instance of that. There, I do recommend that people don't just take the answers, also and that's what they do, like also add specific columns. I think ideally you add illicit columns that ask about what was the computer architecture or what was the data, ask about the ingredients as well in addition to just trying to get the model to give the full answer because sometimes the reasoning still is more complex and might require some human intervention.
Nathan Labenz (36:22) When I do an extra column, am I able to is it automatically including the results of earlier columns in that subsequent column? It suggested it seemed like you were suggesting that I can, like, incrementally build up columns, but I hadn't realized that I could compound them in that way.
Andreas Stuhlmuller (36:38) Did you count in the current published version? We we have a prototype where that was possible and, yeah, excited excited to ship it as 1 of those 1 week features, but it's still work in progress.
Nathan Labenz (36:48) I guess is there just any more kind of intuition you could help people develop for chain of thought. I've seen, like, lots of stuff that shows that it works better. It's obviously just common sense at this point that it's gonna work better. It's also become the default behavior in the models that people are most accustomed to using. It's funny, though. I still because it's been a little while since I've seen the last version of this, but I've seen quite a few examples from published research where benchmark reports were dramatically understating model performance or capability, let's say, because the prompt structure had prevented chain of thought. Like, I've seen this with Big Bench. I've seen it with a couple, like, theory of mind questions where it's like, you set up a few shot thing because that was, like, the way to do it at 1 point in time, and that's, like, maybe how the benchmark was constructed. But you're jumping straight to the answer, and no wonder you're not finding the GPT-four is any better than GPT-three to take 1 memorable example that I really spent some time on. Okay. Whatever. Like, that but we should be past that now. But then you see these these other stuff from the literature where it's it's not really super faithful or sometimes the answers don't really seem to actually depend on the chain of thought. There's been these, like, counterfactual chain of thought examinations where you change the chain of thought, but does it actually change the answer? I find that stuff confusing. And to be honest, I'm still in the mode where I'm like, it seems like it's pretty clearly best practice. It's probably the best I can do in many contexts where I'm working. I've also had really good luck for Waymark specifically training a chain of thought into a 3.5 fine tuned to really teach it exactly how we want it to work through this particular problem. But, yeah, I still feel like my worldview is, like, pretty fuzzy when it comes to what's really going on with chain of thought and how do I make sure I'm getting the best from it but not trusting it too much. So any more you could share there or any even if it's somewhat speculative, I I would love to get up to speed with your thinking.
Andreas Stuhlmuller (38:48) I think it's just fundamentally hard because people don't understand what, including me, what's going on inside of these models. And so anyone who claims they know exactly how chain of thought helps or doesn't help the model, I think is probably just wrong. I also want to agree with you that I think people probably still are making the obvious mistake of being like, Hey, model. Say yes or no and then justify your answer. That's obviously terrible because once it then it has to answer without any reasoning and will just lock itself into a potentially wrong avenue.
Nathan Labenz (39:16) I think this is fixed now, but the first bard was shipped that way. As far as I could tell, that was like an eye opening moment. It was like, guys, you get the best research team in the world arguably here, but we lost something in in translation. I I do think that is now fixed. But, yeah, I remember seeing that. I was like, oh my god. Listen. Some of these problems run a little deeper.
Andreas Stuhlmuller (39:36) Models should be fine tuned on more, hopefully, I think that might be in progress, but we'll see, is being able to say, Oops, I was wrong. I think if models were more okay with saying, Yes, actually wait, no, I meant no, I think that would help with that behavior and I think would probably generally be a good thing to do to allow for more kind of, I don't know, truth oriented reasoning as opposed to after post hoc justification.
Jungwon Byun (40:03) Tactically, some of the things that I I think it depends on like how you structure the chain of thought, not like you guys have been saying, it's not all kind of chains of thought are created equal. Giving the model the option to say, I don't know and here's why, or here's the closest thing I could find in the text, or we don't think it's mentioned, but and here's why we look here's what we looked at and where we expected to see it, but didn't find it, is actually still really valuable and make makes the model more accurate. A lot of times the problem we have is the answer isn't in the text that we're looking at. Forcing the model to come up with an answer is what leads to hallucination. And even if you don't have the answer, you can still give really helpful information to the user. So giving it these back out options and like most relevant next helpful thing, I think is both a better product experience and reduces hallucinations.
Andreas Stuhlmuller (40:50) This goes back also to the point about clear interfaces we mentioned very early on. I think you can define an interface that is you need to say yes or no, or you can define a different interface, which is you will get data. You can choose between saying it contains the answer or it doesn't contain the answer. If it does, here are the things you can say. If it doesn't, here are different ways you can be helpful. I think thinking about what are the ways to deploy models that make them as useful as possible is I think an important part of what we need to do.
Nathan Labenz (41:21) Yeah. That's you mentioned the pause token. I was also reminded by kind of the notion of the back out or the ability to recognize a mistake of the backspace token, which seems like a step in that direction, although certainly a a baby step probably in the grand scheme of what's ultimately needed. How big are the datasets that you're creating for this? Like, at at Waymark, we find that quality is obviously super important. Scale hasn't been that important for us. Like, low hundreds of examples is pretty good. Then we can throw in a few more when we have another use case or random thing that we wanna tack on. Interested to hear how big your datasets are. And then also interested in, like, how you are doing the fine tuning. At this point, we've got quite a menu of you're going obviously beyond base models, it sounds. But are you doing instruction tuning or, like, doing the sort of RLH or RLAIF with a, like, an actual reward model. Or I personally have not been, like, up close and personal with a DPO project, but that seems to be, like, becoming more popular as well. So, yeah, like, datasets, like, how big do they need to be, and how do you actually turn that dataset into a purpose built fine tuned model?
Andreas Stuhlmuller (42:32) Yeah. We like you, we found that a few hundreds to few thousand data points is generally all we need. Datasets have been more important to us for evaluation than for fine tuning, which is not to say we don't use fine tuning. We do use it also use the constitutional AI a bit where we trained a preference model on deciding which of 2 summaries is better, where the constitution included things like the summary should be accurate, should be concise, etcetera. And there yeah. Again, I think the dataset was maybe a few 100 to 1000 human judgments.
Nathan Labenz (43:08) Interesting. So it's not all RL AIF, though. It's so it sounds like it's instruction it's, like, supervised kind of instruction tuning first and then sometimes adding on this RL AIF layer?
Andreas Stuhlmuller (43:21) Yeah. That's correct. And I think the ideal case is we don't have to do any fine tuning. I think we're not in the business of kind of making, I don't know, doing ML work or something. Ultimately, what we most care about is deploy AI systems in a way that leads to the highest accuracy, most useful answers to users. I think sometimes that will be fine tuning of different sorts, but I think if you can avoid it, it's actually better because you can deploy systems even more quickly. I think increasingly there are situations where if you have the right prompts, you can just use off the shelf foundation models and by composing them in the right way, do do tasks.
Nathan Labenz (43:57) Yeah. That's interesting. So you've mentioned epoch, and I spoke to Tamay not too long ago. I'm thinking this I might need to get him on here as a guest too. He asked me a really interesting question, which I'm now gonna ask you. This is more of a prompt maybe than a question. What does a scaling law for scaffolding look like? He may have asked you that question too. It sounds like I'm brought to this because you're suggesting that fine tuning and scaffolding or structure maybe more generically are substitutes to a degree. You can also imagine there may be complements in some cases, but the it sounds like what you're saying is you guys have built out the structure with enough granularity and you have enough confidence in it that you don't need to fine tune as much as I would have naively guessed. And then that leads me to think, is there any multiple people have asked me this, like, scaling law for scaffolding question, and I don't really know what to make of it other than just, obviously, scaffolding is useful. Here, we're getting this sense of some substitutability with fine tuning, but I don't know. What else comes to mind when I say scaling laws for scaffolding?
Jungwon Byun (45:05) Do you mean, like, how does scaffolding scale with larger models or, like, what would enable scaffolding to get better with more compute? Or what exactly is the question?
Nathan Labenz (45:15) Yeah. I don't even know. It's a kind of an ill formed question. 1 sort of scenario that I think about a lot is the fact that we are building out the complements to the core language model. Right? The thing that, like what I often say these days, we now have AIs that can plan, reason, and use tools. Not necessarily super well. A year ago, they could barely do it at all. 2 years ago, they literally couldn't do it at all. So that's, like, happening pretty quickly. And because we now have those, people are, like, obviously racing out to build the, like, planning frameworks and the tool harnesses and all that stuff. And they're putting a lot of elbow grease into that to make it work with the current crop of models. And they're probably speaking, like, the sort of free form ad hoc delegation agents, like, don't quite work yet, although you do have the occasional flight booked or whatever. But then I think, jeez, with all that stuff built, if all of a sudden there's a significant model upgrade that goes wide at the same time, then, like, a lot of things that have previously just fallen over might actually already have enough scaffolding to work. So this is, like, outside of your, like, core product development domain because you guys are obviously building something, like, super structured. And it could work better, but it's not gonna be probably qualitatively different from 1 model release to the next. But I do see these sort of agent systems that that other people are building as, like, potentially right for a step change in what they can actually do in the world just because all of the sort of surrounding stuff is already there, then it's boom. GPT 4.5. You know, is that today? Well, check your watch. All of a sudden, things might start to work a lot better. So I'm taking us a little bit far field here, but the mention of epoch and the kind of analysis that they're doing as well as this notion of some substitutability between fine tuning and surrounding structure, at least caused me to ask the question.
Andreas Stuhlmuller (47:19) Yeah. I think it's a great question. I'm also very interested in it. The the the most basic example is the thing we mentioned earlier is chain of thought. So so you're like you said people underestimated how good t b 4 is because they didn't use chain of thought, and chain of thought is obviously, like, the very, very basic way of scaffolding. And so then that raises the more general question, which is how much can you augment language model capabilities just using better scaffolding? Various types of decomposition and debate, amplification are all ways of scaffolding. Some of these are starting to work better with better models. I think both debate and amplification, like kind of decomposition to sub questions, did not work super well with GPT-three level models, are starting to work much better now. I don't think it's like very well understood what the shape of that landscape is, So I'd I'd be excited to see more work on that. I wanted to address like 1 other thing you said about maybe these other agent systems, so there will be a step change. A key thing for Elicit is always think ahead to the future. Don't build just for the present because the situation where illicit is going to matter the most is the situation where we are moving closer to AGI and actually need to use these AI systems to make really impactful decisions. We've been working on a more scalable infrastructure where we both in the notebook setting where we can let note agents take actions similar to how humans take actions, but also scaling to larger kind of task batches. Internally, we've been calling this the exascale engine where we're like, now people use models at fairly small scale, run it over tens to hundreds of papers and extract information from those, but that's obviously not where things are going to stop. Ultimately, want models to do reasoning across tens of thousands, hundreds of thousands of papers or other kind of similarly shaped large inputs. I think that does require infrastructure that doesn't really exist right now, like kind of an infrastructure for taking these unreliable pieces, running them at scale, using models to supervise other models and eventually get high quality answers without having everything run-in, like, a single kind of huge black box with, like, opaque weights. So we are thinking a lot about that future.
Nathan Labenz (49:39) So it seems like the key there's you're creating a equivalence there between scaffolding and just bringing more compute to bear. It's like it's you're translating all this structure into more compute from more angles. Right? You you can go wider and and handle more stuff. You can also, like, have models check models. But a lot of that stuff seems to be just not just because it's obviously complicated, but there seems to be a sort of the fundamental currency there is still compute, and and the the scaffolding is like a way to organize it, obviously.
Andreas Stuhlmuller (50:13) Yeah. That's exactly right. I think the key question is like how do almost we as a society want to turn compute into more work? I think 1 answer is we're going to train larger and larger models and we'll hope for the best. Maybe we'll augment it a little bit with better interpretability methods. We are working towards a different answer, is think more compute will in fact lead to great things and more correct answers, etcetera, but we can get there through more transparent architectures if we can build the right infrastructure.
Nathan Labenz (50:42) The obvious next question is what does the infrastructure look like? What are you what is the kind of frontier of that today? Obviously, there's like a ton of papers out there in the literature. It's like not being fully web scale, still a very large database that you have to search against. Interested to know what that stack looks like. And then I hadn't really thought too much in the past about the ability, but I understand this is either maybe already happening or, like, definitely makes sense as a an opportunity, the ability to somehow make use of a large corporate internal body of knowledge. Right? A pharma company is obviously gonna run a bazillion experiments and have 1000000000 kind of internal reports. And I guess there's probably a lot of interesting challenges around that, including, like, where they will allow their data to go and live and probably also, like, different standards or quality. Would you wanna does something have to be an official internal report before it would be included, or would you, like, go mine notebooks going back years that may just be messy and never really meant for other people to see certain things? So, yeah, the more I talk my way around it, the more I realize there's a lot of infrastructure probably that is just in that first bit where you ask your question and and get to the results to then be processed in this structured form. So what can you tell us about that first bit of the product experience?
Andreas Stuhlmuller (52:02) Yeah. There's I guess there's what is happening right now and what would we like to happen in the future. Maybe starting with what's happening right now. I can only cover it on a high level, but roughly you go to your list that you enter your question. What happens next? Let's say your question is, I want to find studies on creatine for cognition after 2020. Please give me RCTs or something. Then the first thing that happens is we take your search and we automatically parse it into the different components. There's the search component, I want to find studies on creatine for cognition, but then there's also these intentions you expressed about what filters to use, like filtered onto RCTs only after this year, and we extract those out into kind of an API call that has those as separate bits. We pass it off to our search engine, which is a kind of hybrid semantic and kind of, I don't know, lexical or filter based engine so that it can both find the relevant documents, but then also use those additional intentions you expressed to narrow down to the most relevant papers, you'll probably get something on the order of 500 papers initially that are our strongest candidates for what could be relevant to you. Then we go through a sequence of re ranking stages that are increasingly more expensive using things like Mono T5, cross encoders, feed maybe a few heuristics around recency or citation count, and then get to the final list of papers which we use to generate a summary for you. Even at that final step when the model is looking at the papers that generate your summary, it can still decide to ignore some papers if they're not relevant, so there's many stages where filtering and processing happens.
Nathan Labenz (53:48) Cool. So that was all today?
Andreas Stuhlmuller (53:49) Yeah, so that's all today. I guess the entire process I described obviously has to happen pretty quickly because you're there in your browser waiting for the results, but there are many instances where actually you would be fine waiting longer for a better result. If you're going to write a systematic review, which people have worked on these things for 9 months or something, which is obviously crazy and means you speed it up, but anyways, it's timeline a and so you're like, you would be happy to wait for an hour if you could get really great results and you would probably also be happy to pay $100 or something of that sort. Right now that's not possible because Elicit is tied to your browser session. You close the tab, it's going to go away. Some infrastructure that we're building right now where we're decoupling the execution of illicit from your browser session so that you can tell it, use more compute to investigate all of these papers in more detail, do more filtering, I'm willing to wait longer, I'm willing to spend more money. I'll close the tab, I'll come back later and maybe I'll have a progress bar or something, but the ML work in the background is happening separately from what you're looking at so it's a lot more like outsourcing things to a human research assistant that then goes off, does things, and comes back to you.
Nathan Labenz (55:04) Cool. I strongly feel that people in the AI app space generally put too much emphasis on speed and cost. And I'm I've been banging this drum a little bit lately. Like, what matters to me most is success on high value use cases. So I I think that definitely makes a ton of sense. And I'm excited to see that that future version start to come to life. I guess I wanna get into a little bit more kind of the the business side and the users, and you guys also have done this creation of a for profit entity related to the nonprofit entity. But 1 more question just on kind of the product philosophy. The obviously, the whole thing is really motivated by the notion of let's keep these things under control by putting them in very prescribed context and making them do, like, very specific things or transparent architectures to use your phrase. Is there more that you think about, or do you feel like that is enough? Obviously, 1 of the big things that people have worried about a lot over the last year is what happens if a language model can help you build a novel pandemic agent. And at first, I was like, yeah. Listen. Probably doesn't have too much. They don't have to worry about, like, wrong think or political speech or whatever. But then I was like, oh, but actually, the number 1 thing of the novel pandemic agent is probably right in the illicit will. I'm not give anybody ideas, but it's you guys have probably built the tool that would be the most helpful for something like that. So how are you thinking about that kind of question? Are you, like, putting filters in place to try to identify, like, when people are asking bad questions? Or I guess I'm I'm sure you've you've got more. I won't even try to guess. You just tell me what that looks like right now.
Andreas Stuhlmuller (56:48) I guess, in general, when we think about safety, we both think think about long term safety and misuse. I think you're addressing the misuse point here, and I think that from what we've seen so far, it's not a big concern, but obviously as we scale things up, let models do more work, I think it will become increasingly important. And for If there are use cases that are on the borderline of could be useful for someone who's trying to defend against the pandemic, but could also be useful for someone who's trying to create the pandemic, I think either accounts will have to go through a specific review to be using illicit for those use cases, or otherwise they'll have to use a version of Elicit that just doesn't support them. I think there is initially a question around what is the best way to make that happen, which I think many people in the industry are going to face, and I don't know that we will invent the wheel on our own here, but to just throw out 1 idea, there's been a recent paper bought by Dan Hendrix and others on, I think, like wiping information about, for example, weapons of mass destruction. I think they called it the something like the weapons of mass destruction, something benchmark, where you're trying to figure out can you make models unlearn the most problematic knowledge. I think it's probably important that in cases where people don't want to go through some account review, you can't just ask the model, Hey, make me such a weapon. We were like, Hey, here's your weapon. Good luck.
Jungwon Byun (58:12) Yeah. I think in the past we've talked about kind of applying these factored verification or evaluation techniques to being able to monitor the kind of behaviors on Elicit as well. Or if you do have this extensive log of your research progress, having kind of models deployed for the activity of not just is this hallucinated, but is there some kind of negative impact of this research? I think it's actually like quite related to things users want as well. Like they, yeah, think they also care a lot about the social impact of their research and working on things that are impactful. So it's entirely possible that our users pull us towards building capabilities, Melissa, that help them evaluate the impact and the the different consequences of research.
Nathan Labenz (58:52) The other thing that I've experienced recently that has changed my thinking on this a bit is using pie from inflection. And I, like, red team everything that I come across just because it's 2 birds with 1 stone. It's I wanna try new products and just see what the technology could do. Then I also am just curious as to the state of play. Have people put any safeguards in here? It is I would say right now, broadly, it's super wild west. And, like, most application developers are not putting really any safeguards in. And a lot of them are fine tuning, I think, on their use cases because at least what I interpret is likely happening is that in fine tuning, they are removing whatever guardrails, like, the Lama 2 base version originally had is just wiped out in their fine tuning process. So I, like, very rarely get con get refused things. But 1 thing that I have noticed pie is, like, the 1 thing that I've not really ever been able to jailbreak, and it seems like it has a very good model. I I don't know that it's a different model. It's probably part I think it's probably the same model. But it's, like, very good at recognizing when the user is, like, becoming deranged. And so I've started to think that, like, the model outputs are, like, 1 thing you might wanna filter, but also just asking, does the user seem well is, like, something that the models are actually pretty good at. And I get these responses from pie in my, like, attempts to get it to do something out of bounds that are like, woah, buddy. Like, you sound like you're getting pretty worked up here. But then it's like talking me down, and it really does have a a pretty impressive EQ, which, again, we're I'm a little bit off the the main line of discussion here, but there is something there I think that is it's not purely analytical. And I I've I've found in many cases, even if there are pretty good filters, can, you know, get around it or whatever. But but pie has brought something new to the table with a sort of emotional awareness where it may I think it may, in many cases, be easier to detect that, like, the person seems not well as opposed to this content is, like, inherently problematic. Because so many things are obviously dual use, and it's just confusing.
Jungwon Byun (1:01:00) Yeah. And, actually, I think that's another good example of a place where doing something proactively well that is good for your product can also help you defend against misuse because the more like, we often want to understand the intent behind our users' queries and the intent behind their searches in general. Like, that just makes it significantly easier to give them relevant results. So I think that will naturally pull us towards better understanding. What is the user really trying to do here? In '90 9.99 percent of cases, we'll use that capability to just deliver them a better elicit experience. But in the rare cases, that same capability might help us be like, wait, this person is maybe trying to do something sketchy and we should do something about that.
Nathan Labenz (1:01:39) Yeah. Expanding or transforming that initial question seems like you may already be doing that to some degree under the hood, but I I see that as also an emerging trend. We just did an episode with exa, which does that as kind of default behavior. And now Anthropic just put out their sort of meta prompt creator as well, which is you throw in your haphazardly creative prompt and move a structure up and turn it into a template for you. But I do think there's a lot of opportunity in and, again, that's as you're saying, that's a make the product better, but also you can actually build into that phase a little bit of a gating mechanism for making sure this is the kind of prompt that we wanna be expanding on. Okay. Cool. So this has been great. I I really appreciate how much information you've been willing to share about how this works. I think people will learn a lot from it, and hopefully many will build better apps as a result. Let's just talk about the business a little bit. The last time we talked, I think bio, biomedicine, biotech, pharma was, like, the most intensive use case. Sounds like that's probably still the case. Be interested to hear, like, how the user base is evolving. Maybe anything you wanna share about patterns of use as a, again, extremely small time investor, I, you know, definitely took note of the phrase earlier that, like, 1 of the key things you're looking to do is help people translate more budget into better results. So I'm curious as to how much that's already starting to happen. So, yeah, lots there. But tell us how what sorts of trends you're seeing and and how the business is going.
Jungwon Byun (1:03:07) Yeah. I I would say biomedicine is still a really, like, 1 of the top domains and 1 that we focus on for a lot of reasons because it's a place where there's more willingness to convert budget into accuracy. They have, like, fairly well established and very much batch and power workflows that I think are really good to start automating and generalizing to other domains. We have but honestly, even from the beginning and still now, it's really it really is quite distributed. I think there's been a lot more growth in engineering and the humanities. So before, I think originally it was like biomed and then ML or computer science were like the 2 main categories, maybe they made up like 50% of our users. But now I think if there is a more even split and I've I was surprised to notice a lot more growth on the humanity side and on the engineering, just general, like mechanical, chemical, environmental engineering side as well. And I I think those are also going to be really interesting domains for us to keep exploring. And yeah, we are focused on these like batch users. We always we've referenced systematic reviews and meta analyses a lot. That's been a use case that kind of inspired all the columns and this tabular organization of information from the beginning. Those are the users who are most often trying to process thousands up to millions of papers. And I think that's really I I would really love to see more language model products pushing in that direction in those batch workflows, not just in these more casual or shallow chat based interactions. I think there is definitely a place for that, but a lot of the value of these tools comes from being able to process so much more information than a person can. Usually what's happening is the status quo is like hiring a person to like manually go through a lot of this work or they have a team. Often people don't wanna do this. Like even the people who are hired to do this, it's like it's gonna be like roach and demoralizing. So they're definitely we've had cases of we've worked with teams that are willing to spend thousands and tens of thousands of dollars on these projects. And even at an individual level, there are people who are willing to spend hundreds of dollars. Like, just an individual consumer who's willing to spend hundreds of dollars because it's like there's a very clear value add and objectives kind of gain. So, yeah, that's been really cool, and we wanna keep pushing into that.
Nathan Labenz (1:05:12) So the main driver of cost to the user, value to the user, and obviously revenue to the company is depth of processing. Is that the main takeaway? Expand the number of results is the biggest thing?
Jungwon Byun (1:05:27) It it goes in multiple dimensions. So there's like, the easiest I think the largest kind of lever is, like, number of papers considered and how comprehensive you wanna be. But there's also kind of accuracy, like how much are you doing a kind of shallow pass or just a rough 80 20 of how relevant this paper is so you can because you're gonna read the papers later anyways or whatever versus how much are you going to take the result of this extraction and put it in a database somewhere that's gonna feed into another analysis. And it's like actually really important to get that super right. So that also is another dimension. And then yeah. And then maybe another component is, like, how much data do you extract or how deep do you go into the analysis of any 1 paper?
Nathan Labenz (1:06:05) I'm gonna have to think about what I wanna spend a $100 on Elicit before long. Maybe help me talk me through that maybe a little bit more because I do have some of these questions that I would be definitely willing to spend more than I think I have so far on the product. Recent things I've been interested in I I think I mentioned a couple of them. Mixture of experts. Like, I'd love to know everything there is to know about mixture of experts. That 1 is a little bit 1 thing I found out, at least with the kinds of questions I'm asking, it might be a little bit maybe this maybe it doesn't cause a problem, but maybe it feels a little bit different than some of the examples that you've been highlighting is that I wanna know everything about this kind of cluster of things. And it seems like a lot of times the machine learning papers, like, answer 1 aspect of it. They'll study kind of 1 thing. And in many cases, from what I've seen, there's 1 or maybe just a couple of papers that have even addressed that sort of thing. I'm also really interested in curriculum learning. And then what I see is, like, in mixture of experts, there's, like, a bunch of different techniques. And then over in curriculum learning, there's, like, probably 20 different papers at this point that are like, here's a sort of here's 1 course in a broader curriculum that we studied, and we found that including this in pre training or pre training on it or whatever led to better downstream performance. And then I think I've only seen 1 that was, like, curriculum learning meets mixture of experts. So I don't necessarily know that there is, like, the depth in the way that there might be for if somebody's studying, like, weight loss or whatever, there's obviously gonna be, like, infinite studies published. Here, it's there's not necessarily so many, but how should I think about translating dollars or compute to the fastest improvement in my worldview that I can get out of the product.
Andreas Stuhlmuller (1:07:54) I think that might partially hinge on things that are also still baking in the illicit office. I think what what you want is something like, here are the first, find the 20 ish papers, say, on curriculum learning, and then what you want is now let me reorganize information in those papers to make it as easy as possible for me to understand how this field works. That is something we're actively working on where you're I think you don't necessarily want things organized by paper. You instead care about what are the different types of curriculum learning and for each of those types, what's the evidence for how well this works or what models it applies to or what whatever else? That's a pivot on the illicit table that you're currently seeing. Yeah, stay tuned. We I think we will soon give you new ways to turn money into knowledge.
Nathan Labenz (1:08:45) That is definitely very interesting, and I can see ultimately, this sort of means, like, a row would have a different kind of meaning as opposed to a paper being a row. A different concept could be a row, and papers could populate cells downstream of that. Put me on the early access list. Put me on the waitlist.
Andreas Stuhlmuller (1:09:03) There's a very basic version of this in Elicit right now called list of concepts, which is a workflow that does search through the literature and then try to organize things by concept. I think people should try it. I'm often surprised by how well it works. It is trying to be more comprehensive than the baseline illicit search and also takes longer to run, but it's still only a precursor of the deeper concept based functionality that we're working on.
Nathan Labenz (1:09:28) Cool. How do you with users, so many of these this is not a consumer app. Right? It's a it's a professional tool. Do you have any trouble with habit formation? Like, people just because that's a huge problem in just apps in general. Right? Like, people try it. They don't come back. If they don't come back every day, then they don't come back at all. Maybe this is just like such a high value tool that doesn't have substitutes. That's not really an issue. But how do you think about kind of frequency of use? Again, I'm on the side of cost and latency and maybe even the frequency of use are overrated. What I I would wait a week to get to get the exact report that I want on the that really helps me understand these mixture of experts or curriculum learning or whatever other topic.
Andreas Stuhlmuller (1:10:09) Like,
Nathan Labenz (1:10:09) it's not that it has to be today. But, yeah, I could see that being from a business standpoint. I could see that being a problem if it's so infrequent that people maybe forget that you exist from 1 1 opportunity to use the product to the next.
Jungwon Byun (1:10:22) I think it is different from just like a pure consumer app. We certainly have people who just use it for personal use cases in a more periodic way. I think they still are very engaged with us as a company probably because we launch these features every week. Most of them are I think still aware of us even if they don't have everyday use cases for it. But yeah, from a prioritization perspective, we're much more focused on the high value batch really intense use cases where there's not a there's no alternative basically, and you're working on a project for a while and you're really scoping it out and you have clear criteria that you're going to evaluate the different solutions for.
Nathan Labenz (1:10:57) So I've had this idea for a while of the AI bundle, which is a kinda like your cable bundle in theory where you have a single subscription that gives you access to a lot of stuff. And presumably, it would be, like, more than you can get with the free trial, probably less. Obviously, in the unbounded research case, there's you would have to have some bounds on what you would offer to people who come in as, like, part of the AI bundle subscription. But the my kind of concept there and actually just got contacted by somebody who was like, I'm working on this. So there may even be an opportunity to to make it real at some point. The idea is hopefully that from the application developer side, like, you don't have to give away so much tokens and harm your margins or force you into weird pricing dynamics. And then for the user, like, a more frictionless I I get to really use all these things. And if I really wanna be a power user, I can I can upgrade? So I guess my question is, would you like to be part of my AI bundle?
Jungwon Byun (1:11:55) I think it really depends on the details.
Nathan Labenz (1:11:57) Yeah. Just tell me more. What comes to mind?
Jungwon Byun (1:11:59) It depends who is the target of this bundle? What else will we be bundled with? How will do I feel confident that these users will use Elicit or actually is this totally outside of a target market? What do those kind of finances look like? What do the part pricing margins look like? How what to say operational cost involved for us? How much are we losing the kind of customer relationship? I think that's like the big question with a lot of aggregators. So aggregators are very common in other industries. Like, I came from fintech. They're, like, really big in in finance for, like, credit card comparisons, loan comparisons. And I think 1 of the big questions there was always, like, who owns a customer relationship? We'd have to think through, like, all those things.
Nathan Labenz (1:12:33) That's a pretty good rundown. I think the vision would be that folks would sign in almost like an OAuth kind of thing to so you would have a sense of who they are. They wouldn't necessarily be establishing a billing relationship with you immediately, but that would be the point is that that friction would be reduced. In terms of how the finances work, I don't know. I'm not actually doing this, but the the kind of 2 models that come to mind would be, like, prorated based on use where just all the revenue gets split up and divvied up based on who's doing what. That could be pretty challenging just because what exactly is going on and whose token what tokens are it's hard to equate different things. So another version would just be, like, periodic reestablishing of a certain percentage out of the pie, which is how the cable bundle works. Right? Like, every year or however often the contracts come up for renewal, ESPN's gonna get however much out of your bundle and Discovery Channel gets however much. And that can go up and down depending on how popular those channels are and relative negotiating position. But my understanding is it doesn't matter how much I actually watch any of those channels on a month to month basis. The channels are just gonna get what they're gonna get from each subscription. Yeah. I think that could be definitely interesting trick. I think Elicit is probably a bit outside of the core target for this. Then again, like, it also could be a way to reach a lot of people that might otherwise never even come across the product in the first place. So that's obviously a big part of it too. You get in the bundle, you can get discovered, and then a lot of things could happen from there. Again, to be clear, I'm not doing this, but it's become slightly more real recently as somebody reached out to me and said that they actually might want to do it. How about your story of the nonprofit origin to now creating a a for profit? How's your board holding up? How's the governance? I'm not sure there's really all that much interesting there. I I suppose that the rationale for doing something like this is is probably pretty intuitive. But since people are paying attention to that kind of story, be interested to hear your version of it.
Jungwon Byun (1:14:32) No. That we don't have plenty of different entities. It is I think the general kind of nonprofit or, like, even academic spin out to commercial venture is actually, like, quite common and popular long before OpenAI. This is common in like biotech and universities have tech transfer offices that help with this. So I think that's the idea of doing basic research freely without commercial constraints. Then if there's a commercial opportunity separately, like setting up the best vehicle for kind of scaling that impact seem is like pretty well established and I think makes a lot of sense to me. We just we got a lot of external advice. We did talk to the OpenAI's and the Anthropic's like, early last year and have had great, like, nonprofit lawyers who guided us through that. The the former nonprofit is run by an independent board. It was an independent board that kind of oversaw all of those decisions. We were pretty careful to get into, like, very independent valuations of everything and overall really ensured that the nonprofit was, like, significantly better off. And, like, the original motivation actually, like, came from our philanthropic donors and our board then they were the ones who provided that initial impetus. So if I think it's I think that like governance wise, think that was pretty straightforward. And then I think the other thing that really helps is I would say, so now Elicit is a public benefit corporation, which again makes it clear to the world that we we care about impact, but definitely financially, but also just for the world at large. And also we're opinionated about how we're going to make an insane amount of money. Like, there are lots of ways to make money. We're going to do the version of it that's like really good for the world in a lasting way, and I think that's totally possible. And so I think for us, like the all of the different things we care about actually kind of converge really nicely, which isn't always the case. The mission doesn't trade off with financial success. For us to be successful as product, we have to be very accurate. We have to be a reliable, trustworthy product. We have to be good at really complex reasoning. We have to add value for things like research. So like our stakeholders, our financial success, like our mission, all of that is very aligned. So yeah, I think actually for us and it's it's really nice. I think it feels very convergent.
Nathan Labenz (1:16:32) Is there still, like, a nonprofit team that's distinct from the the core product team, or is it is the whole kind of team united in the commercial mission at this point?
Jungwon Byun (1:16:45) All of the employees moved over to the commercial side, and then the nonprofit is just run by the independent board. So they're the ones who are overseen.
Nathan Labenz (1:16:53) How about your hiring needs? You guys have raised the round, so you've got resources and you're also scaling revenue at a admirable clip. What are you looking for to continue to build the team?
Andreas Stuhlmuller (1:17:04) Yeah. Hiring is always super important to us. As you mentioned, like users, revenue, everything is is growing. The most important function right now to me, I guess I'll let Jungwon speak for herself also, is senior software engineers looking to get into AI. Not even necessarily machine learning engineers, although we're also excited about those, but people who have deep expertise kind of building scalable, modular, well architected systems using Python, TypeScript, super relevant, both front end, back end. There's like kind of need to build the machinery for the AI factory on the back end to build the kind of control structures on the front end and both of these just need people who have been in software engineering for a while, did not just do bootcamp. I think it's a really exciting space. I think these somewhat unreliable pieces you're trying to build reliable, robust systems out of them, builds it as really high talent density. I think there's people from Google, Stripe, Square, others. So I think it's really exciting and really encourage people to join us.
Jungwon Byun (1:18:00) Yeah. I feel like we we did a good at covering some of the technical like, both what we've built, what the team has we're a very small team where 12 people has built today and also all of the challenges and, like, the really exciting technical opportunities like exascale, running unboundedness, and and running language models over millions of papers at really high accuracy. Like, hopefully, all the things we covered today really give a sign of the very exciting technical opportunities that lay ahead. And then on the non technical side, for I'm looking for product designers who are excited about building these evaluative and interfaces with language models that again help the user even if they don't know what they want, even if the language model doesn't have all the information that need they need. And also look definitely looking for people in product marketing and go to market, as well as on the eval side as well. Starting to think about hiring data scientists to marry some of the offline evaluations and analysis we're doing with real time evaluation and feedback. I think there's a lot of potential there. So firing across the board. Yeah, I'm really excited to make we have big visions of transforming research and there's not there might not be a lot of time left before we really need to make it, like, useful for very high stakes decisions in times of chaos. So, yeah, hiring eagerly.
Nathan Labenz (1:19:13) Cool. That's a compelling pitch and certainly, I, yeah, I can't rule out that we could be entering some choppy waters pretty soon. This has been phenomenal, guys. Anything else you wanna touch on before we wrap?
Andreas Stuhlmuller (1:19:24) No. This was great.
Nathan Labenz (1:19:25) Andreas Stuhlmuller and Jungwon Byun, founders of Elicit, thank you for being part of the Cognitive Revolution.
Jungwon Byun (1:19:32) Thanks, Nathan.
Nathan Labenz (1:19:34) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co or you can DM me on the social media platform of your choice.