Keeping the AI Revolution on the Rails with Shreya Rajpal of Guardrails AI

Watch Episode Here

Video Description

Nathan Labenz sits down with Shreya Rajpal, the creator of Guardrails AI, a new Python library that allows developers to add a layer of output, validation and correction to their code. Practically guardrails can ensure a reliable interface between language models and more traditional deterministic software systems. At the same time, mind bending and potentially risky use case frameworks like Guardrails allow developers to ask and answer entirely new kinds of questions.

Talking to Shreya really reinforced just how early we are in LLM’s impact on the software industry. We were introduced to Shreya’s work when recording our interview with Matt Welsh, the CEO of Fixie.AI (featured in Ep 19:https://www.youtube.com/watch?v=MHmd8hmpUMA&t=293s ).

LINKS:
Guardrails AI: https://shreyar.github.io/guardrails/

RECOMMENDED PODCASTS:
Upstream: @UpstreamwithErikTorenberg

TIMESTAMPS:
(00:00) Episode preview
(05:00) Why Shreya built Guardrails AI
(08:33) Common ways LMs can “go off the rails” and how Guardrails can correct it
(13:45) Discussion of validators
(15:31) Sponsor: Omneky
(18:48) Business and creative use cases of Guardrails AI
(25:00) What can be achieved by Guardrails AI that cannot be achieved by traditional code
(32:44) How agents work vs how Guardrails works
(35:34) AI as shepherd vs delegating to AI and the role of human understanding
(39:54) Trust deficit and risks
(46:33) Is it realistic to imagine GPT-4 using Guardrails?
(52:00) How Shreya thinks about security
(57:02) How Shreya thinks about embeddings
(1:05:05) How Shreya thinks about problemsolving with LMs
(1:07:50) Shreya on OpenAI’s Evals Library, Anthropic’s Constitutional AI
(1:12:55) Discussion of determinism
(1:15:00) Recommendations for developers to minimize overhead
(1:26:00) Predictions about providers in the space of LMs
(1:29:30) Shreya’s favorite AI tools
(1:30:00) Would Shreya get a neuralink implant
(1:31:50) Biggest hopes and fears for AI

TWITTER:
@CogRev_Podcast
@ShreyaR (Shreya)
@labenz (Nathan)
@eriktorenberg (Erik)

Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: MusicLM

More show notes and reading material released in our Substack: https://cognitiverevolution.substack.com

Full Transcript

Transcript

Shreya Rajpal: 0:00 It's kind of insane to see just the amount of activity and excitement around the space. There's people training and fine tuning deep learning models that weren't even in this space a few months ago. And that's really awesome, right? And I had people who were, oh, I really like Guardrails, I really like OpenAI, but it's just too expensive for what I'm trying to build. And so can you make this work with an open source model, as an example? So I do think that we're going to see a lot of that proliferation happening of great performing models from different price points, different latencies, different providers, etcetera.

Nathan Labenz: 0:34 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Eric Tornburg. Hello, and welcome back to the Cognitive Revolution. Today, my guest is Shreya Rajpal, a former machine learning engineer at Apple and founding engineer at Predabase, who is now best known as the creator of Guardrails AI, a new Python library that allows developers to add a layer of output validation and correction to their code. As anyone who spent time building AI powered products over the last 2 years will attest, validation and even more so reliability are key challenges. LLMs simply don't always follow instructions and sometimes go entirely off the rails. Better models have helped tremendously. It's true. GPT-4 can follow instructions far more reliably than earlier models, and Claude v 1.3 is also very impressive. But with this elevated capability also comes expanded developer ambition. And so it seems that for the foreseeable future, the problem of LLM reliability will remain both critical and ubiquitous. Shreya's work tackles this problem in many ways and at multiple levels. Super practically, Guardrails can ensure a reliable interface between language models and more traditional deterministic software systems. Validations like, did the language model return data with the right type and format? Or did the model choose a value from the list of allowable values that we provided? These are very familiar questions for developers and are still powered in Guardrails by traditional code. But at the same time, and for me, this is clearly the more novel, exciting, mind bending, and potentially risky use case. Frameworks like Guardrails allow developers to ask and answer entirely new kinds of questions. Assessments of things like the quality of a summary or translation or whether a given piece of text contains any redundancies, inconsistencies, gaps in logic. These are the sort of things that until recently, developers simply had no way to validate. Thus, for product and engineering teams around the world, Guardrails is both a solution to a very practical problem at hand and a sort of introduction or bridge to an emerging paradigm of AI first software development that goes well beyond the Copilot style autocomplete or even chat interfaces that we've recently seen and begins to use AI functions not just as development tools, but as components of the production technology stack itself. Talking to Shreya really reinforced for me just how early we are in LLM's impact on the software industry. The core AI capabilities needed to transform software, as far as I can tell, mostly already exist. What remains though is the work of reimagining not only how software is built, but how it functions now that intelligence can be baked in at any point. Personally, I believe that this paradigm shift could ultimately unlock bigger productivity gains and more user value than the current generation of tools that increase developer speed, but don't yet attempt to change the kind of software that they're building. At the same time, this is also something to be approached with real care and caution. The delegation of AI output validation to other AI models is not a step to be taken lightly or taken for granted. I hope you enjoyed this thought provoking conversation with Shreya Rajpal of Guardrails AI. Shreya Rajpal, welcome to the Cognitive Revolution.

Shreya Rajpal: 4:30 Yeah. Really excited to be here. Thanks for inviting me.

Nathan Labenz: 4:33 My pleasure. So this is the first time that I've invited a guest. Basically, as soon as I got off the call recording a previous episode, it was Matt Welch, CEO of Fixie dot ai, who mentioned your new project, Guardrails AI. And I was immediately, okay, I have to learn everything I can about this. So I'm really excited to dive into it with you. I guess let's start with what made you say a couple months ago, I need to build a system to help people keep their language models on the rails.

Shreya Rajpal: 5:09 Yeah, yeah. I was really solving my own pain points and my own problems. So had been, end of last year, I'd kind of been doing some tinkering on my own where I was building some applications. And even as I was building them, was, Yeah, this is really cool. I think it was nothing very exciting. A lot of what you're seeing on Twitter about chatting over proprietary documents, etcetera. I was building that and I realized, yeah, this is pretty cool. I can see the potential with this. But even as I'm kind of as a developer, testing it out, I can tell that it doesn't get me reliably the desired experience that I want to achieve. And so I think it was this big problem where these language models are really potent and they're really powerful, but they're also inherently very stochastic and very hard to control. What makes this very interesting is that unlike traditional machine learning, how this is different is that as a developer, you haven't really trained the model, so you can't just throw more data at it, make it work really well for your use case. And then separately, the only knob you really have as a developer is here's this prompt. And if you wanted to maybe do something or not do something, how developers typically deal with that is just adding a lot of verbiage in the prompt and maybe a lot of exclamation marks, etcetera, to make it listen to you, right? And that just seems woefully inadequate. And so Guardrails is this idea of a specification framework where, as a developer, you know what the right output for an LLM looks like and you're able to decompose that and deconstruct that and individually validate and verify each component of that output. And then if any of those components fails and if any of those quality criteria that you impose on it fails, then it gives you a set of tools to address that in a very extensible manner. So I was building this and I was, yeah, I know what responses I want a user to be able to get from this thing that I'm building and how do I ensure that I'm always able to do that for a wide variety of scenarios? And so that was kind of some of the inspiration. Spent some years in autonomous systems and self driving, and it's a similar problem there as well, where you have this really powerful deep learning based perception model that often feeds into this more rule based decision making system. How do you essentially ensure that the interface between that stochastic system and that deterministic system is robust and not brittle whenever the perception system maybe doesn't do as well. And so the idea was to It was inspired by some techniques you'd kind of see there, but built for language models and built in a very extensible way so that it's not very domain specific. Those are some of the inspiration.

Nathan Labenz: 7:57 As much as possible, I love to get super concrete on these things. This is clearly a pain point that a lot of people have. The project has gone on its own little rocket ship ride of GitHub stars with 1,200 as of last check at the time of this recording. But probably a lot of people listening also could use a little bit more of a concrete example of, okay, what kind of thing are you looking for and how is it failing? And then, can you kind of tell us maybe a couple of those and then how does the guardrails come in and save the day in those instances?

Shreya Rajpal: 8:35 Yeah, yeah, absolutely. I love that question. I love digging deep into the details, so I'm happy to do that. I think when I was prototyping Guardrails, my favorite kind of prototype example was that I have this Chase credit card agreement, which was my own credit card agreement. And I want to extract what are the key terms, etcetera, from that credit card agreement, get a nice JSON out of it. I want some As a user, I know that, okay, if I'm extracting something like an interest rate, it must always be a number. It must be maybe a percentage sign in it. What is a reasonable range for that? If I'm maybe extracting the name of a specific fee and I want this name to be presented somewhere, I know that the name should be very concise. It should be very robust, etcetera. So think of the So the common failure points, etcetera, there is that if I want to extract this information and then it needs to be maybe added to some downstream data sync, it's hard to do that reliably, consistently from an LLM because an LLM doesn't behave reproducibly, essentially. So in this structured data extraction setting, I could essentially enforce constraints of what I want each extracted entity to look like. So for example, the interest rate must be a number, must be within this range. There must be a description with each interest rate, and the description maybe needs to be some length or needs to be relevant to whatever the entity that it is coming along with. So I think all of those constraints are what I want to impose on this JSON structure. So that was the use case that I was prototyping it. I think since then, I have a ton of examples on my documentation, but one of my favorite ones is Text to SQL. So in Text to SQL, it's a wildly different domain, but a lot of the same ideas apply, which is that you want correctness from any generated SQL query. So the idea is that as a user, you want to be able to ask natural language queries over your data and get a SQL query that you can maybe execute, right? So a lot of the ideas are the same, which is that you need to be able to, what are the constraints you want to be able to impose a priority on that SQL query? So it actually work for my database, for the environment that I want to execute this query. Constraints, right? You may not want to return results from any specific tables or some tables might be private. If you want to say, as a customer, you never want to be able to query those tables, you can filter those out. You can add things like only support these specific SQL predicates. If there's any drop predicates or maybe update or insert predicates, you want to be able to filter those out. So the idea is that you can, as you're setting up this Text2SQL task, you can add all of those constraints. So how Guardrails solves this problem is it takes your database schema and sets up a SQL sandbox for essentially any SQL query that is generated. It's executed in that sandbox to make sure that it's executable. And if it's not executable, you take all of those errors for why the SQL query fails execute and wrap those errors into something, send it back to the large language model to correct itself and get something that actually works for your specific database. And then you can add other constraints and other restrictions on top of that, like filtering out specific tables, filtering out specific predicates, etcetera. So I think just this idea of here's a question, here's a task I want this large language model to solve, but as a developer, I have some domain expertise for what correctness means to me in this task and how much I care about that correctness. If it's incorrect, this query is totally useless to me, or if it's incorrect, I just want to know about it and maybe post hoc I'll handle it. I think that is the main idea Guardrails takes that and allows you to basically implement that as a developer. Yeah. So hopefully that was a little more grounded in terms of specific examples.

Nathan Labenz: 12:31 Yeah, it's fascinating. One of the things I want to explore the most in this conversation with you is the seemingly we're opening up a spectrum or a dimension, if you will, where we can go from on the one end explicit errors, like the errors that we are used to as developers in code. And that could be starting with a syntax error up to more meaningful errors, but that could still be sort of found through kind of traditional software messages like this variable doesn't exist or all sorts of things like that. But then you've got kind of a whole domain of correctness, which code has never really accessed before. And that is, what are you trying to do? Does this appear to be a reasonable approach or output a much more sort of semantic or, dare I say, intelligent point of view. But I find myself a little confused or a little bit lost in that space. And I see that you're kind of covering it in a really interesting mix of ways, right? In the library, in the validators, there's a mix of validators. Maybe kind of walk us through, tell us how you think about that, I guess. And again, maybe some examples of different validators that sit in different parts of that space would be helpful to help people understand what that I'm talking about if it's not already clear.

Shreya Rajpal: 14:05 Yeah, yeah. No, I think that's a very I think that was kind of one of the very exciting things about the library as I was building this out, which is that the general framework works even outside of things you can't maybe use an assert statement to verify or something. It's really extensible as a framework where you can have a mix of large language models and maybe some rule based heuristics, some programmatic checking, as well as more traditional machine learning models that are very, maybe, high precision classifiers, etcetera. And you can ensemble all of these techniques together to get something that is greater than the sum of its parts and much more robust and much more reliable compared to just using a pure large language model. So as an example of that, I last week added a bunch of guardrails for summarization. So if you're summarizing multiple documents and maybe generating an aggregated summary from that, there's a bunch of requirements that you may have in order to ensure that summary is accurate, that summary is concise, it's not redundant, etcetera. And so how Guardrails kind of thinks about this general problem of correctness when assert statements aren't sufficient is to really break it down into either a smaller ML task or into smaller verifiable heuristics, etcetera, and tries to get an aggregate assessment of how well this works.

Ads: 15:31 Hey. We'll continue our interview in a moment after a word from our sponsors. I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sacks, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16c's Marc Andreessen. The link is in the description.

Shreya Rajpal: 16:00 So in the case of summarization, what that looks like is if you essentially want to figure out if the summary is faithful to the original source text, how you can do it is basically look at each sentence in the summary and then essentially do something like a similarity matching based on which parts or which passages or sentences it's more similar to in the source text, which allows you to do much more fine grained attribution and figure out where is the sentence that it generated coming from. The other cool thing that you can do on top of this is that you can assign thresholds. So as a developer, you can do some experimentation, figure out what is my appetite for how varied I want these sentences to be from the original source text and then set a threshold. And then any sentence that is below the threshold in terms of similarity score, you can essentially filter out and not include that in the summary. So that allows you to, I think it's just this way of thinking about text outputs, which may seem like a single unit. It basically breaks them down into smaller chunks and independently tries to verify that. So other techniques you have in summarization specifically is make sure that there's, I want the summary to be concise, so make sure there's no redundant information. So essentially within the generated summary, essentially make sure that each sentence is diverse enough from other sentences within that. If that's not the case, maybe filter out sentences that are too similar to each other. So yeah, I think the cool thing about LLMs, it's we are looking at a lot of the first order benefits of that, where you can use this and do a bunch of really insane tasks. But I think the second order benefits are that you can really use these models as verification systems in and of themselves. You can use a large language model within a validator to verify and validate whatever you're getting is correct. Or you can use, depending on your access to data and your latency requirements, maybe train a smaller ML model that does this for you. But truly bring together a bunch of these different verification strategies and then get something that's more robust. So that's the philosophy behind how Guardrails tries to add guarantees around what traditionally seem like harder ML problems to verify.

Nathan Labenz: 18:29 Yeah, that's really interesting. I'm mapping this onto my own use cases here in real time. So I'm wearing my Waymark swag today. Waymark is in the video creation space. And we have, it's a multimodal problem, right? Where we ultimately take in some minimal information about a business and then ask the user to tell us what they want to create. And then on the other end, provide a ready to watch video. We kind of do that with an ensemble of different models working together. But the core one is the language model that writes the script and kind of gives the direction for what the visual assets should look like and all that sort of stuff. And I mean, for one thing, boy, we've come so far in a year. It was just a year and a half ago, basically that we got the very first fine tuned GPT-3 to do the task at all, with any sort of, it wasn't good, but it would at least respect the nature of the inputs and outputs. And now today GPT-4 can basically do it zero shot and respect the outputs. And then 3.5 turbo kind of not really reliably, but then I think, geez, it is 20 times cheaper. So I guess I'm interested in kind of what do you see the value drivers being for this sort of thing? To some degree, maybe there's no other way to do it. But as I'm kind of running through my head, I'm so on some of these things, GPT-4 is pretty reliable at this point. So people might want to do it for cost savings or they might want to do it for, I don't know, I guess a lot of different reasons, but cost savings and latency are actually one. I would see benefits to that potentially if I could move to a little bit of a less reliable model, but still know that my stuff is gonna render in the right way for our users. So what are you seeing in terms of the value drivers from your community?

Shreya Rajpal: 20:39 Yeah, yeah, I think that's a really great question. So I think there's two aspects of it. So cost and latency, maybe being able to use open source models or cheaper models with the same level of reliability. I think that is one aspect. But the more interesting aspect for me is what is the task that you're truly using this model for? So Waymark, there's a lot of use cases where you're using these models as maybe writing assistance or to help you. One of the products I enjoy using is Notion AI Assist, I think is what it's called, where you draft a little bit and it helps you maybe shape up an outline. So I think for those use cases, the creativity of a large language model is actually really good and it's a desirable trait. But I think at the same time, there's this whole other space of use cases that rely on the world model that is ingrained in these large language models. And it uses these large language models as software abstraction to do more general purpose reasoning. So I think there's a wide variety of those use cases that are seen. So for example, one of them is that use an LLM to basically as an AI receptionist, essentially, where you're a small business owner and instead of needing to hire a receptionist, you can use a large language model. And anytime you get calls, it figures out scheduling, figures out who's available. So that's a very powerful capability of that model where creativity is maybe not as good a thing, right? You wanted to maybe stick to some desired workflow that you wanna be able to ask. You wanna make sure that maybe don't ask for private information if a customer is calling. So I think where Guardrails is most useful is within those constraints, where you use the LLM not just as a text generator, but as a software abstraction, really. And when you're using it in that capacity, that is when reliability becomes most useful. I think over the weekend, I'd shared something that a community contributor had built using Guardrails, which was this GitHub action called AutoPR. And what AutoPR does is it takes a GitHub issue and then automatically creates a pull request from GitHub issue for your code base. And that is one of those use cases that has a bunch of really strict constraints. These files must exist, the diffs that are generated must be valid for those files. And those constraints are pretty hard to enforce without having some validation framework such as Guardrails on top of it. So I think for a lot of those use cases, creativity is not as nice a thing and you want more reliability. And I think it's still such a powerful use case of this model that it would be, we would be kind of underutilizing their capability if we don't build software such as that. Yeah, it's a mix of both. I think even with a lot of the creative use cases, I've kind of found that there's still a bunch of constraints that people implicitly have that right now the way to do it is via prompt tuning or prompt iteration, but truly being able to encode these constraints and maybe only, instead of manually having to do this, have a validation that runs and then only do it when you did is also a pretty nice workflow where, for example, you might not want any profanity in scripts that are generated. Or you might, if you're creating video content for someone, you might not want to mention peer products or competitors. And so I think those kinds of constraints are also useful even with free form text. So yeah, it's a mix of both.

Nathan Labenz: 24:19 It sounds like though you are most excited about something that could not be achieved in traditional code. And I still kind of am really trying to find that line or I guess there's maybe just a lot of overlap. A lot of times when I get to these kind of puzzling moments as I study different aspects of AI, I end up kind of finding that it's sort of both in the end. And there's probably a lot of truth to that here too. I'm thinking, okay, so we're in the GitHub, automatic pull request, AutoPR it was called, right? I saw this, it is cool. And super technical. To some degree, those responses could be validated presumably by existing libraries. I'm sure the Git package itself has some way to sort of say, you ain't got your shit together, so this is not going to work. And I think that's kind of what a lot of people are naively doing is just kind of implementing that stuff on a case by case basis. At Waymark, there is no standard. Nobody else has our video standard. We totally define it and own it. So it was up to us to figure out how do we represent that in text and then what validation comes back from that. So I think one thing we could maybe have tried to do is, we did this before you'd launched the project, but I kind of wonder if you're advising us, and we were maybe a little earlier, how would you think about what points of validation we might ought to do through Guardrails? Should we be thinking about structure or should we be thinking does this copy satisfy the user's prompt? Or I guess there's probably a lot of room in between, right? Those kind of stake out the most rigid versus most kind of semantic desires or requirements from the model. It seems like you're more interested on some level on the semantic side, but that there is this kind of fundamental interface with computing where it gets very syntactic as opposed to semantic as well?

Shreya Rajpal: 26:26 I think it's a good question. I do wanna say before I say this, I wanna preface this by saying that generating something and getting outputs, getting feedback from the end user about, is this good? Does this meet your original criteria? Doing a more qualitative assessment, taking that feedback and going back to the drawing board, I think that's a very valid way of doing things. I think for a lot of domains, getting that human feedback and getting that human input and write off is essential before you can truly think that this output is correct and is valid for whatever use case you have. But I think in the Guardrails world, there's basically this idea that maybe some parts of that human feedback can be done by a combination of traditional ML, heuristics, and maybe more large language models in the loop. And if you are able to take some of that, encode whatever that qualitative criteria is, codify it into something that is more specific, what you're then able to do is generate specific failure, error messages that help the model output correct itself. So in terms of maybe how that would, I obviously don't know as much about Waymark, but what that might be helpful in is maybe doing the number of back and forths you have to do with your end customer because you're just able to take some of their constraints, run those programmatically. If they're wrong, automatically create new prompts that tell the large language model why previous outputs are incorrect and get it to correct itself. So I think this is one of the net new capabilities that we have with large language models that didn't exist previously, which is the ability to get them to self heal or self correct themselves if you give them enough context. And that is where Guardrails, I think a lot of the core functionality for Guardrails really harnesses that. So how it functions under the hood is it's really good at figuring out what the relevant context is, packaging it up nicely, automatically creating a new prompt for you, getting a new response from the large language model, merging that new response with the old response that you had because it's very efficient. It only re-asks things that are wrong, not your whole previous output. So it merges all of that together. And then that is your corrected, validated response. So I think that is the world where Guardrails is most useful. With that said, it is very domain specific. There are domains where, let's say it's something like any text that is generated must be funny. Now that is something that is next to impossible to validate. And I can't imagine writing a validator that maybe assigns a scoring function to humor. That's just very hard to do, and so you can't do that. And you need to have a human. Maybe there's some other domains, which have very high stakes and very high cost to getting something wrong, where even if maybe you can do some pre scoring, the actual human confirmation that this output is correct is essential before you can do another iteration with the large language model. So there are those domains where this re-asking strategy doesn't work. But I do think just this idea of taking what is or what does an aligned output mean in your use case and trying to codify some of that is useful both in terms of prompting the large language model. It allows Guardrails to construct prompts that are more effective at getting you what you want. And then it also allows you to do post hoc validation. And this loop of, I wanna systematically programmatically handle failures as and when they arise. And maybe that involves re-asking, maybe that involves filtering any incorrect output. I think this is a very powerful framework to require less human supervision and take a lot of that pain away of maybe going into ChatGPT and writing, okay, this doesn't work for this reason and you're trying to write something else. So that's a world in which, that's the hypothesis behind Guardrails.

Nathan Labenz: 30:51 Maybe it's just because we all have agents on the brain, but as I'm listening to you describe that, I'm really going to this agent moment that we're in and sort of really when you dig into these agent systems, it's usually you might think of it more classical as a multi agent system. In many cases, it's the same language model playing the role of different agents. We're seeing all these examples where it's a simulated town where GPT-4 plays all of the people or a research agent where there's a planner and a coder and a retriever that all kind of work together and have their own prompts and they're all kind of scaffolded together. And then of course, these things fail a lot because there's some probability of being wrong at any given point in the chain. And then you're only in the naive implementation, you're only kind of as strong as the weakest link in the chain. So in a sense, I guess what I'm learning here is this guardrails paradigm is connective tissue between the different roles in a multi agent system. And I should probably accelerate how quickly I think that these agents are gonna start to work.

Shreya Rajpal: 32:15 I think agents are very exciting. I think I've been thinking about them a bunch and trying to think about how do you make them more effective, more reliable. I think the interesting thing about agents versus how Guardrails works is Guardrails is essentially the main problem that is solved is I want to add constraints on this output so that it works for my use case. It works for what I wanted to work on. But essentially, it gives developers a lot of agency to think about the specific problems that they're solving, what correctness means to them. I think contrasting that with agent frameworks where a lot of the goals of agents and a lot of the tasks of agents are configured autonomously by a large language model itself. So you're in this interesting setting where the involvement of a human or a developer in terms of your ability to enter something like an agent framework and add guarantees, etc. With the agent frameworks that exist today, you typically don't have access to that fine grained level of a task execution or goal setting, because you're not the person who's configuring these agents themselves. So I think that is the gap between how agents operate today versus what humans would ideally like to have. Ideally, if you're a person who wants to employ a bunch of these agents and maybe do research for you, etc., you want to be able to say, this is the set of allowable things that you're able to look at. But also you want that to be configured dynamically based on where agents are operating. So that is the context. And if I were to summarize into what that means for agents, I do think constraints are essential and correctness specs are essential. It's the only way to think about how do you assess what these large language models are doing. You can stream them and you evaluate them at each step to make sure that they're not going off the rails. But how do you do that dynamically when you're not the person who's creating these agents, setting their goals, etc. I think that is the key problem to solve there. So yeah, it's a problem I'm very excited about, and I think it's a problem that will need to get solved before these agents are employable, before you can truly use them, outside of just seeing how exciting they are. Yeah, I guess for agents, I'm always very curious to see people's use cases. So for Remark specifically, where once again, we're in this domain where there's constraints and there's a finite scope of what you want the large language model to do, etc. If you've looked at some of these agents and thought about what is something that you would want to add into what part of the stack could be aided with agents?

Nathan Labenz: 35:06 Yeah, it's probably not such a great fit, I don't think, for the Waymark product experience because we do have a lot of structure. I kind of think of these things as I'm still working out this framework, but I talked about this with Matt actually, and this is part of what led to him mentioning your project. So we kind of bring it full circle. But I kind of think of different I don't know if this is a spectrum or a binary or multiple categories or what exactly. But in terms of how we interact with AI systems, there seems to be a real time co pilot mode where you are doing stuff as a human and the thing is there to kind of guide you, shepherd you, whatever. And then there's things where you're kind of ultimately delegating more because you really don't want to do it. You don't want to be the person in the driver's seat or the entity in the driver's seat. You want to put the AI in the driver's seat, let it do it and then look at its work once it's done or have some other sort of way to circle that back into your life. And those workflows right now are largely done with integrations, various kinds that could be code, no code, Zapier, or what have you. With Waymark, it's code. And it's a pretty guided experience where you are delegating the task of writing a script, choosing all the assets for your video, etc. And then what you get to do is watch the output. So there's so much structure there that it doesn't feel like we really need an agent to come in and mix it up too much. But I do see the agent as sort of the bridge between these 2 modes where you're in real time, I sort of wanna send something off. So if I'm in ChatGPT plus today, I can use a plugin perhaps to look for flights. But I really more might want the thing to be go book me the flight for whatever date, subject to my preferences. And ideally it would kind of figure out all the downstream mess of that. So where I think this is actually really relevant for me, so I'm also working with a company called Athena, which is in the executive assistant. So I've also talked about this with Matt. And they have those kinds of things all the time where a human today is responsible for executing a lot of web tasks for a client. And there's kind of different aspects of the cognitive work there. One is understanding what's going on, translating the language from the client. You got a request, right? What does it mean? Understanding what it means. And then the other part is being able to actually hit the right buttons to make it happen. And ideally we delegate that whole thing, but the agents, that's where in our kind of testing, along with everybody else, we've kind of yeah, we may have AI that can understand the request, parse it effectively, ask the right follow-up questions, or at least suggest some pretty good follow-up questions, demonstrate a really robust understanding of what the human wants. But we're falling down very much still on how do you actually hit the right buttons, check out. I mean, God forbid, have to do a payment process or log into something. Two factor authentication does still work as a deterrent to AI login for the time being at least.

Shreya Rajpal: 38:50 Yeah, yeah. I think that's kind of been my experience and my exploration as well, which is that there's this interesting wedge where agents can be useful. But in order for that wedge to succeed, you need this notion of grounding, of being able to figure out, okay, here's how it understands what I'm seeing. But at each step of that execution, you need validation. So I think this idea or the problem of why you book me a flight is very interesting because book me a flight given my schedule and given my budget and where I'm trying to go, right? That translates into maybe some constraints that are then grounded based on your calendar or based on where you're trying to go and what flights are available. And so I think validation at each decision that the large language model makes and each action that it makes becomes more important for these things to succeed. But at the same time, having worked in the self driving world a little bit, part of it is, are they able to do it? And part of it is, can a human. This is trust deficit, right? Where even if these agents were perfect, there'd be this trust deficit between how much people are comfortable delegating. And so even in terms of building that, you need this verification system essentially that makes sure that each step of the way you're able to have control and have some oversight into how they execute themselves. So I think those frameworks would be kind of essential before we're able to see their adoption outside of demo use cases.

Nathan Labenz: 40:28 Yeah, it's funny that you say that. I would actually take the other side of the bet there when it comes to the human behavior. I mean, if I understand you correctly, you're saying people won't want to trust these systems unless there's good guardrails in place. I think of you as saving people from themselves, because I think people are going to be much more quick to just kind of go ahead and be yeah, it seems like it works. What's the worst that could happen? And I think we may find out, especially in a future world that the worst could be potentially quite bad, but I do kind of expect that behavior. I just recorded an episode of a medical school professor at Harvard, Zach Cohani, who has just written this book, The AI Revolution in Medicine. And this is exactly, I think the kind of system that he's looking to figure out how to implement into clinical practice. Something to kind of help for now it's the human is in the loop. That's the official recommendation. But even then, it's just so easy to get lazy. It's so easy to kind of be overly trusting, especially when the models are getting so good. It's fascinating that you can get better performance just by kind of pointing the model at itself. One of his big findings was our recurrent theme is that GPT-4 is better at evaluating text than it is at generating. And he's starting to develop some of these self critique things totally on the fly, in the context of just exploration in clinical setting.

Shreya Rajpal: 42:08 Yeah, I think this part of what Guardrails, the validation system is also based on, which is that almost Trust But Verify. You get GPT to generate something, maybe even if you are doing large language model based evaluation, maybe have a separate step that does evaluation separately so that you have an additional layer of security. And ideally, this is back from machine learning, but the more diverse evaluators you have, the more guarantees they're gonna be. So ensembling is a technique that I'm pretty bullish on in this space. It's what's gonna get us that confidence. But the medical AI for medicine or medical assist is so fascinating because it's back to it reminds me of, once again, self driving where it's with the autonomous vehicles that are out there today, maybe Tesla, FSD, etc., the idea is okay, you're still supposed to be very alert. You're not supposed to take maybe your hands off the wheel or something. I don't have a Tesla, so I don't know. But even within that, there's this expectation that a human should always be aware and present, etc. But as humans, it's so easy to get lazy. And so one of the things I'm really excited about in this space is good interface design so that we are able to get human involvement when it's most needed and not if you're always notifying them, hey, pay attention, pay attention. Maybe that ceases to have an effect. So it's this idea of, okay, how do we build this balance between when human involvement is necessary versus when we can maybe offload some of that, do that more programmatically or more with code. I think figuring out that balance and figuring out the interface to surface that balance or surface that division of responsibilities to the human, I think that is very exciting and very exciting to see how that evolves.

Nathan Labenz: 44:00 I don't know if it's Tesla either, by the way, but a neighbor of mine does and has the full self driving package. So knowing that I'm as obsessed with AI as I am, he was gracious enough to take me a ride. And first of all, it works way better than people I think commonly realize. I got into the car in front of my house. He put his finger on the screen, going go here and hit the drive button and the car drove. There was not a lot of fuss between just map, go and you're riding. But when you talk about the reminders, that's something I think they've really put a lot of engineering into as well. There's 3 progressions. I haven't spent a ton of time with it, but there's a couple, I think it's 3 levels of first, there was a little maybe visual indicator that you haven't done anything to the wheel in a while. There's a camera also that watches you from the rear view mirror in some versions. But then after that, goes to a little sound. And then after that, it goes to a warning and it's we're going pull over if you don't keep your eyes on the road. So they're pretty far along in that. And I thought that was a really remarkable experience because it is a very delicate balance. So easy to tune out. Obviously they have the hammer there eventually of pulling over. And eventually they also will kick you out of the program if you pull over too many times, he said. If you're truly sleeping at the wheel, they'll retract your access to FSD ultimately. But it is a fine balance because you can start to tune these things out. So I mean, how many warnings do we tune out? Crazy. So the next big thing that's kind of jumping to mind is, okay, you've got this developer, what is right to me? But the pattern seems to be very quickly well, why can't an AI just do that? So I start to then think, is there a version of this that's a plugin? You've created a spec and I'm trying to envision what is it like to use a computer a year from now as kind of the tooling and the plumbing all matures? One good candidate would seem to be it's a chat interface that you can access the world through and delegate stuff to little agents that go do stuff and report back. It sure seems like a GPT-4 central process could use the Guardrails spec or something similar to kind of insulate itself from problems as it delegates things off to the side. I mean, first of all, do you see that as realistic? And do you have a point of view as to kind of how the computing experience might evolve in light of that? Nathan Labenz: 44:00 I don't know if it's Tesla either, by the way, but a neighbor of mine does and has the full self-driving package. So knowing that I'm as obsessed with AI as I am, he was gracious enough to take me a ride. And first of all, it works way better than people I think commonly realize. I got into the car in front of my house. He put his finger on the screen, going go here and hit the drive button and the car drove. There was not a lot of fuss between just map, go and you're riding. But when you talk about the reminders, that's something I think they've really put a lot of engineering into as well. There's three progressions. I haven't spent a ton of time with it, but there's a couple of, I think it's three levels of first, there was a little maybe visual indicator that you haven't done anything to the wheel in a while. There's a camera also that watches you from the rear view mirror in some versions. But then after that, goes to a little sound. And then after that, it goes to a warning and it's like, we're going to pull over if you don't keep your eyes on the road. So they're pretty far along in that. And I thought that was a really remarkable experience because it is a very delicate balance. So easy to tune out. Obviously they have the hammer there eventually of pulling over. And eventually they also will kick you out of the program if you pull over too many times, he said. If you're truly sleeping at the wheel, they'll retract your access to FSD ultimately. But it is a fine balance because you can start to tune these things out. So I mean, how many warnings do we tune out? Crazy. So the next big thing that's kind of jumping to mind is, okay, you've got this developer, but what is right to me? But the pattern seems to be very quickly well, why can't an AI just do that? So I start to then think, is there a version of this that's a plugin? You've created a spec and I'm trying to envision what is it to use a computer a year from now as kind of the tooling and the plumbing all matures? One good candidate would seem to be it's a chat interface that you can access the world through and delegate stuff to little agents that go do stuff and report back. It sure seems like a GPT-4 central process could use the Guardrails spec or something similar to kind of insulate itself from problems as it delegates things off to the side. I mean, first of all do you see that as realistic? And do you have a point of view as to kind of how the computing experience might evolve in light of that?

Shreya Rajpal: 46:42 That's a very, very good question. I think I can go in a bunch of directions in my answer. But I think my core belief for that in that whole space is I think humans will be very essential in the loop. And so it will be very hard to automate all of this away. I think primarily because even what correctness means very different things to different people in different contexts. And so one common thing, as an example, is profanity filtering. Profanity filtering is one of the things that most people can get consensus on. You're generating text, make sure it doesn't have profanity. But I've also chatted with people who are building chatbots where authenticity, where the chatbots are interacting with certain audiences and authenticity is essential, right? And so if the chatbot is trying to imitate someone or be in the likeness of someone who does use profanity, then filtering out that profanity is actually detrimental to the user experience that they're building. I think it is very hard to figure out what those constraints are on a global level. And so I think having humans and domain experts be involved to developers to figure out, okay, ground up from what they're building and what they want to use for users to even think about what is their desired experience. I think those inputs will continue to be very important and there would need to be some way to configure those inputs or those criteria or just that experience that a user or a developer wants. So I do think that offloading this entirely to the model or to a provider is going to be hard to be able to achieve just because of that constraint. But in terms of what the programming experience, what the developer experience could look like, I think one way, and this is purely hypothetical, I'm not making bets on this at all, but one way is to really start thinking almost configuration first. There's a configuration system that allows you to tune the outputs that you want to see from these LLMs. And even if the underlying machine learning model stays the same, how that model is validated, corrected, how the outputs of that model are post-processed, that is all what we can configure. And so when people are working with large language models, it's not just text, it's also this configuration that they kind of pass in every time. And this is I'm an engineer and so all of my empathy goes to engineers whenever I'm thinking of building systems, etcetera. So especially in engineering, this is a pattern that I do think will be very important where the prompt isn't sufficient by itself and just choosing the LLM and choosing temperature or something isn't sufficient by itself. But it's this configuration framework of how do I want this output to be? And even how do I want my input to be processed or formatted, etcetera. So that is one pattern that I can see emerge.

Nathan Labenz: 49:43 Security also jumps out to me as a big driver here because as I was thinking about where would I use this and all the things that I've done, I'm kind of increasingly I see GPT-4 just running pretty well and giving me the format that I want. Certainly it has plenty of errors, but it very rarely goes totally off the rails, so to speak. There are a lot of vectors that are as yet totally undeveloped that seem like they're going to come online and cause a bunch of problems. So having this sort of SLA guarantee layer, validation layer, seems really smart in light of prompt injection for one thing. As users get more generally sophisticated and kind of savvy in their adversarial attempts, gain a lot by having something like this implemented ahead of time. Similarly, we've seen some really interesting things even in Bing where a user will change or not even a user, this is where it starts to get weird. A site owner, begun we've to see what the SEO people are going to do in the AI search era. So the battle between search and SEO, the arms race there, I think is going to be made to look pretty pale in comparison compared to what is going to happen now that you can try to trick the language model at runtime with whatever kind of content. Then there's all these, you've got model risk too, where you really probably do want to, as awesome as GPT-4 is, and my general strategy right now is use it for everything, get the quality to an acceptable level and then think about maybe taking out costs or taking out latency opportunistically or as necessary or what have you. But that's going to come in due time too. And then people are going to be well, about, I heard Alpaca, whatever was just as good. And we can just sort of drop that in there. But now you're just in a world where you have no idea what's going on. I mean, your OpenAI's, your Anthropics, they have a certain SLA. And I think one huge misconception that people have is that that SLA sort of is inherent property of language models when in fact, it's not at all. And they've worked really hard to get to the level that they have. And you definitely cannot take for granted that your bootlegged Llama fine-tuned on whatever is going to be anywhere as safe or friendly to users. I don't know, I'm going on and on, but that seems really important, but all that stuff is just starting to kind of emerge from the mists. I guess, how much does that motivate you? Do you see other things like that? What do you make of all that kind of emergent security stuff that you're helping people get in front of?

Shreya Rajpal: 52:45 Yeah, yeah. I think security is a big operational risk of working with these models, especially prompt injection. I think we all when Bing Chat was released, I think Sydney and New York saw how easy it was to manipulate these models in specific ways. Yeah, I think how I think about this is very okay. How do you decompose the problem? I truly think that this isn't a problem that will be solved by machine learning specifically. I think a lot of what OpenAI has managed to achieve is built on this concept scaling laws where as you scale the data, etcetera, you just kind of start to see and scale the models. You start to see emergent properties. This is just from all of my experience working in machine learning. These are fundamentally stochastic systems and it's very hard to, there's a big long tail of what these stochastic systems can't have guarantees for. You cannot have guarantees or you cannot have data points for all of the different, exciting and weird ways that people are going to use these models. And so as a consequence, it's very hard to add that validation from the model piece itself. And so security becomes an essential thing because that isn't something that you can leave up to the stochastic system. You essentially need to have more determinism around it. And so my way of thinking about it is to kind of break it down into, okay, the model is stochastic. What can we do around the model that then adds those watertight guarantees? So for prompt injection, for example, the things that are really exciting to me are both on the input and output side. You sandwich the LLM API call with input validation, output validation to essentially ensure that you have multiple layers of security that make sure that your model isn't behaving in ways that you don't want it to. So some of the ways, I'm familiar with some of the more exciting new developments in protecting against prompt injection, but also on the output side of things. I think a lot of those corrections are input validation, but on the output side of things, if you know that there's behaviors that you don't want this large language to exhibit, it's possible to do that as secondary checks in terms of output validation. If you want, for example, if you have known patterns of prompt injection that people tend to follow, you can do that on input validation to essentially make sure that there's not, you're almost gatekeeping queries and interactions that users can have to the set of things that you can kind of support. So I think solving, it's back to this problem of decomposing this whole problem of security into specific domains, specific applications. And then for those domains, thinking about what is the end goal that I want my users to have? And then adding constraints to filter out everything that is outside of that goal that is not kind of related to that goal. So I would be very surprised to see if we, for example, chatted with a bunch of teams who have been doing this, I think that's a common design pattern, where you can't simply rely on the LLM. There's a reason, for example, we don't have end-to-end machine learning systems. You kind of have some subcomponent in machine learning and then some checkpoint that has either human or a more deterministic component verifying or validating the output of the machine learning system and then another downstream thing that may be machine learning. But decomposing it and adding different layers of security on either end of the ML system is the pattern that most people, when they try to production those things, happen.

Nathan Labenz: 56:39 Security in-depth, defense in-depth is definitely, I think, going to become really important. Another version of this too is models talking directly to other models in vector form seems like it's going to become a huge trend. We have the authors of Blip 2 on the show a while back. And I think of that as such an emblematic foreshadowing of what's to come where they're able to take a frozen vision model and a frozen language model, and very quickly train a connector model that essentially converts the encoding of the image to something in some unspeakable, literally unspeakable thing in the language embedding space in a way that then allows you to have dialogue with the language model about the image.

Shreya Rajpal: 57:36 I actually love that. Sorry to interrupt, but I love that example because this literally is research that I've done in my master's and that I worked with some collaborators and I have a paper about this that does these kind of joint embeddings in language, in vision space for, I think we did this in the fashion domain where you have some fashion embeddings and then some text associated with it. How do you project it into a single space into a single projection space so that you're able to figure out similarity between, essentially using language, figure out what are similar images, etcetera. So it's a body of work that is pretty interesting and opens up so many cool capabilities.

Nathan Labenz: 58:18 The fact that that's going to work seemingly across, I'm still educating myself on this, but I really just see so many examples of even just simple linear maps from one space to another that people are kind of able to now bridge these different encoded or latent spaces, whatever you want to, embedding spaces. I think that's going to be a huge trend that will bring a lot better performance out of a lot of systems. Because why would you go through this natural language bottleneck if you have a much richer representation of an image or for that matter, a medical scan or a sound or whatever. You have the sound of the bird in an audio file. You're not going to project that down into language long term and be A Sound of Birds is heard or whatever, and then feed that into the language model. You're going to figure out how to represent that in the language model space, but in a way that is truly unspeakable. I just see so much force going in that direction. But then the obvious worry there is well, now you've opened yourself up to God knows what. The space of possibility there is so vast. It's totally untestable. It's incomprehensible on the input side. So what can be done about it? I think, again, this is a really good answer. Start validating your outputs, people. This is really important. And it's only going to get more important. I think one thing people are probably really underestimating is they can create a system today that behaves pretty predictably. And the world underneath it is going to change in such a way that it may start to expose their vulnerabilities over time. I mean, in some sense, software always works that way. We've seen Windows patches way downstream after launch. But I don't know, this seems a little bit different.

Shreya Rajpal: 1:00:12 Yeah, I think it's a very interesting point. I think it's this fundamental idea of input from more sensors. If you're able to have that and code that, I think that just leads to better performance. And it's also going back to this idea of grounding that I was talking about earlier. You have some ML system that projects it in some space. How do you make sure that that projection is correct or not? And if you have this idea of grounding where you have this other sensory input that is also supposed to be projected into something, you can use that as a self-correction or self-verification systems. So we saw this while I was at Drive AI, which was a self-driving startup that I worked at. I think we had this where the performance with using just LiDAR versus using if you have cameras as part of your inputs as well, in addition to LiDAR, you just have a better understanding of the state that the car is in. And you're able to have better just make better decisions overall. So I think it's in my opinion, I think it's a net positive thing because it allows us to also enforce these covers across multiple dimensions. Because it allows us to kind of figure out where things might break down and then add checks and systems there. So yeah, I'm excited to see a world where we can have all of these vectors embedded in the same space and there's a lot of interoperability between them.

Shreya Rajpal: 1:00:12 Yeah, I think it's a very interesting point. I think it's this fundamental idea of input from more sensors. If you're able to have that and code that, I think that just leads to better performance. And it's also going back to this idea of grounding that I was talking about earlier. You have some ML system that projects it in some space. How do you make sure that that projection is correct or not? And if you have this idea of grounding where you have this other sensory input that is also supposed to be projected into something, you can use that as a self correction or self verification system. So we saw this while I was at Drive AI, which was a self-driving startup that I worked at. I think we had this where the performance with using just LiDAR versus using if you have cameras as part of your inputs as well, in addition to LiDAR, you just have a better understanding of the state that the car is in. And you're able to have better, just make better decisions overall. So I think it's a, in my opinion, I think it's a net positive thing because it allows us to also enforce these covers across multiple dimensions. Because it allows us to figure out where things might break down and then add checks and systems there. So yeah, I'm excited to see a world where we can have all of these vectors embedded in the same space and there's a lot of interoperability between them.

Nathan Labenz: 1:01:39 So essentially you're describing a validation step that is in coherence of semantic interpretation of different input signals. I'm trying to think of any example of that that I've seen in the wild. I guess you're pointing to one in the self-driving car area. Maybe I want to just throw a few other safety control paradigms at you and just have you react to those, not in a better or worse sort of way. I mean, I think obviously again, defense in-depth, there's a place for probably all of these different things. Maybe in just comparing and contrasting some of these other approaches to what you're building, again, it can help enlighten people as to how you're thinking about it and what kinds of use cases your project is best suited for. So I've got a handful. One is the most people listening to this will be most familiar with OpenAI obviously. So their moderation endpoint, which briefly you can take an output, bump it up against the moderation endpoint. It gives you a flag of, hey, you might have one of these finite number of problematic content types. And then I believe you can really do with that what you will still. I don't know that they even enforce any particular action downstream of the moderation endpoint, but essentially you're classifying stuff into one of these categories. That's obviously pretty simple. So where do you think that falls short?

Shreya Rajpal: 1:03:14 I think it's great. I think it's very finite in scope. So if you were to take something like that and then expand that out into any use case that you want to verify or validate, but that is at least programmatically verifiable or verifiable with the machine learning model, flag specific things within that output that may be problematic for that, and then also allow you to configure how you want all of those invalid outputs to be dealt with. I think that is how I think about Guardrails. So moderation, if you think about moderation, specifically profanity filtering, is one of the validators out of a number of validators within the library. So you take some LM output, you give it to that profanity filtering validator, it'll tell you whether it is profanity or not in your LM output. And depending on how you configure that validator, it'll also correct that for you. So for example, generate that text without profanity, or maybe just filter those sentences that have profanity in there. But you can just take that pattern and apply it across a bunch of other use cases. So if there's code that is generated that maybe is incorrect or is not executable, you can do the same thing for code. If there's summaries that are generated from some source text, but those summaries are incorrect or invalid for some reason, you can do that. If there's structured data that you've extracted, which is specific parts of it are incorrect, you can basically apply that same paradigm. So it's taking that moderation endpoint, but making it a very general, very extensible thing that we can use.

Nathan Labenz: 1:04:46 In practice, what do you see people doing most to fix on fail? You gave a number of different kinds of choices there where one could be pop an outright error. The other could be rerun the whole call and hope for better the second time. But you're getting pretty granular in between as well with maybe just fix a little bit or snip the profanity. How would you advise people to think about which option they should take in that moment of a problem? And what do you see people actually doing today in the community?

Shreya Rajpal: 1:05:22 I think that's a hard problem to answer because I can't tell people what their, I guess, what their pain tolerance is for their applications. I think fundamentally, there's some cases where you're, oh yeah, if this doesn't pass, this LLM output is of no use to me. And so I want it to either be corrected or just filter it out. I can't use it in this in-between state. And so reasking is one of the things that is pretty valuable. It allows you to, from the user perspective, it's a single, I don't want to say one-shot because it's an overloaded term because ML, one-shot. But from a user perspective, it's a single API call that on the back end might make multiple API calls to the large language model, but give you either a corrected output or it'll tell you, hey, this is just incorrect and I can't handle it. For very high stakes use cases where if a validation check fails, you can't use it, I think reasking is probably the most effective thing that I see people use a bunch. I think filtering is another one. So specifically for summarization or for profanity, if there's some sentences that aren't information dense or that have profanity, you would be able to have that granular control of filtering out those specific things. And I think another one, which is the default setting if you're using Guardrails, is a no-op. So essentially, you would still run all of the validations on an LLM output, but you would only, if validation fails, you wouldn't do anything. You would just return the output as is to the user. So same as if they were using an LM without Guardrails, but it would just log everything that went wrong. And you can access that log and figure out, do I want to iterate on my model or my prompt or something using that? So I think those are probably some of my favorite ones, but there's also raise an exception or deterministically fix it if it's possible to deterministically fix. So yeah, there's a suite of options within the framework.

Nathan Labenz: 1:07:34 OpenAI, we'll go with OpenAI first. They also recently with GPT-4 launched this Evals library. I'm sure you've studied that a little bit. I guess my surface understanding is that's more of an architecture as a benchmarking suite versus a runtime aid, but maybe, what have you learned or what did you think was smart in the evals implementation or approach to validating language model outputs?

Shreya Rajpal: 1:08:06 Yeah, yeah. I like evals. I think my interpretation is the same, which is that it's this offline benchmark that allows them to internally test out how well their LMs are doing for tasks that people care about. And so I think one of my favorite things about it is just the, I guess this is maybe not technical, but maybe more from a product or go-to-market standpoint. I really like how they, it's almost like this way to crowdsource how people want to use their large language models. All of the ways that people are using it and all of the ways that it's going wrong. Take that and almost make it part of your training data set or your evaluation framework so that the models themselves are trying to serve those use cases better. So I think that's probably my favorite thing. I've seen other folks use it to just scour through it and maybe figure out interesting prompts that they can borrow. So I've seen some people do that, which is pretty interesting, which I think is a nice way to mine that library. Yeah, that's probably the, I see it as a totally different framework, basically.

Nathan Labenz: 1:09:13 I thought it was a nice touch also that they tied or essentially created a separate lane for GPT-4 API access for folks who contributed to the eval library. How about Anthropic's constitutional AI approach? Again, it seems like there's a certain, that's obviously a training protocol as compared to a runtime validation. But do you see commonalities there or have you taken inspiration from their approach?

Shreya Rajpal: 1:09:42 Yeah, the constitutional AI stuff is interesting. I think it's back to some of the things I said earlier, which is that it is essential to have deterministic checks, right? Deterministic post-hoc validation, because just by training itself, the models are not super sufficient. So I think runtime verification, runtime validation is the most exciting kind of problem to me. So that's how I think about it, but it is pretty interesting. I think this other aspect of it is just configurability. So also a similar thing that I touched upon earlier, but being able to configure what correctness means to you and enforce that specifically rather than handwritten is globally understood, agreed upon standard of correctness. So I really like that world where developers have that agency to configure what that means to them. So I think that's also being able to do that in this post-hoc setting where you don't have to go and train a model every time your definition of correctness changes or every time your use case changes, I think it opens up access to a much broader audience.

Nathan Labenz: 1:10:50 Yeah, I feel like there's an opportunity though to sort of bring the constitutional critique to runtime. I listened to the guy, his name is escaping me right now, but one of the authors, one of the lead authors on the diplomacy paper, the Cicero model that came out of Meta, he was describing a general strategy of how can we bring more compute forward to runtime? The old systems of Deep Blue and Chess was super intensive, compute at runtime to the point of being borderline largely a tree search more so than an AI, or as we think of it today anyway. But with our language models, it's not been quite as obvious how to do that. And it seems like, again, this sort of framework of rather than lean on all that embodied computing training, bring some of it somehow more forward to runtime and apply all these checks that essentially the more compute you can spend, the more likelihood you have getting really good output essentially is his mega observation. You talk about deterministic, but some of these things, they're still not really deterministic, right? The more I critique, I still have this inherent, it's layered non-determinism, right? More than, which hopefully limits my problems, but still there is some, in most cases, right? There's still some amount of stochastic stuff that is not reduced.

Shreya Rajpal: 1:12:31 I think that's a very good point. And I should clarify, not clarify, I should qualify what determinism means. So I think there's essentially different ways to evaluate or to validate outputs. I think basically one of those ways is taking an LM output, creating a new prompt that asks GPT-4, again, okay, is this output correct or not given these criteria? And maybe give me a yes or no, or maybe some binary response that allows me to assess this. And that is one way of doing validation that is not deterministic at all, but it is additional layer of security. But I think there's also a bunch of other techniques and other validation rules that people can use. So some of them are more rule-based or heuristic-based. Some other of them are not using LLM APIs, but using maybe smaller, high-precision models that are trained on subsets of data. So even if they're not deterministic, they're trained on your own data. And because you have control over the model, you can maybe do things like set random seeds so that you essentially get, even if it's an output that is generated by a machine learning model, it's an output that you have control over and you can control its randomness on, either in terms of changing the seed or changing some of the parameters or just being able to generate tons more training data. So both of those frameworks, one is purely deterministic and then one is also more deterministic in terms of being able to control randomness. So when I say deterministic, I'm truly referring to an ensemble of all of these techniques. And so some of them are basically all over the map in terms of what their randomness looks like. But that was a good little catch, yeah.

Nathan Labenz: 1:14:26 One obvious question from a developer standpoint is what overhead does this create? And you can measure that in a number of ways, right? Token overhead, which is cost overhead, maybe complexity overhead, latency increase. And then maybe some just tensions also in a product experience. I think about the Bing experience where fascinating honestly, one of the biggest corporations in the world went with this as their approach. They spit out the token in a streaming token-by-token way, and then retract it from you if they determine that it went off the rails. So, I mean, that is crazy to me that Microsoft launched with that paradigm. I do get why, of course, because people don't want to sit there and wait for the whole thing to be generated before they can start to see what's happening. There's a really powerful draw to the streaming experience. So much so that Microsoft did this, but it's crazy that one of the biggest corporations in the world is willing to knowingly set up a system where they're going to sometimes emit some toxic content or whatever. And then, well, just swipe it.

Shreya Rajpal: 1:15:47 Off, yeah.

Nathan Labenz: 1:15:48 We're in a brave new world for sure. So yeah, I guess all those dimensions, how do you think about overhead? And again, how would you kind of guide developers toward minimizing that overhead? Are there things that could be done in parallel? What's the smart version of this so that the, because a lot of people are going to be, okay, yeah, but the CEO says we can't wait for that. Nathan Labenz: 1:15:48 We're in a brave new world for sure. So yeah, I guess all those dimensions, how do you think about kind of overhead? And again, how would you kind of guide developers toward minimizing that overhead? Are there things that could be done in parallel? What's kind of the smart version of this so that, because a lot of people are going to be okay, yeah, but the CEO says we can't wait for that.

Shreya Rajpal: 1:16:13 Yeah, I think that's a really good question, honestly. Interestingly, before I started talking about how I think about overhead, you would be surprised by how for specific applications, if you're not in the chatbot world, how comfortable people are with latency, as long as they're getting high quality output at the end of it. So when I was initially building this, latency was a concern, and some of the design decisions I made support that hypothesis. Reasking, for example. If there's not piecemeal reasking, aggregate stuff and then reask. So it's a one shot request. But people are pretty okay waiting. If they're in this world where correctness matters to them, people are pretty okay taking on some additional latency in order to have that correctness. But with that said, I think overhead is a pretty important question. I've actually found token overhead is an interesting one because I found this interesting balance between how efficiently can you write a prompt if you're using the structured prompting strategy that Guardrails offers versus how efficient are your prompts when everything is in words. And so for some use cases, interestingly, I have seen that being able to structure your prompts, even if you would think that it's more tokens, it's actually more efficient because all of that structure and all of those constraints, which are now maybe in symbols or in some domain specific language, end up being represented in words, which is just way more expensive in some cases. That's similar. I've seen that with complexity as well, where if you're trying to get your LLM outputs to be structured in some way or to have some behavior, right now, the only way to do it is to prompt it yourself and do a bunch of prompts. But Guardrails kind of abstracts some of that complexity and some of that exploration that you have to do away from you, right? Because here's this DSL that is tested and works across a bunch of these LLM providers. And you essentially, as a developer, you don't have to think about how do I write a prompt that will give me this output that's structured this way? I just write it in this known way. And that is almost easier in terms of development. I have tweeted this some time ago and also a plug for my Twitter. Follow me shreyaR if you don't already. And I basically talk about Guardrails and AI all the time there. But I tweeted this, which is that even if you're in this setup where you don't care about validation, if you just care about getting structured outputs, it's a pain to kind of think about how to do that. So just use Guardrails for getting those structured outputs across elements. And it's useful in and of itself because you don't have to write prompts and iterate on prompts. Help people understand a little

Nathan Labenz: 1:19:15 bit more that complexity that you're taking out and how you're doing it on the just first question of can I get the desired format back?

Shreya Rajpal: 1:19:26 Yeah, good question. So I think how Guardrails, the entry point for developers is creating this spec, which is a markup language where you're able to specify, here's an output schema. So if I want an output, here's all of the different components that I want in that output. If that's a JSON, then you're able to kind of configure that. If it's just a string with some additional validation on top of it, maybe it's a SQL string that you're generating and you want some additional validation, you're able to configure that. That is totally separate from your prompt. That is just you thinking about what is the output that I want and how do I write that output in a schema definition perspective. So Guardrails basically has real specs, Reliable AI Markup Language. That's the specification framework. And so in a real spec, in addition to the output schema, you can also have a prompt separately. And then all you need to add on the prompt essentially is the high level task description. Because everything that describes, make sure my output is this way and make sure my output has these constraints and make sure my output is formatted that way, all of that you do in a schema, which is a specific, it's a programming language, right? So you can think of it in terms of if you can write XML or you can write markup, you can write that and you don't have to convert that into English and then do a bunch of experimentation to make sure that your LM API will understand what that means. So it allows people to basically, yeah, it allows people to only think about, the pain of prompt engineering essentially goes away a little bit and it allows you to think about interacting with these LLMs in a more programmatic fashion to some degree. So that's the complexity that Guardrails kind of takes on. So one of the exciting things about that is that the contract between the user and Guardrails is basically the spec, right? And then it's Guardrails' job to translate that spec into a prompt. So the user gives a high level prompt and everything, but the output schema definition is translated into a prompt based on Guardrails. So if you work with these elements, you would know that they basically change, the version internally gets updated all the time. And as part of the behavior of the large language model changes, and I've experienced some of that as I've been building this library. But you would essentially find that even if the model version changes, as a developer, you have the same Rails spec, and then Guardrails just compiles that Rails spec into different prompts, depending on what is most effective for a specific LM. So I think that is also, from a developer point of view, you don't have to wrangle model version updates and model quality issues to some degree.

Nathan Labenz: 1:22:19 That sounds really useful. I was just doing some stuff yesterday where I'm doing some task automation and I use the format trick. Everybody kind of who uses language models much comes across the format trick. Uses format in the response. As far as I know, that came out of OpenAI. As far as I've been able to track that back through Riley Goodside, who very much popularized it. He attributes it to Boris Power, who said OpenAI. Before that, I don't know. Maybe Boris just came up with it on his own. But I do run into this stuff where I'm well, how intricate do I really want this format to be? A lot of times I'm just use this format, just XML open tag, content, XML close tag. That way I can at least kind of easily just use a regular expression, just kind of parse out whatever's within those tags. And then if it has a prefix or a suffix or an I hope this was helpful to you, whatever, I can get rid of that cruft and have what I want. But then I usually haven't gone that much farther than that because I'm then I'm getting into hand coding XML and the playground or whatever, and that's not great. So right off the bat, I think that is maybe something we underemphasized at the beginning of the conversation. It's just there's just a straight convenience factor associated with, at least from a developer standpoint, being able to write something in a little bit higher level abstraction and have the thing kind of translate to a more detailed prompt that will get you the parsable result that you want.

Shreya Rajpal: 1:23:59 No, yeah, totally. I kind of found that as well. When I was doing a lot of my prototyping, I had prototyped it with GPT-3. And then I had a similar experience with, I think anybody who's worked with GPT-3 and 3.5 knows that you have, here's the upgrade we're looking for and hope that was helpful, all of these statements, which are oh, just give me the JSON, which is what I want. And then I think with Guardrails, since release, I've had a community contributor actually push this effort to add instruction tags and do a bunch of experimentation to make sure what works well with GPT 3.5. And so from a developer standpoint, you write one spec. Then depending on if it's GPT 3 or 3.5, it basically compiles a little bit separately so that it just works. I think even with GPT 3.5, I've had a bunch of people testing it out, they just get the thing that they're looking for without a lot of that extra filler text that maybe that makes it not work as well.

Nathan Labenz: 1:24:59 Yeah, it's interesting that you are also then translating that to other models, which the angle that I was going to mention a second ago was, it seems like there's a dynamic here where as people kind of try to think about where's all this going, how's the market going to shape up? Who's going to lead? Are we going to have one AI to rule them all? Or is it going to be many or so an oligopoly of large language model providers? Oh my God. So where's this all going? One force that I see kind of pulling everybody in line with OpenAI is just the fact that all of these things are getting developed against OpenAI's state of the art behavior at the time, almost across the board. And so then if you're a different language model, large language model provider, if you're an Anthropic or a Cohere or a Google or an Aleph out of Europe or whatever, then it seems like you have a really strong incentive to basically try to be as much like OpenAI as possible when it comes to supporting all these things. But then that kind of just sounds well, then how are you going to compete on that? If you've got to expend all this energy to basically just make sure that people even can switch to you without things that they're kind of currently assuming will work breaking. That's not a great position to be in, to be spending all your time just trying to catch up in that way. But I don't know that there's any way around that for other providers. So do you see that dynamic similarly? And does this experience lead you to think that we may see kind of concentration, if not necessarily of providers, at least in terms of how language models behave?

Shreya Rajpal: 1:26:53 Yeah, I think that's a really great question. I do think I'm pretty, I personally believe that there would be a big diversity in terms of model providers. I think we are in this almost, it's kind of insane to see just from, I've worked in AI/ML my whole life, or my whole adult life. But it's kind of insane to see just the amount of activity and excitement around the space. There's people training and fine tuning deep learning models that weren't even in this space a few months ago. That's really awesome. Just that amount of demand makes it so that there's more need and more providers that will come up. I had people who were oh, I really like Guardrails AI, but it's just too expensive for what I'm trying to build. So can you make this work with an open source model, as an example? So I do think that we're going to see a lot of that proliferation happening of great performing models from different price points, different latencies, different providers. So I'm pretty excited about that world. I think the other interesting aspect of this is that because a lot of this was unlocked by OpenAI, we're in this space where people are building these frameworks around what is the most performant provider, what is the most performant model. And because those frameworks are maybe indexing too highly on OpenAI, there's this incentive for other model providers to offer interoperability for those specific functionality. So I do think that the standards are evolving at the same time as the models are getting better. So there's this interesting dynamic between them. So at least for the foreseeable future, there's this big incentive for other model providers to also provide similar functionality or at least not have regressions. So that incentivizes people to switch over to them for a variety of reasons. But I do think that just the pace of innovation means that there will be some, just a more level playing field at some point, and then it'll be very interesting to see what are the specializations that different model providers offer, even the different open source models, what that specialization ends up looking like.

Nathan Labenz: 1:29:15 Okay, here's my three quick, you can answer this as quickly or not at all, whatever as you want. But I typically, if there's time, I usually ask these three final questions. One is favorite AI apps, experiences, tools, whatever that you are loving that you would recommend to others.

Shreya Rajpal: 1:29:31 I'm going to be very basic and say Copilot. I think Copilot is awesome. I use it extensively. And I recently had to set up my dev environment from scratch and didn't have Copilot. I was what am I missing? Why is my rhythmic coding slower? So yeah, definitely Copilot.

Nathan Labenz: 1:29:51 I can totally relate to that. When they went paid and it stopped working, I was in a panic until I realized that I could just pay for it and make it come back. But yeah, it was I'm not going to be doing this without this now. Am I? That's not to be endured. Okay, second one, hypothetical situation. Sometime in the future, a million people already have the Neuralink implant. If you get one, then you can control your computer devices with your thoughts. Would you be interested in getting a Neuralink implant at that point?

Shreya Rajpal: 1:30:31 I would not. I can imagine that, I don't know, as a weird straw man, what if I'm in a presentation or something and I'm controlling my laptop and my brain and I get distracted? Does that mean that my slide deck now is on whatever I'm distracted by? I think there's, again, having guardrails and having layers of security between thinking versus something being executed on a different system. I think it's important to kind of have some filtration mechanism there. So I would stick on it, but I would be very excited in that role. I would inhabiting that role, even if I'm not actively participating in it.

Nathan Labenz: 1:31:14 Great answer, I love it. Final one, just zooming out as wide as you can, and thinking as far in the future as you have any sense of what might happen, what are your biggest hopes for and fears for society at large as this AI moment continues to unfold?

Shreya Rajpal: 1:31:33 I think it's a great question. Let's see, I think my biggest concerns, once again, I don't think I have any special insight here, but I do think job displacement is something that I think about quite a bit. And thinking about what the, just what are going to be the, what is the amount of work that would still be valuable in this future where a lot of knowledge work can maybe be at least assisted or a lot of knowledge workers can be made more efficient than what they are today. So what does that mean for the future of work? I think that is something that I think about. And let's see, I think a hope is that I am able to live a life where I just don't have to do any of these mundane things that end up taking so much of our parts of life. So in the future, maybe I don't have to do my own taxes and I don't have to book my own flights. I can just say, book this flight from me at this date and just find me the best price. So being able to automate away a lot of those parts and then just do whatever fun parts of life. I think that would be something that I'm excited about and hopeful for. Go check out the package if you're building with large language models and if you're facing any issues where you're I was just working and it's not working anymore, how I fix that? Check out Guardrails. Yeah, follow me on Twitter.

Nathan Labenz: 1:32:58 Yeah, it's a beautiful vision. We can all hope for it. Shreya Rajpal, thank you very much for being part of the Cognitive Revolution.

Shreya Rajpal: 1:33:05 Yeah, absolutely. Thank you again for inviting me. I really enjoyed spending my afternoon chatting with you.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Keeping the AI Revolution on the Rails with Shreya Rajpal of Guardrails AI

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next