Code Context is King: Augment’s AI Assistant for Professional Software Engineers, with Guy Gur-Ari

Code Context is King: Augment’s AI Assistant for Professional Software Engineers, with Guy Gur-Ari

In this episode of the Cognitive Revolution, Guy Gur-Ari, Co-Founder and Chief Scientist at Augment, explores the transformative impact of AI on the software industry.


Watch Episode Here


Read Episode Description

In this episode of the Cognitive Revolution, Guy Gur-Ari, Co-Founder and Chief Scientist at Augment, explores the transformative impact of AI on the software industry. Highlighting Augment's unique approach, Gur-Ari discusses the challenges and solutions associated with integrating AI into large codebases, the nuances of maintaining context in AI-driven coding tools, and the evolving economics of AI-driven businesses. He shares insights on the company's focus on reinforcement learning from developer behaviors, future trends in software development, and offers advice for junior developers entering an AI-enhanced industry. The conversation also touches upon the vital role of user data, the complexities of vector databases, and the potential of agentic flows to revolutionize coding processes.

SPONSORS:
Box AI: Box AI revolutionizes content management by unlocking the potential of unstructured data. Automate document processing, extract insights, and build custom AI agents using cutting-edge models like OpenAI's GPT-4.5, Google's Gemini 2.0, and Anthropic's Cloud 3.7 Sonnet. Trusted by over 115,000 enterprises, Box AI ensures top-tier security and compliance. Visit https://box.com/ai to transform your business with intelligent content management today

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(04:38) Introduction and Welcome
(04:46) The Software Supernova Series
(05:23) Augment's Unique Approach to AI in Software Development
(06:16) Challenges in Large Code Bases
(07:32) Understanding Augment's Customer Base
(09:19) Context Management in AI
(11:46) Technical Insights and Blog Highlights
(13:16) Context Management and Code Indexing
(19:35) Sponsors: Box AI | Oracle Cloud Infrastructure (OCI)
(22:40) Developer Workflows and AI Integration
(29:04) Vector Databases and Retrieval Systems (Part 1)
(33:25) Sponsors: Shopify | NetSuite
(36:13) Vector Databases and Retrieval Systems (Part 2)
(37:29) Best Practices for Building RAG Applications
(50:36) Establishing a Solid Process for Model Evaluation
(51:01) Optimizing Experimental Iteration Time
(53:10) Exploring Reinforcement Learning from Developer Behaviors
(54:14) Challenges and Benefits of User Data in AI
(01:05:34) The Economics of Running an AI Company
(01:14:16) Future of Software Development and AI Integration
(01:18:51) Advice for Junior Developers in the AI Era
(01:23:58) Conclusion and Final Thoughts
(01:24:50) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Nathan Labenz: (00:00) Hello, and welcome back to the Cognitive Revolution. Today, my guest is Guy Gur-Ari, cofounder and chief scientist at Augment, a company using the full range of AI strategies from autocomplete to RAG to chatbots to autonomous coding agents to transform the practice of software engineering in large enterprise code bases. While our first episodes in the software supernova series looked at vibe coding platforms that allow anyone to prompt their way from 0 to a proof of concept or basic app, Augment, which was founded in 2022, back when OpenAI's Codex models and early autocomplete tools were still mostly just foreshadowing a very different way to code, is tackling a harder but potentially more economically transformative challenge. How do you 10x productivity for professional engineers who bring their considerable human expertise to bear on vast, messy, legacy code bases, which often have millions of lines of code spread across multiple projects that can vary in age, coding style, and underlying technical infrastructure. Unlike personal projects where one can often simply copy an entire code base into Gemini's context window, as Guy explains, the enterprise challenge requires serious technical firepower at all levels of the stack. And so Augment has spent the last 3 years deeply exploring multiple different approaches to code understanding and has ultimately developed a sophisticated retrieval heavy approach from the ground up. Their RAG stack includes a custom built vector database capable of real time updates, proprietary retrieval models designed specifically for large code bases, code search that fires on every single keystroke for every single user, custom code generation models trained with a technique they call reinforcement learning from developer behaviors, and multiple different product paradigms for delivering code to users, all of which is intensively optimized for both accuracy and speed and available across a number of the most popular development environments. The results are quite impressive. As you hear, Guy reports that he personally hasn't written a line of code in months. These days, the coding agent, which I had the chance to use in preview and which will be released to the public very soon, handles all of that, leaving Guy to focus on higher level issues, including how he and the team can continue to improve the agent so that it can eventually run for extended periods, take on larger projects, and even go beyond explicit user instructions to infer and address unstated needs. The economics of the business are fascinating too. Augment's pricing is pretty conventional today with $30 and $60 a month plans. But Guy was quite candid about the fact that some power users already cost them a whole lot more than that to serve. And especially as agentic workflows consume more and more compute, pricing in the AI space in general is very much a live question. It helps, of course, to design pricing that aligns company and customer interests, but it's less clear how best to do that considering that enterprise customers also value stable pricing and predictable costs. The good news for Augment is that having raised some $250,000,000 in investment capital, they do have some time and financial cushion to figure that out. There is a ton of technical depth in this episode, but arguably the most valuable part is Guy's super practical, down to earth advice for AI builders. While he and the Augment team have repeatedly invented new technology to solve hard problems, he recommends starting new projects simply by creating small evaluation data sets of just 10 to 20 high quality, hand labeled examples that you understand deeply and can quickly test new solutions against, and then optimizing for the speed of iteration by pursuing the simplest available strategies first and then exhausting what's available in the market before building custom solutions in house. All advice that as regular listeners will know, I wholeheartedly endorse. Toward the end, I asked Guy if Augment is currently hiring junior engineers, and more broadly, what advice he has for today's early career engineers and CS students. His answer, I think you'll agree, reflects the current moment in the software industry. A sense of excitement and opportunity for the foreseeable future, but also a recognition that nobody can see the future more than 2 to 3 years out. As always, if you're finding value in the show, we'd appreciate it if you take a moment to share it with friends or write a review, and we always welcome your feedback and suggestions either via our website, cognitiverevolution.ai, or by DMing me anywhere you like. Now I hope you enjoy this deep dive into the hard tech powering AI coding assistance for enterprise software engineers with Guy Gur-Ari, cofounder and chief scientist at Augment. Guy Gur-Ari, cofounder and chief scientist at Augment. Welcome to the Cognitive Revolution.

Guy Gur-Ari: (04:45) Great to be here. Thanks for having me on.

Nathan Labenz: (04:47) My pleasure. I'm excited about this. So we've been doing a little series that I'm calling the software supernova, which is, you know, just a nod to how much the software industry is changing. And we're coming at that from a bunch of different angles to try to understand it as deeply as possible. I think you're gonna provide a really differentiated and interesting angle because a lot of the stuff that we've looked at previously has been kind of people, you know, wanna create an app out of nothing. And, you know, there's a growing number of products out there now that actually can take you quite far if you show up, you know, with just an idea and you wanna go from kinda prompt to app in seconds, you know, is often the promise. But you guys are coming at the software industry from basically the other end, which is targeting large organizations with big code bases and, you know, things in production, long lived projects. And so I think this will be a really interesting compare and contrast to understand the different challenges that that poses and the different solutions that you're bringing to the market. So maybe for starters, give me a little just kind of introduction to the company and that core challenge. I think, you know, folks that follow this feed are like paying attention to AI. I'm not actually sure how many have been in larger software organizations and would be familiar with the, you know, the particular challenges that those organizations face.

Guy Gur-Ari: (06:04) Yeah, for sure. So Augment was founded with that vision of bringing AI to bear on real software engineering challenges that show up when you're working on a large team, when you're working on a large existing code base, because those are the challenges that the vast majority of professional developers face day to day. Definitely the larger the organization, the larger the code base, the less 0 to 1 projects people do and the more ongoing maintenance, still feature development, product development, but all has to work in the context of a large code base. And our premise was we could see that AI technology was crossing the threshold of becoming useful. So back when Augment was founded, we had autocomplete as a product that was out there, but ChatGPT still did not exist. But we could see how these models were getting rapidly better, and we felt that AI could play a big role. And we also felt that as a startup, if we went after these hard problems of allowing software developers to be productive in their code bases, we could differentiate because it requires a lot of context understanding. And I guess that's something we'll dig into more.

Nathan Labenz: (07:16) Yeah. There's a lot of dimensions to the problem. Do you wanna just give me a little sense of, like, how big a typical Augment customer is? I mean, you could measure that in employees or, you know, number of repos or lines of code. But, yeah, just how big of organizations are you guys targeting?

Guy Gur-Ari: (07:34) Yeah. So we typically target organizations that have hundreds of developers. We do stretch higher. So we have some customers who have thousands of developers. In terms of how many repos, that really varies from customer to customer. So some customers use monorepos like we do internally, and some customers use many different repositories, maybe one per microservice. So that varies a lot. In terms of lines of code, I think it starts probably with millions of lines, and then it goes up from there.

Nathan Labenz: (08:05) So that highlights obviously one immediate challenge right off the bat, which is when I do my own little projects, my default workflow, unless I'm testing something else or whatever is, I'll usually have the AI write a little script to print the entire code base to a single file. And then for a while, at least, I can just copy that entire file, put it into the context window, and ask the AI for help. So I'll take it to, you know, o1 Pro maybe to do some planning or Claude or now I've got Gemini 2.5, which can take me farther. But, you know, it still maxes out at 1,000,000 tokens, which is obviously not gonna handle the whole code base. What do you feel is kind of not done well or sort of missed by maybe take Copilot if we wanna pick on one, but you could say to, like, other offerings in the market that are you know? And maybe you could kinda catch that up to, like, what are the frustration points or the, you know, the places where you see developers just kind of, like, not getting the value that you that you you know, that Augment can in fact deliver? Like what's what's the because I mean, we've all seen these sort of like one shot examples and, oh my god, I wrote this function for me and all that kind of stuff. But where does the conventional approach like break down in practical terms?

Guy Gur-Ari: (09:21) Right. So from what we've seen, again, if we're operating inside a code base that doesn't fit into the prompt and today, even if we have 1,000,000 tokens of context length, the ratio is roughly 10 to 1. So 10 tokens per line of code. And so still that only gets us to 100,000 lines of code, which in industry is still considered a small project. And there are other downsides as the project grows to actually putting all of that in the context that we can talk about separately. But I'd say the problem there is when you're working in a large code base, you really have to keep in mind not just the work that you're doing as a developer, what you're focused on, but also the context. And that could be very obvious things like, I need to call a few APIs. I need to call them correctly, and I need to put in the right parameters in there, and I also need to call them in a way that respects the conventions. Maybe there are multiple ways to call them. Maybe there are multiple ways to achieve the task that I'm trying to achieve, and we want to be respectful of the conventions that are in the code base. These are all things that if you're a developer inside an organization, if you've worked there for a while, you're already familiar with the right way to do it. But when you ask a model to do it, an AI model, and you don't provide it with all of that context, it's going to struggle. Basically, it's going to give you bad predictions, whether it's a bad completion or a bad chat answer. And so since at Augment, we've prioritized context from the beginning, we have full code based understanding built in by default to every feature. And so if you're getting a completion, it's going to take the context into account, whether it means looking up the function you're calling or looking up other examples of its usage, things like that. If you're asking chat the question, like, where is this function that I used 6 months ago and I can't remember what it was. It's gonna search through your whole repository. And with agents, we see that this actually matters even more because when you're trying to get these models to achieve more and more complicated tasks in a code base, the context becomes ever more important because simply because there's less supervision from the developer as the agent is working.

Nathan Labenz: (11:31) Yeah. Context, as Tyler Cowen says, is that which is scarce. He said that before we get into the LLM era, but it feels like it's, you know, 10 times more applicable to the LLM agents that we're all trying to figure out how to make work for us than it ever was for the humans. So I guess I'd love to just dig into sort of how you're making this work because, I mean, one of the things that has really stood out to me, guess a couple of things have really stood out to me as I've studied the company and used the product a little bit. One is the blog is outstanding. There's a lot of technical information shared on the blog, and that's really an excellent resource for people to get a good sense of what you're doing. And I guess a theme throughout many of those blog posts is just really pushing hard on bringing a lot of resources to bear for an individual user. So, like, one way in which that manifests, I understand, is, if I'm not mistaken, like, literally every keystroke that I make fires off a thing to the server, which then, you know, begins to search the code base to try to figure out, like, what am I doing right now? Where am I? And assemble the useful context. So maybe, you know, let's get into the sort of context management. Like you could take this in many different directions because I know there's a lot to it, but, know, from maybe describe like from the second that I sort of opened the app and we can also talk a little bit about it's an extension of VS Code, not a fork, and there's a whole, you know, sort of debate going on there as to, like, what's the right way to go to market. So let me take that one even first if you want. But then, you know, when I open the thing and I like, okay, here's my repository and I'm new. What's happening behind the scenes as it's indexing and, you know, getting me ready to like put me and the app together in a position to like really use a lot of compute at runtime?

Guy Gur-Ari: (13:17) Right. So we've actually in the beginning, we explored several different approaches to code based understanding. I think the approach we landed on was the third one that we tried, and each one of these was a multi month research project to try to figure out, could we make it work? What we landed on at the end could be described as RAG. And so what happens behind the scene is we upload the code, we have our own custom trained retriever models that we train for the purpose of code based understanding, and then we index code using these models. That's what happens when you open Augment and it says indexing your code base. Once that's done, yes, on every keystroke and on every chat request, we send a request to the model, and part of processing that request is figuring out which parts of the code base are most relevant to show to the model so that it can make the best possible prediction for the user. And there is quite a bit of optimization then that goes the speed optimization that goes into making all of that fast, because it's one thing to index a large code base in the background, but then it's a whole different story to say, okay, this completion request needs to finish within, let's say of the order of 300 milliseconds. And that needs to account for both retrieving everything that's relevant from the code base and actually doing the language model call request to generate the completion. And so we prioritize both quality so that the retriever is good and end to end, it actually feels like it understands your code base and also speed because to us speed is a super important feature basically of the product.

Nathan Labenz: (14:59) Yeah. Can you you know, you can obviously calibrate exactly how much you wanna share about the details here. Although I recently did an episode also with Andrew Lee of Shortwave, and they have a pretty similar, approach where, you know, you sign up and the first thing they do is, like, ingest your entire Gmail history, which can be a lot. Right? And then that goes into their database and he's like a database guru. And he told me, and I don't know if you'd feel the same way. He's like, well, yeah, we could pretty much tell all our secrets because by the time anyone, like, figures out what we've done and tries to recreate it, we'll have, like, a whole new generation. So I'm not sure if you, you know, feel quite as confident on that dimension and would be willing to share all the current way things work. But to the degree that you can, long preface, I love to understand a little bit better, like, how are you chunking code? Because I think people have, like, broadly kind of come to frustration with RAG where and I think there's a number of different reasons for this, but sometimes, I mean, you can kind of fail at every step, right? So like, how are you chunking when you get a hit on a chunk? Are you then expanding out to make sure you have the surrounding context that's needed? So it's not just that one function out of a broader class kind of loaded in isolation. You know, there's context management and then there's effective context management, I guess is what I'm really trying to get at. So how do you make it not just fast, but actually like good so it has the right information that it needs?

Guy Gur-Ari: (16:25) Right. So I probably can't be as open because I do believe that there is quite a bit of secret sauce in what we do. I mean, it is true. Getting RAG to work well is very challenging. And as far as I, in my experience so far, getting it to work well on code is even more challenging than in other domains. So just to give an example of why that is. Let's say I'm starting to type something and I'm starting to type a piece of code. And let's say there's enough context there to understand kind of what it is that I'm trying to do. Maybe there's a comment or something, although often there's not even that. And my cursor is sitting there, and I'm trying to you know, we need to get a prediction out of the model, and we need to know what pieces of code are relevant to make that completion. This is a very different situation from chat or question answering system where the user is asked to provide the context for the request. Right? You start with an instruction or you start with a question. You have basically a lot of context for understanding what it is that you're gonna be looking for in your knowledge base. With code, at least with chat, you kind of have that. With completions, it's more passive. We kind of are trying to both infer what it is that the developer is trying to do and then try to figure out which code it is. So just to give an example, let's say that we figured out that we need to call a function or the model figured out that it needs to call a function. Then the question becomes, okay, what pieces of code are most relevant to help the model make that function call correct? We could pull up the function signature. We could pull up example usages of that function. We could pull up other pieces of code that maybe serve as counterexamples. Another thing about code bases is they evolve over time, and we kind of see a snapshot of the codebase. And so if we're pulling up examples, those examples could be new or those examples could be obsolete, and they're just left around in the codebase, and actually, the developer doesn't want to call them that way. So it's an extremely challenging problem. What I can say is that we use a mix of different techniques. We use RAG. We use some amount of static analysis on code. There are multiple models at play to provide the best possible context to the model, and we also let the user steer often because these systems are not perfect. And so we need a way for the user, especially in chat, to say, okay. I'm actually pointing at this directory or I'm pointing at this file, and this can also kind of indirectly help steer the retriever. Yeah. There are multiple things at play. I'd say on chunking, yeah, there's definitely better and worse ways that you can do it. And it's true that, yeah, code has more structure that you can hang on to. What I can say is this is more yeah. I can say that improvements in chunking are more on to solve problems that are in the tail. Maybe I can say it like that. If you have strong retrievers and strong models, chunking shouldn't be a blocker. Yeah. I think I can say that.

Nathan Labenz: (19:37) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (19:41) In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (20:51) So, yeah, one thing that jumps out to me there is just how you started with the sort of assumption that the user is like typing code into an IDE in the traditional way. And this maybe can also tie back to like the go to market as an extension of VS Code as opposed to a fork. I'm so AI pilled myself and just kind of always trying to do 2 things at once. Usually, I'm, like, trying to accomplish some goal in some project, but then also, like, learn about the latest AI capabilities or, you know, use Augment or whatever. Right? So I'm always kinda looking for these 2 for ones. And I think that probably puts me, I'm realizing, in maybe a very different pattern of behavior than what you typically see. So, like, as I've been using it over the last few days, I've done it entirely through the sort of chat panel and I basically don't really ever, almost never, like, actually get in and start, typing, you know, functions myself anymore. Where are people on that today generally, though? Like, you know, what is the sort of balance of approaches that you're seeing from and by the way, like, sort of a mid programmer, which is maybe why I'm so, you know, drawn to the chat, experience. But, you know, for the pros, are they what's the balance between how many are, like, kind of working the old, you know, traditional way of, like, being file by file and then getting this assistance, you know, sort of proactively served up to them versus those that are actually saying, okay, I wanna, like, interact with an AI and have it help me, you know, but I'm gonna, like, give it an assignment in a sort of chat or agent type of paradigm.

Guy Gur-Ari: (22:29) Yeah. There's definitely a distribution. So I think when we were talking about completions in chat, we did notice that there do seem to be 2 camps of developers. And, of course, there's a lot of overlap. Like, I doubt there are many people who only use completions or only use chat, but there are certainly developers with a preference to being a lot closer to the code, I would say, who don't even use chat much, but really love completions. And now also NextEdit. NextEdit is kind of a way to okay. You get completions. They might be away from your cursor, they can delete code and edit code, not just add code, but it fits in very nicely with the workflow of developers who kind of wanna keep their focus on the code. And then we see a lot of developers who really only use chat. I mean, that is fairly common. The thing that's changing now is as we're building agent mode, you can take another step away from the code and really let the model edit multiple files, run your tests, and you're kind of taking another step away and supervising everything, and then you can dig into the code when needed. This is something that we've seen. When you work on a large code base, we pretty often have to go back to looking at the code and making some changes yourself. That's pretty frequent, which is quite different from the 0 to 1 experience that I think we talked about before. So I would say for large code bases, as far as I can tell, most developers are comfortable being in chat a lot of the time. And, yes, using Completions NextEdit, but using chat a lot of the time, the switch or the move to full autonomous agentic flow will take longer. I think agents and models will need to improve before that becomes kind of the default mode for enterprise developers, let's say. It will take longer. But I feel like that's the direction we're going in.

Nathan Labenz: (24:31) Yeah. Certainly. And is the you know, this is it can get into, you know, sort of almost ideological territory very quickly, but I'm sure for you, it's much more a practical question around extension of VS Code or do a fork. Is that a matter of just kind of meeting developers where they're comfortable and, like, not asking them to change too much? Or are there other, you know, big decision drivers there that have you in the extension paradigm?

Guy Gur-Ari: (25:01) Yes. I think it starts from meeting developers where they are. And so we have a VS Code extension. We also have JetBrains extension, and we have Vim support. This is really about we don't want to force developers to change how they work. I think with the forks, these are all VS Code forks because VS Code is open source. And so you can say that if you switch from VS Code to a fork, you're not changing your workflow that much. But if you're asking a JetBrains developer to switch to a VS Code fork, that's a pretty substantial change to their workflow. So that's one. I would say there are also other considerations with the fork, which is that doing a fork means you need to keep track with updates, especially security patches, which then becomes extra maintenance work that you have to do. And then especially if you're selling to enterprise, these security considerations can matter. Now, the downside of not having a fork is that there are certain UI things that are harder to do or sometimes impossible to do. Although I have to say that with the VS Code API, we've been able to do a lot within VS Code. I don't think this has been a, like, a very substantial limitation. Sometimes we've had to work harder because we can't just go and change the VS Code code. This is another place where I suspect that the more we move to agentic flows, the less we have to do kind of inside the text editor. And once you're building an agent inside VS Code, you have a lot of freedom of what to do because you can open panels, you can put web views in there, and you have full control over what's happening. So my sense is that this distinction is gonna become probably less important over time, but, you know, I can't promise we won't do a fork at some point. There's certainly a trade off there.

Nathan Labenz: (26:48) Yeah. That's interesting. I mean, the point about just security and being able to kind of piggyback on, you know, all the hard work that Microsoft has already done to establish trust definitely makes a lot of sense. I've had enough experience with the security review processes at enterprise customers and not nearly as much as you've had, but I've had enough to know that it's not where I wanna be spending my time and, you know, to the degree that you can shorten that process, it certainly has a lot of appeal.

Guy Gur-Ari: (27:16) Yeah. Exactly.

Nathan Labenz: (27:18) Going back to the just retrieval, and again, you can kind of calibrate your answers however you want, but practical guidance, you know, for other people building their own RAG apps. Do you have a favorite vector database?

Guy Gur-Ari: (27:32) So we actually built our own vector database. I can explain why we did that. There was nothing out there that we found at the time that addressed all of our requirements. So, you know, what do we want the user experience to be? Right? We want the user to feel like the model understands their whole code base and we want it to feel like it understands the current state of the code base. Right? So if I just wrote a function in a file or had chat write it for me, and now I ask chat, let's say, okay, implement the tests, or I go to a test file and I start typing a completion or I start typing a test. We want the model to understand that this is something that I recently did and have that all kind of indexed and available. And so that means giving every developer, or giving the model a real time view of every developer's code base. So it has to be real time or feel like real time, and it also has to be different for every developer because if I'm a developer on a team, I work on my feature branch, you work on your feature branch, we cannot have those things mix. That's also a security requirement. So that means that in terms of a vector database, you need something that allows almost real time updates to the index, which is already a significant requirement from a vector database. And it also needs to be able to have queries based on different views. Right? I have a slightly different set of files that I'm retrieving from the new, but we still wanna deduplicate. We still wanna have like one database that captures our repository and not like deduplicate that for every user on the team. We did not know of a product that did all of that. And there's a technical reason for it. So typically, the way vector databases work, when you query, it's pretty expensive to do a full query every time. And so you apply some kind of statistical algorithm. Maybe you cluster your embeddings and you search the cluster. I mean, there are all kinds of ways to do that. But taking that kind of approach or one of these standard approaches means that it doesn't work well both with indexing or updated indexing because updating the clusters can be expensive and also views or queries based on views is hard because if you're doing a statistical query and you only have a subset of the files that you're retrieving from, you might miss them completely when you're doing the statistical query. And so it was certainly a difficult engineering problem to build a vector database for us, and we still keep iterating on it, especially as we have customers that have larger and larger code bases. We need to keep scaling up solutions. So there's a project ongoing right now addressing scale requirements with repositories for us. But, yeah, we ended up building our own.

Nathan Labenz: (30:21) It's fascinating. I mean, I guess you started in 2022. Is that right?

Guy Gur-Ari: (30:26) Yes.

Nathan Labenz: (30:27) Yeah. So I want and it may or may not be different today. I wonder, you know, this sometimes just strikes people as crazy when I float ideas like this, but that almost sounds like a product unto itself. Have you thought about that?

Guy Gur-Ari: (30:42) It's come up. The thing is when you it feels like when you build an AI lab and then a AI product on top of that, you run into many things that can become a product on their own. And one of the challenges is to stay focused and have a concrete vision of what we're trying to accomplish. So this comes up in, oh, this could be a product. This comes up of, oh, should we fork? And then, well, who are our users really are? Who are we catering to? And are they gonna wanna fork, or are they gonna wanna prefer their IDE? There are questions like this that come up all the time, especially with something like AI where it's a completely new technology that keeps improving rapidly and you keep kind of having to keep up with what's happening and make the right bets. Short story, yes. I think this could be a product, but we're trying to stay focused on building the best AI assistant we can for developers.

Nathan Labenz: (31:36) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (31:42) Being an entrepreneur, I can say from personal experience, can be an intimidating and at times, lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one and the technology can play important roles for you. Pick the wrong one and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.
Nathan Labenz: (33:37) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (35:02) Are there any, like, you know, I'm guessing you probably haven't kept up with, you know, the evolution of other vector databases, but for people who are trying to pick one, you know, because so many people right now are at the stage of like, either we're, you know, embarking on a sort of RAG app for our business, you know, probably for internal use or maybe we made one and it's like not quite working well enough, you know, and we wanna like take some, you know, next level step with it. Are there any sort of general guidelines that you would give people for how to make this part of the system work? One that I have in mind, sounds like you kind of have a version of it is just like I personally would, I think almost always insist on some sort of hybrid, like structured query plus vector as opposed to, you know, for at the beginning of this like RAG wave, people were just like doing pure vector search and that seemed to be, you know, kind of a mess. So having some ability to do, like, classic SQL style where clause along with the vector similarity, whatever seems important to me, but I wonder, you know, what would your sort of lessons or guidance for the masses be based on all this experience?

Guy Gur-Ari: (36:17) Yeah. So I would treat this as a research problem and start with, I mean, unless there are like pretty clear engineering requirements that would preclude like, we have, I'd say, pretty special requirements because it all has to be low latency and so on. I don't think most RAG implementations need all of that. And so I would probably start with some off the shelf vector database, and I would focus more on the quality. And I think for the quality, one thing that's pretty important is to have an evaluation dataset that you trust. It doesn't have to be a huge dataset. You can start with even 10 to 20 samples labeled by hand. That's how we start most projects. Actually, most research projects will start with collecting 10 to 20 samples labeled by hand and then start with some baseline taken off the shelf retriever, whatever is easiest to use, run it on your evaluation, and get a baseline of how are we doing. Are we solving 20% of samples? Are we solving 80% of samples? Probably it's gonna be somewhere in between. Is that good enough? And start iterating from there and hill climbing on your evaluation dataset. And when the evaluation gets saturated so basically, when you've managed to solve it, expand it. Add more samples, make them more diverse, make them harder. I think this is coming up with good evaluations and being diligent about running those evaluations is, I would say, in some sense, it's one of the hardest things to do in research, not because the work is so hard, but because it can be pretty tedious. But this is the way to get to good results. And so starting with that, things become straightforward in terms of, oh, should we just do vector? Should we do structured queries? Well, let's try it. Everything becomes an experiment. Let's try it on the eval set, and the eval set will tell us because we're basically reducing the problem to hill climbing on an eval set. That is the ideal situation. I will say yes. Certainly, doing pure vector let's say it like this. Real world retrieval systems are almost never a single thing. It's almost never, oh, I'll just do embeddings and I'll work on the embeddings really hard and I'll get the best embeddings and they will solve the problem. That almost never happens. It's usually a mix of different techniques. So vector, structure, it could be other any other signal that you can bring to bear on the problem. And the models today are good enough that you can actually throw a lot at them in the context, and they will kind of deal with it. And so, in some sense, recall becomes more important than precision. You wanna make sure that the right chunks are in there in the context. That's really today. With modern models, that is the thing to prioritize. Now I described, like, the ideal situation where you can kind of come up with a dataset that you trust, and all you're doing is hill climbing on that dataset. It's very important to start like that, but at some point, necessarily, your evaluation dataset is going to diverge from optimal user experience. And it again, I've never seen it I've this has happened in every one of the projects, I think, that we've done. It's really hard to capture user experience and map that to one number, not just because there are multiple axes, but also because we don't really know how users use AI products. There's a whole distribution of what they put into the prompt box. There's a whole distribution of what they expect to get out. You can't really boil it down to a number. And so once you have something, dogfooding is crucial for understanding where you are. And then once you have users, user feedback is crucial. You have to take all these things into account. So I would recommend starting with an eval set, but then understanding that you also need these other sources of feedback to iterate. So that, I think, is a summary kind of like a quick summary of best practices for how to go about this. Once you're able to reduce all these questions basically, the place you wanna get to is, can you reduce all these questions of what to do, turn them from philosophical questions to experimental questions that you go test? Then that's when you can iterate and really boot fast.

Nathan Labenz: (40:50) Yeah. That's great. It is striking to me. And for whatever reason, I find myself doing more projects where there isn't quite a ground truth that's so easy to hill climb on. Like, with my company, Waymark, we do video creation for small businesses, and there's not, like, a single answer that's like, you know, what is the right or best script or voice over script, you know, or whatever to or selection of images for this particular small business. There's definitely better and worse. And, you know, sometimes it's, like, very obvious, and other times it's the subject of disagreement. We've definitely had plenty of cases where we, like, ask 2 different people, and they see, you know, one is better than the other. It's not, like, unanimous in most cases, but it is definitely an enviable, you know, from that perspective of having these sort of almost irreducibly vibey sort of tasks, the idea of being able to just, you know, climb a hill is quite attractive. But in both cases, I do think it's really important for people to keep in mind. You can start and you should start with like a pretty modest sized dataset. I have a presentation that's much more about the social side than the technical side of like getting your team on the same page on what 10 instances of a task that are really well done look like. And it's amazing to me how often, like, that ends up becoming the stumbling point. I think it is often because of what you said that's just kinda tedious. And there's, you know, they also, like, don't have any chain of thought, you know, which is sometimes really helpful if you wanna do a, you know, supervised whatever. I'm on the verge of the soapbox. But, yeah, 10 examples, people. We'll take you far. If it's objective, great. Even if it's just, like, a vibe task and you're just demonstrating what a job well done looks like, 10 examples. You know? It's the first place to get. And then from there, you know, world can open up, like, a lot, lot more.

Guy Gur-Ari: (42:42) Yeah. Just to add to that, 100%. So the these few examples, the big advantage of having so few before you go to, like, hundreds, right, if you can, is that you can you become very familiar with them. You can hold them all in your head, and so the labels, the ground truth is less important. I'd say you can still hill climb on 10 samples even if the evaluation procedure is completely manual. So I just trained a new model or I have a new RAG setup. I will run it through the 10 samples. I will run the before model and the after model, and I will compare them by hand. I don't have to have a I can boil it down to a number, but I can also go based on vibe. So I agree with you that the real minimal thing to start with is the 10 samples. The number can, yeah, the number can come later. I totally agree with that.

Nathan Labenz: (43:34) One other thing that you said that I thought was worth just reemphasizing too is the key thing is make sure the model has what it needs. Worry less about other considerations like distracting it with wrong information. You know? And this is, of course, evolving quickly because that used to be, I think, a much bigger problem not too long ago. And as you said, modern models, you know, that word modern is important. Everybody should be using modern models, but like our expectations aren't necessarily always keeping up with what the latest models can do. One way I kind of generalize that for people is just like turn your hyper parameters up, you know, in general, like you have a choice typically in a lot of these RAG type setups of how many chunks am I going to take? You know, I'm going to take the top N chunks or if I'm going to expand out, like how from a chunk, like, how much should I expand out? And I'm always kind of telling people from what I see developers doing, they're leaving all those sort of settings too low. The right thing to do is usually turn them up. Yes. That might make it slightly slower. It, you know, will make it a little more expensive. But, you know, it's always like at least I don't know if I can think of any exception where turning those things up didn't more than pay for itself even with those marginal cost increases in the sense of the time savings that you get to getting to something that's, like, working, you know, better. So I don't know if you have any, you know, exceptions you would put on the turn your hyperparameters up rule of thumb, or is that also I mean, from the blog, it does seem like you guys are definitely like, how can we sort of jam and use all these things, you know, to roughly the maximum? But, yeah, interested in your take on, you know, any nuances you would add to my simple rule.

Guy Gur-Ari: (45:18) No. I fully subscribe to that rule. You just have to be aware of the trade off between, yeah, latency cost and quality. It's really as simple as that. If you're okay with the extra latency and cost, in RAG context, it's just better to show more because these days, the models have been trained to deal with it. It didn't used to be the case, like, definitely not 2 years ago, maybe not even 1 year ago, I'm not sure, but roughly around that time is when models got the RAG training to be able to deal with a lot of distracting information. And so that scales really well, and I expect we'll just continue scaling because the attention mechanism in transformers is basically built to do that. It's basically built to sift through all the noise and focus on the relevant parts. And so with sufficient training, it makes sense that it will work. I think I've also seen this bias toward putting less in there, especially if you're coming from a background of using models the way they were 2 years ago or before. There you had to be a lot more careful with your tokens, but that has changed. I think the place where so if we're focused on if we're talking about RAG, then, yeah, I think that's just the right answer. And also, certainly way easier than on the research because making improving the recall, so that is improving the ability of the model to find the right chunk within the first, like, 50 or 100 or something like that is doable, improving the recall if you have fewer chunks. Like, if it needs to land in, like, the top 10 or something or top 5, that task becomes exponentially harder, basically, the less context you have. So if you can give it the room to do it, then the research task becomes much easier. I can say the place where adding more context doesn't seem to scale yet is with instructions. So giving the model tons and tons of instructions that you expect it to follow, in my experience so far, this actually doesn't scale that well, and it will start ignoring instructions if there's too much in there, skipping steps you asked it to do, things like that. But that's not a RAG problem. That's just a different kind of prompt scaling problem where models are not yet good enough.

Nathan Labenz: (47:35) Yeah. Okay. That's a good point. How about you mentioned training you've trained your own retriever models. This is something that I think I don't wanna bias your answer too much, but my sense is that a lot of software engineers, AI engineers are sort of attracted to the notion of, you know, well and, of course, they're not gonna pre train from scratch, but, you know, we'll grab some off the shelf thing and we'll, you know, customize it for our own purposes. Wonder what guidance you would give people on when that is in fact a good idea. I once made a meme, you know, the bell curve meme of, like, you know, what's genius and, you know, what's dumb and what's in the middle. And my thing was on the extremes is just like, just use OpenAI embeddings. And in the middle was, you know, we'll do this, you know, complicated thing. We'll fine tune our own blah blah blah blah blah. It's worth it for some. You're in that situation where, you know, you have, like, a lot of resources in a very ambitious project. But where do you think it starts to become worth it to take on that sort of challenge versus just like, you know, using the best off the shelf thing you can find?

Guy Gur-Ari: (48:48) Yeah. That is a very task dependent question. Yeah, I think, I mean, anytime you do research, I think there is just a general human tendency to reach for complicated solutions too quickly. My recommendation would be to try hard to bias towards simplicity. Like really the simpler the better and rely on your evaluation. Either like vibes based evaluation or numerical evaluation to guide you. Yeah, I'd say, again, it's hard to give general answers, right? Because the answers are going to be task specific. So I'd rely more on establishing a solid process for finding the right answers for your particular use case. And that starts with an evaluation set that you're comfortable with and that you mostly trust and that you can run through different iterations of your model or your system to test. The other thing that I think is important to optimize for is iteration time because the faster you can run experiments, the more likely it is you'll find something that's good enough or something that's better than what you currently have. So experimental iteration time is something that's very much worth thinking about. Taking experimental time down from hours to minutes can have a lot of impact on not just how fast you get to a solution, but actually do you even get to a solution. Because cranking through 100 experiments versus cranking through 10 experiments, the chances of you finding the right thing in those 100 experiments is just much higher. It's kind of like the RAG problem. If you like, what's your chance of finding the right chunk in the top 100 versus top 10? It's just much higher. It's also like that with experiments. So if you can afford to run 100 experiments, then, yeah, you're gonna try 100 different simple things. Maybe you're gonna try SQL style queries. Maybe you're gonna try 5 different open source models and find that like, maybe you'll try the OpenAI embeddings plus other things, and you'll find that one of them, for some reason, that was really hard to predict, actually works better for your use case. So I'd prioritize experimental iteration time and being able to actually trust the result of an experiment with an eval set to tell you the answer. And once you've tried some simple things and nothing seems to work like, I think another thing that process gives you is a kind of feel for, it looks like nothing out there is really doing what I want. Maybe I should start thinking about fine tuning an open source model. Or, oh, okay. This is like it's not exactly there, but it's kind of close. It's not that far. Probably, if I keep going this way, I will be able to make it good enough without fine tuning. That's the sort of information you get by doing a lot of experiments. So my suggestion would be to do that and then let the experiments kind of tell you which way you need to go and when.

Nathan Labenz: (51:47) Okay. Cool. That's great. There is so much that I wanted to cover, and I don't think we're gonna get to all of it. So I'm gonna have to start to pick and choose, and then I'll refer folks to the blog for some deeper dives on stuff we don't get to. But what I definitely wanna cover is reinforcement learning from developer behaviors. Obviously, reinforcement learning on language models in general is having a moment. The floor is yours. Tell us about reinforcement learning from developer behaviors.

Guy Gur-Ari: (52:15) Yeah. So one advantage that we have as a company that both does research and builds a product for users is that we are very close to our users. We get feedback from them on Slack, on Discord, and they also send us their data. Now for enterprise customers, we do not look at that data. Of course, everything is audited and behind access controls and so on. But we do have a free community tier. That's for anyone who wants to use Augment. It could be on open source. It doesn't have to be on open source, but anyone who's comfortable with us looking at their data and also using it for improving our own models. So there's a very clear separation between those 2 things. But on our free tier, certainly we find value from that data because one of the things that's just universally challenging about building AI products is that we don't really know the, let's call it, the input distribution or the task distribution. What do users wanna do? How are they trying to use the product? What do they expect to get out of it? And this is through collecting this data from the free tier, we get a glimpse into that. And in fact, for coding, we get more than a glimpse. So that is one of the nice things about coding is that it's quite different from a chat interface. Right? In a chat interface, the user asks a question or they assign some tasks, they give an instruction, they get an answer, they can continue steering, but we don't know what the ground truth answer was. By comparison, if they're working on code in their IDE and we follow what's happening in the IDE, we eventually know what they were trying to do because this is where they actually work. And so the way this connects to reinforcement learning is that the idea with reinforcement learning is that you are not just training the model by showing it examples of what to do. You're actually showing it contrasting examples. Every sample contains input, and then it contains a better and a worse output. And the model learns from that contrast to do better over time. This is a very powerful paradigm because it means that it's not just that there are correct and incorrect answers. There are better and worse answers. And you see that in coding just like anywhere else. You know, the answer could be correct in the sense that the algorithm is correct, but it could have the wrong style. Maybe that's not what the developer prefers, or maybe the style does not align with the rest of their code base. Right? There are actually multiple axes of in which a sample can be better or worse, and that's the kind of signal that reinforcement learning tries to capture. And so we've applied that technique initially to the completions feature. I mean, that was kind of our first reinforcement learning project where we use examples from the model and we use what we know from the user in order to improve the model and align it better with what users expect through reinforcement learning. And we ended up with a better completion model because of that. Yeah. Because I mentioned that we train retrievers, but we actually also train the generation models that we use for completion and next edit. Yeah.

Nathan Labenz: (55:33) There's a lot that I'm interested in kinda digging into

Guy Gur-Ari: (55:37) Mhmm.

Nathan Labenz: (55:38) More deeply there. I guess for one thing, like, have you benefited from, like, DeepSeek and other recent algorithm releases? Has that stuff I mean, I think it's safe to say, you know, the GRPO broadly has, like, blown a lot of people's minds. Did it blow your minds, or did you feel like you already had a pretty good read on what was working such that it wasn't such a revelation for you?

Guy Gur-Ari: (56:03) The algorithm itself, I don't think was I'd say yeah. I think there were known problems with existing RL algorithms that every subsequent iteration kind of addressed. So, yeah, I don't think GRPO for us was a it was more of an iterative improvement, I think. The DeepSeek work was remarkable, I mean, in the sense of, yeah, they actually implemented a chain of thought reasoning training. It was super impressive work and very nice paper. I wish they shared more details on how they did it, but, like, I still enjoyed reading the paper. We do benefit a lot so not from that particular work, but I can say we don't train models from scratch. We made a bet very early on that open source models will rapidly become better, and that was at a time where open source models were really not good. This was long before Llama. I previously worked at Google, worked on some of the training of large models, and so it was clear that this is something open source could do because there is, at least at the level of training base models and then also at the level of like, now we also see beyond base models, like actual instruction tuned models because there is, at this point, almost a playbook for how to do it. And so if you have the resources and if you have the people who know the basic techniques, which are for the large part in the literature, like you can read papers and learn how to do it, then you can train very good models. So we made that bet early on. We don't train from scratch. We do a lot of post training on models for retrieval and generation. And so we definitely benefit from open source models coming out, and we generally try to keep up with basing our models on the best available open source model that's out there. Yeah.

Nathan Labenz: (57:53) So this I think this particularly caught my interest, this reinforcement learning from developer behaviors because I've been kinda looking for something like this to emerge for a while. My sense is, like, the compute requirements for the reinforcement learning aren't, like, so crazy. And the datasets don't even have to be so huge. It seems like a lot of sort of, you know, product user bases or just communities potentially in general if they're, like, passionate about a certain subject or whatever, could kind of gather enough feedback or behaviors from people to power this sort of thing. I haven't seen too much of it, and I was wondering, like, why it wasn't happening, and now it seems like it is happening. I wondered, though, if you could shed any light on where you think it's going. In particular, one might think that the fact that there's not, like, a true, you know, absolutely canonically right answer as there isn't like a math problem with a numerical answer would suggest that maybe this process would sort of top out at human level and might not go past human level because how would it go past human level if it's learning from humans? The flip side of that also would be that it would seem like this approach would be very extensible to reinforcement learning from lawyer behaviors or reinforcement learning from doctor behaviors and other basically anything where you can gather enough data that's pretty trusted even if it's not absolute bedrock ground truth. So what do you think? Like, is there a top out that we should be thinking about? And is there any sort of limit on the sort of, breadth of how far these sort of approaches could generalize?

Guy Gur-Ari: (59:36) Yeah, so I guess, well, a few thoughts. So first, if we think about the trends we've seen in the beginning, you know, start with GPT-2 and then GPT-3 and the scaling loss trend, you know, what was the trend? It was there's a whole lot of data out there on the Internet. Let's get as much of it as we can, you know, process it properly, clean it up, filter it because there's also a lot of garbage out there. But basically, that was the first resource that large language model training reached for. I think at this point, that data resource is more or less exhausted. And so what can we do? There are 2 things we can do, I think, roughly or 3 things. Let's say 3 things we can do. One is synthetic data. I mean, we know that we can generate more data out of these models to train new models. So that's certainly one approach. Another approach is pay contractors to give us the data that we need. Right? And that's how most, I mean, RLHF works. That's how you train something like ChatGPT, essentially, you don't if you don't have any other data sources. And the third is user data. So if you have real user data, if you can figure out how to use data from users who are using your product to do real work, that's in some sense the holy grail because that is the closest you're ever gonna get to the actual distribution of what users are trying to do because it is what they're actually trying to do. So there's no distribution gap in that case, if you can do it well, between the data you're training on and the data you're gonna encounter in the wild at test time. Now I think the reason we haven't seen more of that is that there aren't that many products that are amenable to that. So I think if you mentioned beyond coding, doctors, lawyers, what do they do where you can actually get the ground truth? If they're editing a document, you can get the ground truth just like you can from an IDE. If they're using a chat interface, it's a lot harder to get the ground truth. Maybe you can guess at the ground truth because, know, maybe they tell you, no. No. No. That's not what I meant. Do this or that. And then you could but that there's a lot more work that you have to do to extract the ground truth from something like that. But I do believe that as we exhaust the available information on the Internet, user data is just gonna become a lot more valuable, and people will pay more attention to it. In terms of tapping out well, I would just say that everything we've done so far is trained models based on human data. There's nothing really new here. Like, all the data from the Internet is human generated data. RLHF is human generated data. We can automate some of it, so we can throw models in there and let humans supervise at a higher level, but human supervision is always there so far. In the future, if we wanna break away from that, we need some other source of signal. Right? We need some other source of reward for these models. That's where I think code is probably the place where it's gonna come first because the thing that's special about code is that you can execute it and you can get feedback from that. And so I can see how in the future we'll be able to do and some of it is still happening. Like, you look at the way DeepSeek was trained, they don't say a lot on how they did it, but they do get feedback from code execution for RL purposes. It's a very natural fit. So for code specifically, I think we'll be able to do a lot more of that. For other domains, we'd have to find something else. And so, you know, if you're asking the model to write a story or a poem or an essay, how are we gonna automatically assign a reward to that if we don't already have a better model that can judge what this model did? Right? So that's where, I can't think of a way to go beyond human capabilities there. But when you do have a ground truth, that's separate from humans, code execution, maybe for science, this could be experimental validation, things like that, then we'll be able to, yes, at some point, shift away from humans and rely on these other reward signals.

Nathan Labenz: (1:03:40) Have you seen any reward hacking in your reinforcement learning from developer behaviors?

Guy Gur-Ari: (1:03:48) Oh, zero, in our reinforcement learning. No. No. I can't think of an example. No. There was nothing as spicy as that. The mistakes were yeah. Just not understanding what the user wanted was just the, I think, the most common mistake. Yeah.

Nathan Labenz: (1:04:06) Okay. Well, keep an eye out.

Guy Gur-Ari: (1:04:08) Yes.

Nathan Labenz: (1:04:08) We're all looking for reward hacking. We should be these days, I think.

Guy Gur-Ari: (1:04:12) Yeah. Yeah. Yes.

Nathan Labenz: (1:04:15) So let's see. Just prioritizing kind of triaging a little bit.

Guy Gur-Ari: (1:04:20) Maybe a minute. Shorter answers also if that helps.

Nathan Labenz: (1:04:22) No. You're doing great. Maybe a minute on just the economics of businesses like this. So, you know, it's public. And, again, you can go into as much depth as you want. It's public information that you guys have raised 250 ish million dollars. And I looked on LinkedIn. I saw 100 employees. Perhaps not everybody's listed there. But if I were to just do, like, you know, kind of traditional SaaS app math and take a multiplier of employees times some, you know, Bay Area salary and then try to calculate a runway, I get to, like, a really long runway. And so I guess I'm wondering kind of and that's before any revenue, and it sounds like there's quite a bit of revenue. So what are you doing with the money? Like, are we burning a lot on training models? It sounds like if you're not pretraining, that doesn't seem like it would be. Are you like subsidizing users? I mean, you are subsidizing open source. So yeah, I guess, you know, maybe to the degree you can, I'd be really interested to hear about the economics and then, you know, a slight extension of that would be, is there a 10x more expensive version of the product that you could dream of or imagine what that might look like?

Guy Gur-Ari: (1:05:34) Yeah. It's a great question. So, yeah. One thing we've learned is that AI is different than yeah. We're all used to thinking of SaaS businesses and, like, you're developing your basement, you set up Google Cloud project, you start serving users, and it's all very it's pretty cheap, I would say. Right? Your main cost is salaries and so on. But AI is more capital intensive. There is the training, but inference is also very expensive. So serving all those requests at every keystroke and then the chat request and then now the agents, it gets expensive quickly. Yeah. There's some amount of subsidizing users, like, in the free plan. And I think everyone in this space is kind of trying to figure out right now the economic model. Because I would say, you know, on the one hand, usage is exploding, and we can talk more about I think it's going to accelerate dramatically, and we can talk about why that is. Models are getting cheaper, but they're not getting cheap. They're it's not matching the pace at which the usage is growing in this space. And, also, yeah, either they're getting cheap I mean, a given model is getting cheaper, but people always wanna be on the latest model. And that's not getting cheaper as quickly. So all these factors combined, it's running an AI company, both if you're training your own models and also if you're not, can get pretty capital intensive. So that's the short answer. Now on a more expensive product so if you just look at the shift so we're launching our agent feature. The cost of agents compared to chat, for example, is it's a substantial jump. Because with agents, you give it one instruction and then it goes, and that will probably generate 10 or more language model calls, including large calls for, like, editing files and, like, running whole commands and then parsing their outputs and doing all those things all from a single user instruction. And that's at a point where we are giving users a single agent in their IDE. So it's just going from chat to one agent. On the other hand, the value is clearly there. So I can say, personally, I think I have not written I've been using our agent. I have not written a line of code in several months. The agent has written a lot of code. I personally have not had to. The value is very obvious with these things, and it's super early. I expect usage of agents to explode over the next year and with it the cost. So if you're talking about, like, a large jump in cost going from chat, let's say, to agents, there's going to be, I expect, a jump that's at least as high once we're able to unlock the full value out of agents. And I don't think that cost decreases in models are going to keep up with that. And so cost does become a challenge. So the whole thing is just very capital intensive and cost is actually a major factor. Unlike traditional SaaS businesses, I think, where it's not as much

Nathan Labenz: (1:08:44) of a factor. Are you managing your own clusters? Are you doing, like, the actual, you know, buy up all the GPUs and manage them in house? You're not like leasing or renting from some other No.

Guy Gur-Ari: (1:08:58) No. We are leasing. We are leasing. Yes. We're not in the business of managing data centers. We are leasing the GPUs. Yes.

Nathan Labenz: (1:09:04) Yeah. Gotcha. So does that translate to a higher price point at some point in the future? My rule of thumb has been, I think, companies should expect to spend 1000 dollars a month on AI to augment their employees, no pun intended, in the not too distant future. I personally am, like, probably halfway there just with stuff that I've signed up for. And, you know, then your prices are 30 and $60 a month, and it feels like if I'm right, you should be probably, 5 to 10x-ing those prices, but, you know, I don't know. Is that where you think it goes or not?

Guy Gur-Ari: (1:09:42) I don't have definitive answers. To me, these are open questions. There's even a question of should it be a fixed subscription price or more of a consumption model? So, you know, we actually started our current pricing model is a somewhat consumption based model where we sell credits. And then if you use the product for a given month, if a developer uses the product, they consume a credit. But if they don't, then they don't consume a credit, which is different from the, I think, more common seat based pricing where you just sell seats and you pay no matter if users use it or not. So we already took a step in this direction of consumption based pricing, which was meant to really align our interests with those of the users. Like, you use it, you pay. You don't use it, you don't pay. And I expect because of the cost, we're probably gonna lean, my guess is more heavily into that model, but I'm not sure yet. So this question of is it gonna be 1000 dollars a month? Maybe we'll end up there, or maybe we'll end up with a different model that's kinda more aligned with how users actually use it. What I can tell you is that there's a very wide distribution in how users use these things. There are absolutely users who will justify 1000 dollars a month price point even today, and then there are users who don't. They just don't use it as much. And so I think we and everybody else are kind of trying to figure that out. Let me throw another thing in there. I think right now, we're all just thinking about user driven agents, let's call them, or interactive agents where the developer is kind of there. Maybe they go get a coffee and come back because it takes the agent a few minutes, but it's like, you know, it's a few minutes and the developer is kind of staying up to speed with what the agent is doing. I think things are gonna evolve rapidly over the next year, and it's not even clear to me if that will continue to be the dominant use case. I'm pretty sure we will have agents that run for hours or overnight or over days to accomplish tasks. I'm pretty sure we will have agents that work on non user triggers. Maybe it's API calls or maybe it's an agent that goes and does code reviews for you and things like that that are not just run automatically. If you're in that world, then you're not even talking about per developer pricing exactly anymore. Right? You're kind of putting intelligence into a lot of tasks that are not triggered by the user, or maybe there's a wide variance in the cost of what the user triggered. I think the pricing model is gonna kind of have to adjust to that, at least in the short term until all this stuff becomes super cheap. So, yeah, it's a complicated question, and I don't have a good answer for it. All I can say is it's a good and complicated question that we're definitely thinking about.

Nathan Labenz: (1:12:17) Yeah. I think the it feels to me like aligning interest with users is like a it's hard to go wrong when you're generally, you know, keeping that in mind as a true north. The thing that I am always kind of allergic to is when I feel like the product is not performing as well as it could for me because I've got some, like, fixed price and they're trying to keep my, you know, cost to them under that price to maintain a margin. And it sounds like you're not doing that by just basically accepting the fact that you'll have some thousand dollar a month cost users and, you know, kind of figuring you'll figure that all out later. But yeah, it's definitely I want to be able to be that thousand dollar a month user, even if I do have to pay for it. You know, what frustrates me is when I can't be because I'm kind of locked into like a, you know, more conventional price point. So, okay. Time is short. Maybe 2 more questions if we can fit them both in. One is, you guys have a blog post on why you think RAG it's multiple predictions, but the one that jumped out most to me is why you think RAG will trump fine tuning. And here, I wanted to just super quickly sketch an idea that I've been chewing on for what the drop in knowledge worker of the future might look like and kinda just get your reaction to it. And then the last one is just future of the software industry, like, you know, what should junior developers do as well. So drop in knowledge worker. We've covered the RAG stuff. Like, what's hard for the models today? They don't have context. And I always am so like, I feel bad for them in some ways because like when I'm searching through my Gmail or my drive or my code base, one huge advantage I have is I kinda know what's in there and I know when I've found it. And in contrast, like the models today just like get what they get and they sort of have to do the best with whatever is returned. Right? You can turn up hyperparameters and that helps, but they don't know in general, like, have I found the right thing? Should I, like, keep searching? And so I have this sense that continued pre training is maybe one way to describe it on a company's proprietary data. Basically, try to get to the point where the model knows the company from an inside perspective as well as the models today kind of know the world at large. And then, you know, kind of continue with your post training, your behavioral refinements, but try to get to a point where the model knows, yes, I actually found what I'm looking for. Like, this is the ground truth that I, you know, that I needed to go on this task or not. And therefore, I'm gonna kind of keep searching and, you know, maybe using different tools until I actually get there. What do you think? Does that seem like too far fetched or how would you just generally react to that vision of continued pre training so I know, yes, this, you know, yes, I found it or maybe that is not necessary for some other reason that I don't see.

Guy Gur-Ari: (1:15:09) Yeah. So I can say there are a few challenges with that approach. So one challenge with the continued training is that even though so training is very sample inefficient in that you need a lot of data for the model to learn something. And typical company knowledge bases, they're too big to put in the context window, but they are not large in the sense of training datasets. They're actually typically pretty small. So if you think of a typical code repository, it's not a lot of data to train on. And so if you want the model to pick up on what's there, you're going to need to do multiple epochs probably, train on it multiple times, but then you quickly overfit, which you also don't wanna do. You don't want it to just memorize what's in there. You want it to actually learn from it sort of. So I'd say one challenge is the amount of data for this to be effective is typically too small. Another challenge is keeping it up to date. So RAG, you can work hard, but then make your RAG solution instantaneous, or you can reduce the delay as much as you want. Training models is there's more friction there. If you wanna do it for email, for example, every user has their own email store. So if you're gonna be training a model for every user and keeping it up to date, that's again, there's not that much data in the logistics of doing that separately for every user. It is tricky. I think it used but I would say it used to be like, do you do RAG or do you do this? These days, I would honestly try to solve this problem with an agent who tries several approaches until it thinks it finds the answer. Like, it doesn't have to be just, we'll do retrieval and did it get it or not. You can do a lot more now. And so we've actually built a lot of that into the product. So the more advanced versions are not just like a one shot retriever. We do more to give you the best retrieval quality we can. I would, just in the sense of simplicity, models are so good now that I would reach for solutions like that and try them definitely before doing fine tuning. Yeah.

Nathan Labenz: (1:17:10) Yeah. Interesting. So I think you can basically just, in short, like, get good enough performance without ever having a model that really knows in a confident intuitive sense like I do that it actually has found the right thing.

Guy Gur-Ari: (1:17:23) Yes. I think so. And what we're seeing kind of I'm seeing that I'm seeing that kind of based on the evidence that at least how I interpret the evidence that we're seeing.

Nathan Labenz: (1:17:32) Yeah. Okay. Cool. Okay. Last one, and it's not an easy one necessarily. You know, we've got people talking about superhuman coders, you know, within this calendar year and the next year. I'm referring to, obviously, Dario there who said that repeatedly recently. I guess I wonder, like, do you buy that that soon or, you know, even if you extend the timeline a little bit? And if so, what do you think that means or what advice would you give to people that especially are, like, early in their software career today? From what I see on the Internet forums, it seems like people that are just coming out of school with a CS degree are sort of like, yikes. This is not what I thought I signed up for. And, you know, not everybody can, like, just pivot into being a machine learning all star. That's like a great option if it's open to you. But for the rest who are like, I did this because I thought I was gonna have, like, a stable career with like a, you know, solid income and like, you know, never have to worry about my employment status. Where do you think we're going and what advice would you give them for navigating the challenges that might be coming for them kind of soon?

Guy Gur-Ari: (1:18:41) Yeah. So first, I think it's good to separate the short term from the longer term. So I think in terms of what Dario said, the way I understood it is if you look at what actually, like, in the near term, you go to a new line of code and you ask what actually generated this line of code I'm looking at? Then his statement very likely is going to be a model or an agent that did that rather than a developer. Yeah. I buy into that. Like, maybe it's not he said something like 3 to 6 months. That's probably too quick from the adoption that we're seeing, but not, like, 3 years, I don't think. I think, like, shorter than that. However, that doesn't mean that the model decided on its own what to do. It doesn't mean that it supervised its own output. It doesn't mean that it's fully autonomous. I expect for a long time, there's still going to be a developer there steering the model. And I expect that because this is how I work. This is how I see people who are picking up agents' work. You look at their code. The code was almost once they pick it up, it was almost entirely generated by an agent. But if you took away the human, nothing good would happen. Like, you wouldn't get anything useful out of it because the models are they're nowhere near that good. They're not even good enough to say, here's the product requirements. Go build this. Like, we're definitely not there yet. So I think I buy into that statement, but that doesn't mean we don't need software developers, like, in the next year. Now in terms of advice so I have 2 kids. They're 7 and 14, and we're having discussions with the older one on, like, what to go for. My advice is to go for a career that's more tied to the physical world, could be mechanical engineering or robotics or something like that, where it feels like it will take longer to be disrupted because it's very hard for me to predict what software development is gonna look like in, let's say, 3, 4, 5, 6 years, something like that. That's very hard. With the rate things are changing, I don't know where it's gonna land. I think we'll still need developers who understand the system because if you're just vibe coding your enterprise software, you will run into trouble. I mean, I can already see it happening with the code that I'm writing. It will get better, but I don't think it's gonna get better at that scale. But then still the question is, well, how many developers do we need, and how much software do we need to write? And I don't know. So those are the discussions we're having with my older one. With my younger one, we have a bit more time to figure this out. Maybe at that point, we have AGI and everybody can just do art. I don't know, but I'm glad we have a bit more time to figure it out with him. Yep.

Nathan Labenz: (1:21:21) Yeah. Well, the local artisanal economy, yeah, could be a beautiful future as long as everybody has their basic needs met. Are you guys like how does this get operationalized for you in terms of your hiring? Like, are you hiring junior developers at all? You know, is there any on ramp for somebody out of, you know, a CS program to get into a frontier company like yours?

Guy Gur-Ari: (1:21:46) Yes. There is. If yeah. For sure. We look for excellence. We hire junior and senior developers. I think this is still a time we're not there's gonna be a learning curve in knowing how to extract the value out of these models. And so it's not that everyone immediately becomes even if you're using agents, getting value out of them takes time, especially in an enterprise environment or in a code base. Like, we're not we're small. I wouldn't call us an enterprise, but even in our code base, which is, let's say, small to medium size, using agents to navigate that code base requires some skill. So I think for a while, there's going to be ramping up time where it seems that the people who you know, as always, people with less experience are also, like, quick to jump on new technologies. So I think we're gonna see a lot of that in the near future. But, yeah, short answer is we do we certainly still hire junior developers.

Nathan Labenz: (1:22:42) Great. Anything else you wanna leave folks with before we break?

Guy Gur-Ari: (1:22:47) So Augment is out. It's really good at understanding your code base. So I encourage you to download it, give it a try, and really feel the power of an excellent AI assistant that fits into how you work.

Nathan Labenz: (1:23:01) And I would definitely encourage people also to check out the blog for a bunch of deep dives. We didn't even get into the inference optimization work and all the detailed analysis of batch sizes, which I did think was super interesting. And there's a great write up of the next edit feature as well. So there's plenty more to be unpacked from the Augment team than we've had time for today. But, nevertheless, this has been a great conversation. I really appreciate it and look forward to continuing to play with the product. For now, Guy Gur-Ari, cofounder and chief scientist at Augment. Thank you for being part of the Cognitive Revolution.

Guy Gur-Ari: (1:23:36) Thank you so much. This was a lot of fun. Thank you.

Nathan Labenz: (1:23:39) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.