Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

Watch Episode Here

Listen to Episode Here

Show Notes

Blitzy founders Brian and Sid break down how their “infinite code context” system lets AI autonomously complete over 80% of major enterprise software projects in days. They dive into their dynamic agent architecture, how they choose and cross-check different models, and why they prioritize advances in AI memory over fine-tuning. The conversation also covers their 20¢/line pricing model, the path to 99%+ autonomous project completion, and what this all means for the future software engineering job market.

Sponsors:

Blitzy:

Blitzy is the autonomous code generation platform that ingests millions of lines of code to accelerate enterprise software development by up to 5x with premium, spec-driven output. Schedule a strategy session with their AI solutions consultants at https://blitzy.com

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

Serval:

Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week four at https://serval.com/cognitive

CHAPTERS:

(00:00) About the Episode

(03:02) AGI effects without AGI

(07:07) Domain-specific context engineering

(16:54) Dynamic harness and evals (Part 1)

(17:00) Sponsors: Blitzy | Tasklet

(20:00) Dynamic harness and evals (Part 2)

(30:42) Graphs, RAG, and memory (Part 1)

(30:49) Sponsor: Serval

(32:26) Graphs, RAG, and memory (Part 2)

(41:17) Model zoo and memory

(50:07) Planning, scaling, and parallelism

(56:13) Pricing, onboarding, and autonomy

(01:04:24) Closing the last 20%

(01:12:34) Strange behaviors and judges

(01:22:23) Reasoning budgets and autonomy

(01:33:36) Fine-tuning, benchmarks, and training

(01:42:31) Securing AI-generated code

(01:49:52) Future of software work

(01:57:05) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

Hello, and welcome back to the Cognitive Revolution.

Today my guests are Brian Elliott and Sid Pardeshi, CEO and CTO of Blitzy, a company that uses AI, in just about every way you can imagine, to help enterprise software teams implement large scale features and execute modernization plans with unprecedented speed.

Regular listeners will know that Blitzy has recently come on as a sponsor of the Cognitive Revolution, and while that does make this a sponsored episode, you can rest assured that this conversation absolutely stands on its merits.

In fact, I've noticed over time that my interviews with sponsors often end up being among my favorite episodes, and I think the reason is that founders who've achieved real product-market fit are often unusually willing to share the nitty-gritty details of their approach, as it's a uniquely effective way to convince prospective customers that they are better off buying from an AI pioneer than attempting to recreate such a sophisticated system in-house, and because it signals that their product is still rapidly improving.

Over the course of 2 full hours, we go super deep on Blitzy's approach, what they mean by "infinite code context", and what enterprise software development looks like when more than 80% of major projects can be done autonomously in days.

Highlights include:

the architecture they use to generate agents dynamically, just in time, with prompts written and tools selected by other agents;
why they actually run enterprise apps in a parallel environment as part of their process;
how they ingest 100-million-line codebases and deliver value in the form of improved documentation, which also improves coding copilot performance, even before the code generation process begins;
how they used detailed knowledge graphs to support sophisticated context management strategies and minimize models' "context anxiety" and strange behaviors;
the critical role of taste in evaluating new models and framework changes on such large-scale projects;
which models they find strongest for which purposes, and why they always use models from different developers to check one another's work;
why they are more bullish on advances in AI memory than on fine-tuning;
how they came up with their 20 cents / line of code pricing model, and why they will anything they can to deliver more value for customers, even if it forces them to raise prices;
what it will take to achieve 99% project completion and even full autonomy in enterprise software development;
and even their outlook on the software engineering labor market, which favors senior engineers in the short term, but junior engineers who can use AI effectively over time.

Brian and Sid are both high-energy guys, and they were remarkably forthcoming in this conversation – I learned a ton, and I expect any enterprise software leaders who listen will come away thinking about specific projects where they'd love to put Blitzy to the test.

So, without further ado, I hope you enjoy this deep dive into the present and future of autonomous software engineering, with Brian Elliott and Sid Pardeshi of Blitzy.

Main Episode

Nathan Labenz: Brian Elliott, CEO at Blitzy. Welcome to the Cognitive Revolution.

Brian Elliott: Awesome. Let's get into it.

Nathan Labenz: One of my favorite things to do in life is talk to AI maximalists, and I've known Blitzy by reputation for a while as the company that has figured out a way to create infinite code context, and it doesn't get more maximalist than infinite. So I'm excited to unpack what you guys are building, how it all works, and the impact that it's having on the enterprise software industry. We're going to go through all the layers, but first we're question just to orient myself and the audience to you. How AGI pilled are you? How AGI pilled is Blitzy? How AGI pilled are your customers?

Brian Elliott: We believe we can get AGI type effects out of non AGI LLMs. Right. And so as folks are thinking about the impact of artificial general intelligence, they're talking about like huge swaths of work being able to be done to provide economic value. autonomously across domains, right? That's like, that's one, one amongst many definitions, that is a moving target for defining AGI. And so the core question is like, how can you achieve that output with the limitations and constraints of LLMs? We might be the most like bearish on LLM capabilities as a pure standalone, like single LLM asset and the most bullish on the orchestration of those, in long running complex systems.

Nathan Labenz: Yeah, that really echoes the conversation I recently had with Daniel Miesler, who created this personal AI infrastructure framework. His mantra is harness is more important than model. What obviously one big limitation there is the context window is finite and even at a million tokens relative to the size of an enterprise code base, that's not nearly enough. Any other, you know, kind of limitations of LLMs as standalone creatures that you think are kind of most important to have in mind.

Brian Elliott: Yeah. So there's there's so many, right? And like the being so forward on the limitations is what allows you to build something really powerful and really magical, right? So context is one, but there's a difference between a context window and an effective context window, right? So as you start to eat into, let's say, 20, 30, 40% of a context window, there's a depreciation that occurs. Each model is a little bit different. There's lots of different ways to test this with. you know, in-house benchmarks, but you start to lose intelligence and quality as you start to fill up even the advertised context window. Right. And it's the depreciation is a little bit different by even task type. So what you want to do is really effectively manage the amount of work and the type of work that you are loading into a context window while also pulling out anything that you don't need. So that's like that's a more nuanced view on the limitations of a context window. The other limit are how many tools an individual agent can effectively call. It used to be that they could call zero tools, then it could call one, two, or three, and then call like eight or 10, right? And then tool selection in the agent itself is also something that you really need to understand, steer, and give only the correct tool access to. You think of tool as a calculator or a compiler or sort of like any outside entity that one is doing. Like lastly, it's maintaining long-running intent of the human, right? Or like intent of the machine or instruction, right? It's a byproduct of context management, but it sort of has to do with attention in general. And so if you can design a system that says, great, LLMs are a very, very cool probabilistic type of computer. They have all these limitations, at least when leveraged as a single instance. And then if you can accept those limitations and then build the harness or the cognitive architecture, you can really create something that can do AGI type effects.

Nathan Labenz: So I can't help but ask for a couple specific tips, 'cause right now I'm doing the work of kind of building out context of my own life, pulling out all my e-mail history, my Slack history, all the transcripts of the podcast, all this stuff into just this kind of big data soup. And now I'm trying to layer on various kinds of summaries, different angles at it. And in some ways, this is probably quite similar to what you guys are doing with code bases, albeit for me, it's just my own stuff. So I was just thinking earlier today, I wonder how much context I really should put into Gemini Flash, or if that is the right model, maybe there is a different model that even though it's nominal context window is shorter, I would actually get better results for a given amount of context. How would you advise me, are there any kind of top line heuristics that you would be willing to share that are like, this is what we see as the best, and this is kind of where it drops off?

Brian Elliott: Yeah, well, let's put a pin in a point of not to just use one family of models at all to do this. And we'll cover that in a second, and let's talk about how you sort of manage this information, right? So context is serial, information is relational, right? And so like that e-mail connects to that thing that you said in the Slack message, right? And those might be on different applications. And so the question is, what are the core relationships that govern this domain? So we put out a paper about domain-specific context engineering, right? But what is core is context engineering is not general. It is domain-specific, meaning There is a core set of entities that relate in certain ways inside of the domain of, let's say, personal life or work life, right? And so you have to first understand and define those relationships and then pair that with semantic understanding, right? And that is how you get closer to the context that might be important for any task while removing away the context that is not important for any task. That was like a very broad philosophy. But the idea that semantic clustering is sufficient is like really inaccurate.

Nathan Labenz: Yeah, okay. I like where you're going with this. So what I'm doing right now is, again, like starting with all this raw information, and then I'm kind of trying to build up layers of higher and higher order understanding. First of all, I'm just going, Okay, let's create a timeline. So I grab whatever I have from all sources, sorted by date. So some might be emails, some might be podcast transcripts. Throw all that into an LLM and say, give me a summary of kind of what I was saying, doing, thinking about whatever at this given point in time. Build out a timeline. Then on top of that, it'll be like, who are the relationships that really seem to matter over the course of all this time? Then it'll be like, what are the projects that I was engaged with and which people was I working with those on? And I'm kind of building up through all those levels right now. How does that play out? I'm sure it's, again, an analogous thing. How does that play out in the context of a giant enterprise code base that you guys get your hands on?

Brian Elliott: Yeah, well, the approach that you're taking on that personal project will be okay at first and then get worse over time. So you're at the personal project stage of a lower mid-market software application, as I say. You can kind of just shove all the stuff in there and you'll get some approximately right results. But Gates had this Like if you could schematize the world, like you can get a computer to do anything. And so in your example, you're trying to schematize your life, right? In the example of code, you really are trying to schematize code and the relationships in code agnostic of language. So you can throw, in the case of, let's say you could throw a 50, 100 million line code base on it. And because we have a deep relational understanding that we built first. It takes a few days of compute to build that, but a deep relational understanding that we understand first inside of that code base, that is the base layer that allows us to do large amounts of work, development work autonomously. In your example, you'd first schematize your life. That might be dates, it might be months as a group, dates as a group. It might be other activities as a group that relate to other things, but you first and foremost need to understand what are the core relationships that govern the domain. And we have done that in a very, very unique way with code. So when an enterprise starts with us, they ingest their code. It takes a few days of compute, and we then have a deep and novel approach on the category of knowledge graphs, but that's maybe not sufficient to explain how deep the understanding is so that. In any line across a 100 million line code base, I can tell you exactly what is relationally relevant down to the line level so that when I generate code, I am injecting and pulling out the correct context just in time.

Nathan Labenz: So obviously, dependencies is one core type of relationship within software. What's the kind of, and I guess a lot of that has been done traditionally with static analysis tools, right? Like there are all sorts of tools that can go through and say, this file imports these other things and they import these other things. And so we can kind of fan out that way. So what's the breakdown between how much you're using those kind of static analysis tools versus LLMs to do this ingestion? And what's the double click on like the nature of relationships that goes beyond dependencies?

Brian Elliott: Totally. And so if you think about the ASTs, for instance, right? These are like one of these tools, like version-specific, language-specific, abstracts, and text trees. These are like a pre-LLM worldview of understanding the relationships and meanings between a language and version of a programming language, right? So you can think of what we've invented as a, it's not an AST, but resembles the characteristics of an AST from accuracy that is programming language agnostic, designed for AI agent traversal, right? And so that was like a lot of words, but you think like globals, classes, variables, functional relationships inside of an application. And so by having the traditional, I would say, programming language agnostic relationships on top of actually building and running the application, which we do as we create relationships, you're able to create a much deeper understanding. I think one of the powers is you're really not able to get understanding unless you are building and running applications and putting that through the paces to understand everything from what you said in the left side, which is like dependencies to how things are relating when they're run in production and have like actual logs running and understanding those relationships. So you can imagine the spectrum of compile time, runtime, production at load items that like a software development team might look at. And those ultimately form the base of the relationships that instantly schematize enterprise code.

Nathan Labenz: Yeah, okay, that's really interesting. First of all, do I understand correctly that you are literally running enterprise applications in your own like parallel universe to the because of course your your clients are continuing to run their applications in production, right? So you've got to kind of.

Brian Elliott: That on drop it in their own cloud environment. So they're sort of spinning up, spinning up again in their cloud environment. But like, let's see. To get started, one of the reasons it takes, it's not days, zero minutes to get started is it's getting your access to your environment, getting all the necessary keys so that you can spin up these applications and run them. But it's cool because when you get... large-scale code outputs from Blitzy, you'll also see the QA that we did and the screenshots of an agent clicking through and running an application in production. And so that happens both upon ingestion to make sure that we can run and build the application, and then at code generation as we go through QA. Running the application is core to getting high-quality code because you need a recursive correction loop when not just something that doesn't compile and build, but it doesn't act in production how you're expecting it.

Nathan Labenz: Yeah, just the feat of managing to actually stand up another parallel instance of the production application is like, I'm sure it's not trivial in many, many cases.

Brian Elliott: Like you need to see the database, right? Like there's like real implementation work, right?

Nathan Labenz: I bet there's a lot of times where people don't even, because they haven't really done it or this thing has been kind of running in the way it's been running for a long time, I bet a lot of times they don't even have a sort of ready plan for how you would even do that, right?

Brian Elliott: A lot of applications, insurance, for whatever reason, of all places, they just really have no way to provide us these instructions. And so what we'll do, we'll go through this iterative approach, which provides value even in the approach where we will take the information in which they think It takes to run the application, and then Blitzy will find a limit case of not being able to do it, and then we'll like, Hey, we don't have access to this package, and they're like, Okay, well, I had no idea depended on that package, and so you're able to kind of go through this process of actually creating the correct build instructions for the application that's essentially been sitting. somewhat dormant that they want to activate or move over into the more modern technology stack as a part of getting Blitzy to stand it up. So we've provided value just in implementation, I would say, but it does come with, obviously, challenges. And there's lots of old enterprises, for instance, to build the application. It's not It's not as if you're just writing a script or a package. It requires what would have typically been a human, and dialog boxes popping up and putting information in. But Blitzy is sophisticated enough to spin that up and then put in user creds and were in their IP to build an application. That's how Windows applications were built back in the day. And so it requires a real build sophistication in the application to get this level of fidelity.

Nathan Labenz: You guys have been at this for a few years, right? The big question I had is obviously the capabilities of models have changed dramatically, right? So in terms of the ability to look at a screen, understand what's going on, I think we saw that kind of demoed for the first time with the GPT-4 launch, but it was still pretty rough around the edges and not really even available for after that. the computer use benchmarks, you know, we're kind of in the steep part of the S-curve right now. I remember fondly, but also with frustration, the experience, you know, of early computer use agents. Like, even if they could see the button, they couldn't necessarily click on the button. They couldn't quite find the right place to click. So that stuff has all improved dramatically. How do you think about turning Blitzy on itself? Like, I recently did an episode with Andrew Lee from Tasklet, and he's another AI maximalist I really enjoy talking to. One of his mantras is, Speed in the AI era, speed is the only moat. And he takes a lot of pride in just how fast they kind of rebuild their stack from the ground up. So I guess what would be the big unlocks that you have seen in terms of like, okay, models couldn't do this before, we had to do all this stuff to compensate. Now they can, we can kind of simplify that or we can aim higher in terms of what we could do. I'd be interested in what those big milestones would be as you look back and just how often do you find yourself kind of having to do major modernization work of your own stack, even if that modernization is only a few months from the last version to the new version?

Brian Elliott: It's such a good question. And so when we started building Blitzy in 2022, we essentially made a bet that the models were going to get faster, way better than anybody in the market expected them to get better. And so we started building for a future universe that wasn't here when we were doing all of the design and all of the work of this. There's no MVP of Blitzy. It's like an end-to-end platform experience, right? And so the world that we built for over the last three years and the world in 2025 essentially intersected, right? to get really, really good and we were correct. And so when you are building systems for an ever improving state of LLM intelligence, you want to build the systems dynamically. So when people talk about building harnesses, they are like, they're sort of like hard coding and codifying actions based on the level of like LLM intelligence and capabilities, right? And so those, those harnesses depreciate as LLMs get better and you know, The level of appreciation is tied to how hard-coded your design is, let's say, and the rate of intelligence increase, right? And so everything that we do in Blitzy is dynamic design, meaning Blitzy's agents are generated dynamically just in time, Prompts are written by other agents. Tool selection is assessed just in time by context injection, right? And then the, the, the whole planning process, right, that governs all of this is sort of chunked and revisited iteratively, right? And so in this, like as the models get better, like it's just great for us, right? Like we can, we can more or less just do, just do more. And it's a config file to toss in a different LLM, but because everything inside of a, inside of the system is dynamic, dynamic, we don't get, we don't feel the depreciation that one would typically feel when they're like building harnesses. And I would say like the classical way that folks build harnesses today. Like for instance, right? Like as a new model comes out, new prompting instructions for that model come out, right? Like our agents just reference the latest prompting instruction tied to their model. And then that self writes a prompt for an agent that's injected, right? And so like, it doesn't matter that the prompt guidance changes for the next Gemini model. Right? Like we will just go reference that. We the agent will just go reference that as this dynamically writing a prompt for another agent.

Nathan Labenz: So that sounds awesome. It sounds like you're living the dream in many respects. One thing that I do wonder about there though is like, how do you evaluate that? Because the typical harness, while it does have this, and you know, I can speak to that when I tried this sort of personal AI infrastructure at various points in time. And I kind of always felt like it wasn't really there to give me tremendous value. I think now it's actually, we've maybe hit that point. But as I look back at some of this old code, I'm like, oh my God, like 8,000 token context windows when I first tried this. That was so limiting and I was doing so many gymnastics to try to make that work. But one benefit of those gymnastics was, or at least one thing that was easier was I could at least kind of define eval test set that I could wrap my head around, that I could look at and be like, okay, this makes sense in inputs and outputs, and I can throw a new model at that and get a quick sense of is it better, is it worse, whatever. When so much is dynamic, how do you think about evals? I guess one thing I could imagine would be you might do some fixed evals kind of as prep work, like let's characterize the effective context window of this new model and then tell itself what its effective context window is, give it some sort of like metacognitive information. But you've probably got lots of other insights into how to eval such a dynamic system.

Brian Elliott: So I'd love to hear them. So I think it's important that your evals map onto the real world as closely as possible. And so like, if you think about most evals in the world, they are designed to be easy, very easy for the human to evaluate, right? Like very easy because it's like, all right, well, like here's a function and here's a, here's a different version of that function and, and this other version is more accurate, right? But that is a local optimization on an exponential technology, right? And so our evals are a bunch of applications that we've built over the years that are of larger scale. Some of them, you know, started on open source and we built their own versions of private applications over the years. And so we're testing Blitzy on executing what we ultimately want to be a 100% outcome, and we're seeing how close we get to that outcome with the new configuration of, let's see, right? And so we might give it a million line, maybe we'll give it like Apache Spark. It's like 1.3 million lines of code. We have a custom configuration of Apache Spark from previous projects that we've done personally in life. We'll give those instructions and we'll be able to see very quickly, like how close did we get to 100% completion with this adjustment, and it requires an extreme amount of taste. Because if you're not 100% there, right, you don't get to 100% of a result, the 100% being like what you did as a human in a previous life to get that to 100%, you're now saying like, is this 85, 88, 90, 95%, 100%, right? And it's a difference between functionally correct, which like blitz it can guarantee like functional correctness. Like, hey, we passed every end-to-end test, we passed every integration test, we patched every feature test, but it may not be the final version that you actually intended to put in production, right? And there's always a difference between functional correctness and intent. And so it's that taste that is required to really improve the system and provide that feedback on top of like the sort of traditional large scale evaluation. So this is why I think It is really, really hard to build these systems without the right longitudinal experience to understand what great technical design and great implementation from a software perspective is. We always say Blitzy is the instantiation of, if you had Sid work at the time of compute, my CTO and co-founder, because he's instantiating his technical taste into the outcome. in a way that is really, really impressive for the enterprise. Well, of course, they can specify their own taste and their own rules, and the system will respect that. But that is how we do evals, which is like at scale at a very large point with a lot of taste involved.

Nathan Labenz: So that kind of final taste step, if I'm looking over Sid's shoulder as he's evaluating the work of a new model thrown into the blitzy meta harness, What am I seeing him doing?

Brian Elliott: You are looking at the final output, but it's really you're looking at the logs. We use Langsmith for logs and tracing. Shout out Langsmith, big fan of the Langchain guys. This is their tracing product. As you look at Blitzy logs, if you were to type them out in a piece of paper, and I put it on a scroll, it would scroll all the way down to the end of the block. The amount of agentic interactions that occur at runtime are absolutely massive. Where you have somebody injecting context, somebody writing a prompt, somebody writing code, somebody reviewing that code, somebody building the code, somebody doing before and after, local pass, pass, fail, or fail, end to end. So that's happening to get a piece of functionality out in the bigger system. So as you see these agents interacting, it's a lot like looking at your engineers, have a technical discussion on what correct might look like. And so what you need to do is you need to look at the final output of the meta harness, like the pull request here, and then trace back in the system, I didn't like what happened here. What happened in the system? And how can I steer the system to dynamically be able to address this kind of instance in the future? It's a completely different approach to building software because the outcome is a little bit emergent in a way. And you have to build the system to understand how to dynamically steer and validate to get to the right outcome.

Nathan Labenz: Yeah. What does that steering process look like? Is it just like giving the system text, like free text feedback? And because that sounds like its own.

Brian Elliott: Yeah, really, it's trying to be as algorithmic as possible. Right. And so as you think about like chunks of work being completed, the first step that we'll do after we receiving a future state spec from the client, which is like, our system will work with you to get a future state spec of what you want. That's a web application portion. As that's sent off to do work, right, that will, then you start off a planning process, or Blitzy starts off a planning process, and then it executes against that plan. And so each one of those planning steps, any chunk of working thing, planning, readying, testing, validating QA and doing that recursively, recursively is driven algorithmically to get to an outcome. And so it's tweaking the algorithms that govern the system to get to the right outcome.

Nathan Labenz: Fascinating. Going back to the kind of initial ingestion and sort of the knowledge graph that is created, I'd love to hear your thoughts on knowledge graphs and how they relate to RAG and whether you guys are using embedding. There's been obviously many different approaches and schools of thought here. I've always been attracted to the idea of knowledge graphs, but certainly for a long time it was like RAG was kind of more in vogue and then it seemed like a lot of times it was just like dump everything in the context window. It started to become the prevailing approach when possible. Obviously it's not possible for large code bases. Are you able to get to the point where you've mapped things out so well that you don't have need for fuzzy semantic matching? Or do you also avail yourself of that and have like, This is what we were able to find structurally that's relevant, and this is also maybe some other relevant stuff that sort of fuzzy matched you might want to be aware of.

Brian Elliott: Yeah, you really want to use both as a hybrid source of truth, right? And then when there's conflicts, then you want to like, you, the system, want to explore much deeper and much further, right? And so it's important that these, the issue with RAG as a standalone item is sometimes people will rely on the RAG abstraction layer as like the source of independent truth, right? And so to answer your first question directly, like you want to use both through a relational understanding and a semantic understanding, and you want to pair those as agentic tools so that you can arm the agent to use these different tools to go and pull the right information. But you really want to use these, you want to use these tools as an abstraction layer to go search the source of truth, right? Right? And so this is where you don't want to rely on the semantic match to pull out the truth. You want to rely on the semantic match as a map or a legend against the actual source of truth. Go efficiently search, traverse, and find that, and then pull the source of truth into the context window, right? So it's really an efficiency search mechanism more than it is a storage of truth mechanism.

Nathan Labenz: So one thing I've observed that I wonder how you address is, So often when I have an agent searching through whatever, right? My Google Drive or my Gmail, one huge disadvantage that it has relative to me is that I have this sense when I have found what I was looking for.

Brian Elliott: Yep.

Nathan Labenz: And it's always clear to me, you know, I'm always like, I've not found it yet until I find it. And then I'm like, that's what I was looking for. That's obviously predicated on my... you know, historical familiarity and the fact that I was involved in creating all this stuff, right? So I kind of know, yes, that's the thing. The model obviously lacks that kind of deep familiarity, historical participation. And so it can't be so confident in general that like, that was the thing that I was looking for. So how do you guide models when they're doing this kind of search to make that judgment call of like when to stop the search. I find that to be a very perplexing thing in my own building.

Brian Elliott: Yeah, so this is all about the mechanism of the request between yourself and the model. And so like in the instance of like I have a fuzzy idea somewhere between some mental neurons on what I might want, like you might actually be doing the most efficient item by just going through and searching. But if you think about completing work in a workplace, work follows some sort of structure. Right. And so in software development, it follows a spec, right? And so therefore, like you can express and this is how people will do it, they'll express what they're roughly trying to achieve with Blitzy. It'll go look against the source of truth and it'll come back with a plan in the form of a future state technical specifications, like what architects deal with all day to go do that work. Right. And so until you can provide the system the right structure of output, it is unlikely from a system level to go and sort of do your bit correctly. And so then the question is like, how do you create the right interface experience to enable humans to enter with a fuzzy output, get confirmed on a structural strong output, and then send that task off to the system versus the experience that you just described, which is fuzzy input is sort of all you get, right? And so that some people use like chat for this, right? Which is like, I'm roughly thinking about this idea. I think it's this thing tied back to this date, and then it can say, oh, like, is it any of these possible things that you want to go explore further? That's a intermediate abstraction layer ahead of the true deep search. And so it's all about creating an intermediate experience between the system of intelligence, the system of record, and then how you're sort of expressing that ask.

Nathan Labenz: So as much as possible, basically, you are giving, when we actually are getting to the work stage in the process, you hope at that point that you have effectively given the agent everything it really needs to know, or at least the location of everything it really needs to know. And then it can kind of do additional search to like read in the details of that file, that function, that service, whatever. But you've already had a human approve a plan and kind of sanity check at that level. So it should have clarity basically on exactly what it needs to be reading.

Brian Elliott: That's right. And what's super important is like the system is capable of doing both steps, meaning like I can provide you what I'm trying to do inside of my like 30 million line trading system, right? And then you can come back and like, let's say come back in about an hour after you give it this. Let's say a page of general instructions that you're trying to achieve on the code base, and it'll come back with a very in-depth implementation plan because you didn't think about the edge cases or the services that it might touch. The whole point is it's impossible for a human to grok everything that might affect. That is phase one of system interaction being like, Hey, heads up, human with limited human context window. Here is the plan that you said, express against this enterprise code base, and here's a bunch of things that we're going to have to do to implement this that maybe you did or didn't think of. By the way, if you want to do this a different way, that's cool too, but let's assess and make those trade-offs before we go off and write 100,000 or a million lines of code. That is that experience of, leveraging system intelligence to generate a clear version of work is required as sort of like phase one in order to do like large volumes of work in phase two.

Nathan Labenz: So earlier you mentioned that I'm gonna, my approach is gonna work until it starts to fail. What's gonna cause it to fail and what should I be mindful of as I approach, you know, how do I know when I'm approaching failure and how should I be prepared for those failures?

Brian Elliott: Yeah, and so maybe I'll start by saying how we recognize failure in our system, and then maybe we can map it onto your own passion project here, which I love. Inside of the Blitzy system at runtime, right, we are doing as much work as we can autonomously. You can think of it as spec and test-driven development at the time of compute. And we'll retry and reloop and recursively go back and self-improve. between running the application and getting the featured outcome. But at a certain number of attempts, right, we have to say, okay, we can't do this part, right? And so we have a separate and independent evaluation system that figures out what the desired end state was, what the system was able to do, and then writes the doc for, man, if we could optimally get to this end state, like this is the most likely path that we believe a human could do that this system can't do, right? So you need to sort of build these mechanisms, these systems, the system of work, the system of QA, the system of evaluation to operate somewhat independently, right? So that when you get the output, as a part of the output, you also get the report on, the system failed to do. And we always call that like that's like the human completion part, right? And so getting these to be really accurate allows you to move with confidence, right? And so for us, that's a project guide that says, hey, these functions or these parts of the application, we need your help. And by the way, we did all of this work. We passed all these tests. Here's the QA, and here's the screenshots. So you can feel good on that. Go review that code, but go spend your time on this part. So if we map that onto yours, you would need to have-- this is my intent. My intent gets some work. That work has QA involved recursively ahead of it getting the outcome, and also a separate system to evaluate and greed. the effort of that and both of those artifacts should come to you and both of those systems within your application should be independent in nature.

Nathan Labenz: It's like a report card. So how about a kind of model scouting report? You had said that, you know, you don't want to use just one family of models. I mean, that's clear to me, but you know, why? And do you have kind of rules of thumb for Which families are kind of better in which ways? How many are you, you know, how many families are you using? Does Grok, you know, crack the list? Do any Chinese models crack the list? Are you fine tuning models for particular purposes? Tell us, give us a tour of the model zoo.

Brian Elliott: Yeah, so we use the three major families of models in Blood Zoo today, OpenAI, Google, and Anthropic. The other ones are great and maybe incorporated in the future for different purposes, right? But it's very clear the researchers' preferences are somehow expressed in these models' intelligences, and that they're very, very smart in sort of different ways. And they're much, much smarter when you compare different families of models and have them review each other's work, right? And so if you take Like if you took Opus and Sonnet from Anthropic and had them compare each other's work versus an OpenAI and Anthropic model, you're going to get demonstrably better results by having a different family of models review, or, you know, different companies review work, at least in all of our experience, right? And so that is super interesting. And it changes every day, but like first pass code Gen. like Anthropic remains really, really strong. Structured output and code review, great results from OpenAI. Like, by the way, what I say here will probably depreciate by the time the podcast even comes out. And Gemini has been better for long horizon work, task checking, task lists, keeping things progressing, right? And so that's, I don't know, date time stamp this towards the end of January. And I'm almost certain that it'll probably change by the end of February.

Nathan Labenz: Yeah, the paces. Unbelievable and relentless for sure. So translating that back to kind of the meta structure of the whole thing, I'm kind of imagining that there is like a brief given at the highest level where it's like, for this kind of task, you're probably going to want to use this model. For this kind of other task, you're going to use this model. And are you then allowing the system to dynamically select which model to use as a sub-agent as it unfolds itself.

Brian Elliott: Yeah, and you can think of an example of a dynamic algorithm rule would be like, you can pick the one that you think is best for this situation. And also, the reviewing agent must be one of these other options. And so we're not constraining the choice. We're sort of then constraining the selection of choices in the review model. So that's an example of a sequence of steps used in validation that is dynamic in nature, not-- Not like you must use Gemini, then you must use OpenAI, for instance, right? And then you asked about fine-tuning, right? And so this is fine-tuning is... a like a last mile optimization, I would say, and not a not a bet on dramatically better improved models. And so Fine tuning is an expression of essentially, I can't get enough correct context engineering within the system and I can't get the right results. And so there's a place for it in the ecosystem, but it's like soon as you fine tune a model and the next one comes out and it is more raw intelligence, you're basically out of luck, right? And so we are much more bullish long-term on what we call memory. Right? And so you see like a very shallow instantiation of this in tools like ChatGPT, where it will start to sort of remember your preferences. But there's a lot of memory that occurs in the enterprise environment, right? And memory is another way to express both relational and semantic understanding, but with a lot more signal of truth. Right? And so we very much believe that to get to 100% autonomy within an enterprise workflow, you have to sustain memory of the actions of the best people and what they view as correct, and then sort of store that in your instance, in the enterprise's instance of the platform, in this situation, the enterprise instance of Blitzy. And that's how, like, even after the architect retires, that is the only one that knows, like, that system. that the enterprise itself has that IP in their instance of their AI system.

Nathan Labenz: Memory for the missing middle, as I've sometimes called it, of LLM memory has been an obsession of mine for a long time. I was really taken by like the Mamba architecture when that came out just because, hey, here we have something that's kind of competitive with a attention mechanism transformer, but it has a fixed state space size, we can kind of potentially run this thing indefinitely. Obviously, there are, you know, still limits to that. You know, there's a there's a spectrum in memory between like pure scratch pad and deeply integrated nested learning, continual learning, futuristic stuff. That which sounds awesome, but it also does have some kind of challenges in that, you know, a nested learning type approach. The model may perform better, but it doesn't necessarily mean that you have a record of what happened or like what the key, you know, lessons were because it's in the weights, right? So what do you think is, if you were kind of going to put your, your own spec, let's say, out to the Frontier model companies for what you want to see memory look like? Like what is the shape of memory that would be the biggest difference maker for you guys.

Brian Elliott: Long-term memory, I don't believe, will be solved at the LLM level, right? And so LLMs have so much momentum beside them that another architecture, even if it were to solve for this, will not get the level of intelligence required to execute what these systems need, right? And so memory is a problem to be solved at the system and sort of by system, the application layer, but the system layer. And that memory is sort of is application or domain specific to what is important to remember in what instance, right? And so memory, right? You can think of memory as all the way back to the traces, right? The Langsmith traces, like a series of steps was actioned. The series of steps were driven by decisions that you chose to put in context. The decision to put something in context might change in the future based on what you've learned from the way the enterprise expressed work, right? And so this is tying all the way back to your context management system. Like that's where you're storing memory and preferences based on actions, not based on model weights.

Nathan Labenz: Interesting. I have some hope that there could be an integrated memory breakthrough that will help.

Brian Elliott: It will certainly make things easier. It will make things so much easier. I hope for it. I really do. And even some expression of memory in the model layer will ease the burden on the system layer, right? But when it's like how much memory and Like to give you a specific example, like if you think about memory at an enterprise code base layer, right? The things in which one needs to remember are extremely locally specific, right? And so a memory on an enterprise code base is not universal. So it's not use this payment provider service over this payment provider service, even though my enterprise has nine. It's, hey, when I, when you interact with this cluster of context, you need to use this service, even though to you they look relatively functionally equivalent. There's some organizational or contract reason in which you need to use this service, right? And so that is so local from a, from a contact interaction perspective that impressing global memory at the model layer actually has severe limitations. And so the question is, how do you bifurcate global truths or global memories, which people instantiate with rules today to kind of try to manipulate these models to do what they want? How do you instantiate universally true long-term memory in the weights in the models because these are more brute force levels of intelligence while keeping locally contextual memory-based decisions at the system or at the application layer.

Nathan Labenz: Yeah. I totally agree that there's-- you can't-- I mean, the nature of compression is like you can't compress everything, right? Something's got to be lost.

Brian Elliott: Almost every problem, I feel like, is a search and compression problem at the end of the day. And you're trying to get rid of as little loss that compression as you can, and you're using search to try to minimize that. But I think about search and compression like all day.

Nathan Labenz: Yeah, it reminds me of, I'm sure you've heard this, but the old, it's kind of an old, I don't know if it's a parable or something of the sort of junior developer gets a problem, you know, gets all excited, starts like ripping off code, just typing a mile a minute. Whereas the, you know, the seasoned vet kind of leans back and says, I think I've seen something like this before. And that's kind of the, the thing that I, I can imagine Even with like finite size, you know, finite memory space, I can't imagine that getting developed to the point where you could just get tremendously higher reliability on going out and finding the right documentation when it's actually needed, making the right decisions. Not because the model would've memorized every last detail of it, but it would have that sort of intuitive sense that we probably have undervalued in ourselves until we've kind of seen how much we contrast. with LLMs who lack it, sixth sense of, yeah, there's something here that I kind of know I need to go, and I kind of know what I need to get.

Brian Elliott: And if you were to look at how Blitzy spends time as the representation of the best cluster of developers at Inference, we spend a huge amount of time in planning and system understanding and impact analysis, meaning let me really methodically think through this, and then let me spend a lot of time figuring out everything else that this is going to affect, the cogeneration is relatively fast, right? a bunch of time on QA and validation and recursively improving the code based on what you're trying to achieve. But writing a million lines of code is as fast as you can stream tokens, right? Our runs are as short as 12 hours, as long as a few weeks. It's a huge, huge refactor, right? And so as you break that up, it is that wise developer motion of, let me sit back, let me plan, let me think, let me think about everything this is going to impact across the system, and then let me implement, as opposed to the junior dev, which is just rocking code on minute zero.

Nathan Labenz: Yeah, okay, that's a great transition to a couple of questions I had around what you might call blitzy scaling laws, or another way to think about it would be limits to parallelization. Well, you could just sound off on it, but I'm interested in what is the curve, you know, Sam Altman famously tweeted like, it's gonna be really weird to live in a world where you can pay exponentially more for like marginally better results. So you've clearly got a curve like that. I'm just to kind of know how you think about that curve and where you want to be on that curve. How do you know when to stop paying for more inference? And then parallelization, like Kimi K2 just came out, they've got their agent swarm thing. And there's another kind of some sort of logarithmic thing here where like a thousand agents does not make you go a thousand times as fast. It can make you go five times as fast, maybe, maybe 10 times as fast. You could maybe characterize what that looks like, and also what do you think the reasons are for it? Some things, I guess, are just sequential. You got to plan before you can execute and so on. But yeah, that's plenty of prompt. Take it from there.

Brian Elliott: Nice, yeah, good structure prompt there. So let's talk about parallelism and the limits of parallelism. When you think about the work getting done at the system level, this is a core topic. Understanding within the domain that you operate, we operate in enterprise software development, what sort of work can be done in parallel versus sequential? Because trying to do everything at once is a surefire way to get really, really bad results. And so just like in engineering, an engineering team will look at an epic, they'll break it down into tasks, and they'll realize which tasks depend on what. That is a huge part of what is happening at the planning phase within Blitzy. We are deciding based on software development fundamentals, like thing X depends on thing Y, therefore we have to get X to build, compile, and pass tests before even starting on thing two here. So we must do it in that sequence, right? And so that is what is for us happening at the planning stage, which is parallel versus sequential tasks. That is just a software development problem set. Now in other domains, there are other ways to think about what can be done in parallel versus what can be done in sequential But in engineering, it's very easy to grok what depends on what in a sequence of work. And therefore, we have a system that algorithmically works through and assesses that. So that is the answer on parallelism. We want to do it as high quality as possible, which means in the instance where the system is not entirely sure, it'll assume sequential in the instance where it is extremely sure that it is parallel, it'll do it in parallel, right? And so it's sort of like a tolerance preference on quality. which will answer your first question, which is paying more to get better results, right? Our thesis is like, we will pay any incremental dollar, we will write any incremental algorithm, we will really do anything within the system to improve the quality of the code, right? All the way to fully autonomous and a press software development as the goal for the company. And so like, we are not, in our opinion, cost constrained because the other side of a pull request is human labor, right? And so I would much rather have that human be working on problems that are on the edge, that are truly innovative, that are thinking about absolutely disrupting the way that they're applying technology to their business than I would have them spent on vanilla application development, just a regular organization. They've already expressed their preferences vis-a-vis Blitzy on the technical design that they want implemented, and then they're handing off that work for us. And every ounce of work between, let's see, which we typically do like 80 to 90% of the sort of quantum of work, and then we'll call out what we need the human developer to do. But like in the vision of the company, like that remaining work, which is just like traditional configuration, like QA, that's a bug. That's a bug in the system towards the vision of the company, because software developers are problem solvers. They're engineers, they're problem solvers at day zero. And if we can have the world's smartest people working on problems on the edge, not worried about like packaging compatibility or QA, like we've done a great service to humanity.

Nathan Labenz: I'm going to come back to the developer experience maybe in a few more minutes, I suppose. Let's talk about the economics a little bit more though, because on the website, there's this 20 cents per line component to pricing. And you can kind of complicate that. I think there's like kind of a base buy in level and then, you know, 20 lines or 20 cents per line beyond a certain level or whatever. But that strikes me as creating possibly an interesting tension for you where you have now said, okay, this is what we're going to charge you. But then you also just said, I'm willing to spend, you know, every incremental API call necessary to maximize value. So, you Is there just enough headroom under $0.20 that you don't mind kind of bumping up and down? Do you ever have projects where you have to go to the customer and say, Hey, actually I need kind of $0.25 a line, but it'll be worth it because we're going to do that much more with Opus here and it's going to make it better or whatever. How did you come to that $0.20 and how safe of a line in the sand has that proven to be for you?

Brian Elliott: Yeah, if we have to increase prices, we will. That's like our sort of going in point to like not in the act of contract engagement. Like if we have to dramatically increase computer, we can dramatically increase compute to get to 100% autonomy. Like we'll do that and our customers will sort of coast off of that in the sort of duration of their contract and then we'll have to right size it. I'm not necessarily worried about the gross day zero versus the value created day zero. So like, yeah, it's an attractive business today. Absolutely. That doesn't matter. What matters is the amount of value left to be created is so high that like if we close the gap from 80% of the work to be completed autonomously to 99% of the work completed autonomously in a year from now, like the net new customers are going to be so happy, more than happy to pay more money. because they're able to do so much more, right, with the same amount of people. And so as you think about really the delta of value creation that you're thinking about, and you always want to just push as hard as you can on value creation, because the market size for software development, it's sort of 1.2 trillion in labor, but like that is an infinitely expanding market, meaning software is designed to fix the problems, productivity of the customers. And if you're telling me that we're out of problems to solve with software, that's where I don't believe you. And so our market size is capped by the problems that can be solved with software. The goal is to get to 100% autonomy and in the way get to 80%, 85%, 90%, 95%, which is incredibly and deeply valuable for the enterprise that wants to move incredibly fast. And they're thrilled with this, thrilled with this level of autonomy today. And so you can't let a short term pricing decision dictate the technology decisions when the value creation is so high.

Nathan Labenz: Yeah, that makes sense. You mentioned going from 80 to 99% completion. I guess for starters, like what is the maybe even stepping, taking one step back from there. When did you get a new customer? How do you know if this is going to be an easy or hard engagement? And what do you have to do? Because I've In the sort of text to SQL world, this comes up all the time, even in relatively small scale things, where it's like, okay, it's one thing to look at the schema and be able to write valid queries against it, but it's another thing that there might be three different columns in this table that sort of are like, whatever variable 1, 2, 3, and like, which one am I supposed to be using, and why do these exist, and how do they differ in meaning? So I imagine you must come into a lot of different environments where sometimes there's great documentation and it's, you know, reasonably clear what's going on and what you need to do. Other times, probably not so much. Do you have a process for identifying like what is like genuinely ambiguous and potentially only exists in the heads of the employees at the company? And how do you like then have a... Do you have an AI agent interview those people to extract that information? What's that kind of human side of the onboarding look like?

Brian Elliott: Yeah, it's a great question. So the typical enterprise has very little documentation and very little test coverage. And so those are the first things that we actually look to address with Blitzy. And the awesome part about this is by addressing documentation and test coverage with Blitzy, you actually automatically increase the effectiveness of all of the AI code Gen. tools that are in your stack, which we highly recommend you have the individual developer productivity tools part of that stack. So you kind of get super fast time to value as you're sort of getting implemented, right? And so as you ingest a code base, there's opportunity to provide what documentation you do have. What's super helpful here is if you have domain-specific information, like domain as in, I'm in finance, and when we say this in our code comments, this is roughly what it means. But there's an iteration process at ingestion where we're going to provide you a spec Here's everything as we understand it today within your code base. Everything that is everything will be will be technically accurate to what we can be surmised technically. But they're the product portions, right? Where we're sort of expressing what you're trying to achieve there is where we'll sort of have an iteration period where we'll say, here's the here's the blank slate without any information. And now let's provide the system the information and not by by by not starting at zero as in like tell us everything that we don't know. We're like, here's what the system can technically understand. Like all of your dependency diagrams, all the variables are all classified technically correct. Now find out for what, from a product perspective, we should have as additional context, provide us that, and then we get off and running, right? And so that is the process to get to truth from a spec perspective. But you also have to remember that the spec is the human readable abstraction of the truth. And because in context, we're always using the actual source of truth as in going back to the source code and pulling that into memory just in time. The product spec can be a little bit inaccurate at the end of the day if you are instantiating new, like if you're moving from like C to Rust, like it doesn't. necessarily matter if this thing in the spec, which is defined to be human-readable, is exactly precise versus the fact that we can go and run the application and then mirror the exact effects on the other end, in the case of a language translation. Now, when you're doing product development, which we do half of our business is large-scale modernizations refactors, half of our business is just steady-state product development acceleration, that's when you want to be a little more prescriptive and a little more precise because the system will be using your product, expression from a sort of like, hey, we're doing this in finance to then go and make further decisions.

Nathan Labenz: So then when you're going from zero to 80%, 80% plus of the work being done, are there moments when the system loops in a human on the blitzy team and says like, hey, I need help with judgment here, or I think this is a question we should be able to get answered, or do you, is it like literally from that sort of go time to 80% plus is that fully autonomous with no.

Brian Elliott: Go to pull requests. It's from go to pull requests. It's like it would be an impossible task to try to insert a human into this process. Like the way that you would do this, like with agents at scale just doesn't work. The only thing that stops, which is at the beginning of the process, is if we are missing an environment variable or something from a configuration perspective to actually build and run the application. And so you could find that out. You could build, run, and then try to be adding some net new piece of functionality that actually calls a service that you didn't need to run the initial application. And so in that instance, it'll notify the customer Amblitzy that says, hey, we need access to this service and this wasn't a part of implementation or setup, but from a sort of... spec to pull request. It's all agents and it sort of has to be.

Nathan Labenz: Yeah. Tyler Cowan rings in my ears all the time. You are the bottleneck. Yeah. So no doubt you get it. Obviously I was. Of course you'd have to keep that to, you know, a relative minimum, but interesting that you have basically zero.

Brian Elliott: I mean, the point is, right, if the system can't do something, it goes onto the human report for the enterprise. We don't have to be 100% out the gate. So we pass unit tests, integration tests, end-to-end tests. We sort of do all of that, but there's whatever remaining work is a part of the ultimate report that goes out to be completed by the team. And that's an awesome use of clock code and awesome use of cursors. People pull that report down, they go deep on whatever edge case that Blitzy could install. They get that ready to go into production, they go to QA, they go to merge, and then they start their next sprint with Blitzy. So the system is designed to account for the fact that we want to accurately do as much work as we can and then say, great, the human pickup is on the back end of this pull request.

Nathan Labenz: So what is that last up to 20% today. You've mentioned edge cases just now. Is that the bulk of it? Like just unanticipated scenarios that were ambiguous or otherwise problematic that we're kind of kicking back to the humans, not so much because of code, but because of missing judgment that wasn't supplied up front.

Brian Elliott: Yeah, so it's typically items that we think were not captured in the testing strategy, right? And so maybe as like a double click in, right? Anytime Blitzy touches any file, we're doing unit tests before and after. As we do clusters or context, we're doing clusters of work. We're doing integration tests between services. We're going to the end, we're doing end-to-end tests at the end, right? And so there'll be some instances where, let's say we're like, oh, we passed 73 of 75 tests, and for whatever reason, Blitzy would change things to fix item one, and it'll break item two, and it'll change things in item two, and it'll break item one, right? And so the system will say, great, we're 73 out of 75 from an end-to-end testing perspective, these are the sort of files that we're undulating between back and forth. And you need to sort of go in as a human and figure out where there is conflict between these two services, because our system has sort of gone back and forth so many times. It's funny, sometimes the task is an impossible one, which is kind of funny. You're like, okay, you're asking for two contradictory things in your spec, and this is like one way to approve that when you're asking for opposite things. Sometimes it's configuration stuff. Sometimes it's... It's just QA work, right? And so as a part of the report, we'll break down the tasks remaining and the estimated hours to complete those tasks for the human teams and like who would be responsible from a functional skillset perspective. So some testing strategy we didn't get to 100% that we can align on. And then some plan for code review and QA is like really included in that, we'll call like final 20%.

Nathan Labenz: So how do you get to 99%, 'cause when I hear that description, It sounds like less something that a new model is going to be able to handle and more that people just aren't that, you know, maximalist, I guess, in terms of really defining for them what it is they want.

Brian Elliott: We have large customers that will get a Blitzy pull request. They still go through dual review PR strategy. I recommend that whatever your QA process is, you should continue to do that for the foreseeable future, but for anything else, for regulation purposes. But they won't touch a line of Blitzy code and they'll press. merge, right? And so like those customers are unbelievable at expressing intent and doing spec driven development for a large majority of customers that are not as far along that curve, right? What they'll do is they'll express intent, they'll get a spec, they'll get a code back, and then at the code step, they'll realize like, oh, I didn't consider this outcome, even though it was like maybe Maybe expressing the spec, like, I'm moving so quickly. And so we had to build the product the ability to sort of refine further from there from the Blitzy platform, right? Because people aren't used to today, like, doing two or three months of work, making those decisions. You used to be able to get to month two and then figure out the nuance between month two and month three. You have the ability, once you get into the code, you realize, oh, I didn't express this implementation the way that I would've preferred it, and it was hard for me to really conceptualize what that would look like between spec and implementation. They can just go back and they can refine that existing pull request, and they can provide their updated guidance, like, hey, actually, on the implementation of this portion, I want to use this approach and then it'll run for a much shorter amount of time and sort of adjust the existing PR to their preferences. And this just has to do with existing patterns or behavior today. But what we'll see is folks will go through this flow as they're getting familiarized with Blitzy and like refine that larger amount of work once or twice. And then they'll naturally start to get really good at expressing their intent or like identifying it at the spec stage because they are going through the muscle of basically being a systems level thinker, a systems level architect and getting all of that implemented.

Nathan Labenz: So what room for improvement is there on the models? It seems like what you're describing there is still kind of like Models could get better, but it's really the humans that need to get better at expressing what they want for you to drive that completion number up toward 100%. Would model improvement then just translate to like even faster execution, you know, even cheaper total inference cost? Or are there still things that you would highlight that are like, yeah, models are not that good at this and it actually would be really helpful if they were better at it?

Brian Elliott: Yeah, we ultimately want more intelligence, right? Cheaper is fine, cheaper is fine. But if you think about the instance that I walked through with the different end-to-end tests going back and forth, failing, as the code was recursively going back, running the application, trying to fix it, we're like, today our system will just be like, those two things happened and like, go look at it, human, right? Like, we're stuck, right? Like, if you had more raw intelligence, it could very prescriptively be like, hey, this is exactly why this is happening. Here's the trade-off decision that you need to express to us. Which one of these routes do you want to go from a trade-off decision? And I can go and implement that, right? And so when the trade-offs themselves, which are sort of complex, can then be completed or understood by the model itself. It could then come back with two different pull requests, both of those with the end-to-end tasks fully passing, and you're like, hey, I took trade-off one here and trade-off two here, and those are the only two logical trade-offs that you could have made, as opposed to like, I couldn't solve this problem over to the human. So we want more intelligence. It is going to allow us to go further in situations, and it'll allow you to be less less precise at the spec stage or have to be so forward-looking in sort of like your technical design.

Nathan Labenz: Yeah. Okay. Interesting. Seems like we're pretty close though.

Brian Elliott: We're closer than people think.

Nathan Labenz: Yeah. Not that many more, not that many more special requests there.

Brian Elliott: Hey. Said we've gone pretty deep. I was going to have you join in if there was some nuggets you wanted to drop an add in here, but just as a, introduction here. Sid was a prolific inventor at Nvidia. He's been thinking about building large-scale software systems since he was a little boy, actually. He's got great stories about taking computers apart and building software when he was a little kid. And he's really the inventor of a lot of the core, really all of the core technology here at the company being large-scale context engineering system that unlocks the ability for us to understand 100 million plus line code bases, long-running compute orchestration systems.

Nathan Labenz: Sid Pardeshi, welcome to the Cognitive Revolution. So, boy, yeah, we have covered a lot of ground, and Brian has done a great job of explaining a lot. I was just going to go next to strange behaviors from language models. this is, you know, a theme of my life, this feed, whatever, is that I am both extremely enthused about AI, I love what it can do for me, you know, I experience like incredible productivity gains all the time. And then I also pay, you know, reasonably close attention to research that shows all kinds of emergent, surprising, and sometimes quite, in my view, scary bad behavior from language models. And One big question, of course, in the big picture is like, to what degree can we successfully get AIs to monitor the work of other AIs and get to a point where we can be confident in the system overall, even if some of the models, some of the time, are doing something that we wouldn't want them to be doing. So interested in what you guys have seen there in terms of like, you know, QA is one dimension of it, of just catching bugs, catching mistakes. But then there's also famously, I think Claude 3.7 was maybe the high watermark of writing unit tests that would just return true and kind of always pass when obviously the core objective had not been met. How would you guys describe the trends in that? I assume it's improving, but how much have you seen that sort of thing improving? And then what have you done and how well has it worked to get AIs to detect those kinds of problematic behaviors in one another so that, obviously, at the end of the day, you want to deliver something to customers that doesn't have these sort of fake unit tests.

Sid Pardeshi: Right. You know, you've really described two patterns there. One is strange behaviors from the LLMs and how to control them, and one is the LLM as a judge philosophy, right? We've been super early with LLMs as a judge. I think one interesting bit you described there was getting LLMs to correct each other's work. What we've seen, you know, is that LLMs have definitely have some peculiar behaviors given the conditions, right? Assuming everything's constant. So assuming constant temperature, top B, top K, whatever parameters you're using to influence behavior and assuming constant prompts. If you gave two different sets of LLMs, you know, and let's assume they're both following the best prompting guidelines of each vendor. OpenAI and Anthropic, for example, if you gave them the same situation and or condition, you may get different reactions. Like, for example, we've seen SweetBench verified, right? It's like a very popular leaderboard, but we have different scores, right? Even though the problems are very much similar in that sense. There are different problems that Anthropic fails on versus what OpenAI fails on. But if you go to a real world situation where, you know, you have a lot more ambiguity, what you will see is if you run the same situation, even through Claude, multiple different times, you may find that it comes up with a different resolution, right? So for example, it's an ambiguous situation. There's only one way to solve it correctly. If you run it five times, it may be that Claude is able to solve it correctly one or two times, right? And maybe the approaches that it took are slightly more nuanced or different each time. And that is because of how the transformer architecture works. These are like sequence to sequence models. They're generating the next set of tokens to answer the question, right? And they may end up sampling different parts of the space. That is like one way you end up with a difference. Or they may just end up taking a different trajectory, right? They could have executed a search query. Like maybe they, maybe the correct answer executed. use the tool correctly right I wrote a more elegant search query to find what it was looking for so because you know these are probabilistic models and there is at any point in time there's a probability that the LLM you know lands in the right tool and uses it correctly so that's why you have these differences and it is definitely effective the way to make LLM as a judge effective that we've seen is by using two different models two dissimilar models to evaluate each other's work because what you're doing then is you're not just adjusting for these probabilities, but because of the inherent architectural differences, right, not at a very deep level, but let's say, you know, GPT 5.2 is definitely built a lot differently. It has a different set of parameters. It has a different size than Opus 4.5, right? And it may take a different trajectory. It may use tools differently. And by that sense, you have now increased the chances that collectively they land at the correct error, a correct answer, right? Which just solves the problem. So that's where LLM as a judge, right? It's an important part of landing in the correct answer. But let's talk about the strange behavior aspect that you mentioned, and that's really interesting. So we've been very deep into the Claude family of models and OpenAI, right? Like for example, One interesting behavior the O series of models of OpenAI had was that they were very reluctant to use tools. So these were reasoning models and the first earliest reasoning models, but they did not like to use tools.

Sid Pardeshi: If you ask the model to search the code base to come up with an answer to something, you'll find it jumping to conclusions. right, without doing thorough research. So that was a problem with the earlier series of models. But if you look at the latest OpenAI models like Codex, you know, and or even GPT-4, right? Even at the GPT-4 was active at the time of 01, right, and 03, and GPT-4 was by far the best model when it came to tool calling, right? We had repeatedly provided feedback to Anthropic that GPT-4 outshines Claude 3.5 by a mile, right? Even though Claude 3.5 the best thing about Cloud 3.5 Sonet was that it used tools really well but it was nowhere close to as powerful as are efficient as GPT-4 and GPT-4o at tool calling right but as time went by that changed quickly so the Sonet 4, Sonet 4.5 And even 3.7, but not to that extent, were really good at tool calling. The problem with 3.7 was that it was overeager. So 3.7 made a lot of mistakes when calling tools, leading to, you know, tool schema errors. If you didn't validate that correctly, it could cause all kinds of issues in your application. But they quickly fixed that with the 4 and 4.5. But the most interesting, strange behaviors with these models is that they tend to give up as soon as they have real context anxiety is how I like to describe it. So even though Anthropic says that any of this applies to you in OpenAI, right? They say that it's a much larger context window model. So for example, I think GPT 4.1, they introduced 1 million tokens, if I'm not mistaken. But the documentation clearly said that if you exceed 200 tokens, 200K tokens, you may experience different behaviors, the request will take longer and the quality may not be that good. For Sonnet, even though it says it's 1 million token context window you will notice marked differences in behaviors the moment you exceed about 100k tokens or even 200k tokens right it's not just about the price the anthropic charges you differently if you exceed that but What you will see is that if you're working on a complex problem, the model will tend to give up. It will say things like, okay, because I have these time constraints, now what time constraints? I never told you have to finish in an hour or 10 seconds. I just gave you a problem. I expect you to solve it, but then the model brought in the concept of time and said that because I have. these time constraints and I have been working on this for too long and by the way too long was just 10 minutes I have to now wrap up and get a final response right and it gave you the incomplete response. And then context pressure, this seems too complicated. Let me take a simple approach. And that's where you have that behavior that you mentioned. Let me return true. And let's see, this solves all of the requirements. You said, I should not have any bad code, check. I should not have overly verbose code comments, check. I'm just returning true. And the test should always pass, check. I'm just returning to it. It's always going to pass, right? So what I've done is philosophically correct, like justifying to itself that its decisions are correct, even though it's What it's doing is blatantly wrong relative to the user's original instructions. But these are due to external factors that the model providers are implementing. So when we experience this, we solve this our ways. There are a number of ways to prevent these issues, one of them being the obvious one, which is prompting. But we restart to Anthropic and Anthropic actually fixed them. So 4.5 SONET had this issue, but 4.5 Opus does not. It has other kinds of issues again, which as an application builder, you know, you're constantly solving for these issues, right? In production with different labs, different model providers, they all have different vectors in which they would effectively fail for any given use case.

Brian Elliott: And from an overarching perspective, like in information theory, they call this concept entropy, right? Which is sort of the outcome of a probabilistic system has high entropy in LLMs for probabilistic systems. And so the goal of the, or the purpose of the system and application layer is to reduce entropy to get to reliable outcomes. And so like the techniques in which we're describing reduce entropy to get closer to a desired truth.

Nathan Labenz: I love that you mentioned entropy because I was just thinking, Sid had mentioned temperature, and that got me thinking back to history of like, in my early LLM based application development days, that was a huge lever that I would mess with depending on what.

Brian Elliott: You were a high temp guy, I can tell.

Nathan Labenz: It depends on the use, you know, but certainly sometimes. These days, It seems like, you know, I think some of the APIs have even removed temperature and I certainly don't think about it nearly as much as I used to. So that tool to control entropy has kind of gone away. But I wonder what other strategies you guys have for perhaps like progressively increasing entropy. This is something I talked about with the AI co-scientist team at Google. You know, they sort of said like in their system, searching through the scientific literature is the main source of entropy that they sometimes need to get off of a local maximum or out of a local minimum, whatever you want to think of it as, and on to the next higher hill that they can then explore and climb, whatever. What do you guys do? I would imagine maybe you want your first pass to be like, You want to take your best shot. In code applications, I used to turn temperature to zero. I figured I'd want the model's best guess first. But then if that didn't work, well, now maybe I'd turn temperature up. But again, there's a lot of different ways to turn temperature up, like context engineer a little bit different, maybe swap out to a whole other model, do a web search for some commentary on this problem, whatever. And then hopefully with different... you know, inputs you could maybe eventually land on the right output. Long-winded way of saying, how do you ramp up the entropy as needed when the first kind of default answer isn't working?

Sid Pardeshi: Yeah, I would say the levels have changed, right? And that's a very helpful, you know-- thanks for setting that background. Let me add more color to it. So in the beginning, you had temperature, right? And for code generation or any use case where, you know, you needed high, you didn't need as much creativity, right? You wanted to focus on getting the right answer rather than the most creative answer. So the best practice guidance was to bring temperature down to zero or 0.1, 0.2, depending on the use case. Different model providers had different guidance, right? But then as you introduce tool calling, right, with 3.6, Cloud 3.6 and, you know, GPT-4, that kind of stuff, Having temperature with tool calling created problems because, you know, it may end up, you already have the ability to land on a different response because it could take a different trajectory in tool calling. And then you have temperature, which is influencing its behavior, its creativity, and that just created complications. But what changed significantly, what really changed everything was the introduction of reasoning. Right? So when you had reasoning starting with the O series of models and then eventually with Claude, both OpenAI and Anthropic force you to set temperature to one, which means you don't have any control over the temperature parameter, right? So the lever has changed from temperature to the thinking budget, right? So you may have a 200K token context window or a 1 million token context window. And you have between zero to, let's say, how many ever tokens of reasoning the model supports. Typically, we've seen as 32K for Opus and Sonnet, 64K for some others. For OpenAI model, it's about 128K tokens. So that's the reasoning budget. So that's how much thinking the model is allowed to do. before and or in between responses, right? So in the beginning, you only had reasoning, like one set of batch reasoning before the model gave you a response. And then that was it, right? It went into its own trajectory. There were hats you had to do together model to think. while it's in its, you know, while it's working, while it's calling tools. But then you had what we call now is interleaved thinking, and that's what Anthropic calls it, where the model thinks while making every tool call, right? It automatically thinks before making a call. And then there's a budget that you set for the overall amount of thinking, you know, how much of the context window is allowed to use for thinking.

Sid Pardeshi: And then there's, you know, weird metrics for like prompt caching and whether or not thinking invalidates prompt caching. how much of thinking actually plays into the context window and all that, that's different between different providers, but at a high level, right? The reasoning budget is the lever you have. So if you allow the model to think for longer, you get higher quality answers because essentially what the model is doing while it's thinking is taking a stab at creating a response, right? So what happens is, okay, the user's asking me to write code to do XY this, XYZ. Let me take a stab at it. Okay, this is how I would write it. It writes the actual code. It reviews its own code. And this is all thinking, right? It hasn't written a single token of output yet. It's just thinking and says, oh, but I shouldn't do this because the user asked for this, right? And it goes through that process. And then by the time it has either a, you know, exhausted his thinking budget or got a good enough answer to the user's response, it is now ready to write the final response, right? Essentially, what you were doing earlier with setting temperature to zero and maybe running the response five times, maybe with tweaking prompts, the model is doing that by itself, by default, and giving you a higher quality response, right? And if you drop parallels to what actually makes code generation work, if you look at, you know, on a base case, let's say Claude 4.5 offers a really good model in terms of code generation that gets responses right in one shot, right? Right? But that is the thinking model. The moment you turn off thinking, five to 10 percentage points drops even on sweep edge, right? Which is supposed to be one of the most easiest problems and the responses are no longer that high quality. So the theory that, you know, we have essentially the observations that we can see is that the models are really getting better at test time inference, right? Right? They're getting really more efficient at thinking. The system prompts that all of the model providers are building into the models that encourage the models to think before responding seem to be covering a wide spectrum of cases that allow for multiple things. One, higher quality responses, right? Depending on different use cases. Also more guardrails and, you know, ways to safeguard against things like prompt injection or getting the model to say something malicious. You know, there are the multiple layers just beyond even prompting that are applicable to achieve this. But we're definitely seeing that the performance gains that we have from models are primarily driven by test time inference along this trajectory, right, of model improvements.

Brian Elliott: And Sid, maybe you could comment, and I think it's worth with having Sid comment on the path from where we are today towards fully autonomous and oppressed software development.

Sid Pardeshi: Yeah, so, you know, we, the market, when we started, like, and we said that we're going to bring fully autonomous software development, nobody believed us, because they, you had, what, tens of thousands of tokens of context windows, and models could write 200 to 300 lines of code at a time, maybe a thousand lines, but it wasn't good. The code wouldn't compile, it wouldn't do what the user said, the context window is still too small to even, you know, cater to large enterprise code bases, right? And we're not really seeing that change. So we've had 1 million token context window models for a while. We've had even 10 million token context window models. But the efficient frontier for the effective context window, if you don't want to deal with issues like context pressure, and if you always want code that compiles, works, runs, or eventually gets to that point, is still less than 100 kg, right? So even though we've made a lot of, let's say, progress on quote unquote intelligence, models are more intelligent, they produce high quality responses, you still have the problem of context. And we've solved that and a series of other problems to make this work. And our perspective is that today, the folks that are getting the best results from something like Claude Code are using tons and tons of techniques to get to achieve that, right? You have Claude.md, which contains, let's say, the instructions. You have maybe a series of plugins that you're using, MCPs. You have these prompt templates. You have a number of other tricks that you're doing. You're probably, you know, using Claude code to get one output, then you're using, switching to Codex and maybe getting that, getting it reviewed and then pasting that back in. So the most elite AI users that are getting the 10x gains are doing a lot of hard work to make it happen, right? So have you really changed or improved productivity? Like I would argue no, because you're still doing a lot of work to get that. You've changed what you're doing. You're not actually writing the code, but you're spending your time figuring all these tricks out. And every three months, the models change, the prompting practice has changed. So you're relearning all of that. You're switching between a Codex and Claude code and you know, there's this constant struggle to make the model work for your codebase. And our vision has always been that you shouldn't need to do all that. The LLMs are the models. Today, it matters a lot if you're using, let's say, Opus or some other open source model. But we're seeing open source catch up. So it is our theory that LLMs will be commodities. And regardless of that, the point really is that you should be able to go to a model with your work, which is typically in your project management tool like Jira or whatever that is, you should be able to plan the work and you should be able to get a PR back that just works. right? It has, you know, it follows all the coding practices that you outlined. It solves everything in your plan in detail. It takes into account your past, current, and future roadmap. It has the ability to fix merge conflicts if you have a very high velocity team, right? It follows the specifications in your Figma, and it just works across your entire code base. It compiles the unit tests, you know, run, there's good code coverage. There's evidence of testing. This is what you'd expect from a human developer, from a really good human development team. These are the unsaid or, you know, quite often very vocal parameters of success, criteria for success that are set within the engineering org. That is what we've set to build with Plissy, right? Just PRs and high quality code that works. And we will spare no action to make sure that we get to the highest level, right? If it's LLM as a judge, if it's more test time inference, If it's in the future, maybe even test time training to learn about the specific preferences of the user. The goal and the vision we have is, again, like I said, code that just works out of the box without you having to do heroics to get it to solve the success criteria.

Nathan Labenz: That's a funny characterization of how work has changed. It certainly resonates with me, I feel like-- and I don't code full time, but I've created many more applications in recent months than I ever used to. So in some sense, I'm definitely more productive. Like I made three AI apps for family members for Christmas presents this year, for example. But it is definitely true that I'm like always on, either hands-on or like on Twitter looking for, you know, the latest tips and tricks. And that definitely It is striking that like for all of the labor saving nature of the technology, the people that are getting the most from it are probably working as hard or harder than anyone. Maybe that changes, maybe it just continues this way until the singularity, I don't know. I wanna do a quick double click on test time training. This has been a, you know, obviously highly related to like continual learning, which has obviously been a big part of the discourse recently and you know, there's been some really interesting, advances in that space with respect to much more contained puzzles like RKGI-type puzzles, that kind of thing. We talked a little bit earlier, Brian and I did, about just kind of, is there any point to using open source models? Is there any point to fine-tuning? It sounds like today, basically, the reality is the Frontier models are kind of the best. You want to work with the best. You can't really fine-tune the best, and so it's usually not really worth it. Kimi K2, or K2.5, I should say, just came out, and the community's obviously still digesting exactly where that is. It does seem like all Chinese models to be a little bit-- in as much as I don't think it's actually truly the best, which is what their benchmark graphics would have you believe. But I have used it a bit, and others also seem to be reporting the same thing, but it does seem to be really good. And the gap is seemingly quite small between it and whatever your favorite model is for your favorite use case. So does this change the outlook? I guess, because whether fine-tuning is worth it or not would seem to depend a lot on the gap between what you can fine-tune and what you can't. And this gap seems to have potentially narrowed quite a bit. So I'm kind of wondering if you're like, oh, hey, this maybe changes the trade-offs or the analysis, and maybe we do want to get into that sort of thing now.

Sid Pardeshi: My perspective on fine-tuning has always been you know, very classical. It's in the sense that you should only fine tune if you have a very narrow use case that you have, you believe by fine tuning, you will, you know, get much better performance. And that the rate of that performance gain is much more significant than waiting for another three months till the next series of models comes out. You also lose things when you fine tune, right? So you lose the ability of the model to generalize. And it's not always a given that when you fine tune, performance will increase, right? Because you don't really have, let's say, you don't really have necessarily have access to the original data set. And even if you did, you cannot really create a a map between what was the influence of specific parts of the data set on the model's behavior, right? So that's why fine tuning, especially when you don't have large amounts of data, and if you don't have a very clear niche use case that hopefully has historically been successful with maybe a previous family of models, it's always like drawing from a pack of cards. It's always, you know, a risky game. Now, if you talk about models and, you know, their ability to get better, there's also another challenge there, which is, for example, Gemini, and let's say even OpenAI, are very close in terms of the score on SweepBench. And in some cases, it has been proven that there are models that beat Anthropic on code generation, right, in very specific use cases. Even then, In the real world, if you compare Gemini, OpenAI, and Anthropic, they are very different in terms of the code generation, in terms of the use cases for code generation that you would want to apply, you want to apply them to. They're very distinct, right? Even though they're creating similar-ish scores. The point really is that the current leaderboards that we have are insufficient, right? There's a lot of test set leakage. there's a lot of, you know, just, just, just broad insufficiency from the standpoint of generalizing to a typical use case. Like for example, a number of leaderboards rely on the opinions of humans. For example, they'll give you A and B, both have code to solve a specific use case, and you're supposed to select, you know, which one you feel, feel did a better job, right? Now, depending on my mood, I could have chosen either, right? But if you don't do define clear success metrics that would, you know, apply in an enterprise setting, you are not creating a very effective leaderboard, right? Because the leaderboard then is only the perception. Maybe in someone's perception, writing a lot of comments is very helpful.

Sid Pardeshi: Oh, because I read the comments and answered the code in someone else's perspective, like, this is overwhelming. I cannot read that that many comments when I'm trying to understand the code it's just distracting right so that so leaderboard design is actually a complicated problem and even you mentioned RKGI right the the fun part is that Frank Francois Chollet right the creator of that leaderboard he talks about how when LLMs got to 70% plus on that leaderboard everyone said oh I we I guess AGI is here but then he brought in RKGI 2 which didn't really change the difficulty of the problems, right? It just had different problems of the same kind. So if you were to give RKGI to a 5-year-old and then RKGI 2 to a 5-year-old, they would perform the same relatively on both leaderboards. But an LLM that scored 76% on RKGI 1 would not even score 20% on RKGI2 when it just came out, right? Even though you have massive gains in intelligence and gains relatively on paper on the leaderboard, from a real-world scenario, just because of how LLMs work, you don't really have a change in the LLM's ability to learn something that it has seen for the first time. The highlight of ArcAGI2 is that these are problems that are different than an LLM would have seen in its training set. They are not harder. They're just different. And then there's two definitions, broad definitions of AGI. If you focus on the academic definition that Francois Chollet is alluding to, he says it's the ability of the model to learn patterns, right, and adapt to them, right, on the fly. Patterns that it has not seen before. Apply its intelligence to a new problem and be able to solve it. And the other definition of AGI that's more popular that I've seen floating around much more often is just human-level performance on a broad range of tasks. By definition and by real-world results, these are fundamentally different constructs, right? And the problem that I see is that we've gravitated far more towards the latter, but I've ignored the former. And that is why I'm bullish on test time training, because what test time training promises is if we detect, let's say, a pattern that the LLM is not familiar with, right? Where it's not going to perform well. we can give it more context about solving that particular problem such that it does better right and produce better results. Now in any problem in general it's very hard to know whether or not you're going to get the correct answer because you don't have a metric like for example with code you You can compile the code and you know whether or not you're working on the correct answer, right? Or you can define unit tests that you can execute to learn if you're on the right track. That doesn't apply to general scenarios. So specifically in case of code, I'm bullish that you can implement test time training in such a way that you improve the odds of getting to the correct answer. But even then, many of the techniques, I've read papers on test time training, we're not at the point where At the moment, it's practical to implement that, but I definitely see that becoming a real thing in the next one to two years.

Nathan Labenz: Yeah, that's something I'm watching very closely to see how that develops as well. I think the two last things I want to talk about are just security briefly, because I know that's obviously a huge concern of enterprise, you know, customers broadly, right? They don't want to be importing a bunch of insecure code into their environment. And of course, LLMs have a reputation for writing insecure code. The other thing I want to talk about in maybe closing is kind of the labor market in light of all these changes. And that could also include who you are looking to hire and as much kind of information as you would be willing to share about your hiring practices. But on the security side, I guess where are we today in terms of security? What what have you found to work? And do you think this problem is going away? I mean, I've seen some research suggesting that like formal methods can be used to. both validate code that LLMs write and then also as a reward signal that should get them to be writing far more secure code far more often anyway. So my sense is that, like many other things of LLMs can't reason or they can't do this, they can't do that, I think this is probably going to be something we'll leave behind. But I know you guys have also had to do at least the best solution you can before the models themselves have kind of been properly trained. So I guess all that to say, what's your view on the security of LLM generated code?

Sid Pardeshi: Yeah, I think, you know, it's a shared responsibility is the first thing I would like to say. One is there are many behaviors of the LLM that can be influenced and prevented at the training step itself. So if you look at the reports that Anthropic, OpenAI, Google, all of them put out when they launch a new model, they test against these behaviors. And these behaviors could be getting the model to do something it should not be doing. Like, for example, let's say I want to, I need a recipe to create a weapon, right? If I put that as a prompt, hopefully the model does not respond with the correct answer. But what people have done, typically is to fool the model, framed it as an emergency situation such that if the model provided the recipe, it would save someone's life or it would make a positive change. So try to game the reward function that may have been defined for the model and get a response, right? So prompt injection is one of the ways where they've been able to do that. And there are several other ways to jailbreak what the LLM can do. But ultimately, it comes down to system design, right? For example, security considerations would be different for something like Cloud Code, where you interact directly with the model as opposed to blitz. where you have a plan and then you execute that plan and then Blitzy decides whether or not to honor the instructions and in what way to deliver the code, right? So when you're not interacting directly with the model, the attack vectors change. That is one. But specifically for the code generation use case. So if you, there's the one aspect of security is causing harm or using content that is not considered clean for that use case. And like I said, the vectors there are, the models typically refuses to send you a response. Or if it does, you know, you have to set different kinds of guardrails depending on the system. But in terms of software itself, it could just be, you know, having an outdated, let's say, knowledge reference. Most models right now have, I believe, January 2025 as the knowledge cutoff. and there have been a number of libraries that got updated with security fixes after that date, right? So if your model did not look up the web when using an open source library, or it did not realize that this was, you know, a bad practice, it's a bad practice in code, it was a newly discovered knowledge, it did not source the web to understand that, it is likely that your code generated using the LLM has these security flaws, right? But thankfully, as all things go in software, you have a number of ways to detect and prevent that as far as a software item is considered, right? One is having defensive tests within the code. So if you know some of the attack vectors that your application and or product is vulnerable to, you can define tests and you can use AI to create these tests, have them in the code and make sure that your code does not have those flaws, right? So every time you run a job So you make sure the test is passed, so you add more tests as needed, et cetera. Two is having tools that check against known vulnerabilities, right? So there are a number of such tools. Sentry is one that comes to mind. There are a number of others that report vulnerabilities, CVCs, in the code, and then you can use AI to address those vulnerabilities, right? So in Blitzy, we run a pre-check to detect for security flaws, and we address them before creating the PR so that you don't have to go through that process. But at a high level, because you have access to such tools and different languages, frameworks have different set of tools, you can provide Blitzy the ability to check for them. And you can also do that with other tools, right? Just code is significantly easier to protect from security gaps. And I definitely believe from the standpoint of coding, we will have tools or you have the ability to configure tools that prevent security issues.

Nathan Labenz: So this has been outstanding, and I really appreciate how much you guys have been willing to share. I'm going to take the transcript of this episode and turn it into a to-do list for my own personal AI infrastructure project, and we'll start implementing it. Last thing I want to talk about for just a couple minutes in closing is the effects that all this is having on people. There was a paper, you guys probably saw, it ended up being fake, but I think it was kind of an interesting, like it resonated, which was maybe the most interesting thing about it. Supposedly it was about materials scientists at some big company, and supposedly they had introduced AI and they'd become more productive, but job satisfaction had dropped. And again, this turned out to be fake, but I think it was shared so much because people felt like they, it satisfied their expectations, if nothing else, right? So interested in how you see the role of the software engineer changing. Do the software engineers like how it's changing? And then there's also, of course, this big question around. junior developers, like, you know, is the death of the junior developer much exaggerated? Are you guys hiring junior developers? What are you looking for in your hiring? If you want to tell us a little bit about like what your comp looks like, that would be very interesting. But understand if that's not something you want to talk about on a podcast. But yeah, what do you think of the impacts? What's under and over hyped when it comes to impacts on the roles people have and the labor market more broadly.

Sid Pardeshi: I think if you think of it from the standpoint of short-term, medium-term versus long-term, right? In the short-term and the immediate term, what happens is code is now a commodity, right? In the olden days, if someone had written a script to do something, And that was a very complicated or boring task. That script was like gold. You would pursue that developer, be friends with them in the hopes that they would maybe share that script with you, that they got after scourging through hundreds of pages of documentation and just raw experience of having done that numerous times. Now I can just go to Claude, prompt it, get a script back and just do something. But if I'm a junior developer, I won't be able to look at the script and know if this would destroy my production database. or if it would do what I'm expecting it to do, or if it would produce unintended effect ABC, right? And that is the danger. That is the difference between using AI and not using AI to me really, right? If you can't tell that difference. So in the short term, the market unfavorably, you know, is weighted towards senior developers, because we you give a senior developer access to AI and, you know, writing code, you don't have to go through the boring mechanical process of writing a lot of code or even copy pasting a lot of code, right? Just feed it to AI, get code back, review it and get done with it. But then as AI gets better, as the chatbots get better, as the model gets better, as the tools get better, right, at preventing unexpected, unintended outcomes, at understanding intent, and at writing code that satisfies the intent, right? What's going to happen is junior developers, and this is already happening, right? Mid-level developers are performing at the level of senior engineers. Just because code is a commodity, right? And mid-level developers, they've spent some time with the code, they know what a bad action looks like, they know how to make corrective measures, and they still are producing velocity gains, right? So the two advantages that senior developers have were like depth of knowledge, maybe speed, right? And just the ability to understand how the system works. All of that now you can get from AI, right? You can connect Cloud Code or Blitzy or any other tool to your code base. and you have an accurate understanding of what the code is like. You know, there may be hallucinations along the way, but that's changing quickly. Speed, like you cannot beat AI in speed, right? Like you connect Cerebras to some model, you're going to get very fast tokens. And even the labs at a baseline have a very Claude 4.5 is really fast, right? So you cannot beat the models on speed. And just the knowledge bit, right? If the model is intelligent enough, like I said, to understand the intent, you're going to solve that problem as well. So because of that, I believe in the medium to longer term, you will have junior developers that are far more valuable in terms of they are cheap to hire. There's a ton of them that are, you know, now doing computer science degrees.

Sid Pardeshi: and are not going to be employed just because the rate at which enterprises are hiring has gone down, and in the short term, they're favoring more senior talent. But these developers, assuming they upskill on AI, continue to remain in the industry using the tools, they're now going to be much better at getting work done. So the talent, as it ages out, is going to be replaced by more junior developers. So that's a theory I have. Now, in terms of hiring, we've hired senior, junior, and mid-level developers, and we have a mix of them. They're obviously doing different things. The challenge we have is as a startup, we need to produce a lot of code quantity, right? And it has to have quality. Time is a very critical factor, right? So for us, we have, we obviously have shared the bias to initially hire a lot of senior developers. But what we quickly realized is for tasks that don't really require senior developer input, it's not a large code base, it's not really a cutting edge technology, right? It's something that is well known, for example, running Blitzy on a leaderboard, right? writing scripts that automates that process. We have hired as interns, high schoolers last summer to do this, right? And we have junior developers who are research engineers that are using, you know, Blitzy to run this. They are using AI tools to run all these operations. And they are, we can hire them at a very favorable compensation, right? And that's going to be in assets. So just because the market is really flipping on its head, The expectations in terms of salaries for software developers, unfortunately, is going to go down. The junior developers who know AI, they don't have to unlearn, right? Biggest challenge with some of the more senior folks is that they have to learn to trust AI. And the biggest hesitation for any senior developer who's been around long enough is that I can't trust anything else other than myself, right? If I don't write the code, I can't trust it. And that's like this psychological hurdle, I would say, that the senior developers have to adapt to AI. The ones who do adapt are going to be immensely successful, but then there's going to be that challenge. And that, I believe, is a gap that the mid-level developers, once they've known enough, and the junior developers will fill, especially because of the favorable cost equation. and then you asked about you know the salary ranges so you know we've we have a number of open positions and the salary range is anywhere between 100k to 300k right from a from a cash standpoint and equity is separate in that discussion and there's always room for us to pay more for the right talent and it's it's interesting how the definition of right talent has changed typically you paid more for someone who had many years of experience and has built many systems. But now if you were to run a hackathon, you'd be very surprised as to who is actually winning that hackathon. You have high schoolers who are extremely adept at using tools at prompting and often, you know, a good prompt and a good tool can beat out what a senior engineer can do in the same span of time, right? Especially if you're talking about greenfield development. Hands down, someone with few years of experience can do a lot better just because of the psychological gaps, right? But if you're talking about like legacy enterprise software, where you have to check a lot of boxes, you need a lot of experience, you think something is right, but you realize only after being bitten by doing something wrong, you know, that's a space where senior engineers will continue to thrive.

Nathan Labenz: I love it. That was a great answer. And again, I appreciate how much you have been willing to share. Outstanding conversation. I'm looking forward to getting under the hood with Blitzy, and this is certainly a space that we will continue to watch closely. For now, Brian Elliott and Sid Ferdeschi, CEO and CTO at Blitzy, thank you both for being part of the cognitive revolution.

Sid Pardeshi: Thank you.