Flow Engineering and Code Integrity at Scale with Itamar Friedman, CEO of Codium AI
Nathan and Itamar Friedman discuss the future of AI in coding, code integrity, and workflow engineering with Codium AI's innovative approaches.
Watch Episode Here
Video Description
In this episode, Nathan sits down with Itamar Friedman, CEO of Codium AI, the company on the mission to create the code integrity paradigm. They discuss harnessing LLMs for code integrity, task decomposition, workflow engineering, and much more. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
X/SOCIAL:
@labenz (Nathan)
@itamar_mar (Itamar)
@CodiumAI (Sodium)
LINKS:
Codium: https://www.codium.ai/
Cognitive Revolution (new feed): https://cognitiverevolution.ai/
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, instead of...does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
ODF is where top founders get their start. Apply to join the next cohort and go from idea to conviction-fast. ODF has helped over 1000 companies like Traba, Levels and Finch get their start. Is it your turn? Go to http://beondeck.com/revolution to learn more.
TIMESTAMPS:
(00:00:00) Episode Preview
(00:06:32) The evolution of AI in software development: beyond code generation
(00:09:51) The shift towards code integrity and its impact
(00:14:23) Unveiling Alpha Codium: A leap in AI-assisted code integrity
(00:16:53) The broader impact of AI on software development processes
(00:17:11) Sponsor break: Oraclel | Omneky
(00:18:52) The future of AI in coding: enhancements, testing, and more
(00:24:41) AI's role in enhancing code quality and developer efficiency
(00:43:38) Exploring the challenges of large codebases
(00:44:03) The impact of LLM context growth on code understanding
(00:45:19) Innovative approaches to data retention and context computation
(00:46:08) Deep dive into repo analysis and future directions
(00:42:31) Personal experiences and the quest for efficient code management
(00:48:26) Dynamic graph strategies and future features in IDEs
(00:49:10) From code testing to advanced flow engineering: a new paradigm
(00:52:37) Unveiling Alpha Codium: A leap in coding challenge solutions
(01:07:31) Reflecting on workflows
(01:23:30) Final thoughts and recommendations for AI enthusiasts
Full Transcript
Transcript
Itamar Friedman (0:00) I say that very soon, even in for sophisticated programs, we will see, like, higher quality coming from AI. Even in the case we're in a hurry, you don't think a problem, here is my code. Usually, okay, here's the problem. Let's think a bit about how I wanna design it. Even if it's not too much, you do a few steps. It doesn't make sense that we will ask a model like prompt, generate for me. Yeah. Even if it's super intelligent. Let's let it sync like the same way we do it.
Nathan Labenz (0:22) It doesn't seem intuitive to me that you would ask an AI to do something, then immediately take its solution and ask it to critique its own solution. And yet it works. 1 day,
Itamar Friedman (0:31) we would want the AI to be a team member. A virtual team member cannot just write code without testing it, without understanding the deeper context. A team member needs to be on the entire 3 parts of pre build, build, and ship.
Nathan Labenz (0:45) Hello, and welcome to Turpentine AI. What's that you say? Turpentine AI? Well, we've got some news for you. After a year and 100 episodes of the cognitive revolution, Turpentine is doubling down on AI. We're creating a new dedicated channel for the cognitive revolution. Turpentine will be spinning up new AI focused feeds as well, and this feed, soon to be known as Turpentine AI, will become a shared platform featuring best of content from multiple shows and voices. We'll be making this transition gradually over the next month or so, and I'll be posting original content to both the new Cognitive Revolution feed and here on the Turpentine AI feed. So if you wanna keep up with all of the latest, definitely make sure to subscribe to the new Cognitive Revolution feed, where I've just posted part 2 of my recent appearance on the 80000 hours podcast as the first full episode. You can find the new feed at the same URL, cognitiverevolution.ai. And there, you'll find links to the new YouTube channel, Apple Podcasts, Spotify feeds. And for those of you like me who use an old school podcast app, the Bear RSS feed as well. Please do take a second to make sure you are subscribed to both feeds, and stay tuned to both channels for lots more original AI content coming soon. Today, my guest is Itamar Friedman, cofounder and CEO of Codium AI, a company that's on a mission to make code integrity simple. Now for context, at this point, it's broadly understood that large language models make useful coding assistance. Microsoft GitHub Copilot was the first commercially successful implementation of a large language model for developers, and it's already been 4 months since Microsoft announced that it had become a $100,000,000 business all on its own. However, code generation and code completion, while they make for amazing and highly inspirational demos, support just 1 part of the software development life cycle. Before developers can write code, they must work with teammates to figure out what to build in the first place. And after writing code, they have to test it and integrate it into broader systems and production environments. It turns out, especially as products and teams grow, that the coding itself is often a minority of the total work. And that's where Codium AI comes in. Focused on code integrity, Codium is meant to support the rest of the software development process with focus on explaining, testing, improving, and integrating code. These are areas where just about every team could stand to improve. And though the company is only 18 months old, it already serves hundreds of thousands of developers. In the first half of this conversation, we discussed Codium's product, starting with the value that I found in it, even as a solo prototype developer who doesn't really have to worry about production issues, and then broadening out to discuss how Codium supports larger teams and their correspondingly larger products. Then in the second half, we dive into their recent publication, Alpha Codium, which shows how careful task decomposition, that is the process of breaking down larger projects that language models might struggle with, into much smaller tasks that they can do far more reliably, ultimately enables successful workflow automation. The Codium team calls this flow engineering, short for workflow. And while they do use some techniques that are specific to software development, what stuck out to me most about this research is how readily generalizable and applicable it is to many other domains as well. The project is open source, and I definitely encourage anyone building AI powered workflows to take a close look at their framework, prompts, and insights. Codium's work as well as that of Google DeepMind show that large language model are already competitive with human participants in coding competitions. It seems very likely to me, and as you'll hear to Itamar as well, that they will continue to improve and soon achieve elite performance levels. But even as that happens, and in some ways perhaps even more so then than now, it will be critical to ensure that AI generated code aligns to requirements and performs reliably. And with that in mind, I think Codium is extremely well positioned for success and definitely encourage you to check it out. As always, if you're finding value in the show, we appreciate it when you take a moment to share it with friends. This 1, naturally, should go to the software developers in your life. Please don't hesitate to send us any feedback, guest suggestions, or questions. My DMs are open on all of the social media platforms. And remember to visit cognitiverevolution.ai to subscribe to the new Cognitive Revolution feed where you will find exclusive episodes starting this week. Now I hope you enjoyed this conversation about harnessing large language models for code integrity with Itamar Friedman of Codium AI.
Nathan Labenz (5:44) Itamar Friedman, founder and CEO of Codium AI. Welcome to the cognitive revolution.
Itamar Friedman (5:49) A pleasure to be here. Thank you for inviting me.
Nathan Labenz (5:52) We've got a lot of ground to cover. You guys have built a a very interesting product, which I've been playing around with over the last week and also have some cool new research out, is, pushing the the frontiers of what we can get out of language models and excited to dive into all of that. I guess for starters, you've started this company that focuses on code integrity as opposed to the somewhat larger space of just coding assistance or code generation. I'd love to hear kind of how you understand the impact of AI and large language models on the software development profession as it has already happened and how you came to focus in on this question of code integrity as the area that you wanted to build in.
Itamar Friedman (6:40) As I see it, I think that and as I saw it also in 2022, which I call BC before Jai GPT, that code completion and code generation, like per se code generation, like give me code according to my request or something like a prompt, there's like a first use case, very useful use case. By the way, very fitting for the technology of LLM. It's not by chance that this is the, I'll call it the MDPP, minimal desirable viable product that was introduced to us, but it's the first of many of what we're going to see in the intelligence software development landscape. I mean, let's say 3 years, I don't think it will even take 10 years. When I thought about in 2022, let's say August, I saw the first signs of Copilot actually working, I thought that what we're actually getting with code completion is getting your search inside your ID, even a very smart search, instead of Googling it 5 different times, and each time diving 5 different times in Stack Overflow, then I get a transformed result into my code, and that's why probably we see a decline and continuing decline of Stack Overflow, etcetera, eventually this is just 1 of the tasks that we kind of hate doing, like starting to Google, but what about testing? What about code reviewing? By the way, each 1 of the tasks is also nice. I like some time to see what people write in Stack Overflow comments, right? It's just also a matter of having a community, but in many cases I hate testing or I hate verifying that the code fits a spec, checking that I didn't miss any of the flows or features that exist in my Jira ticket or whatever, Monday ticket, etcetera. Basically, think we're just seeing the early birds, the first sign of how AI is gonna impact our software development life cycle, and tools, etcetera. Then going back to 2022, again, BC, before ChatGPT, I thought that code completion and generation is going to be commoditized. We actually, myself and my team, started 2022. We used the GPT-3.5 API before ChatGPT, and we knew it's coming, all of this, and we thought it was going to be commoditized. You probably know there are like 15 tool for code generation completion. Copilot is the leading 1, but there are others. Someday soon, very soon, you'll even run a model locally and it will be good enough, etcetera. But again, when I look on my 100% time of development, this usually takes like 30% the most, and it's not the most painful part to be frank again. So that's why we saw it like we need to focus on integrity. To just wrap it up, why did we call it integrity and not like testing? Testing is the old way to verify your code intent. There are other ways to do that. For example, if I automatically create a specification for your code, and you did write a specification somewhere else, we can try to match it. Here we go, we check if your code is working as expected without going per se via testing. And by the way, when in the long term, and you see signs in our product already, when I check your code, also wanna check best practice and even culture stuff. I'm talking not only culture or best parts of development, but even culture of the company and we might get to AI alignment later on this call, I can elaborate. So that's why we call it code integrity and not testing. That's what I thought of the most impactful point.
Nathan Labenz (10:18) 30% you said of software developers time is spent on the actual development of the code, like writing the sort of happy path. Is that what that essentially means?
Itamar Friedman (10:30) Yes. I actually think it's a good number. You know, I'm not talking about the individual freelancer, you know, working by yourself on some cool project. I'm talking about like a team of 50 developers, 500, etcetera, like, you know, in a real company, even a hyper growth company doesn't, like, in 4 years old, doesn't have to be like 20 years old or 100 old company. There are so many other things you're doing like, by the way, some of them are in the software development regime, architecture, know, like thinking about how to design your software, etcetera. But actually writing code, think there is a lot of evidence that it shows that it's much less than 50%. And even I think a few researchers or research, I don't remember to quote them that it is less than 30, even, I mean, much less than 30.
Nathan Labenz (11:16) It definitely does highlight a big distinction between, and this is like a theme in AI application development in general, right? Like, it's easy to create a demo, it's hard to get things to production that is, know, when people typically say that or when I hear it, I typically sort of think of the just the, you know, the last mile of the AI optimization of trying to get the, you know, the prompts to be reliable and, you know, the edge cases and whatever. Right? There's there definitely are challenges just in the AI part of the application, but now you're also kind of bringing to the fore the fact that real software that serves, like, large, you know, customer bases is typically developed by bigger teams with whole, you know, cascading processes, whether it's waterfall or agile. And, you know, there's obviously a whole other podcasts, that can, you know, dive into the the fine grained distinctions between all the different ways that people try to structure their software development, but it's a highly structured process, and people put a lot into figuring out what exactly are they gonna build, how are they gonna prioritize it, how are they gonna define it. Finally, we get to this what turns out to be, you know, 30% of the work that is what people actually, you know, tend to think about the most, of course, is, you know, that you're sitting there writing code. And then once you've got that done, now you've got all these other steps to finally, you know, get that through approval and finally integrate it into a production environment. So that's interesting. I'd never heard that that number is as low as it is. By the way, we even we even didn't talk about the fact that a lot of the code that you write is more like maintaining code, like changing refactoring code.
Itamar Friedman (12:48) It's also part of it. Like a lot of the cool AI demos, especially coding agents, they're all about, yeah, let me create a new amazing website, which is basically 5 pages of register and some dashboard that it's usually like, anyway you could download something like, to be frank, like GitHub repo that does all that, and in real software, a lot of times after you wrote something, after a week it's legacy code, right? It's a lot about fixing that, refactoring that, elaborating on it. That's why I think there's a lot of opportunities here for more important use cases and for for AI to help in in addition to code completion.
Nathan Labenz (13:30) Understanding where this is, is there more that you wanted to get to on kind of the the criticality of if I were to break it into 3 phases, it's like figuring out what to build, doing the initial build, and then making sure that that build works and fits with everything else that we've already got going on. But you you had maybe other points to expand on there.
Itamar Friedman (13:50) I'll make another point really quickly is that 1 of the reason we focused at Codium AI and code integrity because we believe, and we kind of try to prove that, and I think we did with the AlphaCodium that we're gonna talk about later. It's groundbreaking, I would say, practically, it like beats AlphaCode by DeepMind, which we admire, and that's why we call our work Alpha Codium, and beats OpenAI on their submission with GPT-four by tenfold on their submission on a code contest, it's a benchmark derived from real code competition from Codeforces, and the entire idea on that work is that we put a lot of our know how and expertise into the code integrity part. That's what makes the code generation part very reliable, and does better than the majority of professional competitors there. By the way, it's open sourced, and you can reproduce the only, by the way, work out there from all the 3 companies that submitted. There are only 3 companies actually trying this challenging benchmark. It's like, Beepind Google and OpenAI and us, there's only 1 submitted, but closing the brackets, what I wanted to say is that the reason we focus on code integrity because we think it's not only directly related to a few obvious use cases, like helping you to test, but we think it's a key factor to enable most, if not all of the other use cases in the intelligence software development landscape, even in code generation, which is the same, in their game, in DeepMind and then mostly OpenAI and GitHub game, even their AlphaCodium did better because of the integrity. If you're talking about use cases like RestFactor, that you obviously, when you factor something, you need to test it, etcetera, and many other use cases. The integrity of the code, like testing of the code, is a key factor. That's why we were so focused on that. That's our uniqueness. Yeah, sorry if it's too bit too peachy, but I really believe in it. We're looking on the future and we want to make 10x impact with AI in addition to code completion.
Nathan Labenz (15:58) It certainly relates to a lot of the things that I think about around just, you know, how are we going to take full advantage of AI while still, you know, making sure that we have it under control. And I mean under control in kind of every respect, you know, of course, like, there's the big picture runaway AI scenarios, which I don't dismiss, but there's also a lot, you know, of kind of smaller scenarios where AI's impacts on the world are not immediately obvious, but could turn out to be quite important. And so I think that that that angle makes a lot of sense. Let's just talk for a minute then about kind of some of the upstream stuff. I use Copilot. I take my code snippets and and queries to ChatGPT. New startups coming into the space like Cursor that's creating an AI first IDE. Those, you know, I think are all pretty powerful. How would you describe what you see as the impact of those tools in that, you know, again, I'm shocked by the the smallness of the 30% number. Maybe I shouldn't be, but I but what would you say how would you describe what you see as changing in, you know, that kind of center activity as these tools are coming online and becoming extremely popular? Hey. We'll continue our interview in a moment after a word from our sponsors.
Itamar Friedman (17:13) By the way, last time about the 30%, I think like 1 of the reasons that enterprise, you see them by code completion by mass is because first of all, this 30% does matter, and it's also 1 of the first cases that that actually works really well. It's the most, again, I really like love this use case, but it's like the most trivial that 1 to apply with this new LLM technology. Like LLM were trained to complete the next line of codes according to context, and this is exactly what code completion and even generation does. They're less trained and well designed per se purely by themselves to deal with this code, does this code actually work, And that's like AlphaCodium, presented more like an agent like, more like a flow engineering than prompt engineering approach. That's why, by the way, the enterprise are really interested in the next use cases. So don't be so surprised if you're already by the 2024, you will use AI in many other cases. We'll probably talk about it later. For example, we have a Git plugin that helps you review pull requests, etcetera. It's a huge pain, and I I don't think you're about to try it, but solves a major pain and things that we hate to do, and the pull request that leaves you to the same thing that you love to do. Copilot is amazing. Also, other are some code completion tool. I think Chat GPT is amazing a lot because of the interface of the chat, and by the way, that's why we introduced chat that is more chat chattypty style in our IDE plugin, and we call it by the way Codiumate. So not only we have many commands there that are specific for code, which chattypty doesn't have, but also we enable pure free text like ChatGPT, but we do enable you to mark context more easily than ChatGPT, need to copy it, so also recommend you to try. And then I fully agree with you, they're amazing startup, I think you mentioned cursor, we really loved them. By the way, it's the first time I'm mentioning it, we are experimenting with coming up with our own ID as well, and I see why it's such a big promise, because you can then choose how you want. It's much more native to integrate some of the LLM features. Having said that, there is a reason that we didn't do it, because again, the thing that you were, if I could say, slightly shocked about, is that coding is 30%, and that most of these solutions are focused on code generation. We're focused on code integrity, which is through the entire software build lifecycle. You want to test your code while you're coding. You want to test your code while you're merging. You want to test your code in your pipeline, like in the ICD, and etcetera, and there is more point by the way. If you build a new IDE, you really need to be focused on that, and we didn't want to put that as our main focus more on the entire software lifecycle. So I definitely think that we will see these tools, they will not disappear. In your IDE while you're coding, they're amazing, and for example, Codium AI is not coming to replace that. So that's how I see it. By the way, Sequoia did a nice overview. You mentioned at the beginning that's really pre build, build and ship. They did something similar, and I think 1 of the things that many startups are focused on 1 part, like cursors and they give an environment that enables you to code, but 1 day we would want the AI to be a team member, right? With all the respect, a team member, a virtual team member cannot just write code without testing it, without understanding the deeper context of even going and reading your Jira ticket or whatever, etcetera. So that's why a team member needs to be on the entire 3 parts of the pre build, build and ship in order to really be a virtual team member for you that can do some of some of your work.
Nathan Labenz (21:06) I'll describe my experience with the product in a second and then you can expand on that, but just 1 more kind of question on the impact of AI on the software development process and profession today. I feel like we hear widely varying reports. I personally am not the world's greatest coder, and I would say I probably get a multiple x efficiency boost in many cases even from just going to chat GPT. Regular listeners will have heard me before talk about coding by analogy where I'll basically take a class or whatever I have. It could even be documentation, but often if it's if it's out of my repository, you know, the way that I implement caching is there and, you know, a couple other things, the patterns that I wanted to follow are there. And I say, you know, do a new version for me that does a different thing, but kind of follow these patterns. I have pretty good luck with that. I genuinely do think I accomplish a lot of things on multiple x faster. Then you see these more, like, rigorous studies that come out and say, oh, 30% faster here. Very meaningful numbers there, but mostly the head like, those kind of more sober studies would be more incremental than, like, multiple. And then more recently too, I've seen some of these reports that say that code quality is declining. And again, I don't think that's the case for me because probably my bar maybe wasn't so high to begin with. But you know, the the sort of Gary Marcus position, which I do take seriously, is that at least as a possibility or at least in some context is that, yeah, maybe you're able to get stuff done faster, but perhaps you're doing it at lower quality. You're building up technical debt, and, you know, this is perhaps more of a mirage than, you know, it seems. So do you have a perspective on that? I mean, maybe it's everywhere, you know, every maybe it's everything in different context, but, how do you how would you break that down?
Itamar Friedman (23:02) Okay. Cool. So first, I think both philosophies, both opinions are right. I think there are some cases that it's, like, magical and it does really productive and without reducing the quality, and in some cases, does. Like, I don't think it's a conflict. It's a matter there are so many setups. For example, when you wanna start, and I have a feeling that this is in many of your use cases, start a project from scratch. I think in most, and by the way, you're not trying to build a product that reaches millions of people and need to be robust. You probably get a really pretty good code. Good, I'm talking about good, not reasonable, for these cases much faster. If you're a company with millions of users or hundreds of millions of users and bugs are really critical, and performance is really critical, and other things, and you're not trying to build something from scratch, but rather add a feature, etcetera, you need a lot of context, then maybe the quality actually very likely will be reduced. That's, by the way, eventually people in companies will use tools that are more chatty bitty in the IDE that integrated with the context, like we offer in Codium AI, but that's put that aside. Now I'll tell you what I think more specifically. Yeah, I saw those reports lately for reduction in code quality, and I try to explain, I think, why there is a difference. For example, in GitHub report, they say twice better quality or something like that. The number doesn't matter. They tried, it was a controlled report, controlled experiments where they try to isolate it from scratch, build something, and that thing was actually almost an average request that you could easily, like LLM can extract from the internet, but again, when you have complicated code, and in this case, like I mentioned, yes, I do expect even higher quality than a regular developer. When we're talking sophisticated software, it's not. Having said that, I say that very soon, even for sophisticated programs, we will see higher quality coming from AI, and now I want to explain a bit about that. So I think the way we designed code completion and code generation until today, I'm talking we like I'm talking about the industry. I a lot of our competitors, actually, most see or see ourselves occurring as like a complimentary, but you know that mind share competition to some extent. Think like most of the system were designed system 1 like syncing. You know what I'm talking about? I'm referencing to Dan Buchanan book, system 1, system 2, like syncing fast and syncing slow. Most of the cases are syncing fast, system 1. Hey, here's a bit of a context. Here is a prompt. Give me the solution. That's system 1. I think we will see, and that's by the way, Codium, what we did in Alpha Codium, we will see more of system 2. Hey, let's think about it step by step slowly and not just of a chain of thought, but actually using various tools, it's like agent like. Okay, by the way, opening brackets for a second, you could do agents that are more system 1 than system 2, doing an agent doesn't necessarily mean. I would say that system 2 is like the properties of that, it's like we see on AlphaCodium, not only reflection and steps, but also including additional information that is relevant in the process. When you're singing, you wanna sing something deeply, usually you go read a bit something, you go fetch the right information and things like that, and once tools, even code generators, by the way, not code completion, it's gonna be hard to implement it for code completion, needs to be fast, but when code generation will do system 2, I think we're in a very good chance that even in very complicated code, we'll see higher quality. So even if it's worrying right now, I think it will be solved or even greatly mitigated by the end of the year, even the end of this year.
Nathan Labenz (26:52) Yeah. Things are certainly moving unbelievably quickly. 1 of the things I am saying over and over again to people right now is we now have AIs that can reason, plan, and use tools. And then, of course, people are always like, well, but not that well. And I'm like, yeah, but 2 years ago, they couldn't do it at all. Right? So we we should not, assume that they're stopping here. So it's a good reminder. You are right about kind of the the sort of work that I do and that that, I think that's a good diagnosis. You know, I typically do R and D type projects or prototype type projects. I, you know, don't have to worry about all the edge cases most of the time, and I don't have to worry about, you know, uptime or SLAs or, you know, all of the the demands of a, you know, high volume production environment. Nevertheless though, you know, there is this gap that I also have between the kind of stuff that I spin up and, you know, what my partners on the development team, you know, for example, Waymark will be happy to, you know, trust and run with. So I do see a lot of potential even just for me with the Codium plugin. So here's what I did. I, you know, downloaded it, installed it onto my, Versus Code as an extension. You've got a quarter million active users, which is not a small number when you're talking about developers. In Versus Code, by
Itamar Friedman (28:11) the way, there's also a JetBrains and and the Git plugin.
Nathan Labenz (28:13) I assume that Versus Code is like the biggest population, but maybe not.
Itamar Friedman (28:17) Yeah. You're right, but JetBrains is like, I have to say surprisingly, like growing faster for us. I I guess I I can try to estimate, but maybe less less interesting. I'm trying to estimate why, but the JetBrains is really growing fast for us. Almost like, I think it will suppress like Versus Code like in in a couple of months.
Nathan Labenz (28:34) Wow. Interesting. I guess that would just be that people who are gonna use an IDE that isn't like the most standard industry standard vanilla IDE are more inclined to just embrace new tools generally?
Itamar Friedman (28:46) I think maybe we just invest in it more than others. I have a few assumptions. We care about enterprise. Right? Versus Code is definitely also trendy in enterprise, but JetBrains is also very dominant there in enterprise, and we care about that. So we invested in it actually. We are going to do some few even more bigger releases in the next couple of months, even 1 month because we're focused on that. So I guess, generally, getting high scores in JetBrains, it's hard. I think the average scoring in JetBrains is lower, and we have, like, 4.8 stars there or something like that.
Nathan Labenz (29:21) It's a big world out there. When I see oh, it's these numbers are always a reminder of just how big, the world is. We were talking. We both have a tie to Ann Arbor, Michigan. And for me, the visual that I always keep in mind is the unit of Michigan Stadium, which seats a 110,000 people. And I always just imagine that view of being in that stadium when I think of a 100,000 people. So in my mental picture, you know, you've got over 2 Michigan stadiums full of developers just on Versus Code. And for me okay. I'm, again, not really subject to the, you know, rigors of, you know, real at scale enterprise software development, but I could definitely see how this is gonna add a lot of value to a lot of people because I installed the thing, and now I'm just kinda poking around. Right? I can ask it to explain my code. That's, you know, super useful just in general. I had a lot of success with just clicking generate tests. I thought that was a pretty interesting workflow where it first popped up 20 different things that I might wanna test, different behaviors, as well as kind of the first half a dozen tests to actually test those behaviors, and then you've got the ability to run. And I was also interested in run and fix. Run, I was like, okay. Yeah. Totally. Run and fix, was like, how do we ensure the integrity of that? Hey. We'll continue our interview in a moment after a word from our sponsors.
Itamar Friedman (30:46) First of all, I'm very, very happy that these are the command that we tried because we do have a plenty. By the way, like, there is an additional set of command, there are 2 sets of command for us. 1 that we frame it as a code assistant, like everything that you're doing right now while you're coding, no matter it's framed as a PR assistant, that everything related to the changes, like if I want to analyze my changes only, let's say focusing the entrance point is the changes, of course then we need to have a deep context, etcetera, you can recap your changes, describe your changes. Rest in peace, description, fix bug 111. You can just call the describe and get a really comprehensive description for your pull request, etcetera. But I think if I need to choose the top 2, I would choose exactly the test, the generate test, and I'll share why, but I also want to point out that we also have quick tests, you might enjoy that, because 1 of the things we learned is that our generate test, it's an advanced feature. Even opened you an advanced panel, right, and it didn't present you it inside of your IDE code file or even in the chat, it opened you an advanced panel because it's relatively in advance, and if you're looking some more quick and fun, you can try to quick test the 1, it's in the chat command. So you actually were right about the main valid position of the generate test, and by the way, the 20 behaviors that were generated, notice that you can click and open sub behaviors. So you basically have like 80 ways, and by the way, in some functions and classes, it can come 50 ways, and some others can come 100 ways to test your code. So 1 of the main valid position we hear from developers is that, hey, with Codium AI, I learned how I want to test my code, because actually, like I mentioned before, generating the code itself, it's cool, it's fun, and it does save me time. But thinking how to test my code, that I didn't forget edge cases, etcetera. This is where things we try to sometimes hate to do, but we know it's important, and that's why we focus there on behavior analysis. By the way, you asked me a few times about the future, so I'll allow myself to say this is where we're also gonna focus on. Right now, the behavior analysis are pretty much generated from 2 types of content, your code and your comments, and soon if we connect to your Jira or Google Docs, we can extract behaviors that are not necessarily yet in their code or not implemented correctly in your code. What I talked in the beginning, like spec matching, so it's definitely an important area for us. It's automatically creating BDD, like Cucumber, etcetera, for you, like a specification. This is why I also explain for us is a very important command, because the way we see csing is you have code. We believe in the concept now, I'm gonna open a break, it's really important for us. Really, really. We call it dry by it. You know, dry, don't repeat yourself, and until today, developers tried to be dry by not repeating themselves in code. But what about repeating yourself when you're writing your user story in Jira and then implementing it and then writing tests? All of these 3 have a lot in common, so dry by AI means that AI will help you to complete all the 3 and not needing to work on all of them from scratch. Now you write everything from scratch on the specification. You write everything from scratch on the test. You write everything from scratch for code. No. Now you can write your code, and now in 2024, I mean, you can write your code and get the relevant test and only work a bit to complete them and get the spec, or the other way around, write the spec and get the test in the code and keep it dry and always have them in sync. So that's why I mentioned, explain for us, if you look at it, we're kind of trying to create a read me, right, like a spec for your code, and generate tests. It's kind of like the test, and then you have the code, and all of that we want to keep dry by AI. So right now it's with a push, you needed to click it, right? When the accuracy would be good enough for us, and we are aiming there, then we also do that with pull. We will try to keep it aligned for you and even give you alerts if something is wrong. So that's how we see the future there. The vision of a more kind of passive, you know, assistant that pops up to help you as opposed to
Nathan Labenz (35:18) the 1 that you engage, which is kind of how it all mostly works today is definitely really interesting foreshadowing. I also wanted to flag too that I got a lot of value from the enhanced function, and it honestly was like software tutor kind of for me. You know, I've I've been largely self taught, largely entrepreneurial, you know, making stuff work. I used to go to Stack Overflow a lot. I go a lot less now, but I would go there, you know, a lot of times kind of, you know, with, like, pretty beginner questions in a lot of different domains. And so, you know, I've been writing Python code for years, but I just hit enhance and, you know, it it it gave me the enhance and was like, wow. Okay. It's like typing everything for me. These are all the things that I, like, should be doing that, you know, I'm I'm just not, you know, even bothering to. It definitely felt like I could learn from this, that my output could be better, that my handoff to the development team, you know, would be smoother. And so I do think this is likely to become a part of what I do even though it's like, you know, I'm still in this kind of r and d mode and not really accountable for, you know, the code integrity in the way that, you know, the bigger teams are. Here here are the thing.
Itamar Friedman (36:32) I saw that's interesting about the enhanced. First of all, thank you. I love I love that you love it. By the way, I want to be clear here, there is no feature that we have that 100% of the developers love, by the way. Developers have a taste, and that actually leads me to the next point. Right now, I don't know if you noticed, when you do enhance, you can give it a bit of a prompt, like enhance them in a certain way, but think about it, and by the way, in implementation as we speak, part of it is already in alpha users, like design partners, etcetera. Think about it, if you could have a configuration file, by the way, maybe for each repo, saying, What are the principles of the enhancement that you like to get? In this repo, for me enhancement means A, B, C, D, and another repo enhancement mean EFG or whatever. And then every time you call enhancement, not only it gives you a general best practice as enhancement, but yours, and think about tech leads being able to influence this for their junior developers, etcetera. And now, if you like to enhance, I also suggest you to improve, and think about, you get a code improvement, and think about improve that you can impact. You can configure and digest per repo. For example, 1 of our clients, what they do is they configured that they have an old API and they have a new API, the old 1 they want to deprecate, and they configure that every time someone developer mistakenly calls the old API, then it gives an improvement, gives an alert, hey, use this new For dev team, this is almost a game changer. This is another, I wanna use this opportunity to share with you how I think the future is coming, is that there's also option that AI will learn by itself your best practices in a team best practice, and we're thinking about it, but the first step is actually letting you adjust it by yourself, and that's how we see it. So right now, what you liked about Dine Hands, I really loved it. You loved it? I I could tell you that 50% of the developers or so, they hate it because they think of Enhance in a different way, and and that's why we wanna let let them configure what Enhance means for them.
Nathan Labenz (38:35) Well, I'm you know, again, my low bar is such that even the baseline Enhance is is a win for me.
Itamar Friedman (38:41) It's not it's it's not it's not low or high bar. Like, it's just it's taste. It's like best practices. It's like it's not like better or worse or or juniors or or advanced. It's it's like space or tab.
Nathan Labenz (38:53) Right? Certainly, I don't deny the importance of taste among the, you know, those that have the the true professional chops. 1 big question I was wondering about though as I was using it is the question of kind of the deep context. Right? The what the things that I was doing were mostly like function level, class level, but pretty, you know, contained. Right? And most of the things that I got back were also pretty contained. Test this function, enhance this function. When you get to the PR agent type workflow and just in general to kind of continue to add more and more value, it seems like 1 of, if not the major big leap there, is going to be the challenge of the much bigger context of these bigger projects. So in 1 sense, things are kind of coming to us as implementers of AI because the context windows have grown dramatically over the last year. You know, trying to do this sort of thing with a 4,000 token window sounds, you know, extremely hard. And at a 128,000 tokens, you know, you have, like, a lot of room to put stuff in there. Doesn't mean it's all gonna work. It doesn't mean it's it's trivial, but at least you have a lot more room to play with. What kind of high level principles can you share about how you guys are thinking about the challenge of the much bigger projects and and managing that context?
Itamar Friedman (40:16) This is an amazing question, by the way. I think it's critical for the future of intelligent software development life cycle, the entire idea of being able to understand big repos and multi repos. It sounds similar. Some Some for clients, they have like a huge mono repo of 10,000,000 lines and how is that different than, for example, 10 repos of 1,000,000? There is a difference, but I don't think we that that interesting. But the challenge of dealing with a lots of line of code context is critical. I do think that as LLM context grow, really efficiently grow that will help but not solve, but it definitely help. I do want to point out there are some research talking about that the big windows, they have also like problem with focus, attention, and all that. It's getting better, but but it's it's it's think it's a problem because even the way you order things, even if there's not they will solve kind of the focus on attention. It's like you still need to flow the process. Again, it throws me to the AlphaCodium. If I just give you, as a human, here is all the code. No. You you probably processes work on it and try to organize it. And the same way that you do it in steps, the LLM needs to do it in steps. Yes. There are, like, integral steps inside the LLM. There are stages. There are blocks. But like said a few times, like like, basically, calling in an iteration is basically adding more blocks of syncing, and and and I think it's critical even if you, like, make a better context window. And from our point of view, like Codium AI, we do implement deep context, but not as some of other solutions right now, and here is why we did it differently. We decided that in the beginning, the inception of the product, it was like high level tactics, like I would say on the border of strategic. We want to be a 0 data retention company. It means like even on our on prem, like cloud prem solution, even there, we wanted to start with 0 data retention. It means that we compute the context on the fly, and we do. Like, we build all the dependencies. Go until like, we have, like, like, kind of a heuristic algorithmic stop where to stop the graph of of dependencies and upwards and downwards. But we don't do that upfront, and we don't try to look on the entire 10,000,000 repo at run time. That's impossible. That's what we wanted to do. So we do have deep push context, and now after we purposed quite a lot with these capabilities, now we are looking into ways and there are quite a few and needs to be done right to do digest repo and multi repo, and then exploit that background process that needs to happen all the time, and then on runtime to exploit that. We're working and then it will be even deeper context. By the way, it costs. It costs the environment and the company and etcetera, so it needs to be done right. So the cost is going down. Yeah, but it's still something that cost, but it wasn't the main reason, the cost, why we didn't do it, because it's more like we wanted to give the confidence to our customers that this is like 0 data retention product, and some of them were really scared of their code being indexed and saved somewhere, anywhere, even in their own Their GitHub is strongly controlled, or GitHub, GitLab, Bitbucket, etcetera, they're strongly controlled. It's a really mature product, and then suddenly you come with a vector database you save somewhere, it didn't pass a lot of time since the inception of it. We decided strategically to wait a bit, and now it's getting mature. Product and this field of, let's call it vector databases, etcetera, we're getting into it as well.
Nathan Labenz (44:15) So yeah, this context question, I mean, it's 1 that I have been thinking about a lot personally because I tried to create a GPT to help me with my relatively modest personal r and d repo for Waymark, and it really has not worked. You know, what I've kind of found is at least so far, I just have to manage the context manually. Loading the whole repository in there, you know, tried a few different ways. It kind of sent me off on this path of, like, knowledge graphs because I realized that what it seems to be doing is, like, matching on a chunk level, loading that stuff into context, but then it's missing all these, you know, things that I that I really need to take advantage of. And I've, like, tried saying, like, no. It's your job to go find the pattern in the you know, you're the 1 that's supposed to find the caching pattern and use it. Like, I don't want if I have to do that, then I don't really benefit from the GPT relative to just using, you know, chat GPT and and pasting in whatever context is relevant. Do you have any kind of findings or tips or whatever on the sort of graph side? And if I understand correctly, you're doing this all dynamically in the in the IDE at the moment. So you've got code in the plug in that's kind of spidering out creating a graph. In the future, I guess you're you're also saying, like, yes, this is gonna go toward vector database as well, and we'll have chunking. But any thoughts about the you know, what can we say about a good graph strategy? Because I need I personally need some some tips.
Itamar Friedman (45:40) So first, I have to say that, man, you're revealing all our near future features. I'll explain what what it's going to be. Actually, what I'm going to tell you is already exist, but it's a bit hidden and some developers notice it and we got really good feedback about it. I'll share with you. I did a spoiler and like let me focus on that for a second. For 1 of our features, I think you didn't try it. It's called not what you try to generate tests. It's like generate tests from scratch. We also have a feature called add me more tests or officially we call it extend test suite. Like if you already have extended, and we did an experiment, and there we actually UI, UX wise revealed the graph, revealed the actually we didn't reveal it as a graph, although it's a good idea to do that. We're thinking about it. Sometimes developers don't like too much UI in their IDE, right? So right now we've built a list of the computed, let's call it sub context. There is the direct context, the code under test, the code under explain, the code under command, what you asked about different types of command, testing, explain, enhance. But then there's the sub context, the dependencies and all that. This is being revealed to the users. And the interesting thing is that then the user, the developer can change it. You pull this 1, but I don't like it. But hey, what about this something you missed? By the way, even if the AI algorithm were correct and did bring all the right minimal graph, part of the graph, then you might actually want to add more if you want suddenly to get like a different test that covers more areas or different areas. It's interesting. That's what we're working on to give it, like provide us features in other cases. So 1 thing I want to say is that I think that AI assistants would benefit, the developer would benefit if the AI assistant will reveal underneath computing. By the way, this I think it's true for not only the context, but other things. We're thinking about it. I don't want to overload, but if you do like to see what's going on behind under the hood, here you go. That's our philosophy looking forward. But we did it in some places and not enough. Specifically about the graph, 1 of the complicated thing about it is that it language dependent. It's almost unbelievable for me that there's not like sufficient open source around it. Like there is this the the tree sitter and many other sorry, and a and a couple of other open source out there, but they're not really completing everything we would want to have to have a very terrific graph for each 1 of the leading 30 languages. I can tell you with relatively high confidence that JetBrains have that, they need it for the IDEs and GitHub have that for whatever like search etcetera and what they do in their In Versus Code usually what you do is you download an extension that does it for you, right, in each language etcetera, but it's not like open sourced enough like in some 1 complete framework that gives you all that. It's not easy to do. That's why you're having a hard time. If it's okay for you not to do that in the backend side, then 1 of the things you can do is actually use the IDE's APIs and exploit their graph building. That leads me to 1 thing I wanted to tell you how we do it. You said we're doing an IDE. Basically we're doing it in 2 steps. We're doing part of the pre processing in IDE and your and like a consumer side sorry, in the client side. And and then a part of it another part of it on on the back end. It's it's for efficiency, like to to try to like minimize what we send, but also deeper algorithms that could run on the back end.
Nathan Labenz (49:20) Well, let's get to the research itself then for a little bit. So for all this talk of context, is a very we're kind of back to a toy environment with this thing. But understand it as essentially you and your team flexing on your ability to maximize the performance of language models. Is that like a fair characterization? Like, you wanted to just show, hey, we are awesome at this, and here's, we're gonna go head to head with, you know, some some well known publications on, you know, an established benchmark to demonstrate and kinda teach the community demonstrate that we're awesome at this and and teach the community some of what we do. Is that basically the the mindset behind it?
Itamar Friedman (50:00) The word toy, I'm not sure if you related it to the AlphaCodium or the benchmark, but I wanted to relate to both. The reason we chose a code contest, which is a benchmark designed and gathered by deep mind from mostly Codeforce competition. We chose that because we think this is the most serious coding challenge there is out there. There is like some lead code stuff and human eval includes like coding stuff. They're either saturated, simple, even the lead code hard, etcetera, is like simpler than the Codeforces in most cases. By the way, a lot of them are like showing to be presented in the training set. Like, don't know if you noticed, GPT-four really advanced on human eval, and then somebody presented that it's doing 100 accurate before some date and 0 and after. It shows almost proof that a lot of the stuff there is in the training. With Codeforces, the idea is that currently the code contests are benchmark is set, but we do see it as something that should always progress as more Codeforces add more problems, code contest also should progress with it. Then there's higher chance that That's not in the training set, but the other good thing is that Codeforces problems have private tests. It's almost like cuddle, so you're not revealing. It's hard to really fine tune on it. So that's why we chose it. I think this is 1 of the reason that there's only 2 companies until today that that competed on on the code contest, which is DeepMind and OpenAI and we're the third. All the rest of the, let's say, LLM generators, they usually do an object mini file, etcetera. And there's a lot of debate around it, but they're basically around the same numbers. They're getting the same numbers. Now about the AlphaCodium itself, if it's a toy or not, I can tell you that on 1 hand, it's doing really well on the Codeforces, and these are really hard problem to try it out. And and at the same time, yes, these are isolated problems. They're not, like, con connected to many other moving parts and and companies, like, huge software stack, etcetera. So we are in 2024 step by step taking many of the concepts that we had in AlphaCodium and integrating it into our Git plugin called PR agent and ID plugin called Codium8 and by Codium AI and we are integrating it. And you will see not only like new capabilities, but new interfaces that are more flow engineering if we talk about AlphaCodium. And so I do think that that we will see the impact also in real world software.
Nathan Labenz (52:51) Yeah. A better word, than toy would have been self contained, I think, because and and another way that it's interesting is that it is, I think, a framework that generalizes probably extremely well beyond code too. So let me just set it up a little bit and give a little bit more kind of, you know, foundation for folks and then we can you can, you know, give me more detail. So first of all, this code contest, as you said, is a standard that humans compete on. So I always like to ask the question, well, like, how good are humans at this task? You know, that that's a that's a great grounding. And it is true that, you know, these, it's it's pretty easy to create devilishly tricky little problems, and people like to compete on these. So the kind of prior state of the art before your work came out was DeepMind's AlphaCode, and they published a result. I believe this was maybe even in nature that said, hey, we have achieved median human performance among coding contest participants, which is really critical. Right? Because I mean, people are always like, well, how do we compare to humans? And it's also which humans are we comparing to? So we are comparing 2 people who participate in coding contests. And with AlphaCode, I think it's about a year ago, they got to the 46 percentile or roughly, you know, the median participant in these coding contests. And that's just by successfully executing or succeeding on 25% of the challenges. So these are some these are some pretty hard problems if the sort of person who finds it fun to go spend their time doing coding contests is only getting 1 in 4 of them correct. Now to just unpack a little bit more of the structure, you have essentially a problem statement, right, that's kind of like a story problem from like a high school algebra textbook except, you know, for a coding expecting a coding solution. And then if I understand correctly, you have both the public tests that you can see, right, that are given to you, like, you must pass these tests. And then also the private tests I wasn't clear on, are those, like, are they so private that they're not published anywhere and you have to, like, just ping somebody's servers? And so those are on the Codeforces servers where those tests and they're and they're literally nowhere on the Internet. Is that true?
Itamar Friedman (55:15) That's why it makes it like a really good like, hugga like competition. So about what you said, that's perfectly correct. I want to elaborate a bit. Okay. So first of all, there was AlphaCode. I'm not even I think it's even more than 1 year, and then there's AlphaCode 2, which was just recently. 1 important thing the first thing I wanna relate is that both of them were fine tuned on this specific problem like fuck. And and this, by the way, makes them less general. Like, you talked about if if it's like a by the way, I'm I'm fine with calling it a toy because I do agree. It's currently not fitting for real world cases, which is going to be like in our products in this year, but as opposed to AlphaCode 1 and 2 that they find you in their model on this competition specifically like fuck, our solution does not. It gives, like, a huge hint that our solution, Alpha Codium, is more generic for other problems, and this this is critical. Second, AlphaCode 1 specifically, like, especially, but also AlphaCode 2 makes millions of LLM calls to solve 1 problem. So when they said they reached 25, I think it's 28 if I remember correctly. What you're referring to, like in AlphaCode 2, they kind of reshuffle the entire competition. They don't say which 1 they did. They don't share, like, with all respect, I think, like, they did AlphaCod 1 amazing report. AlphaCod 2, it looked to me like they've been in a rush. They don't know exactly. They test AlphaCod 1 on AlphaCod two's, like, stuff and, like, shuffle of code contest, which is public, but they had a private now something, and they're at 25, but the original 1 at 28. To get to the 28%, you call 1,000,000 calls for 1 problem. This is impractical. So they fine tuned and 1,000,000 calls, forget about it. This is not even a toy. It's like a kind of proof of concept, which is amazing, they inspired us. Yeah, okay. With AlphaCodium 2, which they fine tune a new model, is Gmini Pro like, they still fine tune, but they do much better on how many calls they make. It's like we did while we were fourfold magnitude better than AlphaCode 1, we were just like 30% less called than AlphaCodium 2. Having said that, still they fine tune their model. It means that now for every competition or every new software or whatever you need to fine tune your model, that's like almost impractical. And while our solution is generics. And usually like when 1 of the things I liked about the project that what we thought is the main contribution here was also recognized by others as the main contribution. It's not always like that. You always wait for what's gonna like how the community is gonna look like and what you did. And when he reviewed our paper and tweeted about it, he actually related to the fact that how well we designed the flow. Now you can call it chain of thought, you can call it an agent, we decided to call it flow engineering because chain of thought, can read in the paper react, like the chain of thought is like 1 type of flow, and we designed a different type of flow, we didn't want to confuse, and we don't call it agent per se because we didn't want to fall on that buzzword, and so it's basically what we did is like we did a code oriented iterative iterative fact checking flow engineer flow engineering. Okay? So we start with the problem. Sometimes the problem is defined like in such a humanly way, and we wanted to represent it in intermediate way, which human can read, but is much better for LLM to digest. It's like like taking a product description from a product dude from a PM and, like, refactoring it to a technical document, and this requires us a few steps. But the interesting thing is that each 1 of the steps are relatively reliable. Sometimes when you try to do like the best prompt from the problem to the code, practically the LLM needs to do so many things in the process that it makes this unreliable. And even with the best prompt, it still will be unreliable. But if you design a flow, like I just described the first part of our flow, which is a few steps to redefine the problem, then each step is more reliable. You get a better result. Now, I know what you might be thinking. If you take from the original to the end, maybe it's 40% accurate. You take even if it's 90% accurate, each 1 of your steps, but if you take 0.9, multiply 0.9, multiply 0.9, you might get less than 40. But practically, we have a few techniques that we show in AlphaCodium how we avoid situation with what we call gathering and injections through the process. In the old days, we might call it self supervision or something like that. So basically the flow starts with a few steps where we redefine the problem, and then we offer a few solutions for this problem, and then AlphaCodium is already reflects on it, and then start to generating 1 or 2 solution running the public test if exists. By the way, some of the problem come with 0, if I remember correctly, hope I'm not. Wrong, 0, some with 1, 2, some with many public tests, and then it starts to reflect on it and run the code. It's specific, it's not like a general chain of thought. It's a flow specific for programming. I want to relate to that, and then if you want, like if you take it to the healthcare or you take it to, I don't know, like legal, you can anchor your results in different ways. In this case, in software anchoring via testing, and not only that, AlphaCodium then tries to generate additional tests that are edge cases, because in machine learning, when you want to have a good solution, you need to have the edge cases to draw the boundary, and this is what you also see in Codium AI, what we talked an hour ago with the behavior analysis and giving you the edge cases. And then although it's AI generated code, we know how to smartly, naively, it's not so complicated, but smartly to exploit these edge cases even if they might be wrong by choosing like a curriculum learning, which 1 is accurate enough to conclude and anchor it, use an anchor, and this is how eventually, like after a dozen of steps, reach to the final solution. That's AlphaCodium, designed for solving problems from a spec, from a definition to with a few data points to a solution, it could be generalized to real world product because you do have some data point you can extract. I can talk about it. Yeah, that's like a overview. So the name of the paper is from prompt engineering to flow engineering. Like the time we needed to work on prompt engineering really reduced to almost 5% because once you move step by step and every time you ask for something relatively clear, then the sensitivity of the prompts is being reduced.
Nathan Labenz (1:02:35) I think this notion of relative simplicity of prompts, I went into the GitHub repo and read through the prompts, and I was pretty impressed by just how relatively straightforward they are, pretty brief, pretty easy to read, crisply written. I thought, you know, clearly some care has gone into articulating very clearly what each task is, but the real work has been breaking down this macro task into the subtasks in a way that really embodies the best practices or, you know, a successful workflow to get from beginning to end. So I thought maybe I'll just read the the steps. And I think, again, this is something that you could bring to almost any other area where you might want to apply AI. I might think of it as, I mean, you call it flow engineering. I might call it workflow engineering. The first challenge that a lot of times they have is that they haven't documented their workflows or they haven't even really thought at all about like what did they just do the work, but they haven't really taken the step back to think like, okay, what are the steps that I do? And, you know, maybe I'm kind of blurring them together, but, you know, could I really break them down into discrete and and kind of clearly distinguishable subtasks? So that is the lesson here, I think. I mean, you can you can borrow from the Alpha Codium pattern, but you can also do this for your own type of work or whatever you're trying to do, and I think you'll you'll find pretty good results from it. Illicit, which is a former guest on the podcast, and I'm a very small proud investor, also is like a pioneer of doing this kind of thing. You know, just rigorous task decomposition in their case for research. So here's the flow. You've you kinda divide it into 2 buckets, but, I mean, really, it's it's it's a pretty linear flow with a couple nodes that you can kind of iterate on. So first, you, you know, provide the the coding problem. The kind of first core step is problem reflection. That's your classic, like, unpack this, you know, think step by step, but you're not even doing anything yet. You're just kind of unpacking the problem itself. Really just, you know, literally thinking about the problem, you know, LLM style. Then in this context, you have the public test, so then do the same thing about the public tests. Then generate possible solutions. I might also say describe that as generate possible strategies because if I understand correctly, there's no code generated yet at this point, but it's like possible approaches, strategies, you know, met ways to think about solving the problem. So you got a few of those. Now rank those solutions. So again, each each of these is a distinct LLM call with a pretty basic, you know, straightforward prompt, but really dialing into doing that narrow part of this overall, workflow. Once you've ranked those solutions, now generate additional AI tests. I think this is, you know, obviously more specific to a programming environment than other things. And, you know, we'll leave it to the, an exercise for the audience to think about what, you know, the the version of that might be in in their context. But this thought was really interesting too because the way that this the the the thing is ultimately gonna be evaluated on is, does it pass the private test? Right? Like, if it passes the private test, it's deemed to be successful, and yet you only have the the public ones. So, you know, I assume these things are generally kind of designed in a way where, you know, there will be some failure modes that the private tests are supposed to capture that the public tests, you know, would not capture. So you're taking that extra step of saying, okay. Well, what other things you know, even if we were to pass all these tests, what other ways might we fail? So generating additional AI tests to try to cover those bases. You got an interesting diagram that kind of shows the the problem and solution space and how you expand the the coverage of tests by generating these additional tests. And all that has happened before we get to any code. Right? So this is literally just all in the preprocessing, you know, bucket as it's described in the paper. Now we're finally though having reflected on the problem and, you know, come up with a few different ways to think about solving it and generated new test ideas and, you know, chosen and ranked those those various candidate strategies. Now we're finally ready to generate some code. So generate some code, see if it passes the test. Actually, there's even an there's a self improvement step. That's another thing I think is pretty interesting is just multiple rounds of improvement even before testing. So generate the code that then feed the generated code back in, you know, can this be improved? Then go on to actually run it. This is the only thing as far as I understand that that gets outside of just fully programmatic, you know, or to kind of fully deterministic language model calls is that we're now actually going to run the code. So we do need access to a runtime for this portion. Right? Now you either succeed or fail. You can feed in the error messages, work on that again, again, again until you finally pass all the tests, then do the same thing on the AI tests. There's an interesting nuance there around, like, the fact that the tests might be wrong. Can you give a little bit more color on how you think about that conceptually?
Itamar Friedman (1:07:47) Priscilla, I wanna say that you're spot on in some of the insights. Like, think about it as as a human. Probably this process as a developer is what we do. You don't take a problem like, even in the case we're in a hurry, you don't take a problem, here is my code. Usually, like, I'm not saying that we don't 10% of the time or etcetera, but usually like, here's the problem. Let's think a bit about how I wanna design it. Even if it's not too much, you do a few steps. It doesn't make sense that we will ask a model like prompt, generate for me, even if it's super intelligent. Okay, let's let it sync the same way we do it. That's a core idea here. So yeah, let's think about it. What could be an edge case? What could be like what I'm trying to achieve? How do I break it into points? Okay, what could be a relevant algorithm here? Think about it. You've been asked to do something. You'll think, Yeah, what's the principal technical engineering aspect that I need here before you start writing the code? At least that, you do that. So why wouldn't AI do that? When you get to the test, basically the AI might be wrong. And similar to us, how many times if you do testing, the world divides into 2, the developer world divide into 2 basically. Either you don't do testing because you hate it or you do testing because you and you hate it, right? But basically, when you do the testing, hey, how many times it happened to you made a test and it was red? It didn't pass. And then you're thinking, is it a problem in my test or is it a problem in my code? It happens a lot. So why wouldn't it happen with AI as well? So how do you deal with it? You kind of reflect about it. But by the way, we also a bit could be wrongly biased towards accepting green tests that pass. And it could be test passed, the test is wrong and the code is wrong. But biasedly, we would accept a green test. And then we will put most of our time on the red test or is it the problem in the test, problem in the code. And basically what you see here is that this bias of developers we put into AlphaCodium. When a test is passing, maybe with a better reflection or not, we anchor it, okay, this should not only pass now, but also in next revision. But if a test didn't pass, let's reflect if we think we need to fix it or generate another 1 or actually fix the code, and try to reflect on it and act accordingly. Basically you build a growing anchor group of passing tests, and that concept you can take to other like other fields. Let's say I have some, I don't know, on the legal part, I have different items I need to comply to, and I write something and I check-in some way or another if I comply to SWE, okay from now on I want to comply to them and keep that safe and move on to the next 1. Maybe not, I didn't give the best example, but there could be like other fields that can definitely enjoy this. By the way, it's a field, sorry, like I'm so excited about these stuff. Maybe I'm saying tongue twat bit too much, it's relatively a researched field. We didn't quote other papers, etcetera. We didn't want to make it such an academic work, but curricular learning in general, it's a very adjacent or exact field. There's so much work there. Do you start with the hard cases or you start with the easy cases? So you can apply, there's a lot of form in AlphaCodium for innovation or trial and error to see which other curriculums could work. We chose 1 according to our like our intuition best practice.
Nathan Labenz (1:11:15) Yeah. Okay, cool. Interesting. So I will just summarize then the results and maybe also give just at least my understanding of kind of the economic, you know, comparison between the human performance and the AI. So you try this with a few different models. At the time of the research, GPT-four clearly the best model available for coding. I would be very interested to hear what what Codelama recently released might look like in this context. But the punchline is with a total number of API calls per run through this entire workflow that averages in the sort of 15 to 20 range, and then also allowing the AI to run this whole process up to 5 times, you get up to half of the code contest problems solved correctly. That's basically double what AlphaCode reported some time ago and essentially on par with what AlphaCode 2 also out of DeepMind has reported more recently. And that gets us up to close to, if not all the way to the ninetieth percentile of humans. Again, and these are humans that are spending their time competing in code contests. All other competitors, they mostly did pass 10. What pass 10 means? They submit 10 times
Itamar Friedman (1:12:36) on the code competition, and we submitted 5. Why did we choose 5? Because we thought about UX UI. Think about like if you are a product and you want to suggest a few options, then we thought a 10 option is a lot, but 1 is people will be willing to get like, here's top 3, what do you think? Which 1 would you like? Or top 5 or 10? Practically also mean less running time and easier for us to research when we didn't do 10. And by the way, favors AlphaCodium, DeepMind and GPT with their report because they did 10 submission, we 5. So if we do 10, by the way, we get better results. And these 5 attempts in each 1 of them, there is roughly like 3 solutions that Alpha Codium automatically tries, and in each solution, there is like 15 calls. So roughly speaking, there is like 100 calls, which is equivalent to AlphaCode 2 when they reach similar results to us, which is 30% solving. Okay, so not 50, with 100 calls, reach to 30% of the task are being sold. Solved means that all the private tests are passed, which put us in the 51, 52 percentile. We beat the majority, but not the 10 percentile. To reach the 10 percentile, we needed to run more than 1,000 calls. AlphaCodium, by the way, it's more towards the million, and we didn't want to do that because it's not practical. When you want 100 calls, it's like 3 minutes and it costs you, let's say, a few bucks. When you need to run 5,000 time, for example, to beat it and the 10% of competitors, then basically it takes you tens of minutes and more bucks. We practically wanted to show a practical result that people can reproduce. So we aimed on the, Hey, let's give you a tool that lets you beat the majority, but not necessarily top 10. If you scale the parameters, you'll do much, much better. I think the difference
Nathan Labenz (1:14:40) is that I was taking that number off the validation portion as opposed to the test.
Itamar Friedman (1:14:45) Oh, yeah, yeah, yeah. You're correct. You know why I'm relating to the test? Oh, yeah. Sorry. So what you said is correct about the validation, but let me tell you why I'm relating to the test. I think it's much more correct or real world or like it's better to use the test. By the way, noticed that when we because it's not available out there and it's like a sick more pure results. So I relate it to a test and you're right and you relate to validation. But what I said is everything I said is about a test. But why do I relate to a test? I think you can see interesting insights. For example, GPT-four results on the validation opens much a bigger gap than other models. Then on the test, actually the gap shrinks. This is really interesting in my opinion. It shows the power of the flow and how maybe GPT is trained on so much data that somehow includes some of the, somehow the interpolation of what's going on there, but not the other private 1. And by the way, you mentioned we decided to keep things simple and we presented a result on deep seek code and GPT-3.5 and 4. We didn't at that time manage to do a code LLM on the version 3 or whatever it's called because it was just released a few days and we released like 10 days ago. But we did test on more APIs and I have to tell you that surprisingly or not, DeepSeek is doing relatively well, especially on tests side, like it's very generalizable. So we didn't want to complicate it too much and we presented that. We might later on in AlphaCode 2, when we want to even do more rigorous and not a 8 page report, but like sometimes OpenAI, DeepMind, they do like 70 pages. I'm not sure we want to go there. They will show many more, but DeepSeq is doing relatively well. There are other cool models. Some other models are really good for chat, but not so good for code. So some of the bedrock, AWS, I don't want to call names. They're really good for chat, but less for code. I allow myself to say that it was a pleasure and we can chat again if you wanna host me again later on there. Maybe I'll have some exciting things like hinting a new model or things like that.
Nathan Labenz (1:17:01) Sounds great. Well, I'll just let me just run down really quickly a couple of other quick takeaways that jumped out to me. 1, the value of YAML as a a format for output. That's definitely 1 I'm gonna be using because I have been using XML, and I think I've given that tip plenty of times to people. And I like XML better than I like JSON because I can at least, like, read, you know, as I'm just manually getting started my XML. But the YAML is, like, toke fewer tokens, also easy to read, you know, and apparently works really well. So that's an update for me. Couple other just general patterns that people can follow from your work. Starting with easy, moving toward hard, you use the phrase knowledge accumulation, also kind of deferring the big decisions for as long in the process as you can. Right? So you're doing that reflection and analysis, etcetera, etcetera. And finally then choosing your strategy and getting into the code, that I definitely think is something that, you know, whether you're drafting legal letters or, you know, any number of things, can kind of follow, I think, a very similar logic. Asking for small modular functions, that's a code specific thing, but again, makes a ton of sense for a lot of different scenarios. Right? Can you both break the task down and, you know, have it break the task down in its response to you? And then finally, immediate self review. Just it's it's not intuitive necessarily. I I think we have maybe should be more intuitive. You know, this is maybe an area where, like, more anthropomorphizing would even be better. You know? But, like, we I for whatever reason, it doesn't seem intuitive to me that you would ask an AI to do something, then immediately take its solution and ask it to critique its own solution. Like, it just, for whatever reasons, it's like that wouldn't, you know, be a a way to get a better answer because it's a you know, whatever. It's an AI, and yet it works. So, you know, ask literally just immediately after generating the solution, have the AI critique its own work with a different prompt. Super interesting stuff there. 100,
Itamar Friedman (1:18:56) like a really good summary. I would say the point to think about a bit the prompt, because we simplify the steps, like the prompt is less sensitive, but on the reflection part is where to read a bit about in our paper, also in the blog, but the paper talks about do and don'ts, how to do a reflection. It could be done wrongly. And about the knowledge accumulation, 1 thing I said it before, but the knowledge accumulation is not only how the data is progressing from 1 step to another, but sometimes it's a skip connection. Sometimes you take output that was generated in step number 4 and inject it again in step number 7 because you want to have the knowledge being given almost purely and not after all the digestion that happened on the way. That's how trying to prevent the accumulation of the probability of mistakes.
Nathan Labenz (1:19:48) Yeah. Knowledge accumulation hopefully without error accumulation. Well, I definitely would refer people also to the GitHub repository where you can check out all the prompts. Everything is there with, you know, down to the temperature, all the settings, all that good stuff. And I do think this is something that is a very practical, easy to digest, project that that a lot of people could take inspiration from. So this has been great. I really appreciate it. Any final thoughts or comments before we break?
Itamar Friedman (1:20:15) No, it was really, really, really enjoyable, and I love your podcast. I think it's a good time to tell me which 1 do you think is the, if I may ask, the 1 I must listen to, which 1 of the episodes? For someone like me now that, like, really likes the detail, the code, software development.
Nathan Labenz (1:20:33) 1 that I think has been really good for a lot of people is the tiny stories episode with 2 guys from in fact, 1 is also Israeli Ronan, Eldon. They are from Microsoft Research, and they created this kind of, like, 3 year old reading level, you know, if that's a thing, but, like, very kind of basic language, short stories, simple concepts, and then they train language models on that. And they showed some super interesting results and and some really interesting, and for me, of foundational intuition building, results for how language models learn and why. And you get even to micro reasoning skills at, like, very small models, like just, you know, a few million or a few tens of millions of parameters. You're already starting to see reasoning skills developing. And why is that? Well, we'll leave it to the episode, but I felt that was a I I learned a lot from that 1. This has been a lot of fun. Itamar Friedman, founder and CEO of Codium AI, thank you for being part of the cognitive revolution.
Itamar Friedman (1:21:35) Thank you. Co founder, it's always important, and my team is the reason for for our success for now and for the future. People is everything. Thank you.
Nathan Labenz (1:21:44) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.