Discover how Adam Majmudar embarked on an exceptional journey to create the TinyGPU from scratch, with no experience in GPU design.
Watch Episode Here
Read Episode Description
Discover how Adam Majmudar embarked on an exceptional journey to create the TinyGPU from scratch, with no experience in GPU design. This insightful podcast follows Adam's process from learning to implementation, highlighting the progressive contributions of countless engineers and the accelerating role of AI in the learning journey. Experience the unfolding of GPU architecture and gain a deeper appreciation for the technology driving today's AI advancements.
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
CHAPTERS:
(00:00:00) Introduction
(00:04:34) Adam Majmudar
(00:07:42) Learning Resources
(00:12:11) What is the process of getting your chip back?
(00:14:38) What is the scope of the project?
(00:17:18) Sponsors: Oracle | Brave
(00:19:25) Prioritization
(00:23:19) Memory management
(00:33:19) What instructions to include?
(00:38:03) Sponsors: Squad | Omneky
(00:40:42) Registers
(00:48:29) Memory Limitations
(00:57:51) Compute Pattern
(01:01:14) Dispatcher
(01:07:50) How does it get translated into hardware?
(01:21:07) Compute Core Execution
(01:24:57) The Fetcher
(01:27:07) Memory controllers
(01:37:49) Simulating the design
(01:41:09) What did you learn?
(01:50:36) Conclusion
Full Transcript
[Adam Majmudar] 0:00 These special registers are registers that you can't write to. The GPU itself handles writing to them. And that means that the GPU itself supplies this code execution environment with, like, hey, you're this block member. There's this many threads in each block, and you're also this thread number in this block. And so the high level job of the GPU now is like, okay. Well, how do we manage how to divide up these threads into the available compute cores, wait for them to complete their job, and then give them more jobs when they're done completing their work. The analogy is, like, if aliens come to Earth and they find the computer, like, how are they gonna understand it? Are they gonna break it down and look at all the connections, the transistors, and see what's happening? No. They're not gonna understand it by doing that. They have to come to some understanding of their own and then test it against the the computers that they have. It's a bit of a massively over complex analogy for going through a GitHub repo.
[Nathan Labenz] 0:54 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to share my conversation with Adam Majmudar, creator of the TinyGPU project, in which Adam set out to speed run the creation of a GPU from scratch in just a few weeks despite having no previous experience designing GPUs. The result is truly remarkable and a great educational resource for anyone who wants a better understanding of the foundational technology on which today's frontier AIs are trained and run. In this episode, Adam walks us through the GPU architecture layer by layer from the tools he used to learn and design to the GPU programming paradigm of single instruction multiple data to the process of hardware implementation and verification in a way that I think will be illuminating for just about anyone who's not already an expert in this critical industry. Along the way, I think this episode has a couple big picture lessons to teach us as well. First, it's a powerful reminder of the multigenerational tech stack that all of today's AI breakthroughs are built upon. Adam was able to do this work thanks to the compounding progress in chip design and design automation software that thousands, if not millions, of engineers have delivered over decades. Second, it shows just how much AI itself is accelerating the learning process for individuals. Adam relied heavily on ChatGPT and Claude to help guide his learning journey. Whenever he got stuck, he would propose hypotheses to the AI and get feedback, enabling him to make much faster progress than learning solely from static resources. I found it particularly interesting to hear him describe today's leading models as having a sort of intuition for science and engineering, which transcends online resources in important ways. I've experienced this myself, most recently while trying to bring myself up to speed on the intersection of AI and biology, and separately when planning a macOS app. And I think it's a truly important capability, which when combined with the agentic capabilities that we expect from next generation models, suggests potential for genuinely transformative impact in the near future. Finally, beyond the technical content itself, I came away incredibly impressed with Adam as a thinker, builder, and communicator. The depth of understanding that he demonstrates reflects a level of curiosity, resourcefulness, and relentless drive to figure things out that will surely enable him to make a real impact in whatever domain he decides to tackle next. Personally, I came away from this conversation with a much stronger intuition for how GPUs actually work under the hood, and I'll bet this episode will scratch a lingering itch for many, many people. 1 note, if you're listening to the audio only podcast feed, we do use a screen share in this episode and refer to a number of diagrams and code samples from the TinyGPU GitHub repository. With that in mind, I would recommend the YouTube version of this episode if that's convenient for you, though we'll also have a link in the show notes in case you want to follow along on your own. As always, if you're finding value in the show, I'd appreciate it if you'd take a moment and share it with friends. And if you do have any feedback, you can reach us via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now without further ado, please enjoy this whirlwind tour through GPU architecture with TinyGPU creator, Adam Majmudar. Adam Majmudar, creator of the TinyGPU project. Welcome to the Cognitive Revolution.
[Adam Majmudar] 4:39 Thank you for having me.
[Nathan Labenz] 4:40 I'm excited about this. So you've created this, project that really caught my eye called TinyGPU, in which you described yourself as having no prior experience creating GPUs, but nevertheless set out to speed run creating 1 from scratch in just a few weeks. And I imagine you have learned an unbelievable amount from doing that. I've learned, quite a bit just from following your progress. And hopefully, the audience will be able to learn a lot from following the results of what you did as well. You wanna start off by just introducing yourself a little bit and and telling us how you got this crazy idea?
[Adam Majmudar] 5:16 Yeah. Sure. I'm Adam. I've been working on a company called Third Bone for the past 3 years. Started in college and worked on it full time during college, and then worked on that for a couple years in the gap year. And I just stopped working on that recently. And then now I've been doing a couple deep dives, I've been sharing on Twitter on a bunch of different industries that are gonna overlap between what I think are gonna be defining technologies within the next 1 to 2 decades and areas that I've had, like, demonstrated interest in for my whole life. And my goal has been how fast can you get to a point of technical competence and industry competence. So it kinda just sounds good. It's not just, like, the technical side of things. It's actually, like, understanding the industry and how things are going right now, understanding tailwinds, also understanding the opportunity landscape, specifically for young people. But TinyGPU happens to be the result of the technical part of that, which is I just finished my deep dive on chips, so I'm now planning to use other ones. But during that deep dive, my goal was basically learn the entire engineering stack of chips and computation and history of it. And part of that process was basically 2 2 week sprints. The first 2 week sprint was, can I learn how to make a chip all the way from EDA, which is the electronic design side, and then understand how fabrication works and understand the architectural side all in 2 weeks, which is traditionally something that people can spend 2 to 2 plus years in college to do? And my belief was, basically, you can cut it down to a 100 x time wise. Of course, you're not gonna be learning the same thoroughness as what other people are doing. But the question you ask yourself is, is the thing that matters the thoroughness and the quantity of stuff that you learn? Of course, people who spent 3 years on this building work can be about lots of different detailed stuff. Or is it that there's some not really 80 20. It's more like a 99 1 where it's like 1% of the stuff is actually most of the important value. And if you can figure out how to extract that quickly, then you you get much more value out of the learning. So that was the context behind the project. And so I did the 2 weeks, spent that time learning how to make chips. And then by the end of that, I formed this intuition, which is like, okay. Now that I know this, what's a really cool project I can build that will teach me a lot? It will also be fun. It also has potential to be interesting and valuable to people. And given the whole wave of stuff around AI, you have GPUs and tensor processing units and all kinds of AI accelerators and also just a whole loop around ASICs. I thought it would be interesting to add that into the GPU as a foundational way to learn all of that. So that's the context on the project.
[Nathan Labenz] 7:40 Love it. Well, let's get into it. 1 of the challenges that you noted right up front that I hadn't really considered, but definitely makes a lot of sense, is that there aren't a ton of great learning resources out there for things like this because a lot of the technology is proprietary. That includes designs. I've been doing some of my own prep and realizing that, like, NVIDIA has not published every layer of their technology stack. And so there's questions even to this day about how some of the most advanced GPUs work in some critical ways. And also design tools, the software packages, that those are not always super easy to come by either. So maybe for starters, just characterize that landscape a little bit and and tell us what learning resources and and software tools you were able to avail yourself of.
[Adam Majmudar] 8:26 So I'll contrast this to CPUs, which is a much broader landscape, and there's a lot more learning resources. And the reason it's called least as a function of how old it is, The CPU is such basically the oldest invention in terms of actual useful computing. Lots of people will disagree with that statement, but take the core concept out of it. And because of that, there's so many learning resources. If you take an architecture course in college, you'll learn how to build a CPU. There's basically tons of courses around it. And what it means to have it fully documented and and taught is not just how it works at a high level. Oh, here's the different parts. There's a data memory and program memory on registers. It's it's actually like getting into the control signals in the low level and understanding how machine language translates into stuff happening in the CPU and stuff like that. So it's like the connection between the software and the hardware level. And that's really what it takes to be able to completely replicate it. If you wanna build a CPU from scratch, at the end the day, need to understand how you can actually program it and how you can turn that into low level signals in your design. And so in contrast with GPUs, there's essentially nothing at that level. So what you'll find is if you go look for NVIDIA and AMD GPU designs, you'll see high level architecture, and it'll show you all the different pieces of that. And, of course, it's a little bit ridiculous to even expect something at the control signal level for designs that large. But actually, there's no resources that dumb these things down into the simple elements so that you can at least try to implement something yourself. Now 1 person pointed out that Intel actually does have pretty low level diagrams. Again, you wouldn't really use it for learning because it's an actual production GPU, so it's not really too easy to understand. And that could be 1 interesting approach for people.
[Nathan Labenz] 10:00 Gotcha. Okay. What about on the tool side? I assume you were not able to use all the same software platforms that folks at NVIDIA or at AMD. Maybe more at AMD because I believe there's stacks into more open source, but what were you able to use to actually create these things?
[Adam Majmudar] 10:16 Yeah. So certainly my stack is completely different from all the real players. So the thing is, in this space, the EDA tooling, like the electronic design automation, which is what you use to actually convert your architectural designs into an actual chip layout and also do a lot of stuff optimizing your your design, which is basically the entire complexity of the process. Those things cost, like, hundreds of thousands of millions of to millions of dollars, like, per seat and per company. And so an individual can basically never actually use those things. And in terms of shipping an actual production quality layout, we basically need those. Like, they have a complete moat on this industry. The nice thing, though, is there's this product funded by DARPA within the past, I'd say, to 10 years. I'm not sure on the exact number called OpenROAD. And it's an attempt at making an open source EDA software. It's obviously way worse than those real production ones in many ways. Like, it's not feature complete. But for simple cases, like what any individual would do or even small businesses would do, it's completely sufficient. And then the last piece that made all this possible is that, again, in the past decade, there's this company called SkyWater, and this they basically made this open source PDK for their own process node, which means that they have their own foundry that produces chips at the 130 nanometer scale, which is just like the size of the the gates and the transistors they make. And they opened it with this thing called Efabless, where multiple people can basically pay a much smaller amount to get, like, a 100 chips of this process. And that costs, like, 10 k. And this other group called TinyTapeOut advertised the cost. So they're they'll be like, okay. 100 people cannot pay for a tiny slot, and we'll put, like, a 100 projects on each of these chips. And so because of this incremental cheapening of everything, it now becomes possible for someone like me or others to use that, basically, the SkyWater process node to make a chip. And then you use the something built on the OpenROAD stack called OpenLane. It's basically just OpenROAD specifically for that SkyWater stack. And now all that's open source. Obviously, there are some, like, Synopsys and cadence, which are the real production tools, but completely functional for what I would wanna do or, like, any individual hobbyist.
[Nathan Labenz] 12:20 Cool. So we'll maybe get back to the sort of hypothetical manufacturing side toward the end once we get through all the design considerations. But do I have it right that at the end of this, you have a micro a tiny, I should say, GPU that could be included alongside, like, a 100 other projects on 1 chip? And then that chip would be what you would actually get if you ordered through the process you described would be like a chip that has your design plus a 100 other designs all on it. And then everybody just shares in the production cost, but also gets access to 1 another's designs.
[Adam Majmudar] 12:58 Is that how that's working? And I'm probably actually gonna do that. So the thing that I mentioned called tiny tape out created by this guy named Nathan, he's basically like a goat of making ASIC accessible nowadays. You can find him on Twitter and you can find his course, which is Zero to ASIC. He has basically made this process where they'll buy up 1 of those Efabless $10,000 slots where you have to get a 100 chips and you have to pay 10,000. And as I said, they'll allow a 100 people or, like, a a big group of people to submit their projects. And for example, yeah, exactly what what you said is I could submit this to that, which I will do, and then I'll get the chip. And the cool thing is I'll get the chip and I'll able to play with my project, but I'll also be able to play with the projects of all other 99 people who submitted, which is just like
[Nathan Labenz] 13:38 a cool byproduct of it. That's awesome. Other than for learning, what is this good for? Are there projects that people are actually where they're building things where they can't get something that meets their requirements that they can only get this way, or is this really all for the love of the game and the fun of the DIY?
[Adam Majmudar] 13:57 And you're just now with these stacks opening and the tool are becoming better, that there might be, like, an interesting analog market or a smaller scale ASIC market, where at least you could do the initial prototyping phase of the stack. I think in practice, there's not gonna be huge businesses built on this ever. Like, again, will always make sense for them to go to the real production stacks. So I do think it's mostly hobbyist. And I think the most interesting thing on the learning side is getting more college students and younger people, just getting the intuition of this. Since the industry was created and the engineering was much more so popular several decades ago, Getting the new generation to it is interesting given that there is a whole landscape of the chips act, and there's some geopolitical incentives there.
[Nathan Labenz] 14:36 Yeah. Cool. Makes sense. Okay. So let's get into the project itself. 1 thing I wanted to just try to do upfront is figure out the scope. When you say designing a GPU from scratch, there's like a lot of layers to the tech stacks. Obviously, you're not going all the way down to mining your own minerals. So what was like the foundation that you said, okay, like this is deep enough down the tech stack, scratch, but I'll still be able to use this kind of foundation to build a pot.
[Adam Majmudar] 15:04 Yeah. That's too professional. So there's a couple of things. First of all, obviously, I'm not, like, building the whole EDA pipeline from scratch or anything like that. So the place where it really starts with Verilog, which is where most designs start, which is basically called your register transfer level logic. It's just designing the the specific memory and the specific wires that connect up the chip to make all of the logic. And so designing something at that layer is the layer that I was targeting. And then also more specifically, because that's very generic, the architecture that I wanted to choose is obviously very nitpickable in the sense that, okay, you have a giant GPU. There's like millions, thousands of features to this GPU. The question is what are you gonna include? And because d p like, the word GPU is an amorphous term nowadays. Like, it started out being referring to graphics processors. Now people are still using it for lots of stuff that aren't graph graphics processors, like tensor processors and other end or accelerators. And so there's the blurry line in terms of what GPU counts as. And so because of that, I set my goal of what do I wanna focus on with this, not as, like, graphics hardware or any of the specific things like that. But more if you wanna understand this kind of giant blob of areas that spawned from GPUs, what are the core concepts you need to understand in order to be able to build on top of them and understand little details? And then my goal is, I create a foundation where using this foundation, you'll be able to learn all of the important stuff? And by choosing that, it allows me to strip away a lot of the complexities of some of the nuances of any little engineering decision if you're choosing to make, like, graphics hardware or any of the other specific things and just focus on kind of the fundamentals that I was talking about. And so some people would disagree. Right? Because it's a blurred line. Some people will be like, oh, that's about a GPU. It doesn't include graphics hardware or whatever. There's, like, nitpicks you can make everywhere. But at the end of the day, it's just a design decision for the purpose of the product, which is, like, how can you teach people how this stuff works most effectively?
[Nathan Labenz] 16:55 Yeah. Makes sense to me. And no doubt these are super complicated. It's often said these days that chip making is the most complicated industry in the world and, you know, arguably the most complicated thing that humans have ever devised to manufacture at scale. So Yeah. Not surprising that you'd have to make some simplifying assumptions to pursue a project like this. Hey. We'll continue our interview in a moment after a word from our sponsors. 1 of the things that you mentioned in your Twitter thread that originally caught my attention was that the beginning of the exercise was a a real challenge in just prioritization, figuring out what really matters most. You came up with 3 things that you felt mattered the most. We can get into each 1 in turn, but you want to just give a high level of what those 3 things are and how you identified them. Maybe a little bit on what you decided to leave out, and then we can go deeper into each of the 3.
[Adam Majmudar] 17:53 Yeah. So the 3 things I chose are architecture, parallelization, and memory. And granted that whole Twitter thread, although it was wrote chronologically, I'm not sure if you'll link to it or not, but it's actually the result of hindsight. And so, like, all of the architecture design I chose, for example, the design decisions that I put upfront are actually the result of me having gone through the entire process. I didn't have that entire I didn't have that foresight at the beginning. But the reason I chose those things is because and even maybe I'll rephrase them into 2 really. Like, architecture is is not really saying much. What I meant by focusing on architecture is like, what are the key elements of each piece of the GPU that are important to all GPUs? Like, different GPUs have different little specific elements tailored to their need. But what are the things, again, going back to the foundations that you absolutely need to understand these concepts in order to understand how this the whole thing works, broadly speaking. And so I broke that down over time. Now I had the architecture diagram in my in my tweet, and that was a a good breakdown of basically the most simple possible explanation of patchy p's work. And we can get into the specifics of that. Now the other 2 terms were actually the more interesting parts. So I'll start with parallelization because that's the 1 that is most obvious on the surface level. And that's yeah. GPUs obviously are built to do parallel computation. That's their whole utility. Basically, in graphics hardware, rest of this graphics use cases in general and machine learning, they benefit a lot from doing matrix now. And if matrix that happens to be element wise, in other words, computations on individual elements don't depend on other elements, which means you can do them all at once. And that's a whole reason why GPUs are useful. So the whole point of GPUs is instead of a CPU where generally things with auto as being executed sequentially, in practice, isn't exactly true. But, like, generally, you execute ALUG 1, then 2, then 3, then 4. And you now you've done this whole thing 4 clock cycles or 4 cycles of computation. Instead of GPU, you just distribute it out among a bunch of compute resources, and now you're you're accomplishing this all in a much faster time period. Now the interesting thing about parallelization, on the surface, it seems like it's just, oh, you need to have a bunch of compute resources or, like, compute cores, and then you can distribute these workloads across the compute cores. Actually, it turns out a lot more trivial than some of the other difficulties. So there's 1, the problem of how do you distribute these workloads effectively with your resources and manage the resource utilization effectively to maximize it? So that's like 1 interesting problem. And that happens on, like, the hardware level. And then there's also maybe a more interesting thing, which is what is the software pattern of parallelization? As in, how do you actually program these things? It's 1 thing to make the hardware support this capability, but how do you actually make the software easy to use for developers who aren't usually used to it? And that's arguably 1 of the core problems that NVIDIA solved and what made them win. It was not a stakeout, but their CUDA software designer is so good for parallel programming. Probably by nature of how similar and integratable it was into the previous stack with these developers who needed to use it were used to or to see. And so another interesting element to highlight there is since the program side is so important, how does the programming pattern that people are so familiar with actually get implemented in the hardware? So that was an interesting element of parallelization. So just to break that down, you know, it was the the 2 interesting things in parallelization were resource management and resource utilization, and then software pattern to hardware. Like, how does that get implemented? And, specifically, the software pattern is called same instruction multiple data, and that just means that you're executing the same code on multiple pieces of data, like elements of the matrix. And in hardware, it's called same instruction multiple thread, which is just like the hardware manifestation of that. And then the last thing, which is the less obvious and more interesting or maybe most interesting realization for me. And maybe it's more obvious than they're actually doing the program to, like, GPU programmers because it's it's actually, like, a a first party thing in in for example. But memory management is really interesting to me. And the reason is because I've had a completely wrong notion of memory management in GPUs. In fact, I didn't even realize how important it was initially. And this is the thing I was saying in my tweet where George Toss told me. I, like, randomly ran it to him. I was like, hey. Can you give me feedback on my design? And he's like, dude, this doesn't, like, accomplish the whole memory edge of tunnel, which is the most important thing. And what that is is it would see initially that when you're thinking about parallelization, the bottleneck on how much you can parallelize computation is how much compute resources you have. In other words, let's say I wanna do a 100 matrix additions at a time. That means I need a 100 different cores that support addition. And it seems like that's the bottleneck. In practice, it's actually not high at all. In practice, it's actually memory. Because in order to do a 100 additions, yes, you need a 100 compute units, but you also need to be able to read from a 100 different memory locations at once from the global memory. And global memory is bottlenecked on how much bandwidth it has. And that bottleneck tends to be the limit because there's a pretty big latency in terms of this compute unit wants to access memory, then it needs to send the request, and then it needs to wait for it. And if all of these units are kind of access memory at the same time, you may actually have more requests for memory than you have bandwidth in global memory, which means that you need to actually manage all these requests in memory. And that causes, like, a huge latency And to rip a a quote directly from what George Todd sent to me, he said that GPUs is largely the task of trying to work around memory latencies and hide them away at different layers. And that's a lot of what what the challenge of GPUs really is about, which is why architecturally, they've they've had all these things to solve that. Like, memory controller is the first standard, and then they also have multiple layers of caches to prevent having to access global memory, and shared memory and all kinds of other stuff. So the memory pattern is another very important part of GPUs.
[Nathan Labenz] 23:31 Okay. Cool. That's a great where do you think it makes the most sense to start in terms of spiking down each of these 3? I originally had architecture first, but based on your description there, I might go parallelization first and then come back to architecture. But what do you think makes the most sense?
[Adam Majmudar] 23:50 It actually might be easiest if we go to reverse order because people are probably most familiar with software. Right? So you could actually start with the software side of things, the ISA software side, and then it'll work backwards into how this is actually implemented in the hardware. Do think that makes sense?
[Nathan Labenz] 24:04 Sure. I'll happy to follow your lead. I mean, define terms for us. ISA is instruction set architecture?
[Adam Majmudar] 24:13 Exactly. Yep. Yeah. And so
[Nathan Labenz] 24:16 confident about instruction set. I know if I didn't know the a actually there.
[Adam Majmudar] 24:19 So this part might be a bit challenging to do without video. Like, I think almost all of this art is gonna be near impossible to do without video. So that's the caveat. But, like, it's gonna be really challenging to talk about different instructions and the actual code and stuff like that, which I think is important to understand it.
[Nathan Labenz] 24:36 Oh, let's do this. I'll share my screen. I've got these notes up, and I can I'm right on your GitHub page. So okay. Cool. Let's just try to describe it, talk people through it. If you want to get a clearer view of this, we'll have the video on the YouTube version.
[Adam Majmudar] 24:56 Yeah. So let's scroll down to the kernels first step. Yeah. Perfect. We can do matrix edition. So if you just scroll up a bit. Yeah. That's perfect. Okay. So the best way to start with understanding this whole system is by understanding the software pattern just because that's generally most familiar to people, easiest to see the the layout of what's happening. And so, generally, the way you think about GPU software is it's the SIMD pattern, which is same instruction, multiple data. And so all that means is you're just gonna write 1 instruction. You're not gonna do any loops or anything to be like, okay. Execute this instruction on all of this different data. You're just gonna do an instruction that's basically invariant to the data that it receives. And what's that what that means is your code is gonna take any some data that doesn't know what the data is, and based on that data from, let's say, 2 different matrices, it's gonna perform a computation on the data to get the element of the resulting matrix, and it's gonna put that back in memory. To take a concrete example of this, we can say a matrix addition, which is probably 1 of the most simple GPU use cases. So let's say we have 2 1 x 8 matrices, which means each matrix has 8 elements, and let's call them matrix a and b. And all we wanna do is quickly compute the addition of these matrices, which we'll call matrix c. Right? So let's say that matrix a and b are loaded into memory in elements 0 to 7 for matrix a, and then 8 to what is that? 13? 8 to 15 in for matrix b. And we just wanna put matrix c, the result in this edition, into 16 plus of memory. And 1 thing you could do in a CPU is write a program, and that program is gonna take, like, in a for loop, it's gonna be like, okay, 4 I less than 8 as in, like, for each of the 8 elements, add up the element for matrix a and b, put that into the element in matrix c, and we're gonna do that. It's gonna take 8 cycles, like, 8 iterations of the for loop. And so that's 1 way to do that sequentially. Now, alternatively, what you could instead do is write this thing called a kernel, which is a GPU program. And this kernel is just gonna handle adding up 1 pair of elements in the AMD and putting it into 1 slot in matrix c. And so we'll say this kernel is gonna take the ith element of matrix a and b, add them up, and then put it into the ith element of matrix c. And so the actual code for that 3 simple intro, everyone can imagine what that's like. So you're gonna basically read some global memory, the base address of a plus the ith element, which is just like getting the element from a. You're gonna take the base address of b, get the ith element, read it out, add it off, and then put it into c. Right? So, actually, the program itself is very straightforward, and there's not really any GPU element right there. So the main thing with the GPU is you just wanna tell it how many times do you wanna execute this instruction and on how many different values of I do wanna execute on. So in this case, we want this to be from the 0 to the seventh element of each matrix. There's an 8 elements. So we want 8 different values of I, and that means that and that's just because we want to add up the 8 different elements of the matrix. That's we're gonna call each of those times where we need a separate execution of this code. That's gonna be called a thread. And a thread is just executing the same code again on multiple different parts of the matrix. And for this code, all we have to do is specify what each thread is gonna do. It's just gonna basically just read 2 addresses and then write an address. And then we're just gonna run 8 of them. And the question is, when you run 8 of these threads let's say we now have 8 of these exact same pieces of code running in our GPU. Let's just say somehow we we accomplished that. The question is, how do you actually get each of these threads to do something slightly different? They're all running the same instruction, but you actually need them to be able to perform it on separate data, like the different elements matrix. And so what you need to do is somehow in the actual hardware of the GPU, you need to make each little area that's executing 1 of these kernel codes. You need each little area to know something about which thread it's executing. Is it executing the 0 thread? Is it executing the second thread or whatever? And if you have a way to do that, then within that hardware, you can the the hardware can know something about its own context. So let's say the hardware knows which thread it's executing. Now the hardware can say, hey. There's this instruction here that's supposed to load this data in from memory. The data that I wanna load in is actually based on which thread is being executed here. So if I'm currently executing thread number 0, I wanna load in the zeroth element today and zeroth element of b. If I'm trying to execute thread number 1, I'll bring the first element of each, what blah blah blah, and I wanna store it in the first element of c. And so that is the most important thing here. So the 2 important things are, 1, we're writing some code that just is meant to be executed by a single thread, and then we're gonna specify how many threads get executed. The second important thing here is these special values, which are gonna be local in the hardware. So each time a thread is getting executed in hardware, they have some broader context about what's going on here. And so if you look at those for people looking at c, there's these values called block index, block dimension, and thread index. And this will probably be familiar to people who are familiar with CUDA. Obviously, there's more complexity in the actual pattern. But, basically, the way this works is that when you have tons of threads that need to be executed for something, you're gonna group these threads into batches that are basically executed together on a single compute unit, and those batches are called blocks. And so let's say our block size is 4, which is what it is on my current GPU design, that means that batches of 4 threads are going to be executed together. And so whenever these things are executed, basically, what you wanna know is which block number is currently being executed. And then within that block number, what is the thread index in that block number? So that means that the threads are each gonna have an index from 0 to 3 for each block because there's 4 4 threads per block. And then let's say we have 10 blocks being executed. So now if we're on block 3, how do we get the correct thread that's being executed? Very simply, we're just gonna multiply the the block ID 3 by the number of threads per block. So now that's, like, 12. So that means that there's 12 threads that have already been executed in the previous 3 blocks. And then we're just gonna add the thread ID. So now we know we're on the thread number 12, 13, 14, 15. And now those tests can pull in the receptor data they need to per memory. And so that is how the SIMD architecture, like, programming pattern is implemented in programming. And then we'll talk about soon how that's implemented in the hardware, and it's very simple. And that's the core understanding you need to know from kind of what's a kernel programming pattern. And in terms of this as an asterisk, like, how did I decide what instructions to include in my ISA? It's as you can see, like, I'm trying to make it as minimal as possible. And so, really, my decision function there was basically what are the minimal set of things I could implement in order to make some cool useful use cases. And so matrix addition, matrix multiplication were the most obvious things there. And so the instructions I chose are basically just what's actually needed to support those, which is just basically some basic arithmetic loading and storing data from memory because that's how you basically get your inputs and then store the result of your computation, and then constants just for some convenience.
[Nathan Labenz] 31:54 But yes. Cool. That was all quite well said. So just to try to summarize quickly back, I'll go back to the block code thing, then we'll come back to the instruction set. When you look at this block of code, the key thing to understand is this code only executes the addition of 1 element from within the arrays that are going to or the matrices that are going to be added together. And so the key trick is you have to write the code in such a way that it can take in as a variable, which position are you in, and then handle everything else just based on that 1 variable that the hardware will actually feed in at runtime. So everything is gonna be the same except for what position you're in, and then that will determine the offset of the memory, which will determine what value you load in, which will determine what value you ultimately calculate, and therefore what value you're going to put back into which position in memory. All of that is determined by what position you are in the broader array of computation. And you can think of that as, in your case, you've got a 1 x 8, but obviously you could get I believe the current GPUs go up to like 3 dimensions of Yeah. There's yeah. You wanna give us just a little bit of flavor as to how this kind of gets more complicated from here?
[Adam Majmudar] 33:19 Yeah. So in actual GPUs, first of all, they have more than just block index, block dimension, and thread index to identify a thread. There's, like, a couple more dimensions of just, like, different groups of batching that basically have to use every single dimension to identify what thread you're in. And then some other interesting things. So I basically just captured the most simple piece of, like, GPU programming here. In reality, there's a couple other very interesting and important things. So first of all, like, when you're loading and storing data, that's something that's something that we'll maybe get to later. But by nature of how GPUs have different layers, different levels of memory, and they have shared memory, which is shared between an entire block with threads, which means that 4 different threads can actually store stuff and read stuff from the same location. It's called shared memory. And then each thread also has its own dedicated memory called a register file. That's important. And so 1 complexity that I didn't implement here, which is there in real GPUs, and it's very important for actual use cases, is that you can actually choose which memory you're putting stuff in. So my pattern, there's only 2 types of memory you interact with. All of these mul add comps, all these instructions, everything that has, like, an r in its parameters. So it'll be like, multiply r 0, r 2, r 3. That's operating on registers for each thread, which means that stuff is all just specific to a single thread, the thread that's currently executing, And it's storing stuff in these local register files that has local data. And it's basically just performing computations on the register values. And then there's the load and store, load and store instructions, which is either loading data from global memory into the register or storing data from the register into global memory. So there's really only 2 levels of memory happening here. In a real GPU, you have separate instructions that are gonna let you interact with registers, global memory, and also shared memory, and there's also layers of caches. So that's 1 thing. Then 1 other interesting thing is in real GPU programming, you have the ability to synchronize different threads with each other with barriers, which means that because of the necessity of shared memory, that means, like, let's say 4 threads, they they want to store something in this shared area and then access it from each other. But what happens if 1 of the threads is ahead of the other? In other words, 1 thread gets to some step and the other threads are not there. And now what if this thread is going to read some data from the shared memory? Well, if it's gonna do that, it needs to make sure that the other threads are at the same point. Otherwise, it can have that corrupt data, or maybe the data is not ready yet to be read by that thread. And so there's these things called barriers where you can basically synchronize threads. So it's like, this thread is just gonna wait until all the other threads of that are are gonna get to the same point. And then if there'd be something, maybe where we should share memory or something like that. So that's another important thing, which, again, I didn't implement because it's explainable from here, and also, like, the complexity of implementing it is not worth it.
[Nathan Labenz] 35:58 Okay. Cool. That's good. Hey. We'll continue our interview in a moment after a word from our sponsors. You wanna next just talk through a little bit the instruction set and maybe give a little flavor of what kind of the next in instructions would be if you were to expand from here?
[Adam Majmudar] 36:16 Yeah, so instruction set, I would say a lot of it is very simple. So you have addition, subtraction, multiplication, and division. Very straightforward. Everything just takes 3 registers. And so the important thing I'll cover here is just the specifics of the register file. So with addition, subtraction, multiplication, division, for context, each thread has 16 registers, which means that it has 16 different places that it can store whatever values it wants to store for any arbitrary complications. And importantly, the last 3 of these register values are restricted. And so we'll call those registers 13, 14, and 15. Those are how, in my GPU, the actual those custom context values I was talking about are not a real GPU. It doesn't necessarily work like this. But, basically, the important thing is every thread has access to its context. I happen to implement that through these registers. And so these special registers are registers that you can't write to. The GPU itself handles writing to them. And that means that the GPU itself supplies this code execution environment with, hey, you're this block number. There's this many threads in each block, and you're also this thread number in this block. And, again, that's how in the hardware, the code has access to those values. Now the last 13 registers are read write registers, which means the code is free to access these and however it wants. And so these add, subtract, multiply, and divide, what they do is basically take the values of 2 registers, which you then specify with 4 bits, which is what you need to specify an address of 16 different possibilities. And it will take the value from these from 2 different registers, and it will perform a computation on them, like an issue subtraction multiplication division, and then store that result back into 1 of the registers. All the registers specified by 4 bits. Now with the load and store instructions, similarly, the load instruction will take the address of 1 register. So it will take the value of 1 of the registers. That's going to specify a memory address. So for example, if that register has the number 4 in it, it's going to go load in element number 4 from global memory. And it's going to put that into another register. And for the store instruction, it's going to take the value of 1 of the registers and store it into a memory address in global memory, just then by another register. So again, those are register computations. That constant is very straightforward. You just load in a specific value that you want to into a constant to the register. And that might just be because you want to use it for addition or something else. Now the interesting instructions that I'll spend a bit more time explaining are the BR NZP, which stands for branch. It's a branch instruction and the comparison instruction. And these 2 work together, like the simplest implementation, kind of naive implementation that allows you to do if statements and looping. This is all like CPU stuff, forcing it, or DPUs too. But the main thing is that what you wanna do is create some condition. And then if this condition is true, you can jump somewhere else in your code, which means, like, let's say, I wanna do something like, if 1 of my registers is greater than 0, then jump to the end of this skip this entire segment of code, otherwise go through it. So that's basically like an if statement. Right? Or you could do a loop. So you could do some code. You could put, like, a label at the top of the code. So let's call it like a loop label, which is actually in 1 of my kernels below. And then you have a whole part of code. And then below that code, you're gonna say, if some condition is true, jump all the way back to the top of the loop, which is gonna keep it looping until that condition is no longer true. The way that you implement this is with the comparison and branch instructions. And it's always basically a comparison than a branch. So what you do is first, you do the comparison instruction. And that's basically gonna take 2 registers. It's gonna compare them. And then it's gonna check if the result of subtracting the second register from the first is positive, 0, or negative. And it turns out you can actually do a lot with that. So you're just gonna compare 2 registers, and it's gonna store that result if they're positive, negative, or 0 from the difference between them in something called the NZP register, which is stored in the program counter unit of the g like, any compute unit, basically. In our case, it happens to be a thread execution unit in a GPU. And so the program counter now has the current line of code being executed. And now it also has this comparison thing, which is storing whether the result of the subtraction of the 2 registers is positive or negative. Now with that, so you're gonna have now some NZP value, and it's gonna have a 1 specifying whether it's negative 0 or positive. Now the next instruction, BR NZP, which is a branch. It's gonna allow you to specify branch to some specific line, as in jump to some specific line of code if the NZP register holds some specific set of values. So you could say, if the NZP register is saying that this comparison is negative, then jump back to the loop code, which is exactly what we did in the matrix multiplication. So the matrix multiplication needs to do a loop through some code. So what it does is it compares some counter. Basically, it's just incrementing a counter, and that counter is supposed to do some number of loops through the code. So let's say, in our case, the counter needs to go through basically 2 loops. And so it's just storing the value, and that value is going to start at 0 and keep incrementing up until it's past the counter. And so basically, the way you would use these instructions is use the comparison instructions. So you're going to say, compare my counter to the value that the counter is supposed to always be less than. And then if the counter is still less than the the max value, go back and loop. Otherwise, don't loop. And so that's how you'd use those 2 instructions to create conditional logic, which is exactly what the code does. So those are pretty important. I know the last 2 instructions are very straightforward. So there's like the no op, which is basically not even an instruction. It just moves on to the next line. And there's a return instruction, which tells the GPU that it has reached the end of executing a specific cut. And that's how the entire ISA and that also honestly, this part is actually a lot of the hard part. Once you understand this code layer, which is why we started here, then the hardware layer actually becomes a lot more clear what's going on. And that's partly a result of the fact that this code layer is the result a lot of fine tuning on the hardware layer also in terms of what I actually included here. But yeah.
[Nathan Labenz] 42:25 Okay. Cool. Again, maybe just to try to recap a little bit there. I hadn't understood coming in that you had these 3 dedicated registers, which is just a position in the local most memory. Right? That sort of declare to the kernel, which is the form of parallel programming that says, here's what to do given what position you are in. That indication of where you are is provided in these dedicated registers. And then separately, you have I guess, how is that that's not really related to the size of the instruction set. That's kind of independent. Like, how many bridges I was conflating those for a second in my head because you numbered the instruction set with a similar gap. Is there a connection there that I'm inferring correctly, or am I hallucinating this?
[Adam Majmudar] 43:13 So there's a couple connections in the instruction set to the hardware. There's 2 important connections, and they're very much demonstrating that this is a dummy project. So first of all, if you look at the instruction set, each instruction is 16 bits, which means that the first 4 bits of the opcode, meaning that basically, since there's there can only be 16 different combinations of that. That means that with this design I've created, there's actually only space for 16 total instructions, which means there's only space for 4 or 5 core instructions if I wanted to add them. Now the second constraint is that I've made it so that each register is specified by 4 bits, which means that there can only be up to 16 registers. If I wanted more registers, I couldn't do that. Would need a fifth bit to specify each register, which means the instruction would have to be longer. And then the second thing is sorry, the third thing is that the constant and more importantly, the branch instructions, they both take immediate values, which means they're not taking some value from a register. They're literally taking values that are specified in the code. Those are numbers. And if you look at those, each of the numbers is 8 bits, which means that the branch instruction can only specify the branch to 8 bits of program memory because the branch instruction is telling it, if some condition is true, jump to this line of the program, which means that in order for the branch instruction to cover the entire program, my hardware, as in my program memory, can only be stuffed out by 8 bits, which means it can only have 256 or by 4 56 rows of 8 by memory. So those are the constraints created by the instructions here. And again, to in reality, that for this reason, instructions are usually a lot longer than 16. Usually, they're 32 bit like the instructions.
[Nathan Labenz] 44:49 But yeah. So if you wanted to have more memory, you would have to have longer addresses for that memory. And if you wanted to have more instructions, you would have to have obviously bigger labels for those instructions. And maybe this is where we're headed next, but I'm a little fuzzy on, like, basically, each 1 of these instructions is implemented by hardware at this point. This is, like, the last layer of software when this gets issued to the GPU or to a compute core within a GPU, it now says, okay, I'm going to actually fire up a particular circuit that's gonna do this stuff now. Right? There's no further software
[Adam Majmudar] 45:41 to Exactly. 1 caveat on so I obviously oversimplified. Pretty much everything I said here in this whole thing is simplified in some way. But so, like, 1 example of an oversimplification here is, like, in a branch instruction, In reality, the number of bits in the immediate in a branch instruction, if it is an immediate, is not limiting the program memory. The reason that I limited it is because your alternative is to make this thing an offset. So instead of it being like literally branch this line, you could be like branched to some offset from the current line as in branch plus 50 or branch minus 50, in which case you could do it a couple times and then you're not limited. So you can you can get infinitely far in memory. There's tons of other ways to approach it. So that's just like 1 example of constraint. Because I, 1, did not support negative numbers here, and 2, I'm not making an offset. I just wanted to jump to a specific lack for simplicity's sake. Now that's like a constraint that in reality, there's tons of ways around this, but, yeah, that's an example. And then in terms of the second part, that's exactly right. The important thing to take away from this part, maybe there's still lots of fuzzy details, especially for people listening. The important thing is, 1, you understand the, like, the importance of each instruction and why it's there unless supposed to do. You understand the general programming pattern. Again, it's hard to draw even just because it's assembly, and assembly itself is not necessarily the easiest thing to pick up. It makes sense when you when you look into it. But you just need to understand those parts and the important GPU programming patterns because a lot of the actual implementation will get revealed in the hardware.
[Nathan Labenz] 47:04 Okay. I think it's just about time probably to turn to hardware, but let's just scroll through the multiplication 1 more time because this is where you have basically brought it all together in terms of the complexity that you can support where we still got to be mindful of where we are in the thread. But this time, the calculation do I have it right that basically, this calculation determines 1 position in the output matrix?
[Adam Majmudar] 47:34 That's exactly
[Nathan Labenz] 47:34 right. And So you you have 2 x 2, you're gonna get a 2 x 2, so that you're gonna have 4 spots in the output matrix. So that's why this is 4 threads. Each 1 of those threads is going to do a different set of it's still gonna touch all the numbers, basically. Right? But it's gonna do them slightly differently to get to the final spot in the output matrix. And to do the sort of rows and columns, it has to do this loop. And this is where we get to the point where you have to have this comparison and
[Adam Majmudar] 48:09 branching. Yeah. Exactly. The reason is just because of the nature matrix multiplication being a little more complicated than addition. With addition, you just add up to the elements. With matrix multiplication, unfortunately, it's not just, like, multiplying the elements. It's like a dot product between 2 vectors in the matrix, which means, like, you just take 2, like, vectors, which is just row a row and a column of elements, and then you multiply their elements by each other and then add up the results. So it's actually a bunch of different computation, not just like a single point wise implication or something, which is why there's a loop here and why there's just a bunch more multiplications and additions going on, which is what necessitates comparison and branching this case. And I'd say that's the most important thing. For people who wanna understand the specifics of the kernel, you can look at it. It's relatively simple, but maybe a little bit beyond the complexity of what makes sense to explain on audio.
[Nathan Labenz] 48:57 Yeah. And I misspoke when I said that it's gonna touch all the numbers. It's gonna touch 1 row or column for each of the 2 Mhmm. Input matrices. Exactly. Yeah.
[Adam Majmudar] 49:07 So it's gonna touch a lot of numbers. In this case, it happens to be 3 quarters of them for each computation.
[Nathan Labenz] 49:11 Okay. Cool. Well, then I guess it's time to get on to the hardware. So how do you think best to approach that?
[Adam Majmudar] 49:21 Yeah. So this part is actually gonna be a lot simpler, thankfully. And so the best way to start it is to just start with the architecture of the entire GPU, and then we can dive into the threat architecture. And this is a pretty straightforward. Okay. So we talked about this program. Right? So now the nice thing is we can explore, okay, how is this program gonna actually get executed in the GPU? And that is gonna take us through basically the entire architecture and also the motivation for why everything exists here. First of all, let's take the example of our matrix addition kernel. So we have this code. Now the question is, how are we gonna actually load this up into the GPU, and how are we actually gonna run it? And so what we'll do here is, first, we'll look through how this works high level in the architecture. Then we'll look into individual threads and how to see, like, how the how this code is executed in individual thread level. And then in the kind of last piece of this that might be interesting is we can actually look through the test case and just briefly explain in this repository, like, how it's actually tested and simulated. So it's not just the theoretical. Those tunnels that are written in there, there's actually a setup for them to actually get run on the GPU design and see the entire output of everything that's going on. So we can touch on that last and maybe, like, touch on some of the interesting things in the code. So with this, I'll start is we have this matrix edition kernel. And the question is, how do you actually get this to execute on GPU? On 1 hand, you need to load up the code into there. Then it's like, okay, how do I actually make this thing run and get the result back? So the way this is gonna work is first, we'll about global memory, which is like the place where you're gonna load up all of the prerequisite data of this code. First of all, have the actual kernel code. And that's as we wrote below or or before. It's just a bunch of instructions. And those instructions happen to written in assembly, which is like some nice verbal format for us to visualize them. In reality, we would compile that assembly, just a preset itself, compile that assembly into object code, we just specified by the ISA. So it's just gonna be a bunch of ones and zeros, all 16 bits long, and load that into program memory. So now let's assume we've loaded that into program memory. Let's say it's, like, 50 12, 15 lines long. So now we have our kernel code loaded up into program memory, and that's just going to specify exactly what a kernel set this in bits. And then the second piece of it is in data memory, we're going to load up the actual data that we want to perform computations on. And in this case, this would be like the 2 1 x 8 matrices. We'll call them a and b. So as I said before, a is gonna be in addresses 0 to 7, and then b is gonna be in addresses 8 to 15. And then we'll say that c is gonna be in addresses 16 to 23. So that's gonna be our result. Now currently, our result matrix is gonna start out with nothing in it. And so what is the goal of our entire program? The goal is to somehow get the GPU to use that code that we just loaded up in program memory. And somehow at the end, it's gonna need to fill out rows 16 to 23 in memory with the correct results of the additions of AMD. Then And we're gonna read that out at the end. So from the host machine's perspective, which is like the CPU talking to the GPU, what we need to do is basically get this compiled code, load it into program memory, load up our data into data memory, somehow get the GPU to start this whole computation. It's gonna do its thing. And then at the end of it, GPU is gonna tell us when it's done. And then there's gonna be in data memory, the answer, and the host machine can just read out that answer. And so that's how the interface is GPU, basically. So in this picture, here we have global memory with program memory and data memory, and that's the 1 piece of that. And then, like, another important thing in GPUs is the actual bus where data kind of transfers between the CPU host and the GPU. Again, I left that out because of simplicity reasons. But the other important control thing here besides the memory is what's called the device control register. Now on real GPUs, this is a lot more complicated. But in this case, if you remember, what we need to specify is basically the number of threads
[Nathan Labenz] 53:09 to execute.
[Adam Majmudar] 53:10 We need to tell the GPU that, hey, this thing in program memory, which is just a single kernel code, we need to tell it, you need to run this thing 8 times. And so the way that you do that is in my design, you would store the thread count inside the device control register. And so the device control register is mainly just used to store some, like, high level data about, like, how you want the GPU to execute. In this case, the only piece of high level data we really need to specify is how many threads should you be executing. And so now we have program memory loaded up, data memory loaded up, device control register loaded up. Those are the 3 things you need to do to get this GPU to be prepared to run something. And then that's the external interface. Then we're gonna send a start signal, and that's gonna tell the GPU, hey. Everything's loaded and ready to go. You need to start basically performing computations on data. And that's like a clean break. So that's like the interface interact with the GPU. Now we're gonna get into the computational parts of it. That's lost there. Does that make sense?
[Nathan Labenz] 54:08 Yeah. I think so. And probably just worth reiterating that there's, like, even, in some ways, more complexity on the CPU side when you say the host machine. We're sort of assuming here that you have a general purpose CPU that can do the compiling and can handle the, like, outer loop of here's what I'm trying to do and get something back and show it to you on a screen. It's done. And this sort of sits within that. So it's funny because the programming paradigms in ways is more complicated. But then in other ways, it's a lot less complicated because it can take advantage of all of universal computing benefits of a CPU host machine. Aside from that, I think it all makes sense.
[Adam Majmudar] 54:51 That's a good point. Because GPUs are not really meant to run alone. They're always in practice, either hooked up to a host that you'll see them next to a computer in, like, a rack, or you have you connect to 1 on the cloud or like that. That's not really very practical, but depending on what you're doing. But, yeah, they usually have a host connected to them that can send them data and interface with them. But, yeah, that's a good point. So now we can get into the the computation pattern. Right? And high level is actually pretty simple. So the main things to focus on is you have this dispatcher. And the dispatcher is just responsible for as we said, these threads are executed in blocks. And a block is just like a batch of threads that gets executed together on the same compute resources. And so basically, the way to think about this is your GPU has some large number of compute resources that are organized inside cores. And so this GPU design happens to have 4 cores, and each core has the capacity to execute a bunch of different threads in parallel. And from a high level, before diving into Watson cores, let's just assume the core is some black box. And we just know that this black box, we can just give it some group of threads, which is going to be a block size of threads. So in this case, 4 threads at a time. We can give each compute core 4 threads at a time, Gonna do self applications and finish tossing the threads somehow. And that's gonna tell us when it's done. And so the high level job of the GPU now is like, okay, how do we manage? How do we divide up these threads into the available compute cores? Wait for them to complete their job, and then give them more jobs when they're done completing their, like, completing the work. And so basically, all of this is just managing all the compute cores, figuring out when they're available for new work, and then just staffing new blocks to these to these cores. Now in Takis, this is where, like 1, this is 1 place where there's a lot of complexity in GPU. Because you can imagine when you have a ton of different cores performing stuff at once and different cores can perform different blocks. Like they can actually have different blocks of threads at once and they're, they have a complex resource management. You can do a lot of different optimizations to make sure that you're getting maximum resource utilization. And so this is a software that's like a ton of space for optimization, in terms of using things more efficiently, threading in like, mind the con threading in different threads at once to make sure that everything's getting used. And then, of course, for simplicity's sake, in this example, we use a very simple version of that, which is just that the disk factory looks out for when a core is done executing its threads and just as it worked through us. And just like a round robin style. So it just goes, looks at all of the cores sequentially, and then keeps giving them work through us if they're ready. That's, like, the simple way to think about it. And so the dispatcher is 1 of the more complex elements of the GPU. Now if we get into the individual compute cores Dispatcher Yes.
[Nathan Labenz] 57:28 Player for a second?
[Adam Majmudar] 57:29 Fair.
[Nathan Labenz] 57:30 Yeah. 1 thing what level of design did you have to go to in this project to create a dispatcher? Is that something that you had to design? And if so, at what level? Or was it something that you could drag off a menu of available components?
[Adam Majmudar] 57:48 Yeah. So pretty much nothing here was an available component, to be honest. So the nice thing is I had 2 or 3 repositories to reference. So there's this thing called Meow GPU, which is like an open source Verilog GPU made a while ago by some school. I forget which school it is. And there's this thing called Verilog GPU. So 2 open source GPUs. The thing is these things have, like, very little documentation, and then they're not really written to be simple. They're written to be, like, attempts at production GPUs. 1 of them is, like, completely unfinished. Meow is actually finished. And first of all, architecturally, they're somewhat different in some ways because, again, they're optimizing for production usability. And then the second thing is because of how big they are, they're just not very easy to understand. And so my process for creating things was not really you can't really go through them and try to understand. The analogy is like, if aliens come to Earth and find the computer, like, how are they gonna understand it? Are they gonna break it down and look at all the connections, the transistors, and see what's happening? No. They're not gonna understand it by doing that. They have to come come to some understanding of their own and then test it against the the computers that they have. It's a bit of a massively over complex analogy for going through a GitHub repo. That's actually the analogy for mapping out the brain, which is not a great role there. But yeah. So my approach to going through these repos was not like going through them and trying to understand. It was actually like talking to Claude and chat GPT about I see, like, names of things. Like, I'll see a folder called, I don't know, just stat for schedule or whatever. And I'll try to figure out with Claude through first principles, like, where does this have to fall into place in the GPU? And then that part's not too hard. So then it's like, okay, I get to something. Then I'll start to propose how I think it works. And the reason is because there's not really anything that teaches you how these things work, like, anywhere online because it's all proprietary and they have their own algorithms and stuff. And so even to make a simple version, you'd have to be like so for the dispatcher, for example, I come up with a hypothesis. It's like, oh, I think a simple version of this would be x y z. Like, I think a simple version of this would be, like, round robin scheduling while looking at what good course are available, which is, like, super simple. And then I would have to get confirmation. So Claude would be like, actually, they don't need that. So I said, I think I think you should think about it like this or something like that, which is interesting that there's nothing publicly available. But like Claude and chat GBT, in some cases, it's things that are purely intuitive and it just has, like, the benefit of engineering intuition on stuff that, like, I haven't gone into. In some cases, it says softmax definitely implemented in proprietary GPUs, and maybe it's still out there through intuition. But it's a little more stuff that's a little more complicated than what I expected it to know, which is pretty cool. But, yeah, as a long winded answer to your question, there is no, like, off the shelf stuff. I basically had to assign everything myself from how I thought it would work and then get confirmation from it. Maybe look into the repo to see, oh, this repo actually happens to do something a little similar to what I suspected, and now I can actually understand it because I, like, came up with it first. And I had to do that for all the elements, granted some of them are a lot simpler to understand than others. But
[Nathan Labenz] 1:00:40 yeah. Okay. So this thing implements a round robin where it just says, I'm gonna just keep looping through. As long as I'm live, I gotta just keep looping through all of the compute cores, check their status. So that implies then that the compute core is, like, maintaining a status that can be checked. Mhmm. And if it's available, then I'll send it another thread to process. And if it's not, then I'll come back to it on my next loop. And all of that is implemented in hardware. You're the point of beyond the point of programming this at this point. Right?
[Adam Majmudar] 1:01:24 Yeah. This is implemented in hardware. Yeah.
[Nathan Labenz] 1:01:27 So do you have a diagram of that? Can we look at at that?
[Adam Majmudar] 1:01:30 Oh, we can look at the code in the repo. There's no diagram. The diagram is for thread execution, but there's no, like, this docker diagram.
[Nathan Labenz] 1:01:37 Gotcha. Okay. Let's look at the code to just see what that looks like.
[Adam Majmudar] 1:01:41 So if you go into source and then dispatcher.se. Yep. This is it. You can see all the code is documented with like, at the top, you see there all the important points. And you can see there's some of the things that we're saying must be there are there. First of all, the important thing with the dispatch is that it has a start signal. And what that means is that thing we were saying at the top level, we're gonna tell the GPU to start. The place where that actually goes is into the dispatch. So dispatch says, oh, I've gotten full to start. Now I'm gonna start sending off these blocks to the actual course, and that's when it starts execution. So that's the start signal you see here. You see the thread count coming in from the device control register, which is again important for the dispatch because it needs to know how many threads does it need to execute in total, how many threads has it already executed. And now so it's gonna tell it, like, oh, I have this many threads in order to execute. Let me go send them off to the compute units. Then you can see the core state. So that's the thing where you're saying, oh, it looks like the cores must have their own null intel, like, status of of if they're ready or not. And so that's exactly what you see there. You see core done, which is like, oh, this core is done processing. And then so you see core done, core reset, and core start. Those are important. This is basically the way to dispatch. It's gonna do stuff. Right? It's waiting for the core to be done. It's gonna say, oh, here's a core that's done. Let me reset this core. So that's gonna set all the state back to empty. So it's gonna clear everything out. And then it's gonna say, oh, this quarter just got reset by me. Alright. I'm gonna go start it up with the next block of threads if there's still more blocks. And if you scroll down a bit, you can see there are some intermediate variables which store, like, how many more blocks do I have left? So that's blocks destached and blocks done. And that's just saying, okay, I need to process this many more blocks of threads before this is completely done executing. And then the last thing that the dispatcher does down in this loop, and you can see at the top, there's a done signal exported from the dispatcher. That done signal is sent out to the GPU itself. And so not that 1, but in the actual module definition a little bit farther up, there's just something called the straight up done. You can see it in a couple of places. Output registered done. And that's gonna tell the whole GPU, hey, all everything is done. And then it's gonna stop executing everything. And now it's just gonna wait for the host machine to basically pull out the results from memory, and then host machine can just reset the GPU. And that's, like, 1 full execution loop. You can see that the dispatcher, kinda high level, is responsible for managing that whole high level execution flow. So yeah.
[Nathan Labenz] 1:04:03 Cool. This is code. How does this thing get translated into hardware? This is what the package that you're working in does?
[Adam Majmudar] 1:04:15 So the way that this whole thing works is just in EDA in general. You design your architecture based on what it's supposed to do. It's a functional logic in Verilog or in this case, SystemVerilog, which is just like a modern version of Verilog. And then there's this entire software stack, which is what I was talking about before, which basically translate that Verilog into an actual design. And in practice, that's really hard because, I mean, there's so many different steps that go into it. Like, you need to synthesize the Verilog down into just like a logic. There's like a chart of the logic that's specified in there. And then because of the way that people program stuff, that logic is inherently not already perfect. It's gonna be way more complicated than it needs to be. You can do some processing to basically get that logic down into its simplest form. That's still equivalent. It's that process of turning the Verilog into a logic flow. It's called synthesis. And then once you do that, you can perform a bunch of analysis on, like, whether this design is valid or not. Then you convert that that logic that you've created into gates. And then those gates are based on a specific process node. So, basically, you have these things called standard cells, which are specific gate designs designed for a specific foundries process of producing gates. So you you convert them into gates, and then you lay out all these gates on a huge chip design. So the the gates are just, like, laid out everywhere all over the chip based on a bunch of different optimizations. Then you hook up all the gates with wires, like the gates that need to be connected together. And then there's like several cycles of optimization on this whole process to make sure that everything is valid, all the timings are valid because, like, electrical signals need to flow through this chip. And if any of the signals flow through the chip slower than the actual clock time of the chip, that's gonna cause error. So there's, like, tons of different errors there, even just at a logic level. And then on a hardware level, because this is actually like electronics, it's not necessarily code. It's nice that I didn't code, like, complexity is way higher. There's these things called parasitics, which is like, oh, you're actually dealing with wires and metals. And if there's wires near each other, they actually have capacitance between each other. And there's like tons of other electrical effects that can suck with the chip. And so you need to do computations to prevent that stuff too. And so there's so many complicated things. Then at the very end, your whole design is there. It gets outputted into the layout, and that's in the form of a g d s 2 file, which just happens to be the file format specifying all the meta layers to send to a foundry. And then you do a final comparison called a layout or schematic, which just make sure that this style layout you got after tons of optimization is actually logically equivalent to your original design, and then you can submit that layout to, like, a foundry. And now in practice, you're actually gonna do a ton of formal verification also on this layout to make sure, like, all of these bad states are impossible to get into. And, again, at hardware design in general, the verification process is so important because you unlike software, you can't ship and iterate. You just fuck if you you mess it up. There's all that stuff. But that's what happens. Again, if I do limit the program all that, that's that's impossible. It's a significant task. But thankfully, all these EDA softwares do it for you for a couple million dollars. And then thankfully for me, there's OpenAI, which does it for you for free in a super level.
[Nathan Labenz] 1:07:18 But, yeah, even that has
[Adam Majmudar] 1:07:20 some complexity going through it. Because once you do that, you run through a bunch of errors, and you have to fix that in your code.
[Nathan Labenz] 1:07:25 Yeah. Yeah. You can see where the decades have gone in terms of building up the stack. Talk about on the shoulders of giants. What does this feel like on the open version that you were working on when you go to I'm not even sure what the right word is, but I wanna say compile, but it's almost it's more like translate from hardening designs.
[Adam Majmudar] 1:07:46 That's what it's called.
[Nathan Labenz] 1:07:47 Hardening. So how how long does that take for something relatively simple like this? And you're running on your local
[Adam Majmudar] 1:07:55 Yeah. So local laptop, I would say, is pretty powerful for this stuff, being that it's a m 2. So that's, like, probably on the better side of, like, a hobbyist doing this stuff. And it can take 20, 25 minutes to go through the whole flow. It does, 50 different steps, and that takes, like, a bunch of time. It does, like, tons of optimizations. Really cool things that you can see all the different stuff going on and try to understand them. And then you do run into errors on any stuff. Often you do. And the other thing is, like, obviously, this stuff isn't, like, perfectly documented or there's, like, millions of errors you can run into. Obviously, I think that creation, but there's so many different errors you can run into. And it's not really well documented. Like, almost out of them are well documented, and they're also, like, really random error messages. And so 1 of the challenges of it is it's a lot less easy to use. Like, you just gotta figure stuff out, and it's so much just stuff you have no idea what's going on. And you just gotta try your best to figure out what's going on and debug it. So I think very low transparency on the debugging side, which is 1 of the biggest like, that's the biggest thing because that's the whole flow. Like, that's basically the output of the flow is, like, what bugs do I need to fix before I can get a finalized GDS? But it's doable. And you just push through it, then it works. I said we get your cliola. It's magical. Because then you can start visualizing it. It's pretty sick.
[Nathan Labenz] 1:09:03 Yeah. It's fascinating. How many cycles would you say you went through in this project of that 20, 25 minute hardening process before you got 1 that actually didn't error out on you?
[Adam Majmudar] 1:09:15 Oh, a lot. I have all of them saved in here. Can say it's, like, 30 35, like that, something like that. A lot of different cycles. Interesting. They're not in the repo. They're just in my local. Yes.
[Nathan Labenz] 1:09:28 Yeah. Okay. Cool. So let's come back to the overall GPU architecture. So we've got these modules, components that are defined in code, that are hardened into an actual ultimately, I guess, the classical word would be like etching, but now with this, it's like a it's a light process. Right? At least at the most advanced nodes. So it's forget.
[Adam Majmudar] 1:09:52 A little lithography.
[Nathan Labenz] 1:09:53 Yeah. Thank you. So the that translation is happening with this, like, decades in the making, shoulders of giants software stack. And that gives you the ability to define each of these things, and then it go to come back in 20 minutes and see if it worked or not.
[Adam Majmudar] 1:10:10 Yep, exactly.
[Nathan Labenz] 1:10:11 And you have to then also define the interface, right? Like the compute cores have to maintain some x. I'm interested to see a little bit of the interfaces is always, I think about programming a lot in terms of interfaces. Like, where is it that it's declaring whether it's done or not done, so that the dispatcher can reassign or not reassign as appropriate. But take us through the rest of the architecture, however you think best, and we
[Adam Majmudar] 1:10:34 could yeah, check out a couple
[Nathan Labenz] 1:10:35 of Let's look
[Adam Majmudar] 1:10:36 the let's look at the GPU dot SC file first. That's the high level interface that's gonna be interesting. And then we can look into the compute cores in detail and put in the thread execution stuff. Yeah. This is the high level file. There's some parameters at the top. Those are mostly unimportant. The interesting ones are the number of cores and the number of threads per block. So here, yeah, I guess, it's a slide to 2. That thing, you could just simply change around. So you could switch the number of cores 3 or 4 or whatever, and you can switch the number of threads in each block, which is just gonna change how many threads can get executed on each core at once, which means that there's gonna be more compute resources and register files and everything on each core, and that will make more sense later. But you can see the high level interface here. You have the clock and reset. Those are mostly unimportant. Those are just like the you have that for every design, and that's just gonna allow you to have clock cycles on your GPU and reset stuff. The
[Nathan Labenz] 1:11:21 interesting part question on clock also. Mhmm. It seems like if everything is built perfectly, everything would take the same amount of time. I'm a little bit confused as to why if I distribute 4 threads across this thing, like, why don't they all finish at at the same time?
[Adam Majmudar] 1:11:42 Good question. So the reason the sole reason for this, and this is the reason behind my memory intuition and why George said the whole problem is memory. We'll see that when we dive to the thread execution is, like, if there was no reading and writing for memory, they would all finish at the exact same time because everything is deterministic there. It just takes the same amount of cycles for every instruction. The thing is, when you have memory, like loading from global memory, that memory is DRAM. So first of all, it's it's using capacitors to store memory, which is it has some latency to it. You don't just get it back in 1 clock cycle instantly, which has some non determinism. The bigger thing is the bandwidth issue. So like if I have 1000 compute cores requesting some values for memory, and memory can only re support a 100 reads at once, that means that this is gonna start to get queued up. So there's gonna be, like, 1000 compute cores request data, then a 100 of them are gonna get back data to the next 100, which means that certain cores, even if they're on the same even if they're being processed together, they might make requests to memory at the same time. They may not get the data back at the same time because memory has an unknown latency. It's asynchronous. And that means that if 1 thread gets memory back earlier or later than the other ones, then it may have to wait for the other ones, or there's a bunch of stuff that happens around that. So my design, it waits for the other ones, which means it can't just keep going. And then similarly across the cores, even if they all start executing threads at once, 1 core might finish way before the others because it just stopped the data from memory back earlier. And so back to SRAM has really got bottlenecks to see things, which is why they're about to send. It's all about managing memory latencies. That really is the bottleneck.
[Nathan Labenz] 1:13:14 And the clock, just to set a foundation on that as well, is essentially an electrical impulse. It's a change in voltage that is getting applied through the whole chip. Is that right? Yep.
[Adam Majmudar] 1:13:27 So the clock is just basically a a cycle. It's like a sine wave almost of some signal going from high to low over and over at some, like, periodic interval, usually a couple nanoseconds, like 20 nanoseconds or something like that. And, basically, every time the clock goes from 0 to 1, let's just it's just like a silent. Every time the clock goes from 0 to 1, registers in the chip, which are like deep flip flops, they will take in a new value. And so that creates the way for state. That type of state transfers on the clock. So whenever the clock happens, which means that it sets the pulse of the sends happening to GPU. So it's like, generally, in each clock cycle, you're gonna have the start of the clock cycles, the new values are set, then you have all these wires propagating signals, and the next clock cycle, the value stream gets set again. And so it sets this time interval where state is set on the clock cycle, and then stuff happens in between that time of where state is set. And that's gonna help you influence what's the next state. And the other thing that's important about the clock is you can't just set the clock inside the SAS because the chip actually has physical constraints. Right? Electricity actually needs if you need something on all the way on the left side of the chip to communicate something all the way on the right side of the chip in 1 clock cycle, There's actually constraints there because the the electricity takes time to propagate across the chip, which means that your clock cycle needs to be longer than the the maximum propagation time of signals that need to communicate in your chip. And that's actually something that EDA stacked up. So that's 1 thing, it's like static final analysis, make sure that all the timings in your chip are valid based on the clock cycle time.
[Nathan Labenz] 1:14:55 Gotcha. Okay. Cool. So if something is waiting again, it sounds like you could implement this in probably
[Adam Majmudar] 1:15:03 a lot of
[Nathan Labenz] 1:15:03 different ways, but you could have your cores sort of saying, I'm in waiting mode. So this cycle, I basically just do nothing because I'm still waiting for the memory to return. And then next cycle, if I got it back, then I can actually do it. And that's because there's different latencies. They can just diverge in terms of where they are in their particular execution flow. In an actual, like, production environment with the however many gobs and gobs of gates ultimately that are being included in a single chip now, Presumably, there's also some difficulty there, like some just, like, inconsistency or whatever that could create divergence. I I don't know if that's the kind of thing that would get disclosed, but I imagine there must be some sort of engineering just to try to not allow some local defects to ruin the whole thing. Right?
[Adam Majmudar] 1:16:04 Yeah. I would say that's 1 fabrication layer. There's tons of, like, local defects is actually 1 of the biggest problems in the fabrication industry. That's why contamination control is such a big thing because as you start scaling down the size of individual transistors, now smaller and smaller defects that didn't matter before can actually break everything because they can get in the way of gauge or, like, getting basically, there's tons of different places in the chip where a contaminant can just ruin the whole thing. And that's why there's just like all the technology on actual fabrication has been advancing to smaller and smaller scales. The contamination control technology has had to get better and better. But that being said, that's more so a concern of the actual fabrication process. The reason is that they actually have a code of testing and the testing is built around, like, 1, they have a lot of contamination control. 2, they have a testing to make sure that even if there is some contamination breaking the chip, they're that gonna be able to test that and discard the bad ones. That process is imperfect. But generally speaking, definitely the biggest thing that determines chip yields in practice. And generally speaking, that's a concern of the the the foundries. Not really someone like NVIDIA who's doing the design build or ship off their designs to TSMC. TSMC will often be concerned with that. But generally speaking, NVIDIA can probably assume that most of
[Nathan Labenz] 1:17:11 the chips are working once they get Cool. Alrighty. Should we go back to the diagram?
[Adam Majmudar] 1:17:17 Yep. We can do that. So now we can go into what's going on inside each compute core. So we already saw the interface of the dispatcher. So each compute core is basically gonna get told by the dispatcher, you're executing this block ID, and you're gonna get this many threads in the block ID. What's a when dispatcher's gonna say, hey. Execute 4 threads in block number 2, and tell me when you're done, basically. So what's gonna happen is the dispatcher is first gonna reset the compute core, then it's gonna tell it, here's your block number. Here's how many threads to execute, and it's gonna tell it start. It's gonna set it to start signal similar to how the GPU got a start signal. And then the question is, what's going on inside the compute core? It processed those threads up until the point where it's reporting back to the dispatcher, then it's done. That's like the high level inter interface at compute core. And so there's a couple key pieces of core. And first of there's, like, the high level control logic that's shared between all the threads. And then there's, like, the actual thread execution where we split out into a number of different threads within the within the core. First, we can talk about the high level logic that's shared between all threads, and then we could dial it to the thread. So high level, there's basically 3 units that are shared between everything. So there's the scheduler. And, again, this is where there's a ton of complexity. It's now managing the resources of how many however many different groups of resources there are in the a PCOR. Now in practice, it's in PCOR isn't actually thought of as having a certain number of threads. It just has some finite number of resources. And then however many threads are used to execute, they just stay executed on those resources. And typically, a scheduler is is responsible for handling the difference between like, the disk doctor says, hey, execute 8 threads. And you may only have 4, like 4 different risks of resources on your compute core. So now the compute core needs to manage the execution of these threads in parallel or in sequence or how her determines is most efficient. So again, that's why there's a lot of complexity in the scheduler. In my case, what I basically said is each compute core is just gonna process 1 block at a time. So it's not gonna be able to say like, hey. You can give me 8 threads, and I'll do it 4 at a time. It's just gonna be like, just give me however many threads I support. So each compute core supports 4 threads at a time, which means that each resource a thread needs, it's gonna have 1 1 resource. So, like, each thread needs a register file. Could of they have 4 register files. Each thread needs its own ALU. It's gonna have 4 ALUs, you get the point there. And, basically, what happens is the scheduler handles the execution of the threads in terms of there's, let's say, 15 instructions that each of these threads needs to execute. Right? So now the scheduler is going to constantly execute the control flow of how these instructions get handled. And so the way that happens is, let's say we're on instruction number 0. So we've just been issued a group of, like, 4 threads for the dispatcher, and we need to execute these threads. So we're on instruction number 0. So what we're gonna do is, first of all, we only know to start with instruction number 0. We don't even know what that instruction is. So first, we need to somehow get that instruction from where it's stored and stored in program memory, which be because that's where we put it before at the start. And so the fetcher is the first thing that's gonna execute. So the scheduler is gonna tell the fetcher, hey, go retrieve this instruction from program memory. So that's the whole job in the fetcher basically is to get instructions from program memory. And then in practice, it actually doesn't catch it caching because instructions are very repetitive. Like, all threads are gonna be executing the same instructions. So 1 optimization we can do that's nice is that the fetcher, once it gets an instruction once, it can just store it locally, and then all the threads can just use that again instead of going global memory. So that's the fetcher. And then what the decoder is gonna do, which is gonna basically get us into the thread level is we now have this instruction, which is a bunch of ones and zeros. It has its opcode, which is gonna tell you what the instruction is supposed to do, and it has all the registers and specific values. Now we need to somehow translate that instruction into something that all these different compute resources can actually use. And so that's the job of the decoder. And as we get into this as a thread, it'll become a lot more clear what the decoder is actually doing. So let's look at the actual compute resources that are in thread. This is the last piece of this diagram, and then we'll go off to the next diagram. Within a thread in ByDesign, there's basically 4 key pieces of memory and resources. Each thread has its own register file that we've already talked about. That's each thread has its own ability to perform computations on data. And then importantly, it has its own program counter, which means that each thread can be on its own line of the program. And that's because based on the data, different different threads might have to jump around different lines of the code. And in practice, that's challenging to implement. So that's called branch divergence, which means different threads branch to different lines of the code. Now in our design, even though each thread has its own program counter, for simplicity, all the threads are assumed to be basically continuing on the same instruction. And it just so happens that I wrote programs where that is the case, like, the threads stay on stay on the same instruction. So this design was sufficient for implementing that. And then last 2 things, which are important. So each thread has its own ALU, arithmetic logic unit, which is the actual compute that's, like, performing multiplications or additions. So that's doing all the arithmetic instructions in there, and it's performing computations of the registers. And the last thing is the load store unit or the LSU. And that's the thing that's responsible for fetching data from memory. And so that's the thing that does the load and the store instructions. So as you can see, each thread has its own unit that is separately able to load data from memory and store data into memory. And importantly, before we dive into the specifics of how all these pieces interact together, so I know it's, like, completely unclear at this point. It's just like, oh, I guess there's, like, these high level things. I wonder how they actually work in practice. The important thing is that the fetcher and the LSUs, there there's like a bunch of them across the GPU. Right? Because each core has a fetcher. And that means that if we have 4 cores, there's gonna be 4 fetchers. And they might all be requesting instructions at once. And similarly, there's gonna be like there's 4 LSUs per core and 4 cores. There's gonna be 16 LSUs in the whole GPU, and those would all be requesting memory at once. So the question is, how do we manage the constraint of memory has a fixed bandwidth? Like, let's say it can only take 2 requests at once or something, and then all of these resources are requesting memory at the same time. So you need something to actually like throttle all the memory requests and on the GPU side, hold them. So it's like, here's all of our requests, and then send them slowly to memory to respect the bandwidth that every actually supports, and then slowly send those responses back into the individual compute units in the core. So that's what the memory controllers go for. So there's 1 for program memory, 1 for data memory. And those controllers are basically responsible for respecting the bandwidth that the global memory actually accepts, and then taking all of the requests from compute resources and throttling them to the bandwidth of what the memory accepts. That's what the memory controllers do. And, yeah, with that, I'll pause there. And then after this, we can dive into the app for that execution.
[Nathan Labenz] 1:24:04 Yeah. I think I get it. And again, all of these things are, at this point, purely physical. Right? There's we've left the realm of software some time ago. So the software release stops at the point where the kernel is compiled, is that the right word in that case, and put into the program memory. And then everything after that is a purely a function of the hardware. The hardware is routing everything through just an insane tangle of ultimately logic gates.
[Adam Majmudar] 1:24:40 Exactly.
[Nathan Labenz] 1:24:43 Yeah. Does anyone have an intuition for that at this point? It seems like we've got so high up this tech stack over the decades that I wonder, is there anyone that can really go to the the sort of gate level and understand, like, how how something as complicated as a data memory controller works. Does anybody have an intuition for that at this sort of
[Adam Majmudar] 1:25:05 logical gate level? Yeah. Definitely. Yeah. So I think most people doing architecture and design stuff, they'll probably understand most of these elements at a gate level. Obviously, more complicated logic, it's futile and there's no point really to try to understand exactly what's happening at a gate level. I think the main intuition you need is understand the core of how memory works at a gate level, usually like static RAM, and that's pretty interesting. You understand how like a latch works and then a flip flop works, and that's, like, the core of understanding memory. In practice nowadays, every step like, most memory is dynamic and only cache is static, but that doesn't people understand at a low level. And then, like, ALUs and, again, we're gonna go into the diagram, and that's gonna show you basically as close to the low level as you get. It's gonna show you multiplexers and stuff, which, again, are pieces that people understand at the hardware level. But, yeah, I would say people do understand it at that level. Now in practice, you don't really need to as long as you know, as long as you trust the processor. It's like, okay. I know that this is gonna get translated into gates in this general way. You can study the gates in computer architecture class, and you don't really need to go past that. And then I would say really cracked architecture people. There's a ton of stuff. The thing that makes them really good is that they're managing a lot of stuff in their head, which most people are not. And usually, that's not really gate level logic, although I'm sure they have that in their head too. It's more like understanding the the implications of stuff besides just the high level. Like, design's, like, really nice high level in practice. There's actually a lot of complexity. Like, for example, writing code in Verilog, like, knowing how many gates a piece of code actually translates into is important. Like, for me, it's like, oh, I might just write the divide sign as my division instruction. This is the c. In reality, nobody would ever do that. Like, the divide sign actually turns into gigantic piece of hardware, which you would never actually wanna use. You wouldn't actually wanna implement the division thing as an instruction. So it's things like that. And then also being able to manage all the parasitics and knowing the actual electronics of it and how the electronics work if you place things in different places. There's like tons of things like that. Honestly, the way I would describe it high level is people have a surprising level of understanding of the EDA flow and the things that influence that flow in their head, and they can get it by intuition. Obviously, not the whole thing, but pieces that most people want. And somewhat I've noticed from my very limited perspective, that's what a lot of the really good architecture people have. So I would say certainly with people with intuitions and all these things.
[Nathan Labenz] 1:27:17 Fascinating. Okay. Cool. Yeah. So should we go to the thread diagram?
[Adam Majmudar] 1:27:22 Yep. We can do that. This is gonna be nice, especially for people who are familiar with CPUs, because this is gonna get into very familiar territory. So you can see here something that looks very similar to a CPU in many ways. It almost is exactly like a CPU except for the LSU and, like, the little details in the register file. But what you see here is, first of all, most importantly, everything that has red text and blue text, that is stuff that's coming from the decoder. So now you can actually see exactly what the decoder is doing, which is the decoder is responsible for converting the instruction into these red and blue signals. And these red and blue signals are gonna be used as control signals to control the execution of an individual thread. So what you see here is a single thread. Obviously, there's multiple multiple of these per core, and all of them are gonna get the same control signals once since the decoder is just shared between the core. And now you can see how the execution of a thread is happening with the control signals. And you can see the familiar elements from the diagram above. So you have a register file, you have the ALU, you have the LSU, and you have the PC. And this is standard CPU stuff, but the general flow of what you're seeing is you have the register file. You have the ability to access 2 registers from the register file, which are called the source and transfer registers. That's why they're called RS and RT. And those are also, like, they're in our ISA. And using those registers, there's a ton of different stuff you can do with it. And so the job of the decoder is to tell the compute resources dedicated to thread at any given point in time, what stuff am I supposed to do right now. And so it's gonna basically do that by turning off and on different parts of the compute unit. For example, if you look in the ALU, let's just focus on that 1 unit. So you see the arithmetic unit, and that can do 4 different things. So we know from our instructions that the arithmetic unit and the ALU needs to be able to do addition, multiplication, subtraction, and division. And those are going to be, like, literally 4 separate hardware circuits that do each of those things. And so now the question is, the arithmetic unit is taking in these 2 values. And needs to perform 1 of those concatenes line array. So what it's gonna do is it's gonna take in this control signal, the ALU arithmetic multiplexer, which is much easier shorter multiplexer. And that's gonna tell it, hey. So actually, what it's gonna do is it's gonna do all 4 at the same time. It's gonna both multiply, add, subtract, and divide all of them. And it's gonna divide those 4 outputs into this thing called a multiplexer, which is just like a single chooser. And so now you have all these possible values it could choose. Now this multiplexer value set by the decoder is gonna say, hey, which 1 should I actually take? So let's say we did, like, an addition instruction. The decoder is gonna set this mux value to 0, which is we'll say that the thing associated with addition. And now that multiplexer is gonna take these 4 inputs, just gonna output the 1 specified by addition, and that's gonna come out into the ALU output mux. And then here's another multiplexer just to help the exhaust work with the design. So now you see, like, okay, we have the output from the arithmetic unit. Then again, there's also gonna be an output from the comparison unit on every single cycle. So it's not just like we're only performing the computation done by the things specified by the instruction. Actually, all of the computations are happening because they're just in wires. So they're all disconnected. All the wires are always connected. The question is which computation are we actually taking this time? So the ALU output multiplexer is gonna choose, hey, on this instruction, this should be set to 0 because we're gonna take the arithmetic unit's output this time. And that's gonna get outputted. And then that value, you can see in the rest of the design, most importantly, that's going into the register input multiplexer. And so just to finish this part of the flow, the register input multiplexer is gonna be 0 this time. It's gonna say, hey, we got some value from the ALU, and we actually wanna use the ALU value this time and store it back in the register. So you can see okay. We're choosing the ALU output, and then that's going into the destination register, which is the RD is. And we're gonna send that back into the register file in the registry specified by the instruction. And so now you see, okay. That's actually a full loop, for example, with the add instruction. Like, it's basically completing the addition, choosing to output that addition from the ALU, and then choosing to store that output back on the register file and that's how you get the addition the addition value. And so similarly with the LSU, pretty simple. So you have control signals to enable read a device. And when you set those signals and everything else is gonna be off those times as specified by the decoder, It's going to make a request out to the memory controller. The memory controller is going to make a request out to memory, and it's going to pipe it back. And this LSU is going to it's going to be in a waiting state. And the scheduler, which we talked about before, is going to see that this thread's LSU is still waiting. It's just gonna keep waiting until the LSU gets response from memory. And then the last thing, which I'll cover is the PC. So if you go over to the program counter unit in the top right. So this is the stuff I was talking about before with the whole branch and compare instructions. So, generally, the next program counter is just gonna be the next line of code, which means we just wanna add 1 to the current program counter. Pretty straightforward. However, if we're on a comparison instruction, if you still have the time to get to the NCP register. So for a comparison instruction, we are going to actually def the NZP register. And that's gonna tell us, hey, the comparison we just did, it has these NZP values. It was either negative 0 or positive. Then on the branch instruction, it's gonna use that NZP tester. You can actually see the NZP signal coming in from the instruction that we said before. And what that's gonna do is basically check, hey, did the comparison instruction match this NZP, like the set of NZP values? If it does, then we're gonna jump to that immediate value specified by the instruction. Again, you don't really need to understand exactly what every single control signal is doing. If you want to, you can look at the diagram, and it'll become much more clear, more clear than my explanation. But the important thing is that you understand generally that, oh, this is the layer where actually the instructions are getting literally translated into hardware signals that are determining the flow of how the thread gets executed, how the program counter gets updated, and how register values and global memory data values get updated. And that's basically how the entire thread gets executed, like, the kernel. And if you remember in the register file over there, there's those 3 special values which give us context on where we are in thread execution. That's getting set by the core. And, basically, that's how it turns to execute it from this, like, low level of the individual compute of the GPU. And, it's very similar to a CPU because at this level, we have basically entirely some similarities to CPUs. Basically, it's like a mini CPU here. Cool.
[Nathan Labenz] 1:33:48 That's really very well done. And if you're listening in audio only mode at this point and a little confused, definitely check out the YouTube or jump straight to the GitHub project and check out the diagram, because it definitely will help if anything is unclear from the verbal description. Okay. You've designed all this. You've got all the code. We've taken the spike from sort of what you would do with the programming layer, how that gets propagated through the hardware. Is it time to go on to how you actually get to visualize and and verify that this is working?
[Adam Majmudar] 1:34:28 Yep. We can go to that, please. So if you go into the test flow directory, show you quickly how I, like, simulated everything. And then on the tweet, there's also a video of what's actually happening in GPU. But yeah. So in the test folder, can still get, for example, the test that add kernel. So literally, all we have here is I manually compiled all of those all those kernels I wrote into the, like, valid machine language for my GPU. So here you see the matrix edition kernel. And then in Python, I wrote this external interface since the the memory is external to the GPU design. It's not in the hardware. So I made, like, a simulator right there that basically simulates exactly the behavior of memory. On every cycle, it just checks it based off a request to the memory from the memory controllers. And if there is, it responds with the data. Reason is because memory is not part of the GPU. It's usually, like, an external DRAM or a bandwidth memory or something like that. So you wouldn't really include that as part of the design. So that's what we're simulating here. In the actual if I actually tape this out of PATI tape out, there'll be, like, an external DRAM similarly like this. And so we load up the program memory here, load up the data memory, as I said, exactly like I was talking about there. And both of these are simulated memory. And then also right there, I've set the the value a little below for the device control register, telling it to execute 8 threads. And then basically, all you do is so I have this setup program, and all that does is it's actually gonna take those memory values and load them in. It's actually gonna take the threads and load them in, and then it's gonna call start. And so it's gonna set the start signals to high, and it's gonna trigger everything. And then we have this, like, little piece of code displaying the data memory just for visualization purposes. You can see what's starting out as. And then now you're seeing the exact way to interface with it. So you're seeing, like, just wait until the GPU's done signal is set to 1. In other words, the GPU says it's done. So it starts out at 0. We're just gonna keep running cycles until the GPU says it's done. And for each cycle, the GPU is gonna be processing. And I have this nice piece of code here, is written in 1 of the helper files, which is just gonna display out the state of the GPU. So what that actually does is it's gonna show you every single thread and every single core, what are the registers being read, what are the current register values, what is the ALU output, what is the LSU output, what are, like, the states of the core? Is it, like, waiting or is it executing or is it decoding or fetching or whatever? Those are really nice to see just like the control flow and the debugging. And I encourage anyone who's really curious to see that. That's a good sick learning process. Just go run the map matrix edition kernel. There's instructions in the repo, and then go look at the log file that gets generated. You're gonna see 500 cycles, like, oh, every single thing that's happening inside the GPU to execute this code, which is pretty fun. But, yeah, it's pretty simple. And then I did the same thing for matrix multiplication. So you can see the whole execution trace. Then you could see at the end, it prints out the final data memory, and you could actually see the final values that have been correctly computed. And for anyone curious to see if it doesn't wanna run it, there's a video of it, which is in the tweet thread that'll probably be linked here.
[Nathan Labenz] 1:37:17 Cool. This is awesome. Anything else we wanna talk about at the real low level? Otherwise, maybe I have just a couple high level questions to ask to conclude with.
[Adam Majmudar] 1:37:26 Yeah. This is great on the low level for people who have questions that you just DM me or put an issue on GitHub or whatever. So, like, already a lot of people have been contributing to the GitHub more than I expected. But yeah.
[Nathan Labenz] 1:37:36 Yeah. There's definitely a lot of curiosity out there about this sort of thing right now, but it's not easy to go as far as you have. So I'm not surprised at all that people want to follow in your footsteps. Certainly, I'm
[Adam Majmudar] 1:37:46 1 of them.
[Nathan Labenz] 1:37:47 Coming out of this project, is there anything that still feels mysterious to you or questions you weren't really able to get a good grasp on, things you wish you understood better? Do you feel like you have it all reasonably well understood?
[Adam Majmudar] 1:38:02 Yeah. So I think I got basically my goals on it. And my goals with anything learning wise in tech like, in a technical capacity is usually not necessarily to be, like, super niche deep in every single area because honestly, I couldn't spend infinite time going deep in any engineering discipline. It's more so to get to a level where I understand the entire landscape from an engineering standpoint. I have all the key intuitions, and I'm actually deeper than most of the people who are just going in for, like, an entrepreneurial engineering goal. So I wanna be able to spar with the engineers and anything and, like, fully understand everything they're talking about. But then I'm mostly, like, entrepreneurially motivated in terms of obviously, in this case, there's not really much entrepreneurial opportunity. It's like a super windy industry. It's like, I'm not gonna go compete in GPUs or something. But it's just generally how I approach technical things. And so I feel like I got to the level of intuition where I can now understand anything I need to. And I put at the bottom of the the repo read me, like, some of the more interesting, more advanced GPU, like, concepts that I looked into that I didn't implement. So that was fun. But, yeah, generally, I I think I accomplished my goals. I don't think I would want to spend the time to go super deep into any 1 of these things unless I happen to be unless happen to be interesting to some frontier problem where it's actually useful to go into those. But that's generally how I think
[Nathan Labenz] 1:39:12 about it. What do you think this tells us, if anything, about the future of the chip industry. I feel like there's a lot of debate, obviously, around is NVIDIA gonna take over the world? Is AMD gonna make a comeback? Why is, like, TSMC stock not up nearly as much as the chip designers? And I feel like somebody might say, jeez, you made it pretty far in 2 weeks. If you were actually to try to go manufacture something, you wouldn't make it nearly as far in 2 weeks. That would seem to suggest that there's something a little out of whack where the, like, why is all the value accruing to the design side as opposed to the actual manufacturing side? But what's your take on that, if any, coming out of this experience?
[Adam Majmudar] 1:40:00 Yeah. I have my own opinions, but I think my opinion is so far from being credible and experienced enough to have a useful take worth saying. The 1 thing I'll say is that I don't think there's anything wrong with the value accrual to, like, people with massive moats. It makes sense, especially, like, this late on. It's such a Linde in the share this end. I think for people curious about that stuff, all I would say is just read 7 powers, and 7 powers will just explain why motes make sense and why even though the guy did this in 2 weeks, like, realistically speaking, you you can't do anything to challenge these people in most cases. And, again, disruptive innovation is the place where you get to challenge incumbents, but that's not really probably gonna happen much here. And the incumbents are also very competent. Like, NVIDIA and TSMC, it's because it's such a wild incentivized industry that, like, the incumbents can't possibly become incompetent because of how incentivized it is across all this. They're basically driving the world to semiconductor industry. So it's definitely challenging. And there's also, like, a lot of takes. Again, like, I can't claim to know about if you can challenge TSMC or NVIDIA, but a lot of employees in NVIDIA for example, there's, like, a 1 viral tweet the other day where the guy who just left NVIDIA was like, now that I've left NVIDIA, I can confirm that you're not catching up to them within the next 10 years. And I would think that those perspectives are probably pretty accurate given the nature of the industry and the economics of it.
[Nathan Labenz] 1:41:14 Sounds like it definitely is too pill to climb.
[Adam Majmudar] 1:41:16 I think 1 thing that's really interesting. I'll shout out 1 company here. They don't need my shout out at all, but, like, it's called Atomic Semi. And I think that is probably the coolest attempt at doing anything in this industry in the past 20 years. It's this kid, Samton, who's built semiconductor fab in his garage. And now he's trying to build cheap affordable semiconductor fabs. So initially, the mid market, like the FPGA market, and there's not really an incumbent there. Like, basically, nobody has ever done it before. The question is not like, can he beat the incumbents? It's just, is it possible to create it? If he succeeds there, he's gonna create a gigantic mid market. It's kind of like what we were talking about here where I was like, oh, yeah, hobbyists knew this stuff. But realistically, like most people, if you wanna hit production scale, you just have to hit the income you just have to use the incumbents. I think this is something where because he's tackling down market at first, he's actually legitimately creating a new market opportunity if he succeeds. So I think that's the coolest attempt in in terms of this market, trying to do something that will make an impact on the young person.
[Nathan Labenz] 1:42:14 And the value there is custom designs for relatively, like, comparatively low end use cases?
[Adam Majmudar] 1:42:21 It's not even that. So it'll actually scale probably pretty far, but it's more like a current foundry costs, like, $10,000,000,000. And if you wanna use them, you have to, like, ship off your stuff and wait a long time to get back test samples. And then there's also the FPGA market, which is a lot faster, but you don't actually get, like, a real design. It's like an FPA. Basically, what we're doing is creating fabs that are, like, way cheaper, like, on the order of hundreds of thousands instead of billions of dollars. And you can get your designs immediately because you just buy a fab and you just have it with you, which is, first of all, very obviously gonna completely change it for a mid market. But, eventually, it may actually replace down and mid market opportunities, which are actually pretty big in terms of you don't even need to go to the huge foundries. You can just use these things for way more price efficient. So that could be huge.
[Nathan Labenz] 1:43:04 Yeah. Fascinating. So the the idea there would be, like, enterprises would run their own small fab in the same way that today they might run their own cluster, or they could go to a cloud provider in the future. They might run their own fab internally because they it's been modularized. To the point Exactly. Yeah. Fascinating. Yeah. Do have any thoughts 1 of the paper that has really caught my eye recently was the Microsoft 1.58 bit paper where basically they showed that you could implement neural networks without there's obviously been a ton of work in, like, quantization of the the values within neural networks. Right? We go from full long floating points down to smaller and smaller bit resolution. The smallest that seems to have been demonstrated to work so well so far is this 1.58 bit, which is just minus 1, 0, and 1 are the only available values in a network, and they still show that you can actually get them to work with that. 1 of the things they said in that paper was that this opens up potentially totally different hardware future because, obviously, you don't need nearly as much complication to just handle 1 0 and minus 1 as compared to all the math if you're doing longer precision numbers. Does this Yeah. Give you any intuition for how that might play out in the future?
[Adam Majmudar] 1:44:26 Yeah. So huge of 2. 1 of the things that people criticize about the big measurements of compute power nowadays is FLOPS, which is floating point operations per second. And that's, like, kind of the benchmark for compute towers, especially in GPUs nowadays. And 1 of the reasons they criticize it is because if you have a 32 bit floating point values, if you just switch those to 16 bit floating point values, like drastically reduce the flops because it just takes less time to perform computations around size. Sorry. It drastically increased the flops. And that's actually what NVIDIA has been doing. So, like, they just released their black well. 1 of the things they've been doing, obviously, they've been boosting their flops in many other ways, but their black well now has smaller floating point, like, numbers because ML doesn't really need gigantic floating point numbers on, like, graphics. And so by doing that, just by nature of doing that, it massively boosted the flops. And it is actually meaningfully faster for the use cases that are relevant to. And so it's exactly an implication because if you can use 2 bit, 1.58 bit values to stop the floss, it's gonna go up like crazy. And the important hardware thing is, yeah, in order for that to happen, like, what that actually means is that those register values, like in this case, again, this is what I meant by a bit simplification if you can explain important things with it. So the register values in this case, they're 16 bit register values. Right? They store 16 bits in them. And in real GPUs, they have a whole host of registers. They have registers that can store floating 0.16 and 32 bit numbers. They have cellular registers, they also have vector registers. So what this means is like, in the hardware, you would actually have registers that store 2 bit values or something like that, 1 bit, 2 bit values. And that's gonna be way faster, and that's how it gets implemented in the hardware. But, yeah, that's pretty interesting.
[Nathan Labenz] 1:46:02 Yeah. I'm curious. Seems like it gives me a big simplification in terms of just a lot of the the components too. Right? Like, you imagine the sort of arithmetic layer that you had where you're saying it's doing addition, subtraction, multiplication, division all the time. If you're down to just a much more limited set of logical operations that you need to perform at each given step, that could presumably just get a lot smaller too. Right?
[Adam Majmudar] 1:46:29 Yeah, definitely. Although I'm sure that's what determines the the flops increasing, but I'm sure we're grossly oversimplifying a lot of the things, because I bet you wouldn't actually just have just 2 bit floating point registers in your whole GPU. So there's probably a lot of complexity to it. But as a high level situation, I'm sure this is generally valid.
[Nathan Labenz] 1:46:46 Okay. Cool. That's all I've got. Anything else you want to talk about? Otherwise, I maybe just ask you what you're on to next.
[Adam Majmudar] 1:46:53 Yeah. Honestly, this is it was cool. It was good talking about it too. Didn't expect it to into this much technical detail. I guess another interesting thing is the place that AI played in this. I I guess there's not too much to say there except that it's very clear that AI has a huge place in learning now. And I think part of, like, part of my learning philosophy, in many ways, people may see me, like, people who seen myself now may think of me as like a chick person right now or something like that. But really, I'm just trying to learn a lot of stuff really fast based on a completely different high level framework I've been using to decide on this stuff, which I'm gonna talk about. But I think what AI has made possible is my learning philosophy, which is extremely aggressive. Like, I believe because of this a number of different things that I can learn stuff and people can learn stuff like a 100 x, 200 x faster than they think. Obviously, as I said at the beginning with less complete depth, but with the same or more practical knowledge, I think AI has enabled that in a pretty crazy way. And this has helped me to really feel the truth of that. I was like, in many ways, this would not have been possible without AI. So that's 1 interesting takeaway I have.
[Nathan Labenz] 1:47:51 Cool. Love it. This has been a fabulous walkthrough of TinyGPU. Congrats on a successful speedrun. I'm looking forward to what your next 1 will be, and certainly we'll be following you for updates. For now, I will say, Adam Majvudar, thank you for being part of the Cognitive Revolution.
[Adam Majmudar] 1:48:10 Thank you for having me.
[Nathan Labenz] 1:48:12 It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.