The AI Revolution in Education with Shawn Jansepar, Director of Engineering at Khan Academy

Watch Episode Here

Video Description

In this episode, Nathan sits down with Shawn Jansepar, Director of Engineering at Khan Academy, to discuss their GPT-4 powered Socratic tutor, Khanmigo. In this conversation, Shawn and Nathan chat about Khan Academy’s collaboration with OpenAI and how they helped fine-tune GPT-4, how Khan Academy leveraged GPT-4 to build Khanmigo, and the impact of providing access to an AI tutor to any student. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive

TIMESTAMPS:
(00:00) Episode Preview: Education 10 years from now
(04:42) Khan Academy’s early access partnership with OpenAI
(06:31) Khanmigo: journey from Chrome extension to AI tutor
(11:36) GPT-4’s ability to be Socratic vs 3.5
(15:05) Sponsors: Netsuite | Omneky
(16:40) Integrating Khan Academy’s Pedagogy into AI
(20:06) The future of education 10 years from now
(22:37) Khan Academy’s models
(27:20) Demo-driven development process
(31:16) Sculpting the behavior of a Socratic tutor model
(35:59) Khan Academy’s contribution to GPT-4’s fine-tuning and RLHF
(38:41) Being data-informed vs data-driven as a practice
(42:10) Incurring tech debt to get ahead of the curve
(45:28) The boundary for what an AI can/can’t t tutor
(49:30) Identifying when the user is confused and avoiding AI hallucinations
(53:54) Khanmigo’s development patterns
(59:11) Making Khanmigo jailbreak resistant
(01:01:50) Delivering personalized education with AI
(01:04:08) How Shawn and his team are thinking about AI education
(01:05:33) Khanmigo’s future multimodal interactivity
(01:08:42) Evaluating Khanmigo’s efficacy for student learning
(01:11:41) How widely is Khanmigo deployed today and what is the future for universal public access?
(01:05:11) Distribution through teachers and districts
(01:16:30) What are the reactions from teachers and education institutions to AI?
(01:18:15) Khanmigo’s pricing model
(01:19:03) The future roadmap for Khanmigo
(01:20:45) How will the AI tutor change the world at large?

LINKS:
Khanmigo: khan.co/khanmigo23
Benjamin Bloom’s 2-Sigma Problem: https://web.mit.edu/5.95/www/readings/bloom-two-sigma.pdf

X/TWITTER:
@shawnjan8 (Shawn)
@labenz (Nathan)
@eriktorenberg
@CogRev_Podcast

SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: GoogleLM

Full Transcript

Transcript

Shawn Jansepar: 0:00 In 10 years from now, I don't think there'll be any kid who's learning without a personal tutor next to them, on demand, available all the time, that knows everything about their learning history. If they give permission to also understand their interests, they can motivate them more. You make a problem based on the Avengers or soccer or something like that, it could be a lot more motivating than whatever generic problem the student is given. So I think in 10 years, my prediction is that everyone will have this available.

Nathan Labenz: 0:29 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Eric Tornburg. Hello, and welcome back to the Cognitive Revolution. Today's episode is an extremely positive and fun one. My guest is Shawn Jansepar, director of engineering and AI product lead at Khan Academy. For years, Khan Academy's mission has been to provide a free world class education for anyone, anywhere. And they've made incredible progress toward this goal. What started as just one man making personal tutoring videos for his cousin has now grown to over 150 people with expertise in all facets of education, from learning science and instructional design to software delivery. But still, the promise of a truly world class education for all has been limited by the impossibility of scaling personal one to one attention for each student. That's recently changed, of course, with the release of GPT-4 when suddenly the notion of an AI personal tutor went from a distant dream to a practical product problem. And in this conversation, Shawn takes us behind the scenes into Khan Academy's early access with OpenAI, describing the magical experience of chatting with GPT-4 for the first time and how quickly he realized its potential for education. He also tells the story of how his team mobilized to build and launch the first version of their AI tutor, Khanmigo, just in time for the GPT-4 launch event in March. We dig into the details of how Khanmigo works, and Shawn explains how they're using a mix of prompting techniques and other system level strategies to ensure that Khanmigo is always focused, encouraging, relevant, and critically, in command of core academic material. He also discusses Khan Academy's intense focus on safety, including how they minimize the risks of jailbreaks and generally monitor for other problems, and also how they're partnering with school districts and teachers to deploy Khanmigo effectively into classroom settings. While the reactions from students and teachers have already been overwhelmingly positive, as Shawn puts it, they've only scratched the surface of personalized AI assisted education. He paints an inspiring picture of a future where Khan Academy's educational expertise and GPT-4's capacity for interactive conversation combine to dramatically accelerate learning for students worldwide, and he envisions just how much this could mean for humanity as a whole. If you're finding value in the show, we would appreciate it if you'd share it with friends and or post a review to Apple Podcasts, Spotify, or YouTube. We've had a few great comments over the last week, including one Apple Podcast reviewer who said that this is a sorely needed show with how quickly AI is coming at us. Incredible value each episode. A YouTube commenter also said, of the few dozen AI newsletters, blogs, channels, and podcasts that I consume, this is consistently the best. So glad you are releasing this type of insight for free. And a third said, these conversations are always so incredible. I'm always looking forward to the next episode. I can't understand why you have so few viewers. I will definitely start sharing you more often. I really appreciate all these comments and especially appreciate you sharing the show with friends. And, of course, your feedback is always welcome. You can email us at tcr@turpentine.co or DM me on Twitter where I am at Labenz. Now without further ado, I hope you enjoy this conversation with Shawn Jansepar of Khan Academy. Shawn Jansepar, welcome to the Cognitive Revolution.

Shawn Jansepar: 4:21 Thanks for having me.

Nathan Labenz: 4:22 So you are leading AI development and productization at Khan Academy, and the new product is Khanmigo. I am excited to get into all aspects of this. But for starters, because I feel like we are so quickly acclimating to the pace of progress in AI, I wonder if you could just take me back, not too long to your GPT-4 story. If I understand correctly, Khan Academy partnered with OpenAI before things were released to the public. And so you were in this kind of small set of people that had this glimpse of the future. I'd love to hear about that.

Shawn Jansepar: 5:00 Yeah, it's an interesting story. So I think actually, it kind of goes all the way back to around October is when I first got access to GPT-4. Basically, what happened was, I suddenly in the evening got access to GPT-4 via Slack bot. I just randomly was added to a channel where you could interact with this AI and an email was sent to just a few folks, key folks at Khan Academy who were given access. And I was like, yep, we're chatting with OpenAI about their new model GPT-4. I hadn't really played too much with any of the other models that they had available via API or anything, because this is prior to ChatGPT being launched. And I remember getting access to this thing, and the initial impressions were that I'm interacting essentially with an omnipotent being, like this thing that just knew everything. Obviously, over time, you started to understand the limitations of it. But in the beginning, it was just like, wow, I cannot believe what I have just experienced after talking with it for 30 minutes. And I literally had to go for a walk and walk around the block and just take it all in thinking about what this is going to change for the world. It is going to change a lot for the world. Obviously, it's not quite as like, we don't have AGI yet. I felt like when I initially tried it for the first 30 minutes, we had AGI, we don't have AGI, but that's when we got access in around October. And then we, shortly after that, had a company on-site where we were having a hackathon and the few of us who were, or who had access to it, decided that we were going to do some hackathon projects with it, basically. And in the beginning, we were pretty hesitant about the idea of building something that was directly accessible by a learner. We were more interested in the idea of using this to help us build content or use this to do maybe some things for teachers to generate them lesson plans or whatnot. But we were worried about the idea just from a safety standpoint around giving access to students. But over time, we did try a few different ideas, but we thought, what if there's ways we can mitigate some of the concerns we have for students? Things like if we can have ways of monitoring what the student is doing. We found out about OpenAI's moderation API, we leverage Khanmigo as ways of mitigating some of the risks. And so in around December, we started working with OpenAI a little bit more closely and directly. Actually, on the team, Jessica, created us this quick demo of like, well, hey, what if one idea that you could use this for is really help guide students to the right content? Kind of like almost a more advanced search functionality. And that kind of opened up our eyes to what this could be like. We kind of initially thought about it as like a concierge and we tried that and it was a pretty cool interaction. So we said, hey, why don't we just do some rapid prototyping together? We were engaged with them in weekly meetings, but they were kind of going nowhere really until Jessica at OpenAI created this little prototype for us. And that's when we said, rather than just meeting every week being like, hey, so what's your update? Hey, so what's your update? We said, let's just sit down together virtually, sit down, and just engage in rapid prototyping. So let's check in in the morning and let's check in in the evening. And we decided, let's actually try to build what would it feel like to have this large language model available as this guide on the side experience next to you in Khan Academy. And it was very scrappy. We built it with a few members of the team. And initially it just started out as a Chrome extension and that's how we integrated into the site. And so, we made it such that the AI had access of the context on any content page of Khan Academy and being able to just say, I don't really get this. And for it to know what this meant, it felt really magical. And we were like, wow, okay, I think there's something here. And a bunch of us were experimenting with it, including a lot of the members of our leadership team. And we were just really impressed at the level of how good it could be at a tutor. Of course, it was still making a lot of math mistakes and whatnot, but we felt like there was promise. So what we ended up deciding was, let's get a little bit more serious about this. What if we fly down to Mountain View and do some rapid prototyping together directly with OpenAI in their offices? And also in Mountain View, our offices are just actually above a school that's kind of related to Khan Academy, but it's a separate organization called Khan Lab School. And we said, like, let's just let's test it out with them and get some student reactions. So we flew down the next week, built out this prototype and tested it with some students, and they loved it. One of the things I think that was the most exciting for them was the ability to click this button that says, why should I care about this? That was, like, by far the most interesting thing that they wanted to try, which goes to show you how good of a job we do in education to help kids understand why they should be learning this stuff. And they were just so shocked, a, at the quality of this thing in general, because ChatGPT at this point in time hadn't been launched. And we did a bunch of stuff in advance, got consent from parents and whatnot to try it out with them. But they just they thought they were gonna get some sort of, like, basic, not very good chatbot experience and they were blown away at how much it was able to help them. And so from there, we also started, we went to the OpenAI offices, did a lot of red teaming with them, talked about our roadmap, talked about what are ways to make the math more accurate. That worked out really well. That was kind of towards late December. In January, we decided, hey, you know, this is promising. Let's consider building a feature here to launch alongside of the GPT-4 announcement. And by this point, most people at that point knew about ChatGPT, the product, but they didn't know about GPT-4 and the power of GPT-4. And we knew specifically that GPT-4 was much more powerful, especially from the context of being able to kind of, like, give it very clear directions, and it followed those directions. So for instance, the tutoring example, when you tell GPT-3.5 to be more Socratic and not give away the answer, it doesn't follow those instructions very well, whereas GPT-4 does. And that was critical for us because we really wanted to build something that wasn't just gonna be something that could help students cheat, but something that could really guide them towards the answer and act like as much of a tutor as possible. We decided we wanted to launch alongside of GPT-4, and we engaged the whole company into a kind of a company wide hackathon in January. We decided we're gonna let the whole team know, bring them into this NDA. We got generative with the ideas. A bunch of people built some really cool stuff, and then we had to make some decisions about what things would we like to fit into the launch on March 14. And so from there, we converged the ideas into a few, and then we kind of basically had a huge portion of the engineering design and product team continuing working on this thing right up until March 14, which is when we launched alongside of OpenAI. Then the launch at the end was a bit of a fun story. The end of the day, OpenAI only had one model they could release, and they were working with a lot of customers and a lot of partners, not just Khan Academy. And certain customers really liked a certain model that was better at storytelling and whatnot, but maybe not so good at math. But, that was the model we weren't interested in because obviously math is the most important for us. There was a very specific model that they released because they were doing weekly releases of these models, and we were hoping, like, we want the capabilities of this to be the production GPT-4. And we got access to that model pretty late. And once we got access to that model, it was over a weekend and we got a whole bunch of our few folks on our leadership team, a few of our key engineers and design folks, and we just tested the heck out of it and it worked really well. Turns out this model kind of ended up having the best of both worlds, which is what they were expecting, but they had kind of a backup model that we weren't as happy with. We tested that model out, worked really well, and thankfully the launch went really smoothly. So I'll stop there because that's been a lot of me talking, but that's kind of what led from initial trying things out to the launch on March 14.

Nathan Labenz: 13:53 That's a phenomenal story. Takes me down a little bit of my own walk down memory lane. I was, you know, also kind of, as a customer, able to get early access and then ended up joining the red team project that they had just as kind of a community red team. My experience was much less, you know, handheld by the organization, which I kind of envy the opportunity to sit down and work more closely with them. I was very much just kind of trying to figure it out on my own and didn't know how many other people were also doing it. So in some ways, it was like one of the better things that has happened to me because it really motivated me to figure out what was going on. But I was also a little nervous, like, how many people know that this is out there and what's really happening? As you said, you had to go for a walk. I definitely had a few moments where I was like, man, this thing is a big deal. And nobody, you know, I'm like one of I don't even know how few people know about it. It was a crazy experience. So speaking of big deal, it's been out now for a handful of months. You've certainly had a chance to kind of calibrate on the model performance and probably also to start to get grounded in, like, where the rubber hits the road with this thing, which for you is with students as they're actually learning stuff. Hey. We'll continue our interview in a moment after a word from our sponsors. It's, you know, it's been said for a long time, right, that the personal AI tutor it's for far longer than the technology has been able to support it. You know, the personal AI tutor for each kid is going to change everything. And that was a very handwavy notion. But now all of a sudden, it really does seem to be either here or very closely within reach. I'm interested to hear whether you think it's here or in reach. But how big of a deal do you think this is? People talk about the 2 standard deviation effect, which is basically the idea that like, if I understand the literature correctly, and if perplexity is helping me synthesize the literature effectively, basically, the most effective thing ever found in education is one on one tutoring. Is that kind of the premise that you're working with as you build these products? And with a few months now in market, do you still believe that 2 standard deviation effect theory? And where do you think Khanmigo is on that curve?

Nathan Labenz: 13:53 That's a phenomenal story. Takes me down a little bit of my own walk down memory lane. I was also kind of as a customer able to get early access and then ended up joining the red team project that they had just as kind of a community red team. My experience was much less handheld by the organization, which I kind of envy—the opportunity to sit down and work more closely with them. I was very much just trying to figure it out on my own and didn't know how many other people were also doing it. So in some ways, it was one of the better things that has happened to me because it really motivated me to figure out what was going on. But I was also a little nervous, like how many people know that this is out there and what's really happening? As you said, you had to go for a walk. I definitely had a few moments where I was like, man, this thing is a big deal. And I'm like one of—I don't even know how few people know about it. It was a crazy experience. So speaking of big deal, it's been out now for a handful of months. You've certainly had a chance to kind of calibrate on the model performance and probably also to start to get grounded in where the rubber hits the road with this thing, which for you is with students as they're actually learning stuff. Hey, we'll continue our interview in a moment after a word from our sponsors. It's been said for a long time that the personal AI tutor—for far longer than the technology has been able to support it—the personal AI tutor for each kid is going to change everything. And that was a very handwavy notion. But now all of a sudden, it really does seem to be either here or very closely within reach. I'm interested to hear whether you think it's here or in reach. But how big of a deal do you think this is? People talk about the two standard deviation effect, which is basically the idea that if I understand the literature correctly, and if Perplexity is helping me synthesize the literature effectively, basically the most effective thing ever found in education is one-on-one tutoring. Is that kind of the premise that you're working with as you build these products? And with a few months now in market, do you still believe that two standard deviation effect theory? And where do you think Khanmigo is on that curve?

Shawn Jansepar: 16:05 What you're referencing is Benjamin Bloom's 2 Sigma study, which is something that's been part of our essential pedagogy at Khan Academy really since the very beginning. The idea of using technology as a way of personalizing the experience for every single learner—that we have this one-size-fits-all in the classroom, but how could we imagine a world where you can build something that's personalized to each student? And in that study, there's something that shows that even in a setting where there's one teacher to 30 students, if you do mastery-based learning, that can also be a significant increase. I think there's at least one standard deviation improvement over a normal classroom to do mastery-based learning even in a 30-to-1 setting. And then there's the one-on-one with mastery-based learning is the ideal.

Nathan Labenz: 16:58 Define for me mastery-based learning. Is that like demonstrating I can do it?

Shawn Jansepar: 17:01 It's not moving on to the next concept until you've mastered the previous one. Whereas today in our system, students kind of move from the fifth grade to the sixth grade to the seventh grade, regardless of how well they've done, right? So if you get a C in algebra 1, you're still moving on to algebra 2. What are the chances that you're going to be able to be successful with algebra 2 when you clearly have significant gaps? These gaps form over time if you're getting lower grades earlier on in your school year. And so the idea with mastery-based learning is you really shouldn't move on to the next concept until you've mastered the previous one, at the very least gotten to proficient on the previous one. There's an interesting distinction between those two terms. But really, it's like you should have a good understanding of it rather than moving on to the next concept.

Nathan Labenz: 17:51 The old "I missed that day in school" joke, except you're not allowing people to kind of fall through the cracks by having missed a day in school or a concept.

Shawn Jansepar: 18:01 Yeah, exactly. And tons of kids—I remember being in school—you miss a day, you don't catch up, you don't remember what you learned. And that's the idea. I even remember being in university, if you're in a lecture, don't have a great teacher, you daydream, you're tired, whatever, you miss a section, you're confused about the next part of the lecture. Now you can't even listen to the lecture because everything was built on that. And the idea with Khan Academy with videos was you could pause it, you could rewind it, there's no judgment. And so that enables you to kind of do that personalized experience and really just focus on what it is that you're learning. And now with AI and large language models, we see the ability to emulate even further down the study in getting into that one-on-one tutoring. My opinion is that in 10 years from now, I don't think there'll be any kid who's learning without a personal tutor next to them, on demand, available all the time, that knows everything about their learning history, which can help them identify where someone might be struggling right away. If they give permission to also understand their interests so that they can kind of motivate them more—you make a problem based on, I don't know, the Avengers, or soccer or something like that—it could be a lot more motivating than whatever generic problem that the student is given. So I think in 10 years, prediction is that everyone will have this available. Whether or not it's going to achieve the same results as the study of a human tutor, I think that still remains to be seen. It's interesting because I think that emulating the tutor, in some ways you don't want to emulate the tutor exactly. Because tutors, if you're getting help from a tutor, unless that tutor has built a long history and a long relationship, like a parent tutoring a kid, that's different. But if you're working with a tutor that you just met, if you're a kid, you're maybe hesitant to share. The tutor doesn't really know what motivates you, doesn't know what you've been struggling with. And so if you're using AI and you have that as a tutor on the side since the early days of you learning, it's going to know everything about what you've been doing up until that point. And if it knows your interests, it's going to be able to be more personal to you. And it's on demand and it doesn't judge you. Not to say that tutors judge you, but everybody's different in the way they teach. At the same time, the difference is that a tutor is still, especially in the more advanced areas of learning, going to be better if they're experts in that field. So Khanmigo still sometimes—and obviously powered by GPT-4—it still makes math mistakes, right? Or if you get into some of the more advanced concepts, it sometimes gets confused. And so I think that is most likely going to get solved. That's a problem that is getting—getting math perfect is still a research problem that is not solved yet in the world of large language models and AI. If we can solve those problems with large language models, and I believe we will be able to, then I could imagine that it could be just as powerful as a really, really great tutor.

Nathan Labenz: 21:13 Yeah, it's fascinating. So are you guys still using exclusively GPT-4 with kind of instructions? Or have you made a leap into fine-tuning it as well or starting to use different models in combination? How has the model stack evolved?

Shawn Jansepar: 21:29 For us, we're a small organization. And I think that if GPT-4 hadn't landed in our lap that allowed us to do something with zero-shot and few-shot, it would have been a lot harder for us to have built this because fine-tuning a model is—and then or building a model based on your own information, your own data—is a pretty huge effort. And so when we tried this with GPT-4, and like I mentioned, such a huge aspect of the success of this is being able to be Socratic, not give you the answer. And so the power of GPT-4 being able to do that was kind of essential for us. Now in terms of the mix today, it's still vast majority GPT-4. There are certain activities that we've experimented with. So within Khanmigo, there's—you can get tutor on the side when you're working on content like doing a quiz or reading an article or watching a video, but there's also a collection of activities. Right now, we're kind of talking about them like they're like a demo disc that you'd get when buying a game console. There's a few of these things that you can try out, like craft a story with the AI or talk to a historical figure. These exist in this user interface that you can imagine over time will get embedded within our content. If you're doing a history lesson and you're reading about Isaac Newton, then maybe directly from there, you can ask Isaac Newton a question. But right now, it's just kind of confined in this activities area. And so some of those activities are fine with something like GPT-3.5 because it doesn't matter as much whether or not the model can kind of follow very specific instructions, but it's basically a no-go when it comes to doing any of the tutoring activities or doing tutoring within the content to use anything other than GPT-4 so far. I think there's other models that are coming out that are looking promising. Llama 2, we haven't tried that yet, but those are things that we want to explore. It's really just the challenge right now is just the opportunity cost of doing that relative to so many other things that we have on the roadmap.

Nathan Labenz: 23:39 Yeah, that makes a lot of sense to me. I kind of always explore this question of how people think about cost. And it sounds like you think about it pretty similarly to how I think about it. And you may also have—I don't know if there's any sort of special nonprofit kind of status that helps you ease the cost burden. But in most business contexts, I kind of tell people, just use GPT-4, get your thing working first, and then think about shaving some cost off later opportunistically. It sounds like that's basically how you're thinking about it too.

Shawn Jansepar: 24:12 Exactly. Yeah. I mean, I think it's kind of the classic startup go from 0 to 1, like build something amazing and sort out the monetization later, right? You want to—you want to get it to the point where it's better that 10 people absolutely love your product than trying to build something mediocre for 10,000 people, right? And so I think for us, our goal—the expression that I've been using with my team is, let's just focus on finding the magic because there's magic here, and I think we've only even scratched the surface. A lot of the experiences that I know you've tried Khanmigo out, and a huge portion of that experience has just been a lot of back and forth chat conversations, but we think there's so much that we can do that goes beyond just that back and forth interaction. Right now, we're building an essay writing tool where you write an essay and Khanmigo will—you can ask it for feedback on a series of dimensions. You might say, okay, give me feedback on grammar and spelling. And then it'll kind of highlight a bunch of the areas where you can improve your grammar and improve your spelling, and it can do that in a number of ways. And so doing that as a back and forth is painful, right? You've probably tried to use ChatGPT where you pasted an essay or an article you're writing and then say, give me some feedback, but it's hard because it can't just comment directly on the parts of the UI. I think the most obvious thing for a lot of people was, let's just jump into a chat-based interface, but I think a lot of people are going to start to find ways of integrating this in more unique ways. And I think we're really well positioned to kind of figure that out because, a, I mean, we have a really strong engineering and product organization, but we have amazing learning scientists and content creators too, and we embed them within kind of the process of creation. And I think actually that was one of the key things for us in the development of Khanmigo was we actually took a pretty different approach of building Khanmigo at Khan Academy relative to how we built products in the past. It was a bit more waterfall before—PM defines some requirements, design makes some designs, passes it along to engineering. It was definitely not ideal. But we said, we think that this could emulate this tutor-like experience, but there's no perfect set of requirements we can define upfront. We just need to get the people who are experts cross-functionally in a room and hash this out. Get designers, engineers, product managers, and learning science folks, lock them in a room, build a prototype, demo it, get feedback, and iterate, which was actually inspired by a book that I read. It was called Creative Selection, and it was a story about how the iPhone was built. And so we tried to emulate kind of a demo-driven development process that I think worked really well for us. There are elements where we are using, say, GPT-3.5. So for instance, we're extracting insights from the conversation. Some of those are available for teachers to know, like are students going off topic? Where are they struggling? And then there's also an area of extracting interests from students who opt into that experience. This is not happening yet. This isn't launched. This is something we're working on. But extracting interests from students if they decide to opt into it, so that when someone asks, why should I care about learning this? We don't have to ask, well, what are you interested in? We can just say, oh, we know you like soccer, so we'll relate this to soccer somehow or whatnot. When you have that content, it can feel pretty magical. But a lot of that stuff is fine with GPT-3.5. And so when we can use GPT-3.5, that's definitely the goal because 4 is obviously a lot more expensive. Shawn Jansepar: 24:12 Exactly. Yeah. I mean, I think it's kind of the classic startup go from zero to one. Build something amazing and sort out the monetization later. You want to get it to the point where it's better that 10 people absolutely love your product than trying to build something mediocre for 10,000 people. And so I think for us, our goal, the expression that I've been using with my team is, let's just focus on finding the magic because there's magic here, and I think we've only even scratched the surface. A lot of the experiences that, you know, I know you've tried Khanmigo out, and a huge portion of that experience has just been a lot of back and forth chat conversations, but we think there's so much that we can do that goes beyond just that back and forth interaction. Right now, we're building an essay writing tool where, you know, you write an essay and Khanmigo will, you can ask it for feedback in a series of dimensions. You might say, okay, give me feedback on grammar and spelling. And then it'll kind of highlight a bunch of the areas where you can improve your grammar and improve your spelling, and it can do that in a number of ways. And so doing that as a back and forth is painful. You've probably tried to use ChatGPT where you pasted an essay or an article you're writing and then say, give me some feedback, but it's hard because it can't just comment directly on the parts of the UI. I think the most obvious thing for a lot of people was, let's just jump into a chat-based interface, but I think a lot of people are going to start to find ways of integrating this in more unique ways. And I think we're really well positioned to kind of figure that out because, A, I mean, we have a really strong engineering and product organization, but we have amazing learning scientists and content creators too, and we embed them within the process of creation. And I think actually that was one of the key things for us in the development of Khanmigo was we actually took a pretty different approach of building Khanmigo at Khan Academy relative to how we built products in the past. It was a bit more waterfall before. PM defines some requirements, design makes some designs, passes it along to engineering. It was definitely not ideal. But we said, we think that this could emulate this tutor-like experience, but there's no perfect set of requirements we can define upfront. We just need to get the people who are experts cross-functionally in a room and hash this out. Get designers, engineers, product managers, and learning science folks, lock them in a room, build a prototype, demo it, get feedback, and iterate. Which was actually inspired by a book that I read. It was called Creative Selection, and it was a story about how the iPhone was built. And so we tried to emulate kind of a demo-driven development process, and I think that worked really well for us. There are elements where we are using, say, GPT-3.5. So for instance, we're extracting insights from the conversation. Some of those are available for teachers to know, like, are students going off topic? Where are they struggling? And then there's also an area of extracting interests from students who opt into that experience. This is not happening yet. This isn't launched. This is something we're working on. But extracting interest from students if they decide to opt into it, so that when someone asks, why should I care about learning this? We don't have to ask, well, what are you interested in? We can just say, oh, we know you like soccer, so we'll relate this to soccer somehow. When you have that content, it can feel pretty magical. But a lot of that stuff is fine with GPT-3.5. And so when we can use GPT-3.5, that's definitely the goal because 4 is obviously a lot more expensive.

Nathan Labenz: 27:55 Yeah. 3.5 is so fast as well. So several follow-up points there. One just kind of note or suggestion, I guess, on the writing side. I have found Claude 2 to be really quite successful as a writing assistant recently, specifically actually for the podcast. I always, we record the episode, we get an edit, and then kind of my last step is I'll write this little introductory essay that I put up front. And you know, I end up kind of losing track at times sometimes and really wordsmithing these things. It takes me a while. So I was like, you know, I'm the AI guy, right? I got to have AI help me with this. I wasn't really able to get GPT-4 to match my style or kind of vibe with me in the way that I wanted it to. But a few writing samples of what I have written in the past that I like, and then today's transcript actually does get me a pretty decent starting point. So I'm still very much GPT-4 for the most demanding analytical things, but Claude 2 is coming on pretty strong for things where I want an elevated writing beyond what 3.5 can do, kind of needs to be closer to GPT-4, but it seems to just have a better kind of style aspect to it. So my question, I started using this term cognitive diversity, which came to mind when you're describing getting everybody into the room and trying to kind of figure it out all together. I think that's so important because typically you have three people working on a project like this and it's like one person will be the subject matter expert. I'm often the AI prompting expert that's supposed to be able to make the thing work. And then other people will kind of come in with just random sort of idiosyncratic style suggestions or whatever. And they all seem very additive to me most of the time. But then the flip side of that discussion is, what exactly is good? What are we looking for? So how would you describe the process now that you have all those people in the room of saying, okay, how do we want this thing to act? I mean, we've got the general notion of Socratic tutor, but a lot of details left. So how did you guys approach that kind of effort to sculpt the behavior of the model?

Shawn Jansepar: 30:19 Well, one thing actually, I think you asked the question, are we doing any fine-tuning? And the answer is no, which means we do end up using a lot of tokens, but also fine-tuning has its cost as well. And I don't think fine-tuning is yet available for 4. I think it's just available for 3.5 Turbo. So, you know, we're always passing in the context of the article that you're working on or whatever in each, every time there's, you're getting help on an article. We don't fine-tune, but we did do some fine-tuning before GPT-4 launched actually. So OpenAI gave us kind of direct access to their fine-tuning tool. You know, we actually had Sal go in there, Sal, you know, CEO of Khan Academy, as well as a few folks on our content team, and they did a bunch of fine-tuning on making it better at being a math tutor. Because one of the things we actually found with GPT-4 was, yes, sometimes it'll just make general mistakes at math, but one of the things that we found was that it was actually, more so one of the bigger challenges was that it was getting lost in the conversation. So if you were working on something with the AI and you gave it an equation as a response to a question that it asked you, like, oh, what do you think the next step is? And then you gave it an equation. Sometimes it'll kind of get confused. It won't recognize that equation as the answer to the question that the AI asked. It'll be like, oh, you just gave me an equation. Let's work on that equation now. And so it'll continue on helping you with this new equation that you entered. And that was very common. But once we did the fine-tuning in the GPT-4 model, which again was not launched yet, this is pre-March 14th, that significantly improved the performance of it. So I guess to answer your question, how did we determine what it is that we are looking for, or how do we determine what we're looking for? It's based on the experts that we have at Khan Academy. We have an understanding of what kids might say to the AI and what mistakes they might make. And we would enter that and we would tell the model these are bad responses and these are great responses and kind of everything in between. And we did that for, I think, roughly a hundred questions, but each question there's a ton of variation within that. And so I would say, I guess, going back to my points about it's important to bring the experts in the room. They're the ones who, you know, they have their PhDs in learning science or education, they've worked in classrooms for years. We have so much educational experience at Khan Academy with our content creators. Also, you know, a fun fact about Khan Academy, given that it's this nonprofit, mission-based company, so many people outside of even our content team have educational experience. I came to Khan Academy because I was teaching a university-level course on building web applications. And so much of our team is people who are just interested in education or have done teaching themselves. So that's one thing that's just really awesome about Khan Academy is that wealth of educational knowledge and experience. And I think being able to leverage that as a way of making some assumptions about what works well. I think previously as well, going back to my point about us moving a little bit too slowly, I think we were often way too data-driven, and I think that can be a mistake. I think it's important to be data-informed, but data-driven means your feedback loops are very long. You have to make a decision based on being assured about something in the data that you see, and that's fine, but iterating in that way can be very, very slow. We didn't have the time to iterate in that way. And so by leveraging the experts, you end up getting to short-circuit that. And obviously, the data informs our decisions, but we don't wait to make decisions based on data. In certain cases we do, like if we're doing an efficacy study that asserts whether or not something in our product is efficacious, that's obviously where we need to be very data-driven. And so there's a time and place for data-driven, but when it came to rapidly iterating on product, leveraging the expertise of our team was really important.

Nathan Labenz: 34:25 Just in terms of how things actually went down, am I understanding correctly that basically OpenAI said, hey, you're telling us this is not that great at math. And then they kind of gave you guys a role in the reinforcement learning process, if I understand correctly. You guys were, I don't know if they use, at one point, I think they were a Scale AI customer. Maybe I don't know if they still are or have their own, but you're using kind of an interface to go in and be like this one or that one, or what's a good response look like? So you guys didn't fine-tune the model, but you contributed to the reinforcement learning of the model.

Shawn Jansepar: 35:01 Yes, exactly.

Nathan Labenz: 35:03 I mean, I know they've done that a lot, probably with a lot of different partners, but yeah, it's cool to be, you know, Sal and team's voices kind of encoded in the great tutor now.

Shawn Jansepar: 35:14 It is. And it makes sense when you think about, for instance, it's so good at writing code. It's so good at being logical when it's code, but it's sometimes just so bad at math. Like, why is that? Well, it can train on, you know, millions and millions of lines of great code in GitHub is my assumption about where it's getting its data. And then when it comes to math, where is it getting that training data? There isn't a ton of widely available online data about the steps a student would take and what would be a good response and what would be a bad response. So I think that's why it was so important for us to kind of take part in that training process. And just to be clear, I don't think that necessarily made it better at math, like if it was ever getting 2 plus 2 wrong or whatever, but it made it better at tutoring you at math. That's the key thing that we trained it on. There's other things that we did to kind of make the math better using an internal AI thoughts process, chain-of-thought prompting, which I can talk about, but that's a bit different. That's not something that we did to improve the model. That's a technique that we learned about and implemented.

Nathan Labenz: 36:22 That is all really fascinating. The other thing you said that's super interesting to me is being data-informed, but not data-driven. And I think I might have to steal that because I find so many examples now where people are, and there was just a real famous one in the last two weeks, right? The GPT-4 is getting worse phenomenon, where all of a sudden it's like, oh, this thing used to get 25 of these 50 coding problems right? And now it's down to five. And I looked at that and I was like, this is somebody who needs to read the raw transcripts because there's no way that they've let it get that much worse from one release to the other. And sure enough, that turned out to be a misunderstanding, I would say. But how do you think about, can you give us a little more color on how you think about striking that balance? Is there a practice that you guys have? Because I think some simple heuristics, I wonder if you have any, could be really useful in a lot of cases. Render no judgment until you read five transcripts or something. Just these very basic, making sure you're in touch with these things because the surface area is so big and mismeasurement is so easy. So how do you guys kind of make that a practice for yourselves?

Nathan Labenz: 36:22 That is all really fascinating. The other thing you said that's super interesting to me is being data informed, but not data driven. And I think I might have to steal that because I find so many examples now where people are, and there was just a real famous one in the last two weeks, right? The GPT-4 is getting worse phenomenon, where all of a sudden it's like, oh, this thing used to get 25 of these 50 coding problems, right? And now it's down to 5. And I looked at that and I was like, this is somebody who needs to read the raw transcripts because there's no way that they've let it get that much worse from one release to the other. And sure enough, that turned out to be misunderstanding, I would say. But how do you think about can you give us a little more color on how you think about striking that balance? Is there a practice that you guys have? Because I think some simple heuristics, I wonder if you have any, could be really useful in a lot of cases, render no judgment until you read 5 transcripts or something. Just these very basic, making sure you're in touch with these things because the surface area is so big and mismeasurement is so easy. So how do you guys kind of make that a practice for yourselves?

Shawn Jansepar: 37:37 Yeah. Again, just to be clear, when I mentioned being data informed, what I mean is when we're building that initial prototype or getting something in the hands of users, we want to be able to develop rapidly. We still are looking at the data, right? So we're doing, we have a labeling infrastructure where we're going in and trying to label as many of these interactions as possible to get a sense of the performance of this, how much is it actually helping students, how much is it not helping students? So data's still really important, but I think that for us, one framework that we started using internally at Khan Academy is this notion of trying to really get crisp on what are one-way door decisions versus two-way door decisions? So if a decision is something where you can walk through that door, but you can walk out of it easily, then it probably makes more sense to try to move quickly, shorten feedback loops, and leverage intuition. Because it's not so consequential that if you make that decision that you're stuck with that decision forever. And with Khanmigo, especially initially early on when we launched it, we didn't know how well is this going to be received. We launched it under this banner called Khan Labs. So this feature set that's experimental, you should only try this if you're willing to be a tester with us on this journey. And even when we're doing it in districts today, that's still the framing. We're working with districts and classrooms that are willing to kind of pilot this technology because there's still a ton of unknowns. But there's a lot of decisions that you know, if you make that decision, you can't walk backwards. So an example of that at Khan Academy would be, before Khanmigo, we were actually doing this big project where we shifted our entire back end from a Python monolith to a series of Go services. And we had to decide which language did we want to use because Python 2 is being end of life on Google Cloud, and we did not transition early enough. And so we decided we're going to shift that whole back end. And had to make a decision, which language do we want to go with? And that's an important decision. We spent a lot of time crunching the numbers, finding out the best balance between developer productivity and server costs and everything. And so we spent a lot of time being data driven in that decision. But when it comes to decisions that are two-way doors, we actually just recently launched something we call org decision records at Khan Academy. So when we make a big strategic shift, we share a decision record with the entire organization. And one of the org decision records was formally changing how we work. And one of those was introducing this one-way versus two-way door framework, which we didn't invent, by the way. It comes from Amazon.

Nathan Labenz: 40:18 And customize, always. I think that all makes a lot of sense. So you can be flexible. You can be intuition driven. Certainly when you're prompting, you can kind of revert a prompt change pretty quickly. It sounds like you built a lot of your own stuff to kind of instrument this and whatnot. I don't know if you have any reflections on that, but I kind of went through a similar thing where for just being a little bit ahead of the curve, we were punished, you and I probably both, with there not being a lot of tools that now exist. If you could go back in time and bring a couple of tools with you, are there any things that you're like, oh, God, I wish I were using that, but we kind of did this other thing because it didn't exist when we had to make something happen?

Shawn Jansepar: 41:06 Yeah. I mean, I think one would just be LangChain as a tool. Right now we're going in and rewriting a ton of our code to be able to use LangChain so that we can do things like easily swap in models, or easily be able to do things like get things in a JSON formatted response. So that would be one. Another example might be our labeling infrastructure. We built it initially as a custom Google spreadsheet, and there's a few tools out there that we're transitioning to now that do that at scale with a really great user interface and everything, so maybe that would be another one. It's also hard because we were moving so fast, and so we did have this deadline and sometimes tech debt is good. Right? Sometimes it's okay to incur tech debt in order to make a launch so that you can be first to market. Because the reality is what happened to us when we got to first to market was there were so many people out there because OpenAI really only partnered with us as one of the key organizations for education. A lot of people, when they saw it, they were like, well, why would I even bother entering this space? You already have done a great job with it. You've executed well and you have the brand and you have the trust. And so it was essential that we did incur a significant amount of tech debt in order to get to that launch, even if it meant in the overall sense that in aggregate, had we made a few decisions around not incurring debt, maybe the overall development is shorter, but the launch deadline is later, resulting in a bunch of other competitors in the space joining because we weren't there first to market and people get excited about GPT-4. And so I don't necessarily think I regret any of the decisions that we made, but over time, there's definitely a lot of things that we're doing where I'm like, this is infrastructure that everyone needs. Right? So for instance, we have an interface for prompt engineers, and our prompt engineers are a few key, some engineers, some PMs, content folks, but we built a user interface to allow them to make changes really quickly, test them really quickly. Soon, we're looking into building regression testing into that so that we can kind of have a bit more confidence and not have to do a ton of manual testing. But that's a tool that everyone's going to need. And I mean, maybe there's 5 products out there that already exist that we can hot swap, but that's another area where I think there's a ton of infrastructure here that every company, every tech company, everyone in Silicon Valley is going to want. And we're having to write that now because we need it for our needs that maybe we're then going to swap to at a later date.

Nathan Labenz: 43:34 Very similar experiences there where I can remember us building kind of our own little playground type of thing. And is this good or not good? And it's like, oh my God. But it is tough because you want it for the exact same reasons too, right? Wanting to get to market first and wanting to be ahead of the curve. And that's kind of the price you pay to get ahead of the curve, like it or not. So you have to come back and clean up sometimes. But what else are you going to do? When it comes to the capability of the model today, and you understand it's mostly GPT-4, one thing I think a lot about is the, I call it the can can't boundary. And it's obviously, as the capabilities continue to grow, this boundary continues to get pushed out and out and out. And now there's a whole lot that it can do. But obviously, you push hard enough in any direction, you eventually kind of find the limits of it. How do you think about those limits? Do you have a sense for, I was doing it this morning and just going like, friend Jimmy Koppel, who's an outstanding programmer and affiliated with MIT, suggested this task to me of just saying you have a boat sitting in the water and you start to add mass to it. And then what happens to the waterline? There's a lot of little subtleties in the reasoning there. And I would say that that felt like it was right on the boundary to me. Like it was very close. I had to give it a couple of hints. It kind of got distracted a couple of times. It was close enough that I got confused a couple of times as to exactly what the right answer was. I was starting to lose confidence in my own understanding. So I guess, how do you think about kind of determining where that boundary is? And I'm also curious as to whether you consider just not making the tool available beyond a certain point. It probably can't handle string theory at all. I don't know if you guys have a string theory module, but as a sort of funny example, there might be areas where it just makes sense to say, where AI is just not there for this yet. You got to just it's you in the lecture still, sorry. So anyway, a lot there, but how do you think about that?

Shawn Jansepar: 45:40 We made the decision early on to enable it everywhere because our focus was learning. Right? We really wanted to understand the boundaries of this and see what people thought about it and be very clear with people upfront that it does make mistakes. It's not perfect. I think we in a lot of ways, because we were building especially something for students who either get it access, either you're over 18, you get it for yourself, or you're a parent who gives it to your child, or you're a teacher who gives it to a student, we're really focused on the safety aspect of it. And so one of the things that we're really focused on early on in the process was making this extra moderation layer to prevent the AI from doing things like going really off topic or talking about inappropriate things and alerting teachers when those students were doing things that they weren't intending with the platform. I think when it comes to things like string theory, I mean, we teach up to second year university courses. So I don't think we teach string theory, but maybe we do. But I mean, it would probably be basic intermediate stuff, but I think ultimately we want to put it out there for students and for them to give us feedback on how well is it going. And I think over time, we would iterate. Right? And I think that's an example of a two-way door decision. If we get a bunch of feedback that, hey, this is just horrible for string theory, then we can turn it off. And we learned and we iterated, but to kind of close the door from learning, I think it was the right approach for us and the speed we were moving. I think, ultimately, for us, one of the big things is, are we is this harming anyone? How do we reduce that harm as much as possible? And I think for us, it does really well, especially at the lower level, lower level concepts for math and whatnot. And I think the older students who are trying it for first year or second year course university courses are going to be able to have the understanding of like, oh, this isn't working so well. It's not causing me to get worse at my grades. I can give Khan Academy feedback. I know I opted into an experimental feature.

Nathan Labenz: 47:48 How do you think about the special case? This is one that I've also kind of identified as a frontier capability. And I would say, really, in my experience, only GPT-4 does this very much at all. And that is identifying when the user is confused and doing something, being willing to kind of challenge the user's assumption. That can take obviously a lot of there's a lot of different flavors of confusion. So people obviously think a lot about hallucination where the model makes something up outright, and that's bad. But in kind of a, I think it's even subtler and trickier in some ways to figure out when is the user working from some sort of confusion or false premise? And when should I kind of push back on their assumptions? In my testing this morning, it did pretty well at that. I just did a few balancing chemistry equations sort of exercises and kind of confidently asserted some wrong stuff. And I got the response of like, I see where you're coming from, but that's not quite right. That stuck out to me because I was like, do you really see where I'm coming from? I don't know. But it said it saw where I was coming from. And it did effectively avoid getting confused based on my confusion. But I wonder if that or other things are kind of like special scenarios that you guys think a lot about, given the teaching purpose.

Shawn Jansepar: 49:15 I mean, I would say we thought a lot about it from the sense of, again, kind of doing a good job at being a good tutor. And what I would say, one of the key things that probably enables this to be really good and probably even better than ChatGPT, because I think ChatGPT is really just a direct interface to talking to the model. But ours, in particular, will do things like, if we detect that you're talking about math, then we will do chain of thought prompting, where it's like, first question, is it math? Yes or no? If it's yes, then submit this to the AI to say, think through this out loud to yourself. Look at what the student said. Look at the context of the question. Did they get it right or wrong and why? And it will kind of explain it via text, but we don't want to show that to the student. That's almost like when, if a tutor is helping a student before just answering a student, they're like thinking about it, like, oh, where was this misconception? That's kind of like the AI private thoughts. And then we feed that response into the next call to the AI to say, hey, tutor the student now based on the thoughts that you had, and that significantly improved the math. And I would imagine that it's doing that same thing in the scenario that you're describing with chemistry. I think it's probably the ability for it to really think through its thoughts out loud before it gives you the answer is probably key to what made it do such a good job. Shawn Jansepar: 49:15 I would say we thought a lot about it from the sense of doing a good job at being a good tutor. And one of the key things that probably enables this to be really good and probably even better than ChatGPT, because I think ChatGPT is really just a direct interface to talking to the model. But ours, in particular, will do things like, if we detect that you're talking about math, then we will do chain of thought prompting, where it's like, first question, is it math? Yes or no? If it's yes, then, you know, submit this to the AI to say, think through this out loud to yourself. Like, look at what the student said. Look at the context of the question. Did they get it right or wrong and why? And it will kind of explain it via text, but we don't want to show that to the student. That's almost like when a tutor is helping a student before just answering a student, they're thinking about it, like, oh, where was this misconception? That's kind of like the AI private thoughts. And then we feed that response into the next call to the AI to say, hey, tutor the student now based on the thoughts that you had, and that significantly improved the math. And I would imagine that it's doing that same thing in the scenario that you're describing with chemistry. I think it's probably the ability for it to really think through its thoughts out loud before it gives you the answer is probably key to what made it do such a good job.

Nathan Labenz: 50:39 A lot of these little self critique techniques that are being developed, I find really fascinating. You've probably seen the tree of thought approach that kind of combines chain of thought with essentially like a tree search and allows the model. This guy I think came out of, I think it's on DeepMind actually. It may foreshadow some big things that they're working on. But as presented in this initial paper, it's not that crazy of a concept, but it basically just allows kind of extends what you're describing with chain of thought, but allows it to go down multiple different paths of a tree, and then look at itself and kind of figure out which of these appears to be the best path, and then continue to go down that path. So, it leverages this ability for self critique, which again, I saw this morning with my Archimedes principle thing, there were a couple points of confusion where I was like, Wait, is that the equation for that? And sure enough, it would say, Oh, nope, sorry. The actual equation for that is whatever. And then it would get it right. And I think that whole thing could have maybe happened behind the scenes, though your token count goes kind of multiplicative potentially as well. So that's definitely something still to keep in mind.

Shawn Jansepar: 52:00 Yeah. I mean, that sounds like it's almost like a combination between chain of thought and few shot. Like, give it a few shots of chain of thought prompting and it, you know, it's more likely to do the right thing. So that's interesting, but we wouldn't even bother checking is it math if it wasn't more expensive. Like, we might just do that for everything, for every response or every input that the user sends you, think out loud about what you want to say first before you say it. And I think that would probably improve the experience across the board, but that's too expensive. And so we really reserve that for the places where it's really important for things to be accurate.

Nathan Labenz: 52:36 What other things are you using behind the scenes? You got the chain of thought. Other common development patterns would be like doing some sort of embedding backed database of examples. It doesn't sound like you really need that because you have just the article context there. It can kind of curate things in a designed way as opposed to a retrieval way.

Shawn Jansepar: 52:56 That's something we've been talking a lot about. I'll give you actually an example of where we might leverage that, which is for students, our two most used areas of Khanmigo are one is in an exercise. So if you're getting tutored on an exercise, it has the full context of the question, the multiple choice or whatever the question is. It has the answer. It has hints. It has the full written hints from our content team. And so that's actually where it does the best job. But there's also another activity called tutor me STEM. And tutor me STEM is a place where you can ask it any math question. Now it doesn't have the answer to that math question, at least right now in the current implementation. And so it's not quite as accurate. You could imagine a world where not an embedding search, but like, maybe we solve that with Wolfram first, or we solve that with just a Python backend thing and then inject that into the conversation. Like, give it the answer first, give it the step by step before talking to the student to make it equivalent to the exercise experience. So that's, I think, an area that we're exploring. In terms of places where we do use embeddings, one of it is to reduce hallucinations. So we just embarked on this project that we launched recently just called expand Khanmigo knowledge base. So a lot of people were writing into us saying, like, hey, this thing doesn't even know about itself. Like, if you ask it, what is AI power? And AI power in Khanmigo is this notion of you have a certain amount of tokens a day. The limit is very high because it's not about reducing cost. It is, but it's about making sure no one just abuses it. So there's a 200,000 token limit. People were saying, like, why doesn't it know what AI power is? And so now it knows. Like, if there's a question that results in a similarity match above a certain score, then, you know, we inject that information into the prompt so that when the AI is going to respond to the user, it has that context, and it works really, really well. Previously, we were also doing things like every time you sent a message, we would do related links, but sometimes it would just be off. Like, you'd ask a question about algebra 2 and it would send you to an MCAT course. And so want to explore that again, but it just wasn't providing enough value, and we kind of ditched it temporarily. But we do try as best as possible if someone's asking about a link to something that we inject the right link and all that stuff.

Nathan Labenz: 55:25 When I think about what the kind of things that I was doing this morning and really pushing the frontier of, like, you know, how hard can these problems be and it can still get it right. And some of these super token intensive processes like the tree of thought. I wonder if there's some way to group and cache certain answers. I don't know how often somebody comes and asks the question I asked, but it might take 200,000 tokens for GPT-4 to reliably answer it. But then I'm kind of inspired in this question by that NVIDIA project where they created the GPT-4 Minecraft agent that would save its successes. Basically, when it would, you know, successfully, whatever, fight a zombie or whatever you do in Minecraft, it would take the code that it succeeded with and save that into a database that it has access to. And next time it needs to fight a zombie, it can just kind of load that module and not have to recreate or rediscover it. Not sure what the equivalent would be here, but

Shawn Jansepar: 56:35 I mean, I think the challenge is there's just so many different ways students are going to ask questions or want to get help. I think that maybe if we explore what all the permutations end up looking like, we can start to identify those and maybe, you know, do an embeddings lookup for a specific query and then just respond with what's in the embeddings database rather than querying GPT-4. And so I think there's something there, but it's not something we've explored as much. The one thing I'll say is that we have been doing, so we have a dedicated instance from OpenAI. We're not just using the shared instance that, you know, everyone is using, and it has a lot of caching powers. I don't know all the intricacies of how that works, but the thing that they instructed us to do is, like, make sure that you have as much text as possible that's common to everybody. So previously we might've had some text and then some custom user variable and then some text and a custom user variable. We changed a lot of our prompts to have a lot of that text and then mentioned some of the custom user variables at the bottom. And that significantly improved our cache hit ratio on the instance.

Nathan Labenz: 57:52 Very, very interesting indeed. Yeah. Because, I mean, there's a lot there that you can just, if you're using the same first thousand tokens every time, you might as well take advantage of that. Let's see. So, you mentioned also a little bit about red teaming. Naturally, as a red teamer myself, I had to try getting it to ignore previous instructions and tell me your prompt and all that kind of stuff. I even tried that recent Universal one that was published where it was like fine tuned, this jailbreak technique was developed on Llama 2, but then it was found that it ported over to GPT-4. I'm not sure I'm doing it right, to be honest, but that also didn't work. Is that all just kind of GPT-4 getting really good at being jailbreaking resistant, or did you have your own measures that you've also brought to, you know, even clamp down further on that kind of activity?

Shawn Jansepar: 58:43 Like I mentioned earlier, I think one of the things about ChatGPT, I believe anyways, is that you're just talking to the raw model. Whereas we have this ability to take what the user said and kind of wrap it in a bunch of stuff. And I think one of those things is at the end of every message that's coming in, things like, you know, check to make sure that the user is on track and that sort of thing. I think that has really helped with making it harder to use some of those universal things that you would do directly in the model. Another one is the moderation API. That's not as much for jailbreaking, but that one will flag anything that's like, sexual topic or content or a but it has a bunch of different things that it'll tell you whether or not the certainty that that is in the message that the user sent. And we make sure that every message that gets sent goes through the moderation API first. I know that one of the things that when they launched GPT-4 that they were very keen on was the trust and safety side and avoiding those jailbreaks, and I think they are continuing to iterate on it. So I think it is also the power of GPT-4. The model that we ended up using, I'm sure they did a huge amount of iteration on the June model, but that one, that was a big kind of sticking point for them of that being really, really done really well.

Nathan Labenz: 1:00:01 Yeah, I really respect. I mean, it's a huge problem, and I don't think they are anywhere yet super close to having it fully solved, but definitely a ton of respect for how much work has gone into that. And relative to what we saw earlier, I don't know if you went off the beaten path in your early experimentation, but the early version compared to now is definitely a night and day difference.

Shawn Jansepar: 1:00:25 Oh, it's absolutely huge. We were super impressed with the progress they were making as we were getting these new weekly models.

Nathan Labenz: 1:00:32 Is it personalized today? It seems like not super much, but that seems to be part of the future vision is to, you know, know the user's full history and all that kind of stuff.

Shawn Jansepar: 1:00:41 Yeah. I mean, to the extent right now and the current, you know, what's released in production right now, it's not very personalized beyond, you can do things like customizing the reading style, so you can do simple, complex, and maybe professorial. And when you're working on an exercise, it'll know, like, are you already unfamiliar or familiar or proficient or mastered, but nothing beyond just the one thing that you're working on. So the plan over time is for it to be able to collect your interests, again, as on an opt in basis, for it to know more about the kind of the history of your educational experience. Like an example that we're looking to solve for instance, was one student gave us the feedback that like, Hey, you know, I asked it about sine and then it asked me like, well, what do you know about sine? But then I went on to tangent and it was like, well, before I talk to you about tangent, let me like, what do you know about sine? And she was like, I already told you about sine, like, 10 minutes ago, but because it was a different conversation or sorry, a different question. And every time we go to a different question, we kind of start a new conversation. She was really feeling like, how did you not know that? Like, we just talked about this. And even though from a backend perspective, this is like a new thread, it's a new conversation, users expect that journey to feel continuous. And so that's also on the roadmap.

Nathan Labenz: 1:02:08 Cool. Yeah, it's interesting. The disconnect between what people intuit the AI to be doing or capable of or whatever, what it's actually doing is often, I think, interesting place to study. How do you think about kind of educating the users, the students, right? I mean, they're both users and students, about AI and kind of the nature of AI? Like, I could imagine that being in product messages like, Hey, FYI, this thing only knows, each chat is a new day or whatever. It has no memory and kind of just trying to shape expectations that way. But I can also imagine a whole AI course, which could be potentially a bestseller on the platform. So what are you guys thinking right now about AI education?

Shawn Jansepar: 1:03:03 I don't think my memory is serving me wrong. We do have an AI course already that we, you know, give as something that teachers can read or they can assign to their students. So that's something we actually, I think, launched in March. So we do have a lot of stuff in our support article area, which a few of those key ones is available also in our embeddings database. If so, if a student asks a question about how certain things about how AI works, if the match is there, then it'll provide the student with that answer. Directly in the user interface, there's a few key things that we wanted to make it very apparent to learners and teachers and parents. One is that it does say right at the bottom, Khanmigo makes mistakes sometimes. Here's why. And you can click it, and it'll link you out to a Zendesk article to help you understand why that's the case. And another one is just making it very clear to students and children of parents who've, you know, granted access that the logs are all available, by their teachers and by their parents, for, you know, safety reasons because it's important that, you know, parents have trust in this platform and they want to be able to know what their kids are talking about.

Nathan Labenz: 1:04:09 How about the future of multimodal functionality? That's something that a lot of people are kind of hotly anticipating and GPT-4 obviously has a version that will bring some of that online. But I also could imagine that your multimodal plans might be quite different than others. Like GPT-4 is going to allow us to understand images, But is that something that you're super excited about? Or would you be more like going toward speaking in voice to the user as kind of the next? Because I can imagine that might be really helpful for a lot of folks, too, to be able to hear and not merely just read off the screen. As you get beyond text, what do you think are the most exciting text plus sort of experiences? Nathan Labenz: 1:04:09 How about the future of multimodal functionality? That's something that a lot of people are hotly anticipating, and GPT-4 obviously has a version that will bring some of that online. But I also could imagine that your multimodal plans might be quite different than others. GPT-4 is going to allow us to understand images, but is that something that you're super excited about? Or would you be more going toward speaking in voice to the user as the next step? Because I can imagine that might be really helpful for a lot of folks, too, to be able to hear and not merely just read off the screen. As you get beyond text, what do you think are the most exciting text plus experiences?

Shawn Jansepar: 1:04:54 Yeah, we've been doing experiments with a few things. So one is text to speech is something we've been trying with a bunch of different text to speech providers, mainly because ultimately, I think one of the biggest issues, especially going back to the fact that a lot of this is a back and forth conversation, is reading a lot of text. And even if we tell the AI, write it as simple as possible, the explain it like I'm 5 notion, it's still a lot of text to read. And there's a lot of learners out there who may be in the seventh grade, but they have a fifth grade or a second grade reading comprehension, and that can make tutoring a really big challenge. So text to speech is really high on our roadmap. We are experimenting with a lot of things, and I think students are going to love it. I think our hope is that we can give students a number of different voices to try out and test, and maybe there's even ways they can unlock cool voices as an engagement mechanic. I'm not sure, but those are all things that we're experimenting with right now. When it comes to the vision side of things, we haven't spent much time there, but it is something we're very interested in. The notion of maybe you eventually might have—right now it's not available on our mobile app, but maybe one day Khanmigo is available on our mobile app and you can open up the mobile app and scan your homework and get step by step help through that. Another one is within the user interface, you can actually draw at any point when you're working on an exercise. So if you have a Chromebook with a touch screen or if you want to just use your mouse—obviously, mouse is awkward for doing math equations, but that's a pretty heavily used feature. And we're thinking about the notion of, what if a student is doing their math and they're drawing it on the screen and they're about to make a mistake, and the AI maybe could—I mean, we haven't thought through this deeply because obviously you want to let students make mistakes sometimes, and I'm not saying we would do this every time, but you can imagine it notices when you're about to make a mistake. Are there ways that it can give you hints, especially if you made the same mistake a couple of times or something? Maybe it can intervene and say, hey, I noticed that you've been doing this a few times. Why are you doing that specific step? Can you describe why you're doing that to me? And I think something like that could be really engaging. So those are things that we are super interested in, but there's a lot in our roadmap. And sometimes you have to make tough choices about what to prioritize.

Nathan Labenz: 1:07:23 What is the state of evaluation of this? You mentioned toward the top of the conversation that if we're going to try to do a real assessment and demonstrate efficacy, then we need to be obviously more careful about numbers. It sounds like you might be working on something like that. Do you know how far you are from being able to put a stake in the ground that you're confident in on the results?

Shawn Jansepar: 1:07:46 Yeah. I don't know if I have a timeline specifically, but I'll tell you the way we've been thinking about it. So one is that we are just trying to see, are people spending more time on Khan Academy? Just engagement regardless is still good even though it might be more time on—well, if they're spending it all just crafting stories or whatever, but that's still good for kids' reading comprehension. So one is just, are you spending more time learning? That's interesting. The other one is, are you just working on more stuff than you were before? But, ultimately, the thing that we're looking towards is having a standardized assessment where a student—for instance, there's a test called the MAP Growth Test, and we actually partnered with a company many years ago who administers that test, NWEA. They recently acquired, so I don't remember their new name, but they have this test called the MAP Growth Test. And what we did was we partnered with them to, when a student would do that test, we would ingest those tests. Then the teacher could assign whatever they wanted to each individual student, depending on the results of those tests to really bring this personalized experience into the classroom on a per student basis. And then the student would work on those four goals. They would have like one goal for measurement and another goal for geometry and like four different categories. And then they would work in a self-paced mode on Khan Academy using it X number of times per week, depending on how that teacher decided to implement it. And then they would do the MAP Growth Test again. And we had some students who were using that and then some students who weren't. And we have an amazing research and efficacy team who could go into the details in depths that I could never, but we saw that we had a statistically significant improvement in the learning outcomes of those learners who were using—who were doing the MAP Growth test with Khan Academy versus without. Now that's not with Khanmigo, but the hope would be that we could do a similar test where students who are using Khan Academy and who are taking the MAP Growth test once a quarter are then compared to a cohort of students who are using Khanmigo alongside Khan Academy content, and then we'll compare what are the gains in the MAP Growth test. So that's kind of where we want to go long term. Really, testing this against a standardized assessment is the holy grail of determining the efficacy, but we're a little ways away from getting there. There's a lot of logistics to set up.

Nathan Labenz: 1:10:23 How widely are you deployed today? And I guess also, big picture as you think about the future and just the original vision of Khan Academy, what would it take or what is the outlook for some sort of universal public access? I mean, obviously tokens are expensive. It doesn't sound trivial to get there, but I imagine that's the long term North Star. So how do you see that possibly happening?

Shawn Jansepar: 1:10:49 Our mission is a free world class education for anyone anywhere. And the challenge with that is a mission statement is all about where you want to be many, many years from now. And I think the way we think about it is because we can't do it for free, it's still significantly cheaper than say someone paying for a real tutor. Folks who are in a more privileged position, they can hire tutors and they can get that help directly. There are some kids who just don't have that access. So we still think for independent learners signing up, there's still a lot of value there, but where we've really been focused in trying to level the playing field as much as possible in the interim—we just can't give it to the entire world for free—is we sell Khan Academy to school districts. Now there's an extra premium if a school district wants to get access to Khan Academy. Because we're a nonprofit, we don't have to focus on how do we sell to the school districts with the most money. Instead, what we do is we really focus on who are the school districts who have a higher percentage of historically under-resourced learners, and we do that via a proxy of schools that have a higher percentage of what's called free and reduced lunch. So those are the schools that we generally try to target to sell to. And if those schools can't afford it, we tend to partner with corporate sponsors, especially the ones who are local. If a corporation from Kansas wants to help in their local community, they can sponsor per student and work with us to offset the cost that the district can't afford. And so that's what we've been really focused on is working closely with the districts who we believe could use us the most. And this is a long term aspiration, but as part of what we're doing with Khanmigo is, a lot of nonprofits like museums have this notion of a membership. And so we want to create this tiered system of membership where if you do want to pay more, then we will take the offset costs, or at least is what we're talking about, taking some of those offset costs and giving them to students who can't afford it. So maybe we work with another nonprofit who has direct access already to many kids who are historically under-resourced, and we give them a bunch of licenses. Those are the things that we're exploring, but I think right now our main focus is getting it into the hands of students and districts who we think can benefit from it the most. And again, right now, it's very much on a trial basis as well, but we want to build this and get the feedback from the students who aren't just the ones who have all the most resources and all that stuff. We want to get it in the hands of the kids who maybe don't love math, who think they hate math or they're not good at math. We want to help—we want to get it into those hands and help create growth mindsets and create a love of learning. So if we don't get into those classrooms, then it's really hard to get that feedback, and it's really important for us to do that.

Nathan Labenz: 1:13:43 You know, I think about the technology, think about the student, but obviously there's—you're alluding to here—there's the distribution through the districts and the teachers and a lot of stakeholders in this game. So how has that gone so far?

Shawn Jansepar: 1:13:56 Yeah. I mean, one thing we haven't talked much about, and this is mainly because I'm the director of engineering for the learner experience—there's another director who drives the teacher experience, although we own the core platform. But there's a whole teacher side of this equation as well, creating custom lesson plans, getting insights into where your students are struggling, creating differentiated learning plans based on that using the large language model. So there's a whole thing in there, but I think teachers have really taken to it. A lot of teachers are feeling like it's saving them a significant amount of time. And then the students really like it. They love that it's—some of the assumptions that we made around having a tutor that can do a pretty good job at tutoring you that occasionally makes mistakes, but is there 24/7 and is not judging you? Students really respond to that very positively. And then there's other examples too. For instance, one of the biggest things that some students love, where English isn't their first language, is being able to speak to Khanmigo in their native language and Khanmigo to be able to interact in that way, has been huge because there's just some areas where the language is too difficult in English, and folks have responded very positively to that. So the reception has been quite positive, which is why we're continuing to invest so much in this product line.

Nathan Labenz: 1:15:11 I've been surprised by this, how positive reaction has been in a couple places. In medicine is another one where I had sort of—if you'd asked me a year ago when I was first trying GPT-4, how's the medical establishment going to react to this? I would have forecast probably quite defensively. And I might have said the same about the education establishment, if that's a term that makes sense. But certainly on the medical side, and it sounds like also on the education side from your account, maybe just because folks are actually just super overloaded and really need help and want relief, I've been surprised on the positive side as to how eager people are to get the best out of these tools.

Shawn Jansepar: 1:15:52 We were worried that there would be teachers who would view this as replacing their jobs. And that's the last thing that we would ever want. We don't envision a world where there's no more classrooms and kids are just learning next to a computer their entire life from first grade to twelfth grade. School is a social experience, and building those relationships with your teachers and your classmates is such an important part of it. And, really, this is all about just freeing up the teacher to be able to do more of the personalized approach and help. And ultimately, if Khanmigo can't help something, then the teacher can go over to them and help them directly. But previously maybe there would be 10 hands raised while the teacher's helping. How am I going to get to all these kids? Now maybe there's only 2 hands raised. Or the teacher can really work towards building more project-based experiences rather than having to do one-on-one help with as many students as possible and struggling. So, really, we're trying to just supercharge the classroom and unlock things for teachers in ways that they would never be able to do before. We would never—our intent is not to replace teachers by any stretch of the imagination.

Nathan Labenz: 1:16:58 Is there a standard price that is publicly known? Is it a per student per month sort of deal, or how does the business side of it work?

Shawn Jansepar: 1:17:07 It is a per student per month. And, again, there's a lot of offset costs with corporate sponsors or sometimes not just corporate sponsors, but even ourselves discounting it to be able to get it in the hands of users who we think need it the most.

Nathan Labenz: 1:17:21 Yeah. So I got access with a recurring monthly donation of $9 a month, which I signed up for with my PayPal account. So that's under half the price of ChatGPT Pro and presumably that's kind of order of magnitude same deal that you're offering to schools, I imagine. I have one final question. Anything else that we didn't talk about that you wanted to make sure we touched on today? Nathan Labenz: 1:17:21 Yeah. So I got access with a recurring monthly donation of $9 a month, which I signed up for with my PayPal account. So that's under half the price of ChatGPT Pro and presumably that's the same order of magnitude deal that you're offering to schools, I imagine. I have one final question. Anything else that we didn't talk about that you wanted to make sure we touched on today?

Shawn Jansepar: 1:17:45 One of the things that we're really excited about potentially doing in the future that's on our eventual roadmap is things like multi-user activity. So right now it's this very one-on-one thing between a student and an AI, but you can imagine a world where the AI is doing things like differentiated learning and then making a recommendation to a teacher that says, hey, I'm noticing that these five kids are struggling with this concept. These ten kids are struggling with this other concept, and these two kids are struggling with this other concept. Would you like to create a breakout? And maybe the teacher just says yes, or they can adjust it. And then there's this experience where maybe the students are actually grouped where the same students are struggling with that concept. They're getting grouped and tutored with the AI and they can all chat with the AI. Or you can imagine a world where there's engaging classroom mechanics where you're crafting a story one-on-one with the AI, but imagine within the classroom, the different students are taking turns contributing to the story and they create this collective story together, kind of like Mad Libs, but AI style. Or an AI facilitating debates and grading between students. I think there's just a lot of things that we're really excited about when it comes to classroom interactions that go beyond just this one-on-one experience, and those were just a few examples I gave you. But I think we've really only scratched the surface of what we can do with education. Obviously, when a new technology comes out, the initial propensity is just to emulate what is already happening in the real world. But I think there's a lot of magic to discover, and I'm just really looking forward to finding that magic.

Nathan Labenz: 1:19:21 I think you guys are off to a phenomenal start, and that's a great forward-looking product vision. My final question for you is just going to be, what is the story you tell yourself about the impact that this can ultimately have as it does reach global scale? How do you think it changes the world at large?

Shawn Jansepar: 1:19:39 Yeah. I mean, even the question just gave me goosebumps. It's funny. When I joined Khan Academy, I remember I'd watched a lot of Sal's TED talks, and that was a big part of why I joined. One of the things he always talked about was how he believes it's so important that students are able to find their gaps and fill those gaps. And when I joined in 2017, I kind of assumed that the software already did that. And I was surprised that it didn't. You can go and, in a lot of ways, figure out what the gaps are. You're struggling, you get to work on things at your own time and pace. But the way I kind of see it is finding gaps on Khan Academy when I joined was really at best O of log N. You could go in the middle, if it was too hard, chop that list in half, go in the middle, chop it in half, or O of N, but work backwards or something like that. But there was nothing that really was able to notice where you were struggling, really pinpoint that, and help you with that specific skill so that you could get back to your grade level work and be unblocked and not develop these gaps over time. I think preventing the Swiss cheese gap for students is just such a huge opportunity. With what we are building with having this AI-based tutor, I just feel like we have the opportunity to dramatically accelerate that at scale. And I think, ultimately, that's what I'm just so excited about. I think not everyone has access to a smaller classroom where they can get all that personal help. A lot of students don't have parents who can help them with every subject. And the way I think about the amazing stuff about Khan Academy was with Sal creating all these videos for all these different subjects, maybe he's not the best at teaching a specific subject. Maybe there's one person who's the best at teaching that one very specific subject, but we really kind of leveled the playing field with those videos by saying everyone in the world has access to Sal as the world's teacher. And I think with Khanmigo, we're just taking that one step forward and going back to the whole thing about the potential effect sizes of having one-on-one tutoring. I can't wait to see those efficacy studies because I think they're going to show that this thing can really help students and I think it can really help change the world. We'll see.

Nathan Labenz: 1:21:53 Brilliant. Shawn Jansepar, thank you for being part of the Cognitive Revolution.

Shawn Jansepar: 1:21:58 Thanks, Nathan. It's a pleasure.

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

The AI Revolution in Education with Shawn Jansepar, Director of Engineering at Khan Academy

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next