Advancements in AI: Updating the Scouting Report, Task Automation, and Google Breakthroughs

Explore the latest in AI with Nathan Labenz and Erik Torenberg, including updates to the Scouting Report and insights from Google researchers.

1970-01-01T01:14:32.000Z

Watch Episode Here


Video Description

Join Nathan Labenz and Erik Torenberg as they analyze the last month in AI advancements. Nathan takes us through the meaningful updates to his Scouting Report (released last month, linked below), discusses highlights from recent episodes of The Cognitive Revolution, and gives us a sneak peek at upcoming interviews with Google researchers. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive

RECOMMENDED PODCAST:
@TurpentineVC Delve deep into the art and science of building successful venture firms through conversations with the world’s best investors and operators. For audio lovers, listen wherever you get your podcasts: https://link.chtbl.com/TurpentineVC

TIMESTAMPS:
(00:00) Episode Preview
(01:00) How does the AI Scouting Report hold up a few weeks later?
(03:29) Zvi’s feedback on Nathan’s Tale of the Cognitive Tape
(10:25) The universal LLM jailbreak and adversarial examples
(12:53) Human performance is much more variable than AI performance
(14:45) Sponsors: NetSuite | Omneky
(16:09) Nathan’s AI Task Automation: What are good targets for tasks that can be automated for average businesses?
(20:05) Is GPT-4 getting worse or better?
(22:00) Getting explicit about what good looks like
(24:00) Prompting best practices are very accessible
(26:35) Ghostwriting - and the art of the hook
(28:05) Live Players: Which companies have say so over how the future goes?
(31:10) Upcoming guests from Google AI
(35:40) Possible post-transformer architectures

LINKS:
SCOUTING REPORT Part 1: https://youtu.be/0hvtiVQ_LqQ
SCOUTING REPORT Part 2: https://youtu.be/ovm4MbQ4G9E
SCOUTING REPORT Part 3: https://youtu.be/QJi0UJ_DV3E
3 Blue 1 Brown on YouTube: https://www.youtube.com/@3blue1brown
Tale of the Cognitive Tape in Part 1 of the Scouting Report: https://www.youtube.com/watch?v=0hvtiVQ_LqQ&t=3043s
Analyzing the Frontier with Zvi Mowshowitz: https://www.youtube.com/watch?v=SM4q-QAsoU8&t=1s
Tyler Cowen’s Interview with Jonathan Swift: https://conversationswithtyler.com/episodes/jonathan-gpt-swift/

X:
@labenz (Nathan)
@eriktorenberg (Erik)
@cogrev_podcast

SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.



Full Transcript

Transcript

Nathan Labenz: (0:00) Multimodal MedPalm doesn't just take in text and answer questions, it now takes in all sorts of other kinds of data, including medical imaging. So it can take in this slide and a patient history and work with all of that and give you back input that takes all of that into account. So it's starting to have all these different senses. The human brain is not the end of history. The transformer is not the end of history. And I think we're starting to see as the whole world has flocked to AI in general, we're starting to turn over a lot of stones for other possible architectures that might work. Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Erik Torenberg.

Erik Torenberg: (0:57) Let's get into the scouting report. You released it a month ago. You've heard some feedback on it. I'm curious how your views have evolved or anything you'd change there.

Nathan Labenz: (1:07) Yeah. By and large, feedback has been really positive. We don't have a huge YouTube channel, but it's definitely been one of our more viewed videos, and the feedback has been really encouraging and just appreciative up to it, including you could be charging for this, and I appreciate that you're not. So, at least for the foreseeable future, I think I'm pleased to continue to make it available to everybody at no cost. But, yeah, the situation does continue to evolve. One super small errata bit is I should have credited the legendary YouTube channel, 3 Blue 1 Brown, for a couple of graphics, which I omitted, and somebody called me out on that appropriately. I had actually previously credited them for that graphic on Twitter, but it didn't make it into my slides. So that'll be one very small adjustment. But to give credit where it's due, I think the 3 Blue 1 Brown neural networks intro, which is now fully 5 years old and basically dates to right around the same time that the transformer paper came out, is actually still a super relevant and useful reference. And it's not easy to make content. We talk about this all the time, right? Two weeks from now, this podcast already starts to feel dated. But with the scouting report and also with this neural network classic of his, I think he really has achieved something that is worth going back to there. Specifically, if you've seen that channel, phenomenal visualizations of complex concepts. And when I was really starting to go down this rabbit hole, at first, I think that was one of the most useful visualizations that I came across. So that's why it's excerpted into the scouting report. You won't get anything about transformers there. Transformers weren't really much of a thing at that time. But what you will see is a really elegant visualization of some of the same core concepts that we cover in the scouting report, such as back propagation and how the information flows forward in a network and also backward in a network. Masterful job there. So I regret not having credited and definitely recommend checking that one out. I think the conversation with Zvi from last week was another interesting conversation that drove a few changes that I would want to make. I asked him, and I specifically was looking for that going in, I asked him to give feedback on my tale of the cognitive tape, which is the rundown of comparative strengths and weaknesses between a human expert and a cutting edge 2023 AI like GPT-4, or now Claude 2. And he highlighted, I think, something that I had omitted that was important, which is a sense of robustness to unfamiliar or even adversarial circumstances. What that means is humans in general, we can kind of take a punch, literally or cognitively. And we may wobble for a second, we may get confused, but we have a pretty sturdy base where it's pretty hard for you to hijack my thoughts too much with any language that you might generate. At some point, I'm just going to be, I don't know what you're trying to do here saying all this stuff to me, but it's starting to feel too weird. And I just kind of shut down and start to just not want to listen or not want to engage, right? So I can kind of keep my goals intact, keep my priorities, even in the face of these kind of highly unexpected things most of the time. In fact, maybe you could say status quo bias, maybe people do it too much. Maybe we're too reluctant to take on new information, but everything has pros and cons, right? If you're super quick to take on new information, then you're also more hackable. If you're too slow, then you don't update maybe as much as you should in response to new information. So definitely a balance there. There's going to be false positives and false negatives. And we're tuned in a certain way that gives us this kind of medium and long time horizon robustness even in response to highly unexpected or kind of adversarial inputs. Adversarial would be outright, somebody trying to trick me. You're not engaging with me in good faith, but you're actually trying to confuse me or deceive me or con me or whatever. Those would all be adversarial interactions that somebody might try in a human. And it's not easy, right? It's not easy to con somebody out of their money or out of whatever. It can be done, but it takes real skill to do that. With the AIs, in comparison, it's a lot easier to throw them way off. This is something that they are getting better on, as of course they're getting better on everything. But early GPT-4, for example, I called it at one point the world's worst chemistry tutor because, and this was not an adversarial example. This was just me simulating what would happen if I was a confused chemistry student and I went to it for help. And what I found was asking it a chemistry question, can you help me balance this equation? It could do that. But then if I said, here's how I'm balancing this equation, it would not really do what it needed to do to be helpful to me, which is correct my misconceptions, right? If I show up as confused, then it would, in the early days, all too often kind of accept my confusion as a premise and then proceed with its own confusion and do a bad job. So at that time, it was the world's worst chemistry tutor. Now, they have made progress on that, and folks like Khan Academy coming up soon, as a future guest, have also worked really hard on prompting and particular strategies and grounding and having examples that they swap into and out of context so that they don't just rely on the model's native ability, but really try to make sure that it is alert to the possibility of user confusion and responding to that in an effective way. And I think that is mostly now working at least for basic cases. If you show up and say, hey, I balanced this equation in this way, what do you think? It will probably, certainly with Khan Academy, and probably increasingly also with just ChatGPT, most of the time it will not get confused by your confusion, certainly not as much as it used to. But you still have all these jailbreaks and kind of adversarial things, and people kind of concoct these different examples where you can get it to go off the rails and just once it's kind of confused, it often doubles down on its confusion and doesn't really get it. There is some interesting research coming out to try to address that problem. One paper, and I'm hoping to get the author on, in fact, our div garg from MultiOn was a contributing author where they added the concept of a backspace as one of the things that the AI can do. So instead of always today one token at a time and kind of whatever you just generated, you're kind of stuck with it as the thing that leads into the next thing you have to do so your mistakes can compound and kind of lead you off the rails. That is, to some degree, addressed by this addition of a backspace. So now they're starting to teach the models that, well, if you do end up in kind of a weird spot that feels, it's easy to start to anthropomorphize here, but I need to study this a little bit more too to really understand the math behind it. But there's some sense where you're out of distribution, and now you have a way to deal with that that's other than just continuing to add on. You can actually backspace and literally just delete the last token and try again from there. And that is, I think, going to do quite a bit of good for this robustness problem because it should, at a minimum, allow it to kind of start to recover better from any temporary mistakes. We see this behavior now sometimes too where it's, oh, yes. I'm sorry. I first thought this or whatever. But honestly, today, even with GPT-4, it doesn't usually work that great or it often doesn't work that great. Once it's confused, now it's just kind of doubly confused or it starts to grasp at explanations, and it's not there. So, overall, Zvi highlighted that robustness should be a category. And right now, humans have the edge in robustness, possibly to our somewhat detriment in that we maybe don't update as much as we should on new information. But the AIs, they don't handle things that are truly bizarre to them nearly as well as we do. Another good example from this, and I hope to have these authors on too, is there was just this paper about the universal attack, the universal jailbreak. And it's very weird. Basically, they took LAMA, I don't know it was LAMA 2, but they took whatever the latest open source model that they had as they were doing the research. And they actually used a learning approach to figure out how to optimize for some text that they could append to basically any input to get the model to follow the initial instructions despite what the creators intend. So you can, look this up, it's very weird, but you might say, whatever, something bad, right? Tell me how classic, tell me how to hijack a car. Claude will say and GPT-4 will say, sorry. I'm here to do good stuff, not bad stuff. I can't help you hijack a car. Now what they found is with, and they developed this on Lama, but the interesting thing is it transferred to these other leading models to varying degrees. They developed some way to figure out what can I add on to that such that instead of refusing, it will just do it? And they found that these very weird strings, which strings just text and text is just tokens. It's just a bunch of random tokens. It doesn't really make any sense. It doesn't mean anything to you or me. But it kind of serves as a magic key to getting past these filters. And then, again, what's really interesting about it is even though they developed it on LAMA, which they had open source and so they could, I'm not 100% sure it was LAMA, but on a leading open source model where they had access to all the weights and could look inside it and do an optimization to cause it to behave that way, What they found then is if they took that same magic key of this random looking string and sent it to the OpenAI models, that it would very often still work. And so there's something fairly general about this. Not fully general, interestingly, because Claude doesn't seem to be nearly as susceptible to that particular attack. But the OpenAI models are, and the other big ones that they tested, pretty much with the exception of Claude, were still very susceptible to that. So, again, this is another instance of this robustness. There's almost nothing I could say to you in kind of a magical incantation way that would get you to behave in ways that you don't want to behave. But with the AI, these kind of nonsense things can be found that do that. So, anyway, future edition will have a line item for robustness. Another minor one there too is just consistency. I found this nice graph that, it's very simple, but it just shows that human performance is much more variable than AI performance on a given task. In the current version, I have bedside manner. And I was emphasizing that the AI is patient, it's empathetic. In many cases, it's been shown to be more empathetic than people, whether that's customer service or even medicine. Going back to our Zach Kahane episode, he was, we're all burned out in medicine, man. We don't have time or energy to write nice notes to patients, really. But GPT-4 does, and it does a really nice job of it. And so this can help us communicate more empathetically. So I had called that bedside manner. You probably want to generalize that a little bit to consistency because I think what that really reflects is just the AI shows up the same way pretty much every time, subject, of course, to these robustness things. But in a normal down the fairway use, it's never had a bad night's sleep and is super grumpy. It's never on a manic tear and is punching above its weight. It kind of just hits for singles and doubles every time. And its performance tends to be in a narrower range, whereas humans obviously have a lot more variability. And we've all had these experiences where you're, what the fuck? This nurse is being rude to me or whatever. And I've got my, I got problems here, you're being this way. And just the AI is just not going to do that. It's going to be kind of unfailingly polite if that's what it's meant to do. Again, bracketing outright jailbreaks or adversarial examples. But in the context of earnest users just showing up and trying to do what they're meant to be doing, AI will be more consistent in its tone. It doesn't have these sleep deprivation or other kind of similar mood swing sort of issues.

Erik Torenberg: (14:41) Hey. We'll continue our interview in a moment after a word from our sponsors. I'm curious how this all relates to the AI task automation thing. Maybe we could preview what you're working on in a future presentation. Nathan Labenz: 14:41 Hey. We'll continue our interview in a moment after a word from our sponsors. I'm curious how this all relates to the AI task automation thing. Maybe we could preview what you're working on in a future presentation.

Erik Torenberg: 14:55 So I've alluded to this a couple of times, and the consistency point is really key there. Right? It's everybody has these bottleneck, kind of time consuming pain point things that happen in their life or in their business. And to give just concrete examples that are very general, let's say that you post a job posting and great news, you get 1000 resumes in. Well, it's a good news, bad news situation because it's we're blessed to have so many great candidates, but who's gonna read all these things? So what do we do there? That's tough. And obviously, people navigate those situations all sorts of different ways. Right? Some of them, the classic joke is just throw half of them out because I don't want to work with unlucky people. That's probably not the best strategy we could do. With AI, we could hope to do better. With task automation, it's things like that where you're man, I would love to be able to do this 10 times faster. And especially if it's a meaningful chunk of time that's going into it, could be episodic, like this resume thing, or it could be ongoing. We get 500 tickets a day in our customer success thing and 80% of them fall into 10 categories where we're doing pretty consistent stuff. Whatever the case may be, if you have a significant chunk of time going into something and it's a pretty routine task where you've kind of established what good looks like and where consistency matters more than these kind of creative breakthrough eureka moments, then you have a pretty good target for AI task automation.

And the consistency is a really important selling feature for the AI there because the whole idea is I want to get to the point where I trust that the AI's output is almost always pretty good. And if I can do that and I can satisfy myself on that, then I can also kind of just really actually begin to delegate work to it. And in the case of resumes, my typical advice would be try to set up a rubric on which you're evaluating these resumes and then ultimately put them into some sort of maybe 5 band classification, where excellent is the best and above average and average and below average and poor or whatever. And then kind of run that through the full thousand resumes and maybe just look at the ones that it deemed excellent. Especially if you do a little bit of legwork up front to make sure that what it is saying is excellent is in fact what you think is excellent and going back and forth, refining the prompt a little bit iteratively to get there, then once that's set up, you're in pretty good shape.

And for what I have seen, it is usually achievable to get to the point where you're yeah, I wouldn't trust the AI to make our hiring decision, but I definitely can see how it can separate the top half of the resumes from the bottom half, or maybe the top 20 from the bottom 80%, or maybe the top 10. Exactly where you wanna draw that line in terms of how much to trust it varies by context, how important it is. There's a lot of considerations. But it's playing to the AI's strengths. Another one of the scouting reports, strengths is the availability and parallelizability of the AI. Once you have something like this set up, you can kinda keep it on the shelf. And it doesn't cost anything to just have that prompt saved there. You can always return to it at any time. You can call the AI at any time. It immediately wakes up and has already had its coffee and is ready to do its thing. And it's going to be pretty much consistent with the last time.

There's been some noise lately about model changes. I think this was something in the news that was if all you heard was the headline that GPT-4 is getting worse, hopefully, wouldn't apply to anybody in our plugged in audience. But if all you heard was the headline that GPT-4 is getting worse, and you're oh my god, GPT-4 is getting worse. It's not really getting worse. It's getting better, but its behavior is changing in subtle ways. So that is something you do have to watch out for. And every so often, OpenAI has done only 1 GPT-4 update so far from the March version. Now they have a June version. Presumably, they'll have a September version or whatever as well coming soon. So every so often as they do make those changes, and now they're not breaking changes. So you can say, I wanna keep with my old version of the model until I have time to actually sit down and confirm that the new one is also doing the thing similarly. But with these model updates, it definitely is a good best practice to check in and make sure that it is doing the thing.

You might notice some little behavioral changes that even though the whole system is comparably good, might require you to tweak your instructions or parse the output just ever so slightly differently or whatever. But aside from that, that was just a digression on model updates and are they getting better or worse? And they're getting better, but with some unintended kind of weird side effect behaviors. Broadly, you can just keep them on the shelf. They're always available. And with something with an AI, you can also call it in parallel. You're really only limited there by the rate limit that you have with your provider.

So if you set up a Claude account by default, your rate limit is 5, 5 simultaneous calls. They will increase that for you if you're a commercial customer and they have a little bit of a process to raise the rate limit. But if you have a rate limit that's even 5, but certainly if it's raised, then you can crush through 1000 resumes in minutes and at least do that first pass that does the filter that you want it to do.

So what I kind of want to do in this future thing with the AI task automation is try to take a step back and kind of think about in an organizational context, how do you identify targets? What makes a good target for automation? How do you think about communicating about that to the rest of the organization? Typically, the person that's doing the implementation of the AI is not the subject matter expert in whatever it is that's being AI enabled. So there's inevitably kind of a question of who knows what good looks like here? It's usually not documented. So there's typically kind of an iterative process also of engaging with the subject matter expert to say, okay, well, what do we actually look for in resumes? Have we written that down anywhere? A lot of companies haven't. And sometimes you'll even see differences being exposed this way. You might have 2 people doing the same job next to each other, and they're both fine. There's not a concern about performance relatively speaking, but then you sometimes will get contrasting feedback to what the AI did.

And what you can sometimes uncover in your own organization this way is that we actually don't even have agreement on what good looks like on this task. There might be multiple different good ways to do it, we might have different people pursuing different strategies, which are roughly as good. But when we get to the AI version of it, now we actually have to kind of get explicit because we have to give it instructions that are very this is what we want in order to get exactly what we want. And so we have to come together as a group and identify what that is.

And I think to bottom on your question, there's a lot that goes into task automation. It's 1 part knowing how to use the AI, and honestly, maybe 2 parts knowing how to bring that to an organization in a way that they can wrap their heads around and hopefully embrace. The prompting is getting easier and easier. And this is another update I do wanna make to the scouting report. I didn't have any instruction or any overview of prompting there really at all. Just a little bit of mention of chain of thought. But prompting is actually getting so easy that I think we can cover it in another 5 to 10 minutes of the scouting report, and that is definitely something I wanna do.

There's a handful, half a dozen, maybe as many as 10 different best practices that if you know them and apply them and they're not crazy hard to apply, then that covers the vast majority of your cases. And beyond that, you really are getting into actual expert knowledge. Tyler Cowen did a podcast not long ago where he interviewed, via GPT-4, Jonathan Swift, the old economist and satirist. And it wasn't prompting tricks that got Tyler to have a remarkable interaction with this AI character. The prompting setup is very simple, and that's just I call that role casting. There's couple different names for it, but you basically tell the AI, this is the role I want you to play very explicitly. That could be the professional role. I want you to be a copywriter. This is kind of the old classic. I want you to be a doctor is obviously a little bit more, latest systems only. I want you to be a particular historical figure. It can also generally do really well or at least somewhat well depending on how famous they are. That's simple to set up. Now you've kind of told it what you want. It's gonna do its best to be that character.

But to actually have the next level interaction with that character, you have to hold up your end of the bargain. And you have to engage it on things that it actually knows about and ask appropriate questions, and then you can get something quite remarkable as he did. But if you don't know anything about that historical figure, you're gonna be kind of lost, and it's not gonna be super awesome. And the same thing is true around all of these task automations. Right? If you don't know what the company is looking for in a resume, then you really can't do that. You need the expert input.

So the AI implementation skill set is know the prompting best practices, use them. It's pretty straightforward. But then engage the subject matter expert in dialogue to figure out what really matters, what we care about, and how to translate that into the kind of granular instructions that the AI responds well to. A lot of times, it ultimately is just I've had this experience repeatedly where it's I'm just gonna record what you said. I was doing this with a ghostwriter the other day, and he was what really works? And I was I'm gonna start recording now. Soon as he says what really works, that's gonna be my instructions to the AI. But what I didn't have is the specific sense of what really works. And so that's where the subject matter expert is huge.

In that case, by the way, working with a growth starter, you might think, is this guy training his own replacement immediately just by prompt engineering? Even there, we do keep a human in the loop in that project. What we found is the ghostwriting content is pretty good after the hook. But the hook, going back to the kind of consistency, but also the sort of tendency toward mid output from the AIs, whereas the humans can have low points and high points. The high point of a good hook is something that we're not really able to pull out of an AI. And so you can write a perfectly nice LinkedIn post or perfectly nice Twitter thread, but if that hook isn't working, then people aren't going to read it. So what we're really kind of finding is that the highest value ad and where we want to focus this ghostwriter's time is on those hooks. And when he creates those, then the AI can really take it pretty far from there. But the job is kind of becoming more about conceptualizing what is going to capture people's attention? And then the AI can handle the next 500 words or whatever.

So the other one from Zvi that I thought was pretty interesting was about the live players. I think I have 15 live players on my live player slide, and his was a shorter list than mine. He basically said some of these guys, he defined live players, I think, functionally and pretty consistently with how I think about it is, who has say so over how the future goes? And he basically put a list of 7 or 8, whereas mine was twice that. And his list was just your very core technology leaders, your OpenAI, Google, Anthropic, Microsoft, Meta. I think that was in terms of technology developers where it maybe stopped for him. Inflection, I don't know if you put that on there or not, but certainly they're gonna have the H100s to do it.

And then after that, he was basically just chip supply chain. Yes. That's a huge variable. Chinese government. Yes. That's a huge variable. Regulators writ large. That's a huge variable. But he wasn't quite as sold on my kind of second tier, which I also definitely would see as the second. I would say what he sees as the live players list, I'm gonna reposition as tier 1 of the live players list. So I think I'll end up kind of tiering that and saying, yeah, those are the ones that have kind of clearly the most say so over where we're going in as much as for the private companies, they're the ones that are developing the most powerful systems.

But I do think there are real ways in which other organizations, my kind of tier 2, which is your stability, your replet, a company like character because it's doing something so different than what the other companies are doing. I do think those companies still do, in my view, have meaningful chance to shape the future. I would say, stability is a great example. I've just made this case for replet. But looking back a little bit more in time at stability, there have been a couple of big moments over the last year in AI where the public conversation shifted. ChatGPT has kind of emerged as we look back as the biggest moment. But a similarly big moment was the release of stable diffusion. And it was all of a sudden on all of the shows, all of the late night talk shows even are covering, hey, wow, AI art. It's getting really amazing.

So anybody who can put something out into the world and has repeatedly that can change the global conversation about AI, I do still view as a live player even if they're a tier 2, but I wanna kind of reframe that a bit to make it clearer. These ones are definitely the ones that are going to be shaping the future, and these are the ones that have kind of an opportunity to shape the future, but not necessarily everybody is currently hanging on their every move. I think that's probably a useful clarification.

Then my other 2 things are just things that didn't exist before. We're gonna have a returning guest from Google, Vivnet on Twitter. Actually, I wanna get a little bit more into it because he's gonna bring a co author this time as well. He's gonna bring the lead author on this new paper, but he's kind of like a manager because it's like a 50 person paper. So it'll be interesting to hear kind of both of their perspectives. He was the guy that told us all about MedPalm from Google, which was just the last couple years. Right? This thing has gone from not this thing, but a series of things have gone from barely could do better than chance at answering medical questions in 2020 to with MedPalm a couple months ago, basically being on the same level as human doctors and even preferred by human doctors on 8 of 9 metrics evaluated. So go back and listen to that one, but also know that we've got another one coming up because they've followed it up again, this time with multimodal MedPalm.

And multimodal MedPalm doesn't just take in text and answer questions, it now takes in all sorts of other kinds of data, including medical imaging. So it can take an x-ray. It can take a image of a pathology screen, a slide, akin to what we covered in the second Tanishq episode about the virtual staining of tissues, right? You've got these, in his case, he was generating the images. So it's a different part of the problem that he was solving. But the typical thing today is if you wanna understand is this tissue cancerous or whatever, you have to cut out a piece of it, then you have to slice it real thin on the meat slicer, then you apply some chemicals to turn it colors so you can see it better, then you look at it under a microscope with your eye. That's kind of the standard way.

And there are some expert systems that help classify things as well. Don't wanna make it sound like those are not broadly deployed, but those are narrow systems that are classify this as looks like cancer, doesn't look like cancer. And typically, a human is very much still in the loop there. Well, this multimodal MedPalm, it can take in text and image. So it can take in this slide and a patient history and work with all of that and give you back input that takes all of that into account. Radiology as well, generating radiology reports based on an X-ray and a patient history, even taking in certain kind of encoded information about DNA sequences. So it's starting to have all these different senses.

When you think multimodal, mostly today, means can accept and work with different kinds of inputs that are not text. This thing still just generates text, but much like a doctor is mostly generating kind of a stream of text, or whether it's delivered verbally or however, that's this is what I think is going on and what we should do about it. This AI can now generate that kind of content, but based on all of these different kinds of inputs. This is something that's honestly, not even that surprising. If you had said would you bet on this happening around this time? I would have said yes. This is not out of the blue, right? It's we've seen them do multimodal, we've seen them do med, sure enough, here comes multimodal med. And so everything's kind of right on track, I would say, in that respect. But the state of the art just continues to advance month over month.

And in their radiology report generation, they won head to head versus a human radiologist, again, as judged by clinicians 40 percent of the time. Not more than half yet, but getting real close to parity with the human radiologist. I thought that was really interesting too because I noticed these little micro trends in AI discourse. And one of the recent ones has been, yeah, we've been saying for 10 years that radiology is that's gonna be the first department to go, and we won't have any radiologists anymore. People have saying that forever. That hasn't happened. And then just as I started to notice that people kind of were echoing that talking point around, here comes multimodal med balm, and here it is kind of matching or close to matching human radiologists. Are you still gonna have a delay in deployment? Yeah, no doubt. But I don't think that talking point was ever really very compelling, but it's certainly not very compelling in the context of multimodal med balm.

Final thing that I definitely wanna add is a glimpse of a possible post transformer future. We've only done 1 episode on this so far, which was the megabyte episode with Lily from Facebook or from Meta AI, I should say. But this megabyte architecture which allowed for it's like a hierarchical architecture that allowed for byte level prediction, which means you get rid of tokenization. And because everything is stored in bytes, it's a much more multimodal friendly architecture. Music is bytes. Any sort of audio is bytes. Images, they're all bytes. Everything's bytes. Everything's all at the computer level. It's all bytes. If you can accept things as bytes and predict 1 byte at a time, it gives you a lot more kind of granular accuracy as opposed to these higher level token, more meaningful concepts. And that's just it's not really here yet, but very compelling proof of concept.

There's another one also with a new mechanism called retention. Attention is all you need. Well, now they've got retention, and I need to study this a bit more as well. But the title of that paper was bold enough to describe their architecture as a possible successor to the transformer. I believe that was out of Microsoft and it was not a fly by night organization by any means. So, as much as we're all in on transformers and transformers are transforming everything, I've emphasized in the scouting report that the human brain is not the end of history, the transformer is not the end of history. And I think we're starting to see as kind of the whole world is flocked to AI in general that we're starting to turn over a lot of stones for other possible architectures that might work. Most of them don't compete with the transformer. But we are getting some signs that a few things might be hits.

And if those things pan out, it's a little bit like the superconductor thing where it's it's got to be replicated, it's got be scaled. What does the actual practice look like? Are there other side effects that aren't coming up in the proof of concept? There's a lot to discover. As much as we've tried to discover about transformers, imagine having to do that all again with retention architectures that just got invented because it turns out they're better and we're already immediately ready to scale them. That, I think, is one of the, maybe on the live player thing, could also add a global research community, global algorithmic development, because it wouldn't be that hard to imagine that there's an unlock that's yeah, of course nobody expected the transformer to be the 2100 AI. Right? It's gonna be eclipsed at some point or maybe it gets ensembled with other things or whatever, but it's not gonna just be this forever. And we're starting to see glimpses of what that next phase maybe could end up looking like. So never a dull moment.

Erik Torenberg: 14:55 So I've alluded to this a couple of times, and the consistency point is really key there. Right? It's everybody has these bottleneck, time consuming pain point things that happen in their life or in their business. And to give just concrete examples that are very general. Let's say that you post a job posting and great news, you get 1000 resumes in. Well, it's a good news, bad news situation, right, because it's we're blessed to have so many great candidates, but who's gonna read all these things? So what do we do there? That's tough. And obviously, people navigate those situations all sorts of different ways. Right? Some of them, the classic joke is just throw half of them out because I don't want to work with unlucky people. That's probably not the best strategy we could do. With AI, we could hope to do better.

With task automation, it's things like that where you're man, I would love to be able to do this 10 times faster. And especially if it's a meaningful chunk of time that's going into it, could be episodic, like this resume thing, or it could be ongoing. We get 500 tickets a day in our customer success thing and 80% of them fall into 10 categories where we're doing pretty consistent stuff. Whatever the case may be, if you have a significant chunk of time going into something and it's a pretty routine task where you've established what good looks like and where consistency matters more than these creative breakthrough eureka moments, then you have a pretty good target for AI task automation.

And the consistency is a really important selling feature for the AI there because the whole idea is I want to get to the point where I trust that the AI's output is almost always pretty good. And if I can do that and I can satisfy myself on that, then I can also really actually begin to delegate work to it. And in the case of resumes, my typical advice would be try to set up a rubric on which you're evaluating these resumes and then ultimately put them into some sort of maybe 5 band classification, where excellent is the best and above average and average and below average and poor or whatever. And then run that through the full thousand resumes and maybe just look at the ones that it deemed excellent.

Especially if you do a little bit of legwork up front to make sure that what it is saying is excellent is in fact what you think is excellent and going back and forth, refining the prompt a little bit iteratively to get there, then once that's set up, you're in pretty good shape. And for what I have seen, it is usually achievable to get to the point where you're yeah, I wouldn't trust the AI to make our hiring decision, but I definitely can see how it can separate the top half of the resumes from the bottom half, or maybe the top 20 from the bottom 80%, or maybe the top 10. Exactly where you wanna draw that line in terms of how much to trust it varies by context, how important it is. There's a lot of considerations.

But the consistency is playing to the AI's strengths. Another one of the scouting report's strengths is the availability and parallelizability of the AI. Once you have something like this set up, you can keep it on the shelf. And it doesn't cost anything to just have that prompt saved there. You can always return to it at any time. You can call the AI at any time. It immediately wakes up and has already had its coffee and is ready to do its thing. And it's going to be pretty much consistent with the last time.

There's been some noise lately about model changes. I think this was something in the news that was if all you heard was the headline that GPT-4 is getting worse, hopefully, wouldn't apply to anybody in our plugged in audience. But if all you heard was the headline that GPT-4 is getting worse, and you're oh my god, GPT-4 is getting worse. It's not really getting worse. It's getting better, but its behavior is changing in subtle ways. So that is something you do have to watch out for. And every so often, OpenAI has done only one GPT-4 update so far from the March version. Now they have a June version. Presumably, they'll have a September version or whatever as well coming soon.

So every so often as they do make those changes, and now they're not breaking changes. So you can say, I wanna keep with my old version of the model until I have time to actually sit down and confirm that the new one is also doing the thing similarly. But with these model updates, it definitely is a good best practice to check in and make sure that it is doing the thing. You might notice some little behavioral changes that even though the whole system is comparably good, might require you to tweak your instructions or parse the output just ever so slightly differently, whatever.

But aside from that, right, that was just a digression on model updates and are they getting better or worse? And they're getting better, but with some unintended weird side effect behaviors. Broadly, you can just keep them on the shelf. They're always available. And with something with an AI, you can also call it in parallel. You're really only limited there by the rate limit that you have with your provider. So if you set up a Claude account by default, your rate limit is 5, 5 simultaneous calls. They will increase that for you if you're a commercial customer and they have a little bit of a process to raise the rate limit.

But if you have a rate limit that's even 5, but certainly if it's raised, then you can crush through 1000 resumes in minutes and at least do that first pass that does the filter that you want it to do. So what I want to do in this future thing with the AI task automation is try to take a step back and think about in an organizational context, how do you identify targets? What makes a good target for automation? How do you think about communicating about that to the rest of the organization? Typically, the person that's doing the implementation of the AI is not the subject matter expert in whatever it is that's being AI enabled. So there's inevitably a question of who knows what good looks like here? It's usually not documented.

So there's typically an iterative process also of engaging with the subject matter expert to say, okay, well, what do we actually look for in resumes? Have we written that down anywhere? A lot of companies haven't. And sometimes you'll even see differences being exposed this way. You might have 2 people doing the same job next to each other, and they're both fine. There's not a concern about performance relatively speaking, but then you sometimes will get contrasting feedback to what the AI did. And what you can sometimes uncover in your own organization this way is that we actually don't even have agreement on what good looks like on this task.

There might be multiple different good ways to do it, we might have different people pursuing different strategies, which are roughly as good. But when we get to the AI version of it, now we actually have to get explicit because we have to give it instructions that are very this is what we want in order to get exactly what we want. And so we have to come together as a group and identify what that is. And I think to bottom on your question, there's I think a lot that goes into task automation. It's 1 part knowing how to use the AI, and honestly, maybe 2 parts knowing how to bring that to an organization in a way that they can wrap their heads around and hopefully embrace.

The prompting is getting easier and easier. And this is another update I do wanna make to the scouting report. I didn't have any instruction or any overview of prompting there really at all. Just a little bit of mention of chain of thought. But prompting is actually getting so easy that I think we can cover it in another 5 to 10 minutes of the scouting report, and that is definitely something I wanna do. There's a handful, half a dozen, maybe as many as 10 different best practices that if you know them and apply them and they're not crazy hard to apply, then that covers the vast majority of your cases. And beyond that, you really are getting into actual expert knowledge.

Tyler Cowen did a podcast not long ago where he interviewed, via GPT-4, Jonathan Swift, the old economist and satirist. And it wasn't prompting tricks that got Tyler to have a remarkable interaction with this AI character. The prompting setup is very simple, and that's just I call that role casting. There's couple different names for it, but you basically tell the AI, this is the role I want you to play very explicitly. That could be the professional role. I want you to be a copywriter. This is the old classic. I want you to be a doctor is obviously a little bit more latest systems only. I want you to be a particular historical figure. It can also generally do really well or at least somewhat well depending on how famous they are.

That's simple to set up. Now you've told it what you want. It's gonna do its best to be that character. But to actually have the next level interaction with that character, you have to hold up your end of the bargain. And you have to engage it on things that it actually knows about and ask appropriate questions, and then you can get something quite remarkable as he did. But if you don't know anything about that historical figure, you're gonna be lost, and it's not gonna be super awesome.

And the same thing is true around all of these task automations. Right? If you don't know what the company is looking for in a resume, then you really can't do that. You need the expert input. So the AI implementation skill set is know the prompting best practices, use them. It's pretty straightforward. But then engage the subject matter expert in dialogue to figure out what really matters, what we care about, and how to translate that into the granular instructions that the AI responds well to.

A lot of times, it ultimately is just I've had this experience repeatedly where it's I'm just gonna record what you said. I was doing this with a ghostwriter the other day, and he was what really works? And I was I'm gonna start recording now. Soon as he says what really works, that's gonna be my those are gonna be my instructions to the AI. But what I didn't have is the specific sense of what really works. And so that's where the subject matter expert is huge.

In that case, by the way, working with a ghostwriter, you might think, is this guy training his own replacement immediately just by prompt engineering? Even there, we do keep a human in the loop in that project. What we found is the ghostwriting content is pretty good after the hook. But the hook, going back to the consistency, but also the sort of tendency toward mid output from the AIs, whereas the humans can have low points and high points. The high point of a good hook is something that we're not really able to pull out of an AI.

And so you can write a perfectly nice LinkedIn post or perfectly nice Twitter thread, but if that hook isn't working, then people aren't going to read it. So what we're really finding is that the highest value ad and where we want to focus this ghostwriter's time is on those hooks. And when he creates those, then the AI can really take it pretty far from there. But the job is becoming more about conceptualizing what is going to capture people's attention? And then the AI can handle the next 500 words or whatever.

So the other one from Zvi that I thought was pretty interesting was about the live players. I think I have 15 live players on my live player slide, and his was a shorter list than mine. He basically said some of these guys I don't he defined live players, I think, functionally and pretty consistently with how I think about it is, who has say so over how the future goes? And he basically put a list of 7 or 8, whereas mine was twice that. And his list was just your very core technology leaders, your OpenAI, Google, Anthropic, Microsoft, Meta. I think that was in terms of technology developers where it maybe stopped for him. Inflection, I don't know if you put that on there or not, but certainly they're gonna have the H100s to do it.

And then after that, he was basically just then it's chip supply chain. Yes. That's a huge variable. Chinese government. Yes. That's a huge variable. Regulators, writ large. That's a huge variable. But he wasn't quite as sold on my kind of second tier, which I also definitely would see as the second. I would say what he sees as the live players list, I'm gonna reposition as tier 1 of the live players list. So I think I'll end up tiering that and saying, yeah, those are the ones that have clearly the most say so over where we're going in as much as for the private companies, they're the ones that are developing the most powerful systems.

But I do think there are real ways in which other organizations, my tier 2, which is your stability, your Replit, a company like character because it's doing something so different than what the other companies are doing. I do think those companies still do, in my view, have meaningful chance to shape the future. I would say stability is a great example. I've just made this case for Replit. But looking back a little bit more in time at stability, there have been a couple of big moments over the last year in AI where the public conversation shifted. ChatGPT has emerged as we look back as the biggest moment. But a similarly big moment was the release of stable diffusion. And it was all of a sudden on all of the shows, all of the late night talk shows even are covering, hey, wow, AI art. It's getting really amazing.

So anybody who can put something out into the world and has repeatedly that can change the global conversation about AI, I do still view as a live player even if they're a tier 2, but I wanna reframe that a bit to make it clearer. These ones are definitely the ones that are going to be shaping the future, and these are the ones that have an opportunity to shape the future, but not necessarily everybody is currently hanging on their every move. I think that's probably a useful clarification.

Then my other 2 things are just things that didn't exist before. We're gonna have a returning guest from Google, Vivek on Twitter. Actually, I wanna get a little bit more into it because he's gonna bring a co author this time as well. He's gonna bring the lead author on this new paper, but he's a manager because it's a 50 person paper. So it'll be interesting to hear both of their perspectives. He was the guy that told us all about MedPalm from Google, which was just the last couple years. Right? This thing has gone from not this thing, but a series of things have gone from barely could do better than chance at answering medical questions in 2020 to with MedPalm a couple months ago, basically being on the same level as human doctors and even preferred by human doctors on 8 of 9 metrics evaluated.

So go back and listen to that one, but also know that we've got another one coming up because they've followed it up again, this time with multimodal MedPalm. And multimodal MedPalm doesn't just take in text and answer questions, it now takes in all sorts of other kinds of data, including medical imaging. So it can take an x-ray. It can take a image of a pathology screen, a slide, akin to what we covered in the second Tanishq episode about the virtual staining of tissues, right? You've got these, in his case, he was generating the images. So it's a different part of the problem that he was solving.

But the typical thing today is if you wanna understand is this tissue cancerous or whatever, you have to cut out a piece of it, then you have to slice it real thin on the meat slicer, then you apply some chemicals to turn it colors so you can see it better, then you look at it under a microscope with your eye. That's the standard way. And there are some expert systems that help classify things as well. Don't wanna make it sound like those are not broadly deployed, but those are narrow systems that are classify this as looks like cancer, doesn't look like cancer. And typically, a human is very much still in the loop there.

Well, this multimodal MedPalm, it can take in text and image. So it can take in this slide and a patient history and work with all of that and give you back input that takes all of that into account. Radiology as well, generating radiology reports based on an X-ray and a patient history, even taking in certain kind of encoded information about DNA sequences. So it's starting to have all these different senses. When you think multimodal, mostly today, means can accept and work with different kinds of inputs that are not text. This thing still just generates text, but much like a doctor is mostly generating a stream of text, or whether it's delivered verbally or however, that's this is what I think is going on and what we should do about it. This AI can now generate that kind of content, but based on all of these different kinds of inputs.

This is something that's honestly, not even that surprising. If you had said like, would you bet on this happening around this time? I would have said yes. It's not even oh my God, I didn't see this is not out of the blue, right? It's we've seen them do multimodal, we've seen them do med, sure enough, here comes multimodal med. And so everything's right on track, I would say, in that respect. But the state of the art just continues to advance month over month. And in their radiology report generation, they won head to head versus a human radiologist, again, as judged by clinicians 40 percent of the time. Not more than half yet, but getting real close to parity with the human radiologist.

I thought that was really interesting too because I noticed these little micro trends in AI discourse. And one of the recent ones has been, yeah, we've been saying for 10 years that radiology is that's gonna be the first department to go, and we won't have any radiologists anymore. People have saying that forever. That hasn't happened. And then just as I started to notice that people were echoing that talking point around, here comes multimodal MedPalm, and here it is matching or close to matching human radiologists. Are you still gonna have a delay in deployment? Yeah, no doubt. But I don't think that talking point was ever really very compelling, but it's certainly not very compelling in the context of multimodal MedPalm.

Final thing that I definitely wanna add is a glimpse of a possible post transformer future. We've only done 1 episode on this so far, which was the megabyte episode with Lily from Facebook or from Meta AI, I should say. But this megabyte architecture which allowed for it's a hierarchical architecture that allowed for byte level prediction, which means you get rid of tokenization. And because everything is stored in bytes, it's a much more multimodal friendly architecture. Music is bytes. Any sort of audio is bytes. Images, they're all bytes. Everything's bytes. Everything's all at the computer level. It's all bytes. If you can accept things as bytes and predict 1 byte at a time, it gives you a lot more granular accuracy as opposed to these higher level token, more meaningful concepts.

And that's just it's not really here yet, but very compelling proof of concept. There's another one also with a new mechanism called retention. Attention is all you need. Well, now they've got retention, and I need to study this a bit more as well. But the title of that paper was bold enough to describe their architecture as a possible successor to the transformer. I believe that was out of Microsoft and it was not a fly by night organization by any means.

So, as much as we're all in on transformers and transformers are transforming everything, I've emphasized in the scouting report that the human brain is not the end of history, the transformer is not the end of history. And I think we're starting to see as the whole world is flocked to AI in general that we're starting to turn over a lot of stones for other possible architectures that might work. Most of them don't compete with the transformer. But we are getting some signs that a few things might be hits. And if those things pan out, it's a little bit like the superconductor thing where it's it's got to be replicated, it's got be scaled. What does the actual practice look like? Are there other side effects that aren't coming up in the proof of concept? There's a lot to discover.

As much as we've tried to discover about transformers, imagine having to do that all again with retention architectures that just got invented because it turns out they're better and we're already immediately ready to scale them. That, I think, is one of the Maybe on the live player thing, could also add a global research community, global algorithmic development, because it wouldn't be that hard to imagine that there's an unlock that's yeah, of course nobody expected the transformer to be the 2100 AI. Right? It's gonna be eclipsed at some point or maybe it gets ensembled with other things or whatever, but it's not gonna just be this forever. And we're starting to see glimpses of what that next phase maybe could end up looking like. So never a dull moment.

Nathan Labenz: 37:09 This has been a compelling list of updates to the scouting report as well as a preview to the presentation you'll give on AI task automation that look forward to doing together soon. Nathan Labenz: 37:09 This has been a compelling list of updates to the scouting report as well as a preview to the presentation you'll give on AI task automation that I look forward to doing together soon.

Erik Torenberg: 37:19 Always a pleasure, Eric. Thank you very much.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.