Watch Episode Here

Read Episode Description

Nathan Labenz talks to Google Robotics researcher Keerthana Gopalakrishnan (@keerthanpg) about how they train robots at Google, and how robots contextualize their environment and complete tasks through supervised learning. They go deep on integrating language models for commands and how to approach translating them, as well as the importance of robotics safety. Keerthana also shares how we’re using robots to better understand ourselves, personal experiences using AI, and her hopes and fears about AI's impact on society.

This episode provides valuable insights into the current state and potential future of robotics and AI, with Keerthana's expertise offering unique perspectives on these fascinating topics.

(0:00) Preview to episode
(0:51) Sponsor
(1:09) Intro
(5:20) GPT analogy for robotics
(6:51) Domestic robots
(10:53) Description of Google robotics
(15:17) Controlling robots and supervised learning
(18:50) How robots complete tasks
(26:17) Using data to train robots
(30:45) Data operability
(36:02) Integrating language models for commands
(38:58) How to approach translating commands
(42:49) How robots contextualize their environment
(47:02) Approach to robotics safety
(51:02) Tasks that robots can’t/shouldn’t do
(54:08) The hard bounds of robotics
(59:17) Frequency of inference
(1:03:50) Form factor in robotics
(1:08:38) Using robots to better understand ourselves
(1:12:23) Having robot friends and spouses
(1:17:44) Inference and latency
(1:19:17) The Google Robotics team
(1:21:07) AI tools Keerthana uses in her personal life
(1:22:44) Would you get a Neuralink?
(1:24:12) Keerthana's biggest hopes and fears about AI

*Thank you Omneky* for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Twitter:
@CogRev_Podcast
@keerthanpg (Keerthana)
@labenz (Nathan)

Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack

Websites:
cognitivervolution.ai
omneky.com
https://research.google/research-areas/robotics/
https://research.google/teams/brain/

Full Transcript

Transcript

Keerthana Gopalakrishnan: (0:00) That's why I'm so excited about robotics, because we are inventing ourselves, right? It is in many ways a quest to understand us and our intelligence, and it's so hard to put down onto paper how we detect a cup or how we are doing these things or how we are planning tasks. You know how software engineers say the best way to learn something is to build it. And I think robotics is basically our quest to understand ourselves and build more of ourselves.

Nathan Labenz: (0:28) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg.

Nathan Labenz: (0:51) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Nathan Labenz: (1:09) Keerthana Gopalakrishnan, mother of robots, is a robotics and machine learning engineer conducting advanced robotics research at Google AI, where she's been an author on several of the most influential papers of the last year, including SayCan, which demonstrated that language models can effectively act as the executive planning function or brain of a robotic system, capable of understanding and carrying out long horizon, abstract, natural language instructions by assessing situations, decomposing problems into parts, and issuing logical next step commands for their robot bodies to execute. And also Robotics Transformer or RT-1, which demonstrated that the same large scale pretraining paradigm which has recently delivered so many breakthroughs in so many domains can also work for robotics. RT-1 ultimately delivered a 97% success rate on some 700 different tasks. It's a truism in Silicon Valley startup culture that it's easier to manipulate and control bits than it is to control atoms. And some have argued that the impact of AI will be limited to digital domains as a result. To me, this work strongly suggests that this outlook is wrong. Google's most recent release, PaLM-E, as well as OpenAI's GPT-4, show that computer vision is still improving rapidly. And at this point, the combination of visual and language understanding seems quite clearly rich enough to support general purpose robots, suggesting that relatively few major conceptual problems, including more general robotic control, which we will likely achieve by teaching robots to learn from watching humans, still remain to be solved. Barring an unexpected slowdown, my takeaway from this conversation is that we should expect many form factors of robots that can see, communicate with us in natural language, and solve basic problems on their own, all in just the next couple of years. At that point, the race to engineer and scale the production of all sorts of robots will be on. And shortly after, we'll begin to encounter them in factories, offices, businesses, and even homes. I enjoyed this conversation with Keerthana tremendously. She is technically deep, and we get into the weeds. But she's also extremely thoughtful about how her work will affect the future, and I very much enjoyed her thoughts on the big picture as well. I hope you enjoy this conversation with Keerthana Gopalakrishnan. Keerthana Gopalakrishnan, welcome to the Cognitive Revolution.

Keerthana Gopalakrishnan: (4:01) Thank you.

Nathan Labenz: (4:02) Really excited to have this conversation. A lot to cover. You have one of the best Twitter bios that I've seen, Mother of Robots. And as soon as I saw that, I was like, all right, we've got to get Keerthana on the show and talk to her about everything that she's doing. You've had quite a tour de force over the last year and change as well in terms of having been an author on some of the biggest papers in robotics. And obviously, you've worked with an outstanding team at Google to create that work. But really looking forward to getting into some of the weeds on that. You recently gave a lecture for the ML Collective, which I have watched and definitely recommend. One of the things that you said in that lecture that really caught my ear was that we're somewhere between GPT-2 and GPT-3 in the world of robotics. So I wanted to just start by asking, what does GPT-4 look like in robotics? Or zooming out even more broadly, how do you think we will integrate robots into our lives as they get smarter and more capable over the course of, say, the rest of the 2020s?

Keerthana Gopalakrishnan: (5:20) I'm very excited because we are seeing the type of scale, learning at scale in robotics, which is similar to GPT-3 for language. We are seeing emergent capabilities in reasoning and also in low level control. And the way that I see robots coming into our lives is, I think at Google, at Robotics at Google, we have cracked this recipe of transformers for control, language for interface, and then foundation models for reasoning. I feel this recipe is fairly generalizable and extensible and has the potential to scale with data and with compute. So the way that I see robots coming into our lives is where humans interact with robots in natural language and the robots can do a lot of instructions in the real world and they can think and plan in visual language space.

Nathan Labenz: (6:20) So without being falsely precise in terms of a specific date for a prediction, does it seem realistic to you that we are headed for a time within the decade where we might have, for example, domestic service robots that can do the dishes for me after dinner or pick up the toys after my kids and put them away when the kids don't do it? What is this going to actually look like in homes or in places of business for people?

Keerthana Gopalakrishnan: (6:51) Oh, absolutely, I do. In fact, the tasks that we train our robots on that we publish are in our office environment, picking things and cleaning and everything. In robotics, there is the problem of a lot of demos that you see being fairly staged. If you look at robotics technology, there was very traditional control, PID control and search algorithms and stuff. And then there was early deep learning, end to end deep learning, but which was mostly ConvNets and everything. And now there are transformers in robotics. And so we are seeing a curve and we are seeing that transformers are actually pretty good at doing a lot of tasks with one model, which is quite similar to how you are, right? You can cook, you can clean, you can plan tasks, you can talk to your kids, language generation. So I feel to build a robot that's very useful, that goes into different people's houses, it's going to be very generalizable and it's going to be one model that's good at a lot of different things. And we are kind of seeing those trends already. The question is, so we know it works in multiple Google kitchens. RT-1 can do a lot of tasks, but Google kitchens are still very similar and still a very small subset of the total number of kitchens that people can see in the world. So how do you scale fast enough that you can bring the generality and complexity of the real world into the models and where you bring that to my house and then it doesn't suddenly break down.

Nathan Labenz: (8:21) Yeah, it's unbelievable convergence across all the things. We've talked to guests who are in computer vision and who are just touching all these different modalities, and the same story is underlying the progress in all of them. So no surprise to learn on some level that that is also happening in robotics. I think one of the other things that you talked about in your lecture and you kind of alluded to there is just scale starting to happen in robotics as it has happened in other modalities. But the data doesn't naturally exist, right? That's a huge difference because for language we have web scale data, we just did an episode on generation of human-like voices and obviously there's a ton of voices out there. I cloned my own voice. Took only 10 minutes of audio because there was so much pretraining. But obviously that doesn't exist in robots or for robots because we don't have all the robots. They're not out there doing tasks in today's world and collecting data. So I was really interested. I dug in a little bit to the RT-1 paper and tried to get a sense for the work that you guys have done to assemble the dataset, at least beginning to approach the scale that is needed to power this. And I want to get into that a little bit deeper because I think that's something that people probably have no real intuition for, and you have lived it and put hours into helping to assemble that dataset. So maybe you could just run down some stats and then get very practical in terms of what the data is and how you're going about collecting it. So you've got, this is from the RT-1 paper, 700 tasks, 130,000 episodes, which I take to be an attempt to complete a task. You can definitely correct or refine my understanding. Multiple different robot form factors and a project that took a year and a half to complete. I guess for starters, can you just describe the robots? There are a few videos out there that people can watch to also go see them. What are these robots? What do they look like? How big are they? What are they like to be around? Tell us about the robots.

Keerthana Gopalakrishnan: (10:53) Firstly, a lot of questions here. So let me go part by part. The robots that we currently use are really cute. So they are mobile manipulators. They can drive and they have an arm with a gripper at the end. You can also fit different tools. They can do wiping and other things. And they also have self-charging and other capabilities on top of that. So they are really, and I think they're a pretty good and fairly stable stack and we have a fleet of them and we have been collecting data on them. And the way the 130,000 episodes for use for RT-1 are supervised learning. So they are collected by a human teloperating these robots to do different tasks over the 700. So they will attempt all of these tasks and then we train on that using supervised learning. So now that brings to your question about data and language and vision and stuff. So in language and vision, as you said, there's a lot of free data available on the internet that you can just take and then use and then you're good to go. But in robotics, okay, there's reasoning and planning for which there is a lot of data because it's similar to both humans and robots, but for actions specifically, there is not that much data. So that is one bottleneck in robotics. And it is also a very engineering bottleneck. Very few people can collect and accumulate datasets of that scale, which means that very few people, robots are expensive and you need a lot of engineering effort to orchestrate this large data farming operation. So that already means that very few people can do robotics. And also, which is why I feel we at Google who have the opportunity to do it really have a unique position in scaling and making a dent in our attempt towards solving robotics. So I'm very excited about that. But also, if you look at language and vision, the acceleration in deep learning was achieved by weakly supervised learning. Collecting data using robots can only scale linearly and only scale, especially with human demonstrations, it can only scale with the number of humans. But we want to scale faster than that. And so one way would be to get transfer from human manipulation data. For example, imagine how you would learn to surf or cook something. You would take a YouTube video and then you would watch it. And there are a lot of YouTube videos of humans doing a lot of things. Imagine that one day you want to train, let's say an NBA level basketball player, but who's a robot. You would not teleoperate robots to do NBA level basketball or do self play. Maybe you'll do a little bit of reinforcement learning and specific fine-tuning on top, but you would make it watch all of the NBA videos that humans have been playing all these years. And then you would try to get the robot scaled up to that point and then collect a bunch of data to exactly fit to the robot's environment. So transferring from human manipulation to robot manipulation is a problem that we haven't yet solved, which I think is going to be very important in really solving robotics. So while humans cannot scale that fast, robots can actually scale. We can build 50,000 robots. It's just a question of money. And then if robots can do autonomous collection, then that's going to scale faster than with supervised learning. So how can robots do autonomous collection? So one is using these foundation models to collect data zero-shot or with a little bit of fine-tuning, something like CODAS policies.

Nathan Labenz: (14:35) So let's build up those layers in a little bit more detail, starting with the supervised learning, the 130,000 episodes. So if I understand that correctly, you've got essentially a kitchen, which is a lab, on the Google campus, and you've got robots that, as of the time that you're collecting these episodes, are not AI-powered. They are instead remote control powered. So somebody sitting there with a PlayStation controller going around and picking up napkins and whatnot with the robot. Is that actually a reasonable picture of what's happening, a PlayStation controller type of interface?

Keerthana Gopalakrishnan: (15:17) Yeah. We have Oculus controllers for these robots, and then we have multiple mock kitchens, which are called robot classrooms. The robot goes to a classroom and learns some things. And then once they can reasonably do things in the classroom, they're brought to our actual kitchens to do stuff.

Nathan Labenz: (15:35) I just did the most naive math. I divided 130,000 episodes by the number of workdays in the year and a half that the project took, and I got 367 episodes collected per day. So if I'm envisioning this right, it's as if you must have, I don't know, 20 people who have been operating the robots and doing the actual VR to robot housework and collecting all the data. Then I understand also that the robot sensors, sort of robot proprioception if you will, is also being just recorded at each timestamp along the way. And once all that is done, now you have the reasonably big dataset for supervised learning. So then your inputs would be the task or the command or the instruction plus the imagery that was seen at that given time plus state of the robot at that given time. And then the model is predicting next action. How many data points does that translate into? Because each episode presumably has, I know you're running multiple inferences per second, so do you have a sense for what that 130,000 episodes would translate into in terms of example predictions?

Keerthana Gopalakrishnan: (17:07) Yeah. So it depends on the median task length, right? And it also depends on what tasks. For example, picking is something that is fairly fast and I think it's about 30 steps. So you have 30 actions to pick an object and something like opening doors is a longer episode because you have to approach the door, grab the handle, then open the door. So let's say if we put a median around 40 or 50, let's say 50 steps per episode, then each step has around 8 plus a few, let's say 11 tokens. So that's how many. And then you multiply that with the number of episodes. So it's still way, way, way smaller compared to language and image datasets. So we need to scale both by autonomous collection as well as by transferring.

Nathan Labenz: (18:10) Okay, cool. So we've got the foundation of all this manually collected data and now tell us kind of the outlook for the self-collection. I do this with language models where I'll try to get it to do a task. It sometimes does it, it sometimes can't. And then I basically siphon off the data that is good, that shows successes, and feed that back into my fine-tuning and I get better that way. Probably a lot of our listeners are familiar with that general cycle. Is that basically what you're doing also in robotics? Or what is the robotics twist on that?

Keerthana Gopalakrishnan: (18:50) So right now, our system has a high level and low level control. The high level control runs at lower frequency compared to the low level control which is much faster. And the high level control, that paper is called SAICAN, where a language model is deciding how to plan, what task to do in sequence. So imagine something like "bring me a coffee." So it'll be like: go to the kitchen, find a coffee cup, pick up a coffee cup, place it on the counter, and then go to the coffee machine, press a button. So there's a series of steps in natural language to achieve a small task that you tell the robot to do, which is "go get me a Coke can from the fridge" or something. So there is a language model which is planning how to do the series of tasks. And then there is the Robotics Transformer, which is executing each of the smaller tasks.

Nathan Labenz: (19:46) I'd love to just hear more examples of what the tasks are and maybe your overall kind of impression of where the robots are today in terms of, are they practically helpful? If you had one at home, would it be worth it to have one at home at this point, or is it still more trouble than it's worth? How close are we to actually having something that would be useful in the wild?

Keerthana Gopalakrishnan: (20:13) So it's not in the wild yet, and that's one thing that we are working towards. But the thing is, scaling from 5 to 500 is quite hard, but scaling from 500 to 5,000 tasks is easier and 5,000 to 50,000 is going to be also easier. Each of them are a different class of problems, right? Towards the end, you come to more scaling and distribution and inference at scale problems, but initially you have algorithmic problems. And I feel we sort of more or less have the algorithmic problems down. So the tasks in RT1 are mostly pick object, place, not pick, open cabinets, take things out of drawers and put them on counter, also open fridges and close, technically things that you can do in a kitchen. And we also tried in various Google kitchens. So yeah, if you take it to a new Google kitchen and make it run there, I think it should work. But if you bring it to your house, I doubt it will work because all the images kind of look very out of distribution. I don't think it would generalize that much, but Google Kitchens are still somewhat similar. Things are at similar heights and stuff, so it should sort of work. So then that is the question of generalization, right? And also how many objects or how many scenarios can we generalize to? So the initial RT1, we wanted to focus on skills. So we will do a lot of skills, but a few objects. So we only had 17 objects. Now 17 objects are too small, right? You have millions of objects in the real world. And in order to be realistic, you need to be manipulating millions of objects. Which was our recent paper on open vocabulary object manipulation. So the idea there is that, so imagine you can do 17 objects and then you collect a little bit of data on about 100 objects. So generalizing from 17 objects to any objects is quite hard, but generalizing from 100 something objects to any objects is slightly an easier problem. So what we did was use a visual language model that can do a lot of zero-shot detection, right? Right now, language models can differentiate between your faces. Language models can say, "Hey, this is a bottle, or this is a phone," just from a zero-shot image. So can we use that information to go manipulate that object? And it sort of works reasonably well.

Nathan Labenz: (22:40) What are the high end tasks? You gave verbs of knock, lift, open, grasp, et cetera. And then you also talked about the longer time horizon planning, which I mentioned is still pretty short. It sounds like the things are tasks that would be 30-second tasks for a human. But how far is that currently going? Could the robot, for example, actually make and deliver a coffee with actual hot coffee in it?

Keerthana Gopalakrishnan: (23:14) So we don't let them handle hot objects yet, and they're also not good at operating machines yet. But that's something, because we just didn't collect data on it, but that's something fairly doable that you can learn. They probably shouldn't be handling hot coffee. I doubt that, that's probably, imagine you're carrying a hot coffee and then there's a kid nearby, something happens and then you splash it, that's not good. Or you just pour it over yourself and then you break your electronics. So if you think about tasks in a kitchen, let's say when you're cooking and stuff, you only actually do a finite number of skills, but you are able to combine them in various fashions. So if you do, let's say you have 100 skills, which is pick vegetable, cut vegetable, wash them, put them in plates, animal stick. So it's only finite number of skills, but then let's say you have around 100 skills now, and then you can then combine them in various ways. So it would be like 100 choose X, where X is the number of tasks that you want to draw in order to do a high level sequence. So something like "make an omelet." So that's, I don't know, open the fridge, take an egg, break an egg. Even with a finite set of skills, you can still do a lot of tasks in the real world by combining them and a lot of high level instructions. We still have a lot of skills to go. I think with RT1, the objective was how do we get an algorithm that shows scaling limits? So if you look at the scaling plots in there, task diversity is very important. RT1 doesn't seem to saturate with the amount of data you throw in. And in fact, it actually seems to get much and much better, emergence at higher scales and higher diversity. And you can also increase the size of the model a lot to fit the data. Imagine that you have to build a very generic robot manipulator. You need data and you need a data absorber. So in our office, we call it a sponge and a fire hose. So a sponge is, RT1 is a very good sponge and we are also building better sponges there, which is, I don't know, VLM, foundation model transformers, right, which can transfer also internet scale generality into robot manipulation. So visual language models as manipulators. Have you seen the PAMI paper? Something like that. So now imagine that you do have a data sponge that you can put a lot of data and then it just learns and then it does them in the real world. Now, question is, how do you build a fire hose to really pump it with a lot of data? So that's one of my main projects at Google. So how to scale autonomous, but also how to mine for a lot of data from the internet.

Nathan Labenz: (26:16) Tell us more about that. I'm understanding from your commentary that it sounds like you are kind of trying to base this on video of humans doing stuff, which this is always kind of an interesting pattern as well, where a lot of times the hardest part is figuring out how to cast the problem in such a way that you actually have or can create or can sort of massage existing data into form where it can actually power the training paradigm that you want to power. So obviously, there's a lot of data out there, whether it's NBA basketball games or how-to stuff of all kinds on YouTube. I guess a couple questions that come to mind there. One is there's so many different possible forms of robots that if I was thinking about this naively, I would think, "Okay, maybe I need," does video work? Or maybe I need a first person POV GoPro type of view to make this work because that might be a little bit more analogous. Boy, there's so many different possible forms of robot. Obviously, you'd want this foundation model to be able to adapt to all kinds of different form factors or embodiments, if you will. So I guess those are my two questions. In the pre-training, do you think that the third party point of view, like the NBA basketball camera from the side view, is going to be enough to develop the conceptual semantic understanding that you need? Or is there going to be kind of a need for more specialized point of view type thing or something else? And then second, is this kind of inevitably shaping up to be a two-stage thing where they're sort of first building the representations of what the humans are doing and then later a bunch of adapters to specific forms? What am I getting right and what am I getting wrong as I guess what that might look like?

Keerthana Gopalakrishnan: (28:28) So some of the things you can actually transfer, right? Let's say I see my mom cooking an omelet from third person view, and then I can try to cook an omelet and I'm mostly successful. But if I see Kobe Bryant playing basketball and then I try to play at that level, I'm not going to be successful, even though I have a very good modeling of my own body. So I think it's going to be similar like that. A lot of tasks, whatever's easier, you can learn by third person watching, but there still are tasks that you can only do by actually practicing and improving your own control model. We have some good work coming, I don't want to speak about it before it's published, on transferring between human and robot data and also training larger VLMs for control.

Nathan Labenz: (29:16) Just the one that came out this week, PaLM-E, was pretty amazing, kind of inherent in that. The big project is the E stands for embodiment, right? So we're taking Google's kind of signature language model, which has been adapted to seemingly every domain already. I go on and on about MedPaLM in other conversations, but this is PaLM-E. So kind of sitting it at the center of this robot system and having it do the kind of reasoning and planning, pretty amazing. Definitely picks up on another theme too that we've talked about a few times, which is model-to-model communication. If I understand correctly with PaLM-E, the kind of auxiliary models are trained to inject embeddings into the language latent space directly without going through language itself, right? So there's this kind of really interesting connectivity that's starting to happen across these models. That's something that we also talked to the BLIP authors about, and that's been a technique that they have used in their most recent paper as well. Translating these predictions to actual action in space and accomplishing stuff, that's very particular to robotics, obviously. Can you kind of talk us through that half of the equation?

Keerthana Gopalakrishnan: (30:45) I think the key to building a foundation model would be data interoperability. And if you look at POMI, it converts images into tokens, language into tokens, and actions into tokens. And once they are tokens, then a transformer can operate on all of them. And if you really think about it, the foundations of thought are similar whether you are acting in language space or vision space or action space. And to get a globally optimal solution, you need to think in all of these spaces, right? If you are a person who's cooking or cleaning or walking your dog or your kid, you are thinking in language, in vision, and also with your body. So any solution that would be globally optimal will be fully multimodal. And I think eventually it would be one model to rule them all, to do all the modalities together. And specifically, the way I think one of RT-1's innovations is tokenizing actions. At any given point in time, at least for the EDR robot, we have 11 variables. One to terminate the episode or not and decide between arm and base control, and then around seven variables for arm control. For the arm, we are basically only doing where to put the end effector. So, end effector positions, end effector rotation, and how much to close the gripper. Three variables for position, three variables for rotation, and one variable for gripper close. And then for base, X, Y, and the angle. So three variables for base, seven variables for the arm, and one variable to decide to control base or arm, and then to also terminate the episode. Once you make your actions tokens, then they're just like vision tokens or language tokens, right? It's very radical. When ViT came out, people were like, oh, this is not gonna work because how can you convert an image into language as a sequence of tokens? Language is a sequence of characters, but then it works and it was able to bring us large multimodal VLM models, CLIP, even stable diffusion, DALL-E and everything, which is built on top of CLIP-like models. So then the question becomes, can you also add actions in there as tokens? Now actions look exactly like language and it's just a question of predicting 11 tokens. And you can do that for any robot. And maybe another robot has more degrees of freedom. Imagine a humanoid or something, and then you have a lot more variables. So each degree of freedom can be one extra variable for the transformer model to predict. And then it just becomes can you predict the XYZ sequence? That's how I think actions are going to be tokenized and converted into data for transformers.

Nathan Labenz: (33:59) So at each step of inference, the kind of core language model, which you're projecting out in the future might just be literally one big model that takes everything in and it's one huge black box. But for now, we kind of have some auxiliary models that do the vision part and feed that into the language. I looked at the POMI paper. There's even a couple different types of vision model, object segmentation, general scene description, depth mapping, all these kind of different things come in together into the language model along with, of course, what are we trying to do here? What did the human ask us to do? And then what is spit out the other end is 11 values. And was it either-or? You're either going to do something with the arm or you're going to do something with the base?

Keerthana Gopalakrishnan: (34:59) So RT-1 is like that, but our next models are going to be whole body control, which means that you move both of them together. If you really think about how you do your control, you are not either moving your arm or your head separately, right? At any given point, you move them together. So yeah, that's going to be fixed.

Nathan Labenz: (35:22) So there's kind of a couple, I'm understanding a couple of different paradigms here or ways that this could work. One would be that the language model says, okay, based on what I'm seeing and what I'm trying to accomplish, I want to move the base to position X and Y. Okay, cool. We got there. Now, next time, I want to move the gripper to position X, Y, Z and have a certain angle and have a certain open-close. How does the issuing of that command relate to the cycle time? Because I would assume that you can only accomplish so much before you're going to be running the whole process again.

Keerthana Gopalakrishnan: (36:02) So that's the action bound, right? The language model basically predicts a token and the token is a value between 0 to 256. But each of those numbers correspond to a certain bound between no action versus go forward or backward. The language model is predicting still within a bounded space of action that you can go to or not. And in RT-1, the inference time was 100 milliseconds and then the full stack time was 300 milliseconds. So that's 3 hertz control. Humans are a lot faster than that. And yeah, we know that we need to optimize, but these are research robots, which is why the stack is slower, but production robots would be a lot faster than that. And also, let's say assume that once the language model says go to XYZ position or whatever, and we are doing concurrent or non-blocking control. So what that means is that after the language model tells the robot to move to that position, the next cycle of inference does not wait until after that action is executed to start the next step of inference. So there is a thinking cycle which is going and then there is an executing cycle which is going. That is also how you do it, right? You don't wait until, let's say, you grab something to think about what is the next thing to do. So thinking and acting are happening simultaneously. And it is also important for robots to be fast. Because there are moving objects. Let's say if your kid throws you a ball, you need to be dynamic. And if you are slow, if your control is too slow, if you move like that, then you're not gonna catch the ball. So that imposes certain limitations on the latency of the whole system. So one of the ways in which we tackle that is by using the token learner, which is really compressing the tokens from the pre-trained EfficientNet so that the inference times are faster. And that cuts the inference time by a third. And also in language models, everyone is excited about really big models, but scale comes with emergent properties and scale comes with large reasoning abilities, but then that also makes inference slower. So they are sort of competing objectives. So we want fast control, but we also wanna fit as much information, as much context as possible. Definitely in the future, our hardware, our inference hardware, all of that is going to be much faster and optimized for transformer inference. But right now there are still those competing objectives.

Nathan Labenz: (38:53) Okay, that's really helpful. And I think likely very clarifying for a lot of people. If I understand correctly there too, the command, I guess, the predicted command of go to XYZ, that's explicit enough, I gather, that it can be sent directly to the robot control system for execution. But then with the latest paper with POMI, it sounds like there's a bit of a different architecture where it's starting to, instead of saying go directly to X, Y, Z, it's giving slightly higher level commands that are then received by and translated into actual action or motion by another model within the same system. So can you describe that version as well? And then maybe the pros and cons. Why is that just a strictly better approach or do they have trade-offs?

Keerthana Gopalakrishnan: (39:49) Yeah. So actually, POMI and RT-1 work together. So POMI is the updated SayCan, which is converting high-level language into low-level language. So converting between how to go get coffee and then planning that with feedback from the system and everything. And then RT-1 is doing the little things like pick up the coffee cup. So I think two models optimizing both of them synchronously is suboptimal compared to having one model do both planning, language generation, and inference control. And I think that is the way that we will go in robotics. You'll see future papers from us which go in that direction. Yeah. So POMI is more like SayCan level, reasoning about the environment, but also it is doing affordance, it is doing scene feedback. And RT-1 is doing the exact actions.

Nathan Labenz: (40:50) POMI as the successor to SayCan will take in everything that's going on, which includes my having told it to get a cup of coffee and its knowledge of its position and state and its visual input. And then it will translate that to a low-level command that is like grab the cup. And then RT-1 takes that input in and translates that to X, Y, Z coordinates for the gripper. What does sound nice about that architecture is you could have RT-2, 3, 4, 5, and 6 for different kinds of robots. But RT-1 already handles multiple different types of robots as well, right? So is that just a variable within the bigger thing? What kind of robot arm you're actually controlling? Is that just another token that's in the input? Or how does it know which embodiment it is controlling at any given time?

Keerthana Gopalakrishnan: (41:53) So it has another input. So when RT-1 was jointly trained with Kuka and with EDR, it was told where the data came from and the action spaces were also sort of mapped together. Yeah, so one way that you could compose one model on multiple robots is using one token to determine which robot you're inferencing on. So let's say, imagine if you're running a robot dog and then you have one token saying this is a robot dog, which means you predict maybe 24 variables versus if it's the everyday robot that we are doing, then just predict 11 tokens. That's one way where you can still mix a lot of different data from a lot of different robots and then also have transfer between them while allowing it to inference on a lot of different robots.

Nathan Labenz: (42:49) One of the videos that I thought was most striking was an example of where there's these sort of exogenous shocks to the robot's plan. For example, there's one video where the robot is picking up a bag of chips out of a drawer and then the human comes in and knocks the bag out of its hand and puts it back in the drawer. And so it's a demonstration of the fact that the language model, especially as it's informed by the visual inputs, obviously it's not having these conscious thoughts of like, oh, somebody knocked this thing out of my hand, but it is realizing the bag is over there and I need to go back and get it to accomplish my objective. That's a pretty striking video. There's one with pizza being made as well and sauce being kind of smoothed around on the pizza and the person comes and moves the pizza and the robot recognizes that and continues to apply the pizza sauce appropriately. I was like, boy, that's pretty awesome. Then I was also thinking, what is the limit of how much perturbation I would want a robot to sort of push through in the real world. I have a four-year-old kid and a two-year-old kid and another new baby coming in 25 days. It's definitely a very common occurrence in our house that I'm trying to do something and then there's a shock of a kid entering the scene and messing something up or whatever. How are you guys thinking about sort of, okay, this is sufficiently big of a shock that the model did not expect, that maybe it's time to just stop. I should not pursue this goal anymore because something is happening that is out of distribution and I don't know what to do with it.

Keerthana Gopalakrishnan: (44:50) These are ethics questions. It's not a question of can you do it, but should you do it? It's very similar to safety questions. And here, our straightforward approach is to borrow from AI safety and AI alignment research. One way is we're running a lot of autonomous scale operation where the robots run around and propose what tasks to plan and practice, and then autonomous policies attempt to practice those tasks. But the question becomes, imagine the robot is roaming around people's desks. It should not attempt to pick up their personal belongings. Or if people are sitting in the Google Microkitchen, it should not come near them and try to disturb them or propose any task that is harmful for them. For example, one day there was a robot that we were planning, and it detected that there was a phone in my hand because I was recording it. It proposed, "take the phone and pick it up and put it on the table." But the phone was in my hand as a human, and I would not appreciate it if it just grabbed the phone from my hand without any consideration for how humans behave. Before you take my phone from my hand, firstly, you need to understand that the social norm is to not grab people's phones from their hands. Secondly, you would ask for these things. So there's a lot of HRI components in there, which is, how can robots be nice and polite even in social interactions? We have a group that is working on HRI research, thinking about how to make the robots polite, how to make them reason contextually about these things. For example, in this failure mode, the robot detected a phone in the scene and proposed a task, but it did not detect the relational aspect of the phone being in my hand and the fact that I was holding it. If we have better scene description, I think a language model would be able to reason that the phone is in the hand of a person, so don't do that. Currently our approach to safety is asking a language model, should you do this or not? We also have harder bounds for tasks like collision. It would not do tasks that actually have collisions, even if the language model plans everything. It's still bounded. It should not be dropping objects or colliding into things. So there is a hard bounding from a control perspective, which is just don't run into anything. Then there is the question of asking a language model, is this a safe task to do? Is this a nice task to do? Or if not, can you explain to me why this is something that you should not attempt to do? Some of the tasks, our robots right now have only one hand. When it is given a task that requests two hands, then it should be, "I'm not able to do it because I don't have two hands." Or if it's asked to lift a heavy object that's above its payload, it will be, "this is unsafe for me to do because it's a heavy object." Or if you say pick up the thing with the hot coffee, it'll be, "this is a very hot object that I should not be handling. Maybe you do it." So now we come into the realm of collaboration between a human and a robot, where the robot is closing the loop in itself. Robot knows what it should do, what it shouldn't do, what it can do, what it can't do, and also how to be nice while doing these things. Once it can classify these things, the question is, you should still be able to get the coffee to the human, even if it's hot coffee. But if you are not able to handle that, if you think you're not safe enough to handle that, then you need to replan and reason about how to still get that task done. One way could be you can chat with the robot. Our robots have a chat interface. You can tell the robot back, "Hey, I have the hot coffee in there. Can you come help me pick it up?" So that is the realm of collaboration or intervention. I think deploying robotics in the wild is not going to be with fully autonomous policies, but with autonomous plus interventions. So how self-driving cars are, right? They are good for handling autonomy, can handle 95, even 98% of cases. But those 2% of cases, or even 1%, if you run at 10 Hertz in front and you have 1% mistakes, you are making a mistake every 10 seconds, and that's highly unsafe. But what the self-driving cars are good at is knowing where they're good at and where they are bad at. And they can ask a human for help when, let's say, the map doesn't make sense, or there's a person there asking the cars to go around. Similarly, I think the way that robots will be deployed is they know their bounds of operation and they also can ask a person for help when they are confused or when they can't do something. And they also have hard bounds around them, common sense things, don't hit a child or don't run into things. And even if a language model somehow plans hit the child or whatever, then you still have safety on robot control, which is not a machine learning model, it's deterministic control that says don't run into objects.

Nathan Labenz: (50:37) That sounds like, for one thing, kind of another Herculean labor to collect all that data, because that's another level that doesn't really exist, right? Or maybe you're in the process of collecting it now, but there's not really a lot of examples, certainly out there on YouTube or whatever, of robots running into situations where they determine that they can't do the task.

Keerthana Gopalakrishnan: (51:02) Definitely, I think that we will get a lot of transfer from a language model, all the work that people are doing for alignment and safety of language models. And now with GPT-4 or multimodal models, don't generate harmful output, don't ask it to do harmful things. So that's definitely going to feed directly into our research, but we also need more information about the specific failure modes of our data, of our problem. And once it is close to productization, how Waymo or self-driving companies have their simulators to test for safety related things. They also have human in the loop, which is constantly testing the hardware. So I think all of those things will come in. But robotics especially has a particular imposition on safety, which is that language models can be wrong on the internet, right? You ask ChatGPT something, it goes wrong, no problem. Or it says something weird, no problem. But then if you make a bad plan, if you drop an object, you break things, cost money and you probably injure a person or destroy an object. So there is a higher penalty of failure on robotics models, which is going to be very important towards their safe deployment. I don't think that this is an impossible problem to solve. Look at self-driving cars, they're already making money in San Francisco. So it's just a question of us doing it carefully with human in the loop and then rolling it out in phases. Initially everyone was, "self-driving cars would never work because they can never solve 100% of the problems. They're always going to be long tail and whatever." But you don't have to solve the long tail. You can solve as much as you can and then you can make humans solve the rest. So you're basically doing intervention. Assume that the robot already works, start making money today and then keep solving the problem, then reduce the intervention rate. So then that becomes your whole problem. I think bringing robots to people's houses will also be deployed today with whatever capability you have and then use humans to fill the gap of your algorithms right now. And then as you go, as you make money, as you keep improving, slowly phase it out.

Nathan Labenz: (53:23) It's funny you mentioned the self-driving cars too. I just took a ride in my neighbor's Tesla that has the full self-driving enabled, and I came away from that feeling like, first of all, just pure wow. It really does work amazingly well. And also, nobody seems to be acknowledging it. It's this odd situation. As you said, people say it's never going to work. You can still go online today and find recent articles, probably can find one from today that says FSD is never going to work. But I just rode in one and it definitely is a lot further along than some of the naysayers would suggest. So it sounds like robotics from this conversation is probably in a similar position. I wanted to talk a little bit more about the hard bounds because that also is something that seems, as you just spoke about, much more important in robotics paradigm than language models or others. One article, just for listeners, that I really recommend, because I could go back to it repeatedly just to kind of meditate on how I work as a human, is this Scott Alexander book review of a book called Surfing Uncertainty. We could put the link in the show notes. But basically he talks about how the human biological nervous system just has a ton of layers. You've got the prefrontal cortex layers that are kind of the highest layers that deal with the highest level of abstraction. And then, obviously radically oversimplifying, but as you go down the nervous system all the way out to the periphery, there's just layer after layer. These are probably not discrete layers, but to some extent there are discrete cells. So there is some amount of discreteness to it. Anyway, whatever. It's not meant to be too literal. But as you get out to the periphery, you have just very low level interpretation of what's happening. And you can do things like remove your hand from a hot stove before there's certainly any conscious thought of the hot stove. So all of that, and I highly recommend that article, all of that to just preface, it seems like robots need something similar, right? Where they need to sort of detect that I'm encountering resistance or the thing that I'm grabbing is deforming in my grasp more than I expected it to or more than it seems like it should. And therefore, I need to back off. I need to slow down. Don't necessarily want to wait for the next inference cycle to withdraw pressure or back off of a collision or whatever. So how is that working today? Is that a non-AI system, is that more of a classical system that's not AI driven, or is that also something that the models are ultimately going to handle in your view? What's the future of that look like?

Keerthana Gopalakrishnan: (56:26) Let's go back to the example of touching the hot object, right? There is the planning in our body, which goes to the brain and the spinal cord and whatever, and then tells you that you should not touch the hot object. But then when you touch a hot object and you reflexively act, you are not going all the way to run the inference step. This is what I was previously talking about by various layers of safety. There is one layer, which is the language model telling you, should you do it or not? Don't touch hot objects, don't handle whatever. But there is also low level control layer. RT-1 also has its own safety. It will not do highly unsafe things because it doesn't have the data and the data is cleaned and everything. But also on the robot, on the control level, just the C++ code, there is also a way to determine don't do collision related stuff, don't run into objects. And that's very easy to detect, right? You know the occupancy in space, don't run into things. But also more than just collision, there's various other types of feedback, right? Which is tactile sensing. As you said, if you try to pick up a full bottle and if it deforms a lot, then maybe don't do it. Or if it spills over, don't do it. Right now our robots don't have tactile feedback on them. With the gripper, it knows some, it has some pressure feedback. So if it's a very hard object and then you squeeze it, you won't be able to squeeze it. And that's reflected in the max torque applied onto the thing. But then there's also softer feedback, to cut vegetables or soft objects like tomatoes and stuff, you need more finer feedback, which is not there yet. But we can still do a lot. We can still deconstruct safety into various layers with existing just force feedback.

Nathan Labenz: (58:31) I think that's reassuring that there's some, that it's not fully a black box, that there is this C++ safety layer that is a hard override. Zooming out just a little bit more still and returning to some of our very early discussion around how we'll interact with robots, what kind of role they are going to play, how fast do you ultimately think we will want robots to be? As I watch the videos, it is notable that they run pretty slow today, certainly much slower than humans. The videos are typically shown at either 4x speed or 10x speed. I've seen both. So they're pretty slow in today's world. Obviously, you'll want to speed them up. But I wonder if you have a sense for what the optimal frequency is for a robot to be operating at in a human environment. There might be a maximum at a certain human-like frequency that would be sort of accessible. If we could run them at 1000 Hertz, I'm not sure that we would even really want to.

Keerthana Gopalakrishnan: (59:39) Frequency of inference is not the only variable for speed. Think of frequency as how reactive you are. Even if you're running inference frequently, but each step you can only move a little bit, then to go from here to there, you're going to run 10 inferences. But you can also go from here to there in 1 inference, right? So there are 2 variables to control here. One is how reactive you are and your inference frequency. But second is how much movement you're allowed in each inference step. Time to complete a task is within 2x or 3x of what a human would take for that task. And that's about optimizing both the machine learning and robot software, and also optimizing for both reactiveness as well as speed. We didn't play around with speed a lot because it can be a little bit dangerous. You're trusting the inference a lot. But I think because these are research robots, we haven't really optimized all of these things a lot. When they come into production, then that becomes a more important question. But I think it's fairly solvable still with existing machine learning models and existing robots to get at least within 2x of a human. Robots, you would be freaked out if a robot were as fast as you in doing things and as good as you. I think it's fine to give them a little bit more time until they ease into society and people are okay with them.

Nathan Labenz: (1:01:10) Yeah, totally. I agree with that for sure. The most advanced robot in my home right now is the Roomba and it kind of stumbles its way and crawls its way around the house. It's not too smart and doesn't figure things out too quickly, but it kind of feels appropriate. And we run it overnight, so it's one of these things too where it doesn't really matter if it takes an hour or 2 hours or 5 minutes. It wouldn't be much better for me if it was a 5 minute task for the robot instead of an hour. Honestly, probably wouldn't be any better for me. So yeah, I do think you're right. But it is also interesting to hear that perspective that increasing the frequency of inference really just smooths the behavior as opposed to really changing the behavior. Then you have a separate decision to make around, per unit time, whatever that interval is, how much am I going to move in that unit time? So that's something you can kind of control independently, which is helpful because I hadn't really conceived of it in that way. What are they like to work with? Do they break a lot? Do they need maintenance? Are they heavy? How robust are these robots that you're currently working with?

Keerthana Gopalakrishnan: (1:02:25) So I've worked with space robots. When we were in grad school, we sent a little robot to the moon. I've worked with self-driving cars and I've worked with everyday robots. And I think these are the most stable, at least at the fleet scale they have. And they're fairly robust. The navigation stack and everything is fairly reliable. They don't break as much. They don't run into objects. Everything works. And I think it's because Google is really good at engineering and we also have great ops folks who are keeping everything running, oiling the machine. And yeah, they're fairly robust and fairly good robots so far, I think. They don't break easily and it's very convenient to run experiments on them. We still have a long way to go with solving smaller issues and stuff, but I think they're the best robots that I've worked with so far. And another thing is that you see industrial robots, right? They're also fairly robust, but they're in a very constrained environment. It's a fixed base. It's doing picking or it's doing scanning. But these robots also move around. So compared to the complexity of the domain that they operate in, I think their reliability is really good.

Nathan Labenz: (1:03:42) Do you think that they are the right form factor for ultimately use in homes?

Keerthana Gopalakrishnan: (1:03:50) I think form factor is a very, very important question and there's a lot of debate among roboticists on what is the ideal form factor. Initially, a lot of robotics was fixed base. So there were just these KUKA arms and stuff. They would just do things from a fixed location, but this is not very general robotics, which is why we now have arms moving around. But the arm is the most costly part of these robots. So if they have 2 arms, then the robot is going to be almost twice as expensive. And then the equation, the economic equation would not work out. So there are various ways. Cost is one tradeoff. Up until this point, we were limited by the robot software. And so if you cannot even do one hand tasks, then it doesn't justify having multiple hands and legs and whatnot on the robot, right? But now we're seeing that we are limited. The robot capabilities have increased to a point where we're actually limited by the hardware. So we've cracked a paradigm where it's like throw data and then just learn it. So then the question is how much data can you throw? And then on which robot should you throw the most data? And I think that if you really think about it, if you really think about the ideal generalization, a superhuman physical intelligence, it's probably going to look like humanoids, right? Because our world is designed for humans. Your kitchen is designed for humans. A car is designed for a 6 foot human and a cup is the dimensions of what fits in my hand. A lot of the data on YouTube is from a human egocentric perspective with human hands operating. So if you look at it from a design of the world perspective, then humanoids are the most optimal form factor. If you look at it from a data perspective, NBA basketball agents playing, they look like humans. So you probably need a human-like form to play very good basketball, right? Both from a data and a utility perspective, humanoids, I think are the ultimate form factor for robotics. But if you really think about it, stability is a very, very hard problem to solve. And Boston Dynamics has really, really complex robots, which all of them are legged robots, but stability is a very hard problem to solve. And that means that all of your problems become stability problems. While you're picking things up, you're also worried about falling. And if you do fall, you're in danger of breaking parts or injuring other people. So I think that one of the reasons that Boston Dynamics has little stubs for their arms is because they invested into legs too soon and then now all their problems are stability problems. So initially it was one arm on fixed base robots, then one arm on mobile based robots, and now we have navigation and manipulation together. I think the next is bimanual manipulation. A humanoid on wheels is basically 2 arms and a camera at human height, right? It sounds so complex, but it's actually just 2 arms and a camera on wheels. So I think that would be the form factor for a lot of indoor and office applications. Most homes nowadays have, most offices have elevators and stuff, and for most robots, if they're on the floor, they can do a lot of things. And then the next step would be legged robots, which adds an additional level of complexity, right? Because imagine that you do build a superhuman intelligence and then it goes to deliver, but then it can't cross a little curb in front of your door. So you need legs to go upstairs or go past curbs or even to do outdoor robotics. So humanoid on wheels would be the next version with 2 arms, then would be legs and also fingers. 5 fingers would be another upgrade because a lot of things you can do with just grippers, but then fingers are highly complex. We have so many degrees of freedom in here and it's highly complex to plan and stuff, but ultimately a lot of the tasks would need more finer, dexterous manipulation.

Nathan Labenz: (1:08:13) As you're describing that, I have this kind of image in my mind that's analogous to the human evolution, where it starts with the great ape or whatever, and then it gradually sort of becomes more human over 6 steps. You basically just described the 6 steps that go from one arm to the ambulatory robot. It's a pretty similar evolution, so to speak.

Keerthana Gopalakrishnan: (1:08:38) That's why I'm so excited about robotics, because it's like we're inventing ourselves. Right? It is in many ways a quest to understand us and our intelligence. And it's so hard to put down onto paper how we detect a cup or how we're doing these things or how we're planning tasks. At least high level things you can describe in language, but low level manipulation, low level influences, you really cannot. Why your arm and leg are moving in a certain way, you cannot. But then building, you know how software engineers say the best way to learn something is to build it. And I think robotics is basically our quest to understand ourselves and build more of ourselves. If you think about GPT and language models, they're doing a lot with respect to scaling intelligence, right? And bringing down the cost of knowledge work. If you think about it, the Industrial Revolution automated a lot of mechanical labor and our lives are much easier. And now with the intelligence revolution, we're automating a lot of knowledge work, but then truck drivers in the United States make $200,000 because no one wants to drive trucks. Even people that sand boards apparently in San Francisco make $95 an hour, which is comparable to a software engineer. So as the availability of labor changes, once more knowledge gets automated, physical work is going to be the most valuable. When my friends ask me, oh, software engineers are getting automated, artists are getting automated. It's the Moravec's paradox, right? We thought the things that we thought were hard to do are actually easy to do for computers, but the things that we find are easy to do are really hard. For us, composing these images and stable diffusion is super hard and editing those things, but then computers apparently can do that. But something so easy as going to my kitchen and making me a coffee or taking my dog for a walk that we take for granted and we think is so easy, it's really, really hard for computers to do. But physical labor still accounts for a large portion of our GDP and automating that has a lot of economic opportunity. And I think eventually, if we think of our species really going forward and capturing different planets or expanding ourselves, then definitely solving both AGI in the digital world and AGI in the physical world is going to be very important.

Nathan Labenz: (1:11:12) We had Suhail from Playground, which is an image creation service that people are obsessed with. He has a young child as well. He said, Maybe my kid, who knows? Maybe some of their best friends will be AIs. And then we also had Eugenia Kuyda from Replika, which is the virtual friend app that is all digital today, which people are already falling in love with. And that's become a big challenge for her because people are doing erotic roleplay in the app and she's like, I might have problems there in any number of ways. So they just kind of dialed that back and they had a bit of a backlash from users because people really care about that stuff as they become attached to it. Is there anything that's blocking in your mind the sort of extension of this? When I think of this, I just think of picking up the Cheerios. That's as far as it goes typically in my imagination. But then when I think of Suhail and his kid, I think of Eugenia and all her users, it's like, Boy, are we going to have robots that are playmates to our kids? Are we going to have robots that are robot spouses?

Keerthana Gopalakrishnan: (1:12:23) In our lab, on days when I'm driving to work, I book two robots to work with the whole day. You get close to these robots. They have numbers at their end, this Meta or that one. And at the end of the day, I cracked one robot's wrist because I moved the chair against it. You spend the whole day with it and then you have to file a bug to take it to the hospital kind of service they have where they fix the arm. I felt so bad that I broke the robot, but the robot would never understand that I'm actually sorry. The robot never told me that it's okay, but a human, if I would hurt a human by accident, they would tell me that I'm going to be okay, it's okay, and I would feel better about it. That's one thing. You eventually start anthropomorphizing these robots because now they're also getting smarter with language. I can tell it, let's go on a walk. And then that language command is enough for me to take it on a walk. Or I can teach it things, I can talk to it, or it says sorry for doing something. I think sometimes we had a little bit of creepy programming on it where you're working in the kitchen and then it would go up to you and be like, hey, are you my creator? And I'm like, dude, what? I already feel so, when I wrote my bio as mother of robots, it's because one day I was sitting in our robot classroom and I'm like, is the robot better at doing these things? Is my dog more consistent at instruction following? Maybe we should have a face off between them. And then I'm like, wait, but whoever fails, I'm going to feel so sad because both of them are my babies. And I'm training both of them. And that's when I'm like, they are actually an extension of us. If you think about language models, they are basically condensing all of human history, all of our experiences and with multimodal models, all the videos that we are collecting, all of our information about the world, all of our more private parts, all of our feelings and everything. So, in some sense, they are a distillation of the human mind itself. And not just one human mind, but a human mind across history and a human mind that is also omnipresent in space and time. And when you put it on robots, it's a human mind distilled, but also with physical capabilities. I can see that becoming really, really close to being human. And maybe in the future, once we have BCIs or whatever, I really believe that the future of machines is also the future of us. Imagine that 1000 years later, our bodies have exactly the same skills as we have today. That would be so disappointing. But I think it's very possible, with Neuralink and everything, we have BCIs, we have more powerful compute on us. Eventually we merge with machines in a way that they are our progeny, which is also one reason why I'm not that worried about AI taking over the world. And this might sound a little bit defeatist or controversial, but if they are in fact smarter than us, then I would want to be merged with them and have a common future with them than to be like, it would be like if monkeys said that humans should not be formed. If they are actually superhuman and so much better, then we should just become one with them.

Nathan Labenz: (1:16:16) Love it, fascinating. I assume that inference is going to have to live on the edge for practical robot deployment. It seems like the latency of going back and forth to the cloud is probably not viable if you want to run at 10 hertz or whatever. So first of all, is that right? And then second, what about sort of on the edge fine tuning? Is there a paradigm that you see in the future where these robots will learn the layout of my home to take one very simple thing, or our preferences and start to have a more customized, the sort of final finishing training would be done in deployment by human feedback from the customer, the actual homeowner or business owner or whatever the case may be. How do you think that shapes up long term?

Keerthana Gopalakrishnan: (1:17:14) So I used to believe that you needed to do onboard inference for robots, but now we are seeing with larger models like CAMI and Poly that we are running on the robots, we are actually doing inference on a different server and then pinging back and forth. And so far the latency is still good enough to do that. And I think it would be very hard to run bigger models on robot, but then this is still new technology. So, we will have to see how they evolve. So maybe in the future we have chips that can run multi billion parameter models on robot, on device. That would be great. Or we have some way of fixing the latency so at least right now it is still within operational limits, it's still good enough. With language models, now you are seeing more personalized AI, where you can shape the character of AI, even with prompt engineering or with fine tuning on top of an existing language model, you can customize it to your application. It has various different advantages. One is privacy, which is that you don't want to give your data to some large company to put it in their training set, but you want to keep it in your house or in your office and then a little bit of fine tuning on top to personalize it. So I definitely think this will also be the way that once we build a foundation model for robotics, I think these paradigms will start to happen. I definitely think it has to be fine tuned. Imagine a robot that is in your house. It has to learn the preferences of you, your kids, your schedules, how you like certain tasks to be done.

Nathan Labenz: (1:18:58) I noticed that there's, I believe, more than 50 authors on the RT1 paper. And you've given us a little bit of a couple different angles on a hospital service for the robots. I had never thought of that. Can you just describe the team that has to come together to make all of this work?

Keerthana Gopalakrishnan: (1:19:17) Firstly, I want to say that I have an incredible team at Google. They are, one thing I'm realizing in life is that to do great things, you need to stand on the shoulders of giants and also work with really smart people. And my colleagues at Google are the smartest in the world. They are the best in the world at what they do and they are still so humble and so curious. You can challenge them, you can ask them questions and it's really nice to show up to work and be heard for your ideas and be respected and to work with world class people. And now the way that we organize these large efforts, we have the big papers that we do and we also have splinter papers. Splinter papers are usually smaller collaborations, intern projects and other things and big papers are more foundational, upgrades in robotics technology that we publish as an entire group. And the way that this works is someone leads the effort, but then everyone's included because doing robotics takes a lot of people. Everyone who's in part of the ops and keeping the robots running, who's collecting the data for it, people who advise, people who actually implement things and train and evaluation. So, the bigger papers are fairly inclusive and then we try to bring everyone on board to publish the large upgrades in robotics technology. And for the smaller ones, it's usually people involved and then they have different guidelines and stuff. And at Google, we always lean towards being inclusive and including as many people as we can.

Nathan Labenz: (1:20:55) The role of AI in your personal life right now, are you using any particular products, services that you find exciting and would recommend to people?

Keerthana Gopalakrishnan: (1:21:07) So I use Bard or even ChatGPT sometimes to ask it questions to explain code. Whenever I'm writing new code and especially with web programming and stuff, you are writing a lot of templated stuff, just asking Bard and GPT is very useful for tail queries that search engines are not that good at doing. That's one AI product. I also like the summarization thing. So, these days in meetings, if we have a large meeting that's running over time, we just take the whole thing, summarize it, and then have these bullet points. But mostly I use AI to code and even make fun poems and stuff. And nowadays, whenever people ask me to write emails, I feel so lazy because a lot of being obligatory nice in an email is just filling it with text that you can just ask GPT to write. So sometimes when I get a lot of emails where I want to give a short response, but then it would be impolite to be curt, I would just put it in there and then ask it to make a bigger email and then you just paste it in there.

Nathan Labenz: (1:22:19) Secret safe with us. So let's imagine that 1 million people already have a Neuralink. And if you get one, it will allow you to type or create text as quickly as you can think. In other words, you have thought to text. Would you be interested in getting one?

Keerthana Gopalakrishnan: (1:22:44) Absolutely, yeah. Why would I not be?

Nathan Labenz: (1:22:48) Well, you know, people are a little squeamish about holes in the skull. So that's one common objection that we've heard.

Keerthana Gopalakrishnan: (1:22:55) But 1 million people have it, right? So it worked for them. In fact, I don't think that I would wait. I'm an early adopter. I don't think I would wait for 1 million people to get it before I get myself.

Nathan Labenz: (1:23:07) I think I'm with you, and that's part of why I asked the question. But this has been actually one of our more polarizing questions we get. I've had a couple responses just like yours. I'm like, I don't even need to wait for 1 million. And I've had others that are like, the last thing I'm going to stand for is something that can read my thoughts being physically implanted into my body. And I do kind of get that. Although, personally, I think I'm enough of an enthusiast that I would probably be with you on the earlier wave.

Keerthana Gopalakrishnan: (1:23:38) I'm also big on privacy. So, I don't do any private company products that scan my retina or anything. But if it's running inference on chip on my brain, that's fine. But if it's sending my thoughts to a cloud to run inference, then I would not be okay with that. So, yeah, there are some constraints on the bounds of how the technology works. I don't know. I'm very curious and very excited about it.

Nathan Labenz: (1:24:04) What would you say are your biggest hopes for and also fears about the places that AI may take

Keerthana Gopalakrishnan: (1:24:11) Firstly, I think AI and AGI have a lot of promise in both automating knowledge work and bringing a lot of utility into our world. Imagine that humanity is pushing forward. We are making inventions in science, making inventions in robotics technology and a lot of these things, but we are bottlenecked on intelligence in many of these areas. But imagine that there is an AI scientist that can propose hypotheses, run experiments, write papers, and invent new things. Our world and our technology would move so much faster. My hope for AI in this decade is that we can train an AI scientist, at least on a few things, and we can automate a lot of different types of knowledge work. Software engineers could be much faster, doctors could be much better. And we are not then resource constrained on intelligence in our world. That would be amazing, even in physical intelligence. But of course, AI is a very foundational technology, very impactful, and it comes with a bunch of risks. We need to do it correctly. One thing that I'm worried about is increasing inequality. If AI is built by large companies and then you use less and less labor, it leads to a certain level of centralization. Even now with language models and stuff, there are very few people who can train models that are billions of parameters, and that increases the existing inequality already. A lot of people are left behind. I'm very concerned about that. And I think that the open source movement is going to probably try and equalize the field. So I'm really cheering on Stability AI and other companies for leading that effort, Hugging Face also. The second thing that I'm worried about is AI safety. I'm fairly an accelerationist. I'm for we need to build this and we need to build this fast, but also safely. But there is a finite probability that things could go wrong. And so we need to be careful about that. We also need to make sure that we solve alignment to a reasonable extent so that as we phase out this technology into products, we are careful about how they are being used. In fact, I think that productization of AI is going to accelerate not just capabilities research, but also safety research, because ultimately only an aligned AI is a useful product. If it's not doing what people want it to do, but it's gaslighting you to leave your wife or threatening you, it's not a good product, right? So I think it's great that Sydney Bing went first and then told people what a bad chatbot could be, because that gives the opportunity and made everyone think about how important it is to build a really good chatbot, and it presents an opportunity to do this right. And I'm very excited for Google to take on that challenge and fulfill and meet people's expectations. I think we are approaching the development of AI in also a responsible fashion. We want to do this well in addition to being one of the first in the markets. Yeah, so I think safety is one concern, and I think productization and acceleration of AI into products will also accelerate safety research. The third thing I'm worried about is bias in AI. So in addition to existential risk type of safety, I'm a person of color, I'm bisexual, I'm a female. So I'm a minority in various ways. I know that language models and other things have biases with these models. For example, if you say go to Keerthana, it would not understand, but if you say go to David, it will go to David. Because Western names or male names are more represented in the dataset. So I'm concerned about AI ethics and AI bias. I think it's an important problem to solve in order to bring the benefits of AI to everyone and not just a dominant majority of the population. AI is making everything data driven, and data is highly, it's a mismatch. We have different distributions for different distributions of people.

Nathan Labenz: (1:28:42) Keerthana Gopalakrishnan, thank you for being part of the Cognitive Revolution.

Keerthana Gopalakrishnan: (1:28:47) Thank you for inviting me. I had a lot of fun talking about these topics and great questions as well. I'm really looking forward to it. It was a very technical podcast, more so than any podcast that I've done. So thank you for that, and I hope that your audience will enjoy.

Nathan Labenz: (1:29:06) Thank you and I am

Nathan Labenz: (1:29:08) sure that they will. Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Mother of Robots Keerthana Gopalakrishnan of Google Robotics

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Transcript

Nathan Labenz

Read next