The Mind-Reading Revolution with Dr. Tanishq Mathew Abraham (Part 1 of 2)

Watch Episode Here

Video Description

In this episode, Nathan sits down with Tanishq Mathew Abraham, 19-year-old UC Davis grad and one of the youngest people in the world to receive a Ph.D, with a degree in biomedical engineering. Tanishq is the founder of the Medical AI Research Center (MedARC), and with his teammates, recently published a paper: Reconstructions of the Mind's Eye, which encompasses their breakthrough research on reconstructing visual perceptions from fMRI scans into images. In this episode, Nathan and Tanishq talk about the technology behind the fMRI-to-image project, developing the model, and future applications for this research.

Part II with Tanishq will be released in the next episode.

The Cognitive Revolution is a part of the Turpentine podcast network. To learn more: Turpentine.co

TIMESTAMPS:
(00:00) Episode Preview
(05:43) The MindEye Project
(09:06) Resemblance between AI reconstruction of mind's eye and visual presented
(10:00) What is a voxel and which regions of the brain were studied?
(10:23) What would the raw data of a voxel be?
(11:44) Is there a time dimension to voxels?
(15:00) Sponsor: Omneky
(17:50) Goals for the MindEye project
(25:57) What is the starting point of the model?
(31:15) Aligning the model: reconstruction vs retrieval
(40:34) Would doing a full end-to-end training be fine for the reconstruction?
(42:15) The role of a limited data set
(43:09) Training separate models per subject
(45:07) Generalizability with a limited dataset
(47:20) Mapping from one high-dimensional space to another
(50:47) Stable Diffusion VAE encoding
(1:00:50) How long does it take to train the model?
(1:03:14) How similar or different are the subjects and their individual models?
(1:05:59) The future of this research: custom models for your brain?
(1:07:34) How much does this research contribute to brain research and wearables?
(1:11:15) Fuzzing data and future research applications

LINKS:
MedARC: medarc.ai
MindEye Paper: https://www.researchgate.net/publication/371136623_Reconstructing_the_Mind's_Eye_fMRI-to-Image_with_Contrastive_Learning_and_Diffusion_Priors

TWITTER:
@iScienceLuvr (Tanishq)
@MedARC_AI (MedARC)
@CogRev_Podcast
@labenz (Nathan)
@eriktorenberg (Erik)

SPONSOR:
Thank you Omneky (www.omneky.com) for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

MUSIC CREDIT:
MusicLM

Full Transcript

Transcript

Dr. Tanishq Mathew Abraham: (0:00) I think this idea of mapping one latent space to another is a very powerful idea. I think it's always best to try to take advantage of that as much as possible. And the real innovation these days is to be able to use these multimodal spaces as well, and being able to map different things to these multimodal spaces. That's a really exciting area.

Nathan Labenz: (0:21) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Tornburg.

Hello, and welcome back to the Cognitive Revolution. Our guest today and also for our next episode is Tanishq Matthew Abraham. Tanishq is a remarkable talent. He started college at 7 years old, gave his first TED Talk shortly thereafter, and has recently completed his PhD just as he turned 20 years old. He is also the CEO of MedArc, an organization he founded with the goal of translating AI progress to real world medical applications.

If you listen to this show, you know that much of the incredible AI progress we've recently seen has been unlocked by a strikingly consistent recipe. A clever, easy to score objective called a loss function allows the transformer architecture to be scaled up and trained with unsupervised, often web scale data. This approach has repeatedly advanced the state of the art in text generation, image understanding, image generation, and so much more.

Against this backdrop, Tanishq has recently published 2 remarkable papers with 2 different sets of coauthors, which show that there is still plenty of opportunity to produce breakthrough results even with relatively small datasets and modest compute budgets by using more thoughtfully designed problem specific architectures. The first, called reconstructing the mind's eye, which we'll cover in today's episode, shows that AI can quite literally read minds. Leveraging pre trained CLIP and stable diffusion models, Tanishq and his collaborators are able to take fMRI scan data, collect it while a person viewed a particular image, and reconstruct that same image with remarkably high fidelity.

Our conversation doesn't depend on visual aids, but to properly appreciate this result, you absolutely must look at the headline graphic of this paper. Follow the link in the show notes, and I promise you will be impressed. For me, this conversation was not only a chance to learn about some of the latest techniques that leverage the power of pre trained foundation models in small data environments, but also a chance to reflect on the big picture. This result, like so many others we've seen recently, would have felt like science fiction just a couple of years ago. Now with the pace of AI progress so relentless, they often seem to come and go with minimal fanfare.

And just as important, I think Tanishq's work demonstrates the potential for AI to help advance neuroscience in general. Decoding the brain's activity in a noninvasive way holds the promise of helping us understand our own human cognition like never before, potentially bringing an entirely new meaning to the phrase cognitive revolution.

As always, if you're finding value in the show, I would really appreciate it if you'd take a moment to share it with your friends. We recently hit a new high as the number 15 rated show on Apple's technology podcast chart. And while I'll always prioritize depth over reach, it is extremely gratifying to know that this project is helping others make sense of the rapidly evolving AI landscape.

So now, without further ado, I hope you enjoyed this conversation with Tanishq Matthew Abraham. Tanishq Matthew Abraham, welcome to the Cognitive Revolution.

Dr. Tanishq Mathew Abraham: (4:17) Thank you for having me.

Nathan Labenz: (4:18) Very excited to have you. You are prolific all of a sudden, not always. For folks who don't know you, you are a child prodigy, I think, fair to say, who's now grown into a young adult AI researcher who is hitting on the level of, from what I can tell, a tenured professor. So extremely impressive young career that you have.

And a couple of papers that have come out in just the last week or so, both of which are super interesting for a whole bunch of reasons. One about reading the state of the brain and translating what is detected in the brain into a reconstruction of what the person saw, which is pretty amazing. And there's some incredibly striking visuals there. And then one also that is about taking raw images of tissues and predicting what those would look like if they were treated in various ways to then allow a lab technician to read what information they contain and make diagnoses.

So super fascinating stuff, really where the rubber hits the road kind of research. This is not pie in the sky at all, but very applied practical hands on stuff. And I think it's just fantastic. So great work for starters, and I'm looking forward to getting into all the details.

Dr. Tanishq Mathew Abraham: (5:37) Yeah. Sounds great.

Nathan Labenz: (5:39) So let's talk about the first paper first. This is the one that has, I think, blown up a little bit more on social media. It's called Reconstructing the Mind's Eye, fMRI to Image with Contrastive Learning and Diffusion Priors. Maybe just for starters, I was surprised to learn, if I understand correctly, there is an fMRI dataset that was already out there in the public for you to build on. So maybe just start off by telling us what you began with as you undertook this project.

Dr. Tanishq Mathew Abraham: (6:08) Yeah, it seems like the field of neuroscience also tends to have somewhat of an open approach to research in terms of releasing datasets. And there are these sorts of databases with neuroscience datasets, like EEG datasets, fMRI datasets. So luckily, there has been some of these datasets that have been released where they take a subject, ask them to look at some images, and they measure the fMRI signal at the same time.

And so there have been a few datasets over the years that have done similar sorts of things. The dataset that we used was released, I think, maybe it was 2020, 2021. So pretty recently, I think. So it's a pretty recent dataset. It's a very high resolution dataset, so it uses a, so most of the datasets are using like, I think, a magnet that is used for the MRI. It's, I think, like 1.5 Tesla, 3 Tesla, something around this range, whereas the one that this one is using is 7 Tesla. So it's a much more powerful magnet. So it actually results in a much more powerful and higher resolution fMRI signal, and so you have a better signal that you can also measure as well.

So this was created by another team that conducted that research, and they measured the signal for many different subjects. I think about 8 subjects in total. And the subjects went through hours of looking at images. Maybe they look at each image for a few seconds, and then the signal would be measured. And then there'd be a break of another few seconds, and then they measure again for another image. So they have that sort of process. And then they collected all of that data, and they released it publicly.

And so actually, I wasn't originally aware of this dataset. It was only after meeting my collaborator here, who's the first author of the paper, who told me about this dataset. Actually, we had started working on some similar ideas for this project about a year ago or so, and we were actually looking at a different dataset at the time. We were really happy to see this dataset, this new dataset that has much higher resolution and much better signal quality. And yeah, it was really designed for these sorts of questions. So, yeah, it's really a great dataset. And it's great to see that the neuroscience community also has this sort of open research spirit. So I'm really happy about that. And it makes me more excited to work in that field as well.

Nathan Labenz: (8:39) Yeah. Well, the interaction between AI and many, if not all aspects of biology is certainly going to be fascinating. And this is one that people are just so struck by because you can look ultimately at, this is what we showed the person, and this is what we were able to reconstruct based on a reading of their brain. And that first figure in this paper is incredibly striking where it's just like, wow. That is not a, I mean, it is a similar image, but that's not just a somewhat similar image. That is a very similar image that really looks like what you showed the person.

So we can put a link, and we will put a link into the show notes to the paper. But go check out that first visual pause right here and go look at it because it really does ground what we're talking about. The striking resemblance of the reconstructions is just incredible.

Dr. Tanishq Mathew Abraham: (9:33) The reason why it's doing so well is both because of a combination of this very high quality dataset as well as the latest AI advances that we're taking advantage of. And I think because of these 2 factors, that's why we're able to see such great results. And so this wasn't something that was possible before, but only because of these 2 factors now we're able to see such amazing results.

Nathan Labenz: (9:55) Cool. So let's talk just a little bit more about the fMRI side and the nature of that dataset. So reading through the paper, there's 15,000 voxels. A voxel, as I understand, is a 3D space in and of itself. Right? So basically 15,000 little cubes, if you will, each corresponding to a physical region in the brain. Is this all on the back of the head in the visual cortex? Like, how much of the brain is under consideration here?

Dr. Tanishq Mathew Abraham: (10:26) Yeah. That's approximately right. We took a, the dataset actually has the fMRI signal of the entire brain, but we used a subset of the voxels that correspond to visual perception. So, yeah. And this was actually something that the original dataset folks actually prepared. They actually prepared a subset of their data with focused on visual perception. And so that is the subset that we used in this study.

Nathan Labenz: (10:50) If we were to look at the sort of raw data of one voxel, like, what would that contain?

Dr. Tanishq Mathew Abraham: (10:58) I think it's just be a value associated with that particular voxel for that particular signal. So the signal is measuring the fMRI signal is actually measuring, it's dependent on the blood oxygenation level. So, basically, as a particular region in the brain is using, when there's a lot of activity, it's going to be using up a lot of blood oxygen. So you kind of see that sort of change in blood oxygenation level. And so you have a value that's associated with that particular voxel that is indicative of the sort of blood oxygenation and the usage of blood oxygen in that area of the brain.

Nathan Labenz: (11:32) Gotcha. Okay. So it's essentially activity or energy. It's a raw scaler that says, this is how intense activity was in this particular place. Is there a time dimension to it as well? Or are you staring as the user at an image until there's some sort of steady state or whatever that is then averaged? How does that work?

Dr. Tanishq Mathew Abraham: (11:55) Yeah. Kind of like that. There isn't much of a time dimension. fMRI doesn't have as much high temporal resolution, especially because you do have this sort of blood oxygenation, and that process is a little bit slower. And so it's much harder to get very high accuracy in terms of the time resolution. So it's more of a single steady state kind of value. So that's why you have the patient look at it for a few seconds first, and then you take that measurement over a course of a few seconds. And then, yeah, that's the measurement that you're using. Yeah.

Nathan Labenz: (12:28) So essentially, the input is, when we think about an AI model that you're now going to develop on this dataset as an input output device, I always fall back to that framework. The input then is 15,000 numbers that correspond to intensity of activity in one of 15,000 little regions of the brain that are all in that back visual cortex area. That's strikingly not that much information. You know, 15,000 numbers in the grand scheme of things feels small. So does that feel small to you?

Dr. Tanishq Mathew Abraham: (13:08) Yeah. Definitely is striking in terms of how much information is present in the fMRI. And again, like, this isn't looking at individual neurons or anything like this. This is a region. It's about 1 to 2, maybe 2 mm, 2 x 2 x 2 mm cube, basically, like, yeah, it's a 2 x 2 x 2 cube millimeters kind of cube of area or volume that you're looking at for that particular voxel. So it's not very fine grained. It's more fine grained than maybe some of the other technologies like EEG or some of these other technologies, but it's not like a lot of these sorts of systems where you have these sorts of invasive measurements of specific neurons or things like this. It's not looking at specific neurons.

So it's interesting that you would expect you would need the actual neuron signal, the actual electrical activity, and at that very fine grade level to be able to get the response, to be able to accurately predict the reconstruction, but it turns out that's not absolutely necessary. And I think a lot of it has to do with just the way the brain is organized and certain, there are certain regions that respond to certain features in an image, for example. So just being able to know what regions are being activated and responding to the image, that gives you an overall idea of maybe the sorts of features that are there in an image already.

So I think, yeah, a lot of it has to do with the organization of the brain and the visual perception of the brain. And, yeah, and that's also partly why these sorts of approaches are interesting from a research perspective is to also better understand how that perception works as well. And what's being activated, what kind of signals are there?

Nathan Labenz: (14:59) Hey. We'll continue our interview in a moment after a word from our sponsors.

Dr. Tanishq Mathew Abraham: (15:03) So, yeah, overall, yeah, it sometimes can be when you think about it, it's a little bit surprising. I mean, I think it's just surprising overall that you can do this sort of reading of brain activity in the first place, whether or not it's with fMRI or any other thing, that's still pretty impressive and doing it in a non invasive manner. It's like, that's pretty surprising. But yeah, of course, you still are getting very high, you still have to get that very high quality data and you have subjects that are in this MRI machine sitting there for extended period of time. And so it's still a very involved process anyway. But it's a great first step and still useful for hopefully research applications. Yeah. Dr. Tanishq Mathew Abraham: 15:03 Yeah, overall, when you think about it, it's a little bit surprising. I think it's just surprising overall that you can do this sort of reading of brain activity in the first place, whether it's with fMRI or any other thing. That's still pretty impressive, and doing it in a non-invasive manner is pretty surprising.

But of course, you still are getting very high quality data, and you have subjects that are in this MRI machine sitting there for extended periods of time. So it's still a very involved process anyway. But it's a great first step and still useful for research applications.

Nathan Labenz: 15:49 I mean, the learning is going to go both ways here for sure. I think that's another very key point and probably key theme. The feedback cycles and dynamics here research-wise, we're just scratching the surface of that.

Again, just grounding for a second. So you said a 2mm cube is one voxel. If I were to do a little mental math on that, I'm like, okay, 5 x 5 x 5 is 125. So in a cubic centimeter, there would be 125 of these voxels. And we're on our way to 15,000. So that would be basically 100 cubic centimeters. If I'm thinking about that as the cortex, it's maybe a 10 by 10 by 1 deep segment of tissue is what we're really looking at here.

Dr. Tanishq Mathew Abraham: 16:40 Yeah, I think that sounds about right.

Nathan Labenz: 16:42 We've got that kind of slab of tissue. We've divided it up into these 2mm cube voxels. Each one has a number that corresponds to its intensity. That is, again, where we started. And so that's the input now to the things that you are training AIs to do.

I guess there's a couple of big themes here that jump back to me. One, the just the volume of data, not that high. You've got only so many images, only a few patients. And this is true of your other paper as well. It's a low data environment, and that creates some interesting challenges.

Tell me about the outputs. You have multiple different approaches to create different outputs that you use for different purposes. So maybe just run those down for us.

Dr. Tanishq Mathew Abraham: 17:29 The final goal was to reconstruct the image. We want a final image there. The approach that we wanted to take was, how can we leverage the existing image generation models and image representation models that exist already to be able to perform this reconstruction as well as this additional task that we talk about, which is retrieval.

And so in order to do that, the goal is to basically get some sort of CLIP embedding. So this is, again, the CLIP models that OpenAI has developed and released, where it's a sort of joint representation of an image and text. So you have this representation space, and the idea is that a lot of these image generation models already take some sort of CLIP embedding. So if we're able to predict the CLIP embedding from the fMRI signal, then we can use that CLIP embedding for image generation and get a reconstruction that way. So that's at a high level the approach.

There are a few different other aspects that allow this to work. So, the idea is that we have a sort of neural network, a basic MLP, multilayer perceptron. That's a basic neural network architecture. And so we take this MLP to predict the embedding.

And the problem is that the embedding that we get at the beginning isn't aligned. So basically, when you have these embeddings, you can get them so that the cosine similarities, the similarities between the embeddings all match up the way you want them to, but the actual representations aren't aligned with the other representations. It's not just a problem with the fMRI embeddings that we're working with, but it's actually an issue that happens with even just within the regular CLIP where you have image to text, image and text embeddings.

So for example, in the OpenAI DALL-E work that they have, they wanted to take text and then they wanted to convert that to image generation. So the problem is that they did something similar where they actually took text and then they got the CLIP embedding for the text. But then that CLIP text embedding isn't aligned again to the image embedding. So of course, the CLIP text embedding will, for example, if it's a picture of a dog and that's the text, and then if it's the actual image of the dog, those embeddings have high similarity to each other, the text embedding and the image embedding, but they're not of similar values. So that's the issue. They're not really of similar values. They're not really matching.

So you cannot take, for example, a CLIP text embedding and pass it in as a CLIP image embedding to an image generation model that expects CLIP image embeddings. You need to somehow convert from the CLIP text embedding to a CLIP image embedding. So the DALL-E paper introduced this diffusion prior, which is another model that converts the CLIP text embedding into a CLIP image embedding and aligns them.

And so that's what we did here where we predicted our CLIP embedding from the fMRI signal, and then we also had a diffusion prior that would take that predicted fMRI embedding and align it with the CLIP image embedding. And now we have a CLIP image embedding that's aligned, and that can be passed into a pre-trained image generation model. So that's the reconstruction pipeline.

And then we also have this retrieval pipeline where, in this case, you don't really need to have the alignment because here, all you're looking at is similarity. So you can just say, okay, if you have some sort of fMRI embedding, what images are those similar to based on cosine similarity of the CLIP embedding. So you can get the similarity to a bunch of images, and you can say these are the ones that are most similar to, and we also did those experiments as well.

And we can see that if we used images from the dataset that we worked with, then the ones that it's most similar to are the actual images that the subject actually saw in that case. Or we can also extend it to a very large dataset like LAION-5B, and you can do, again, some sort of similarity and get similar images, and those similar images would look very similar to the actual image. And you can kind of treat this as a generation pipeline without actually using a generation model because you're getting these similar images that could correspond very closely to the original image.

And again, the nice thing about what we show in this paper is that you can get actually very fine-grained information. So the example that we have in our paper, which we think is a really nice example, is that there are actually a lot of zebra images in the dataset that they were working with to collect from some of the subjects. And so if we have one zebra image that a subject looked at and you try to predict the fMRI signal and you look at the retrieval of the images, you can actually retrieve the exact same zebra image based on the fMRI embedding that we predict with our model. And so this indicates that it's not just, oh, this is a zebra image. You can get this is the exact zebra image that the patient was looking at. So there's actually that fine-grained level of information that's also present. So that's the two sort of approaches that we have here. We have the reconstruction. We have this retrieval.

There's one other aspect that is also worth pointing out, which is that because with reconstruction, you're having this clip embedding, and that's being passed into this image generation model, you get some nice image. But it would be also nice if the positions of the objects are similar or maybe the color are similar, these sorts of more low-level information. So we also have a low-level pipeline that also helps predict that can help with that low-level information.

And so the idea there is you actually just have a simple model that takes in the fMRI data and predicts a VAE representation. So with stable diffusion, it has this VAE, which is a variational autoencoder. So this is just a lower dimensional representation of the image. And so if we can predict that and then we actually just then decode that VAE representation that was predicted from the fMRI signal. And that's a very blurry image, but it has some of the features that are important, like some of the position of the objects or the colors, things like this. And we can use that as a starting point for our diffusion process.

So because our standard pre-trained image generation is usually some sort of diffusion process, we can use some of these image-to-image processes that people have been using, and use this low-level image as a starting point for the high-level image generation to occur. And so that way, we can get the high-level image generation with the low-level information. So that's this combination of this low-level pipeline and this high-level pipeline. And this allows us to get very high metrics also on the low-level metrics.

So people will do things like look at the comparison of the original image and the output image. And of course, if you have very similar semantic information in terms of, okay, this is a zebra, but the zebra is not in the same position, if you're going to directly compare those images, it's not going to be very high. The results won't be very good in terms of that sort of low-level comparison. But here we can show that our low-level comparison actually is much better than previous approaches as well.

So yeah, that's a lot of information, I guess, but those are the different pipelines and, of course, feel free to ask questions about the specific pipelines and the specific details. But yeah, hopefully that makes sense.

Nathan Labenz: 25:37 So am I right, first of all, to say that the simplest first step is taking 15,000 voxels, each a number, and mapping that to a CLIP embedding. And that's the starting point regardless of then whether you're going to go to the retrieval or to the reconstruction.

Dr. Tanishq Mathew Abraham: 26:00 Yes. That's correct. Yes.

Nathan Labenz: 26:01 So I guess, first of all, how big is that? How big is a CLIP embedding? How many numbers is that at the end of the prediction?

Dr. Tanishq Mathew Abraham: 26:08 It depends on, because we did try different CLIP models. I don't remember which one we used finally. But it's basically, you have a vector for each, in the case of text, it would be each word or token, for example, and then those would be concatenated together. Or in the case of an image, if you're using the sort of transformer, you have image tokens as well. So you have an array of vectors that you have for each image or each sentence or whatever.

And then typically, what people will do is they would, for example, average across that and just get a single vector that they work with. We don't actually do that, because it turns out if you actually do average, if you average across all the tokens, all the words in the sentence or the entire image or whatever, you actually lose information that way. And if you actually keep that information and try to predict, basically, we're predicting the CLIP embedding for the whole image, and not just that global representation, but actually the different parts of the image as well. Again, that allows you to get some more spatial information, some more low-level accurate information that way too.

But it's basically some sort of vector for each of these tokens, and then you have that. So overall, it's, yeah, I mean, I can quickly look up the dimension. It's probably mentioned in the diagram.

Nathan Labenz: 27:35 257 x 768, if I'm reading the diagram.

Dr. Tanishq Mathew Abraham: 27:39 To clarify what that means, basically, for each token, you have a 768 vector. So for example, if you had a CLIP, so, yeah, again, CLIP image embeddings are a little bit different because basically, the image is divided into multiple patches, and then for each patch, it's treated as a sort of token. So then you have a 768 vector for each patch in the image, and they have 256 patches. So there's 256 patches, so there's 256 of those 768 vectors. And also, there's one global patch, global representation for the entire image. So that's another 768 vector. So that's how you get 257 by 768.

And so instead of just predicting a single 1 x 768 vector for the entire image, we actually predict that 257 by 768 for each patch plus the full global representation. So maybe, hopefully, that makes more sense in terms of what's being predicted here for the CLIP image embedding.

Nathan Labenz: 28:46 Yeah. Interesting. So that's just, I mean, it's a dramatic increase in the number of numbers. Right? I mean, you're going from 15,000 to something like, whatever, 150,000, maybe 200,000 numbers. So that's always interesting in and of itself. Now, you're predicting there the image embedding?

Dr. Tanishq Mathew Abraham: 29:10 We're trying to predict the image embedding. But again, when we predict it, it's either predicting that this is sort of, we have these different losses, like a contrastive loss and all these different losses. So again, when we predict it, it's not necessarily aligned from the very beginning to the image embedding. So it's not going to have the exact same values as the image embedding from when we start out. And so that's why there's this additional process for the reconstruction.

But you can use that predicted image embedding for retrieval because the similarities will work out perfectly fine anyway without needing to do this alignment process. So it's just, we predict these image embedding. So that's why in the paper, we don't call it necessarily image embeddings, the direct prediction from the MLP. I think we call them fMRI embeddings or something like this. But it's not, the direct output from the MLP isn't going to match exactly with the CLIP image embeddings, but that's what it's trying to be. What we're trying to predict there. Yeah.

Nathan Labenz: 30:14 And that's enough to do the retrieval out of the LAION-5B dataset, often down to even the single one image out of 5 billion. Correct? Nathan Labenz: 30:14 And that's enough to do the retrieval out of the LAION 5 billion dataset, often down to even the single 1 image out of 5 billion. Correct?

Dr. Tanishq Mathew Abraham: 30:26 Yeah. That's basically correct. Basically, we have this contrastive loss that we're training the model, the MLP with. And so that allows for the cosine similarities to work out because we're doing the contrastive learning similar to how the clip or the original CLIP models were trained. And so then we can get the embeddings that work well with this sort of retrieval task.

Nathan Labenz: 30:49 The next step is doing this sort of aligning. And I'm a little confused there as to was there something preventing you from doing that in an end to end training sort of way? Or was it just a matter of we knew we wanted to do this in pieces anyway, so this was kind of an efficiency? Why is that 2 steps instead of 1 end to end?

Dr. Tanishq Mathew Abraham: 31:10 So that turns out to be actually a sort of trade off between retrieval performance and reconstruction performance. It's hard. If you do it end to end, it would be hard to get the best performance for both tasks. Dividing it up into these 2 processes, these 2 pipelines, allows us to get, we can have something that we can do well for retrieval task, and then we can get something that does well for image reconstruction. So that's kind of the main motivating factor there for dividing it up. And we have some ablation studies in the paper that kind of demonstrates this concept of this sort of trade off between reconstruction and retrieval.

Nathan Labenz: 31:50 Is there an intuition for that? I'm a little drawing a blank as to why I would expect that they would, I currently don't expect. So you have to help me develop the intuition for why that would be the case.

Dr. Tanishq Mathew Abraham: 32:02 I think it's potentially because in terms of getting the nice cosine similarity that sort of making sure these sorts of embeddings are similar to some images and different to others, I guess that's kind of, I guess maybe different information is needed or different sort of representations may be needed for that approach versus if you want to just take something to predict the final image. I guess there may be some differences. I mean, it's even for us, maybe we don't, I'm not sure if we have the best intuition as well for what's going on there, but there's a different information that may be needed for these 2 different tasks, I guess. It's an interesting question why the reconstruction and the retrieval have this sort of trade off.

Nathan Labenz: 32:53 So if I, again, understand correctly, it is a sequential process even in the inference step. Right? Maybe I have this wrong, but what I understood was voxel data comes in. It's first then mapped onto the CLIP embedding space in such a way where it'll have this high similarity so that you can perform your database search with a vector database. For folks who want to go deeper on that concept, we have a whole episode on vector databases with Anton Chernikov, who is from Chroma, one of the founders of Chroma. And they have a powerful demonstration of their technology built on this same dataset that really flexes the scalability of it. So go listen to that if you want to think about or you want to learn more about the nature of doing these vector database searches. But okay, so you've got that. That's enough then to power the database search. And then you're doing another transformation sequentially, right, into the form that would then feed into the diffusion model.

Dr. Tanishq Mathew Abraham: 34:06 Yes. That's pretty much the case. Yes.

Nathan Labenz: 34:10 So all the information is there the whole time. That intermediate step contains all the information by definition, right, that then gets kind of reformulated. So is it maybe the case that just the constraints that you're working with are different across these tasks? Maybe it's not sort of a fundamental thing and more like, well, what the diffusion model needs is just different from what, you're training not necessarily for informational content, but for a specific representation that is required to use some frozen existing system.

Dr. Tanishq Mathew Abraham: 34:48 Exactly. Yeah. That was one of the things as I mentioned because the idea is that you have a diffusion model that's taking in specific CLIP embeddings. And so the problem is that those embeddings that we produce, again, they're trained with this contrastive loss. Basically, you can either train with the contrastive loss or you can train with something like an MSE loss, which is more you can directly compare, you can get the values to match exactly. But so it's for the contrastive loss, that's good for that retrieval sort of task. So when we train with the contrastive loss, it's good for that sort of retrieval stuff. And so we're using that for the retrieval task. But then the problem is, again, those representations may not necessarily match a regular CLIP image embedding. Again, it's useful for doing the sort of cosine similarity, getting similarities to existing embeddings, but they don't match those sort of existing embeddings directly. And, yeah, again, there was actually, we have a good example of this sort of alignment process in the appendix of our paper. And so there's this sort of UMAP depiction of the CLIP image embeddings as well as the fMRI embeddings. And you can see that without any sort of, without our sort of alignment diffusion prior process, you could see that the embeddings are kind of in 2 separate clusters almost. And so they don't really line up together. And then we, but we need them to line up together in order for us to be able to pass it into an image generation model that was trained with those CLIP image embeddings because you want it to match those CLIP image embeddings, because that's what the diffusion model knows to work with. So we're just trying to make, yeah, the information is there, but we're just trying to put it in a way that the pretrained frozen models are used to. And I guess that also has to do with the different objectives, the sort of contrastive objective that we're training with. You can train that. You can get these good representations for the retrieval task, but it doesn't give you representations that match exactly the CLIP image embeddings that we need for passing into the pretrained models.

Nathan Labenz: 37:04 So the final version that actually can be passed into the image generation model, that is the thing that is directly analogous to the image embedding, and I see how that's represented in the figure. So is then maybe the surprising thing or the thing that needs understanding, in addition to that, there's a fact that if you want to maximize your retrieval performance, you probably could just use those. If you had trained the whole thing end to end, you could use the genuine aligned image embeddings to power retrieval, but it doesn't work as well. And what's revealed is there is actually a better way to represent these voxels for retrieval purposes that is kind of its own thing. It's kind of a different subspace in this space that somehow it kind of just gradient descents its way into, and we really don't even know what that is. That's not really interpretable as of now.

Dr. Tanishq Mathew Abraham: 38:09 Well, I think this is the same question for CLIP in general, because this is a similar problem with CLIP image embeddings and text embeddings. They are not lined up in the same way. So I think this may be just kind of a general question in terms of what's going on in terms of the CLIP image embeddings and text embeddings. It seems like when you train these sorts of models with contrastive, this sort of contrastive objective, that doesn't really incentivize them to actually map to the same exact region, same exact space. They map to kind of different spaces that have the appropriate, I guess, cosine similarities and appropriate relationships because the contrastive learning is just kind of just trying to maximize similarity to similar things and minimize similarity to things that are not similar. But I guess that doesn't necessarily mean that they have to be in the exact same space, I guess. So I think there's sort of that question even in case of CLIP image and text embeddings as well. And, again, that's why in the original DALL-E work, they did something similar where they took the CLIP text embedding, and then they had to incorporate this additional diffusion prior to convert it into a CLIP image embedding. They couldn't just pass in their text embedding into a diffusion model that took image embeddings. So it's a similar sort of problem, and I think it's just kind of a question for the contrastive learning community, I guess, or something like this. I think it also has to do with this sort of when you work in these high dimensional spaces, it, this is something that starts to show up. So I think that's the other thing. It's like we're working with this very high dimensionality data in terms of the CLIP embeddings. So that's something that there's something about working with this high dimension data that leads to these sorts of properties.

Nathan Labenz: 40:04 If you had done the full end to end training, that would presumably be fine for the reconstruction.

Dr. Tanishq Mathew Abraham: 40:11 Yeah. I mean, the goal was obviously to get best reconstructions. We try different things out. I think we kinda stumbled upon this sort of retrieval task, I think. Because the issue is that if you want to go from, again, the problem is if you're going from the fMRI signal to the CLIP image embedding, I guess it has to depend, it depends on the source of losses that you use because if we train it with a contrastive objective, theoretically, it should work, but maybe more practically, it might be difficult to just directly go to the CLIP image embedding. I think that because you can't, so you can't train it necessarily maybe with a contrastive objective because that is meant, that's going to cause it not to be, it won't necessarily be aligned that way, but theoretically, that should be a perfectly fine pipeline. But I think practically, there can be some difficulties getting such a model to train well and to get an accurate model at the end. So that's that would be the caveat I would mention.

Nathan Labenz: 41:09 Because one thing that's really interesting about both of these papers that you've put out in the last short period is, again, the low data, but also kind of the non transformer architecture. Right? I mean, you've got something here that jumped out to me. If I understand correctly, there's no attention mechanism in this architecture, at least in the models you've created.

Dr. Tanishq Mathew Abraham: 41:32 Yes. The model that we create is basically the MLP. And so, yeah, there isn't really. Yeah. It's just the MLP. And then, of course, the CLIP will have, of course, attention in there. And, of course, the diffusion models could have as well, but the MLP, not really. Yeah.

Nathan Labenz: 41:48 How big of a role then does the limited dataset size play? Because you have basically just under 1000 images per individual, and you end up training models for each individual. So you're really just training on 1000 pairs of this is the image that the person saw, and this is the 15,000 numbers for the 15,000 voxels as measured for that image. That's it. Right? Just 1000 of those pairs?

Dr. Tanishq Mathew Abraham: 42:21 The 982, I don't know if you're talking about the 982, if that's what you were referring to in terms of under 1000. That was specifically a test set. There are, I think, several thousand training samples that are used during training, and that was with the retrieval. I think that's the value you may be talking about. I'm not a 100% sure if that's what you were referring. So that, yeah, that was just the retrieval experiments. And so, basically, we have, so we have multiple subjects, first of all. So that's another thing worth mentioning is that we have multiple subjects. So that's multiple different people that are looking at these images and then the fMRI signal is measured. The fMRI signal is different per subject, and so we actually train separate models per subject. So the way we do that is that we have subject 1, and the data for subject 1 is divided into a training set and a test set. And so there are maybe a few several thousand samples from subject 1 that's used for training, and then you have a test set again. And so that's what we have, and then you have to do process for subject 2 and all the remaining subjects. And then another thing to note is that, yeah, so we have subject 1. We did all the model development on just subject 1. And so any sort of experiments were done. They were all done with subject 1. And then for the rest of the subjects, we just did the same, whatever worked best for subject 1, we did the same thing for the rest of the subjects. And that's what we also reported. And then the other thing, of course, is that when we're looking at the datasets, the images that the subject looks at during the training, in the training set is different from the images that they look at that's present in the test set, of course. So just, again, to make sure that we are not overfitting or anything like this. We want to make sure they're 2 separate sets. But, yeah, there are several thousand samples for the training set. I think the exact information is in the appendix of the paper. But yeah. I mean, it's still not that much, of course.

Nathan Labenz: 44:31 Yeah. It's small. Right? I mean, that's we're looking at 5 billion images in the LAION dataset, and now we're talking basically 1 millionth of that. So it's surprising that you're able to get this much generalization out of seemingly such a small set. Right? I mean, these images are just all kinds of different random scenes and subject matter, and it's pretty diverse set of imagery that you're drawing from. Nathan Labenz: 44:31 Yeah. It's small, right? I mean, we're looking at 5 billion images in the Lion dataset, and now we're talking basically one millionth of that. So it's surprising that you're able to get this much generalization out of seemingly such a small set. These images are just all kinds of different random scenes and subject matter. It's a pretty diverse set of imagery that you're drawing from.

Dr. Tanishq Mathew Abraham: 45:03 I think part of the reason is because we are leveraging these pretrained models. So of course, we're leveraging the representation space of the CLIP models. And so the MLP just needs to learn how to map something that is similar to the CLIP image embedding space. I mean, even with these sort of, even before the diffusion prior alignment step, it's still mapping to some space that should have kind of similar properties of the CLIP image embedding space. And in terms of how the images are laid out in that space, it should also be similar. It should still have some sort of similar properties of the CLIP image embedding space. So I think that really helps. It's just you're kind of already mapping to something that is a very well established representation. So you have that sort of well established representation that you're mapping to.

But again, also, we don't know how well it's gonna generalize out of natural scenes, whatever we were training with. And it's very much possible that it doesn't generalize out of those specific kind of images. And yeah, again, it also has to do with the sort of image reconstruction, the sort of image generation models that we work with. They also may have some particular datasets there that they are trained with, and those are the sorts of images they're best able to produce. So yeah, there are some sort of limitations in terms of the generalizability, and it I guess we haven't studied that as much. But the fact that we're able to train with such little data and get really good results is also partly because, I think, of this really rich representation space that we're working with.

Nathan Labenz: 46:46 One of my prior touchstone episodes that I always keep going back to was with the authors of BLIP, out of Salesforce Research in Singapore. And I use some of their models in my own product development work. And that was kind of my first trip down the rabbit hole of these kind of bridging or mapping from one high dimensional space to another. And since then, it's like, this just kind of comes up everywhere. It seems that to a first approximation, it seems like every space is kind of bridgeable or mappable to another space if you're clever enough to figure out where to put the pylons. Does that seem right to you? Like, it seems like all these things are kind of learning similar world models under the hood?

Dr. Tanishq Mathew Abraham: 47:34 I think so. I think this idea of mapping from one space to another, one latent space to another is a very powerful idea. And yeah, I think it's always best to try to take advantage of that as much as possible and to also take advantage of these preexisting latent spaces, especially in the cases that you may not have enough data. I think this is really a great opportunity. Again, I mean, this has been true, I think for a long time. The more recent thing is the idea of these multimodal spaces. Of course, the general idea of using pretrained latent spaces isn't new. I mean, that's kind of what transfer learning in a nutshell kind of is, using the preexisting latent space of these pretrained models. But a lot of them were focused for specific domains.

I think now the real advantage, the real innovation these days is to be able to use these multimodal spaces as well, and being able to map different things to these multimodal spaces. I think that's kind of a really exciting area that, yeah, it's interesting to see lots of developments in that. I think there was some papers that recently it was like mapping anything to anything. I think it was from Facebook or Meta. I think that they published a paper like this. So there, it's again, a very powerful idea.

And yeah, I guess there is some sort of base representation of something that can be utilized for various tasks. But yeah, I guess it also depends. There are representations that are useful for certain things, and that same representation may not be useful for other things. So it's also about there are different representations for a particular image, for example, and maybe some representations are better than another depending on the task. So there's also a lot to explore in the sorts of how are those representations trained and what tasks are those useful for. Again, maybe different sort of CLIP models, BLIP models, all these different models that are out there, and they are trained with different objectives and they lead to different representations. And that's something I think isn't fully explored yet. And there's a lot of opportunity there to explore what the differences between these different representations and how they can be used for different applications. So lots to explore there.

Nathan Labenz: 49:55 Yeah. So we're not even fully done with the architecture. We alluded to kind of the last part earlier. But since you mentioned these different pretrained models and the different spaces that they have, you've got another kind of clever dimension to this as well, right? So that is the, and I'd love to maybe understand a little bit, again, trying to develop some intuition. With the CLIP embeddings, that is originally a joint space of language and text. And so you point this out in the paper that the image embeddings are kind of inherently more about the captionable part of the image, if you will. This is something I've explored in a lot of different ways too. But notably, things that are not included in the captionable qualities of an image would be like, how beautiful does the image look? Is it appealing? Those things are typically not represented. And captions are not really captured in this CLIP representation all that much. So what you've really kind of communicated is some sort of semantic understanding at that point that this is what the image would contain.

You also then have this other angle, and then they converge at the end. But then you also have the stable diffusion VAE encoding. And I want to develop a little bit more intuition for that as well. But that has kind of the other half, which is what does this thing look like compositionally? Like, what colors are where? And you have this kind of intermediate state that is like a blurry thing where it's like, I don't necessarily know what that is, but I do kind of know where the color, what colors are where and there's some blobs here. Yeah. Well, I mean, what more would you tell us about that side of the pipeline?

Dr. Tanishq Mathew Abraham: 51:45 I mean, that's pretty much a pretty good summary of it. I think yeah, the idea is just to map the fMRI signal to some sort of VAE latent representation, and that latent representation is coming from the Stable Diffusion autoencoder, and that is able to contain a lot of this more low level information. And we are able to then output some sort of blurry image, and that is used as a starting point for the diffusion models that are then producing the final image given the fMRI CLIP embedding, the CLIP embedding that is produced. So yeah, this is doing a sort of image to image process.

People have explored this for various AI art applications already where you can kind of start out with some image. Often, for example, you can start with a sketch or something like this, and then you take that sketch, and then you can use stable diffusion, for example, to just produce this beautiful image based on the sketch. But again, the idea is that the output image has a lot of that sort of the colors or the structures, the positions very similar to that original sketch. And so that's a similar idea is that we have that starting image, that's almost like a sketch of the final image. And we are then, of course, taking our diffusion model, which is given the semantic information coming from our CLIP embeddings, and it is producing a nice final result that matches the sort of maybe colors or the sort of spatial position structures that was there in that original very blurry kind of sketch of the image that was produced by our low level pipeline.

So I mean, it's again very blurry, so it just helps a bit. It helps a bit and gives you better low level information, better low level results. But it's good to have that additional information where you can kind of see, you have to kind of look for it and you can see, you'll start to see that it's kind of apparent that, okay, maybe that zebra is in the right place or that person is kind of in the right place. It's you see a little bit of that kind of matching up a bit when you look compared to the original image. And so yeah, it kind of just provides an extra factor, that extra improvement in the final image, I'd say. Yeah. But again, yeah, it's not completely necessary. Like, if you just cared about the semantic information, you don't need to have a low level pipeline. It's not a necessary part of the pipeline.

Nathan Labenz: 54:15 You can just start from noise.

Dr. Tanishq Mathew Abraham: 54:16 Yes. Exactly. You just start your diffusion model. The typical process is it starts from noise, and from noise, it does that sort of iterative process to produce the final image. So instead of starting from the low level image, you could just start from noise, and you can get an image that matches the semantic content of the original but may not necessarily have that spatial information or colors or any of those things correct. So yeah, it's not a completely necessary part, but it just helps provide sometimes it'll just help improve the image a bit.

Nathan Labenz: 54:45 You're definitely standing on the shoulders of a lot of recent giants here where it's like, this work wouldn't even really have been possible in anything approaching this way. Whatever was the most recently released, your kind of frozen model that you're building on top of, I guess would be Stable Diffusion. So that's like 6, 8 months ago that that thing first came out?

Dr. Tanishq Mathew Abraham: 55:06 We actually, yeah, we of course use Stable Diffusion. We actually tested a few different pretrained image generation models. We tested Stable Diffusion variations because we're taking in CLIP image embeddings. So we're looking for the ones, other image generation models that take CLIP image embeddings. We started with stable diffusion image variations. I think it was Lafitte. I don't know how to pronounce that exactly, but it was one, again, one of these other papers, Lafitte, and then versatile diffusion. And that's the one we actually went ahead and actually used for the final image generation. It gave the best performance was this versatile diffusion model.

And this is actually one of the models that, again, takes in, like I talked about, the full CLIP embedding instead of the sort of global information, but actually the information for each of the tokens, each of the patches, so that full 257 x 768 tensor. Some of the other models only take in the sort of global vector. That's just like 768 vector, but this one takes in the full information. And again, I think that also really contributes to much improved performance when you get that full information that the model's predicting for the whole image, and you're passing all that information into your pretrained diffusion models. And some of the diffusion models don't take that. And those ones are not really necessarily doing well for this task. But when you have a diffusion model that also takes all that full information, then yeah, it can actually produce really nice results.

And I think yeah, versatile diffusion is a paper that I think maybe it went a little bit under the radar, but yeah, their models are actually quite good for image generation as well. And so that's what we used and got the best performance. Yeah. We can use any sort of future pretrained image generation models that take in these sorts of CLIP image embeddings. So yeah, we're really excited that if Stability AI releases some other Stable Diffusion models that take in these CLIP image embeddings, these sorts of variation models or whatever that is much better, that should hopefully also give us better reconstructions as well. But yeah, again, like you said, it's we're really building on top of these sorts of existing models, but I think it's also worth highlighting that our approach can continue to work well for whatever future models also may come in the future. So that's also really exciting about our approach.

Nathan Labenz: 57:24 Yeah, that's cool. So, okay. Let me try to play this all back in kind of one description. And then I got a few other questions, then we can go on to the other paper if you have time. So, I guess, kind of working backwards, it's almost like, all right, we have this raw data about the brain, and we know on the other end that there are these image generation models that have recently been created, and they can work in various ways. They could take text in. They could, if we could somehow figure out when they take the text in, have to embed the text, and then we could figure out how to kind of bypass the text step and project directly into the text space. Then there's all these variations where it's like, well, it could take an image in. It could take an image in and noise it, and then kind of take it in some guidance direction. And I think most of our audience will have at least played with these various tools, right, where you start with text and make an image.

Or we had Suhail from Playground AI on, actually, our very first episode, and they have a really nice kind of command based image editing tool now. And then, of course, you've got your image to image and all these different variations. People played with all these flavors. So you kind of recognize that, okay, that's out there as something we can tap into. And I know that all the information I need is contained in these 15,000 voxel numbers. So then you're like, all right, I'm going to take kind of a pincer movement at it.

You identify one kind of blurry representation space where it looks like I've basically just put a Coke bottle in front of the image, and I can just make out kind of color blotches and maybe some vague forms. And you figure out how to project the voxels onto that space so that you have kind of a good starting point for what the image qualitatively should look like. And then you separately say, okay, and I also know that I can guide that toward something semantically meaningful if I have the right representation to guide the diffusion of that blurry image towards something crisp. So then the other arm of the pincer movement is now we'll project those 15,000 voxel numbers onto this semantic representation, which is used to guide that reconstruction process. You put both of those things into an existing model, and now you're in business with reconstructions coming out.

Dr. Tanishq Mathew Abraham: 1:00:06 Yep. That sounds about right to me. Tanishq Mathew Abraham: 1:00:06 Yep. That sounds about right to me.

Nathan Labenz: 1:00:08 Alright. I love it. Well, I had to earn that one. And some really, again, super interesting things. Yeah. One thing I do really appreciate about the paper too is you have this just kind of raw code architecture that's this is how the model is set up. And they're not that big. Right? So can you tell us a little bit about just kind of, I guess, the scale for one thing? I'm also always really interested in how long does this take to train? Can you do it on a machine overnight? What does your kind of cycle look like in terms of actually training these things? Your iteration cycle obviously follows from that.

Tanishq Mathew Abraham: 1:00:45 Yeah. In terms of training, I don't know exactly how long it took because I wasn't the one training it. Again, the main authors who, in this case is Dr. Paul Scotti and Banerjee, they were the ones who were running a lot of the experiments. So I don't know how long exactly it takes to train. It's probably on the order of several hours. It's not gonna be days and days of training. That's not the case here, but yeah, I would have to confirm with them the exact timing of how long it takes to train. I think in terms of how big the model is, it's mentioned in the paper somewhere. I think it was on the order of 1 billion parameters. Part of it is also you see that if you use larger models, they do tend to get better performance. So that's also definitely part of it. Yeah. I'm just looking at the in the paper, it's not I think the model that we finally used had 940 million parameters. So, close to 1 billion parameters that we used. In the paper, we have different ablations that we have different ablation studies of the different architectures that are used. So there are different things that we did try in terms of, yeah, model depth, parameter count, all these different aspects that we've tried, and we found the best model based on that. Yeah. So I guess maybe yeah. It's not a too large of a model, but it's certainly sizable, I guess, 1 billion or, yeah, 940 million parameters. So it's still a decent size. Yes.

Nathan Labenz: 1:02:13 GPT-2 scale in rough terms. So, yeah, trained on, I'm just quoting the paper, one A100 machine for 240 epochs, with a batch size of 32. So, yeah, just one machine. Pretty amazing. Do you know how inference, I assume inference would be quite fast on this. Right? Similarly, it should take sub second type of deal?

Tanishq Mathew Abraham: 1:02:44 Yeah. It's pretty fast. Yes.

Nathan Labenz: 1:02:45 And how about on the topic of the 4 individuals? So you train models specific to the individual. Can we say anything about how similar or different we are as individuals based on if I was doing this, maybe you did this, I would be alright, I'm gonna put subject A's data into subject B's model. And if I do that, do I get total garbage noise out? Do I get something that's decent but not as good as it was on the actual person's model? How different are we under the hood?

Tanishq Mathew Abraham: 1:03:19 It's a good question. Yeah. I don't know if we actually tried those experiments, to be honest. But or yeah. Maybe they yeah. I'd have to ask them. But, yeah, definitely, there are some differences in the sort of visual perception that different people will have, so that would lead to different representations. Yeah. I mean, we are interested in seeing if, for example, there is a way to somehow map them to some sort of shared space that can be used for anybody. But, yeah, there are these sorts of minor differences in the visual perception of different people. So that's why we're required to train separate models for the time being. But, yeah, again, this is an open research question to how in terms of how we can it'd be really nice if you could have one model that just works for everyone. Part of it is also we're trying to think of different approaches. We have different projects that we're working on right now to address this. So, again, if for example, people who are interested, people listening to this who are interested in working in this further, feel free to check us out. And we can we have different projects maybe some sort of foundation model for fMRI. Let's just train on all kinds of fMRI datasets and then try to use that as something that we can map to. Or some form of maybe you have a sort of shared latent space for all these different subjects and for any sort of fMRI data, and then maybe there are subjects that may need to do a few more image yeah. So some more data collection to be able to kind of calibrate the model or something like this. So there are different approaches that we are considering to account for that sort of difference in visual perception of the different people. But again, this is an open research question and there's lots of interesting avenues of research there. And if people are interested in helping out, feel free to join us. And yeah. So I think there's a lot to explore in that direction.

Nathan Labenz: 1:05:17 Sounds like if I understand you correctly there, your expectation would be kind of we should be able to get to a point where relatively fast calibration would be possible. Maybe another if we sort of extend we said, okay, originally, we had 5 billion images in the datasets that trained the foundation models. It took a few thousand images to train something that could kind of bridge the voxel space into the embedding space. And it sounds like you think there's probably enough kind of similarity there that maybe you can bring it down another thousand fold and have a 5 image calibration step that could kind of fit your personal biology, shape of your own head, whatever, to the kind of shared fMRI latent space. That's basically where you think this is going?

Tanishq Mathew Abraham: 1:06:17 Yeah. I think that's the hope that we have that something like that may be possible. And this is, again, something that the main author, Dr. Paul Scotti, he's been really kind of trying to spearhead this effort in terms of trying to develop these sorts of approaches. And so he's really excited about this and really thinks that something like this would be possible. There's various questions that need to be answered. What are those sorts of 5 for example, if it is just 5 images, what are those 5 images gonna be? Are those different images per subject? Maybe there are some images that maybe there yeah. You may have to find different optimal images. And then, yeah, how is that gonna map appropriately? So there's lots of open questions, but the general idea is that's kind of where we're hoping we can get to that kind of stage in terms of image reconstruction. And, yeah, that'd be pretty exciting, I think, if we can get something like that working.

Nathan Labenz: 1:07:12 Yeah. It's amazing. How much do you think this ultimately feeds into better understanding of the brain, which seems like there's tremendous potential there, versus the other direction would be wearable devices, more well person consumer application. What's your kind of expectation for how this research develops in that respect?

Tanishq Mathew Abraham: 1:07:38 In terms of wearable devices and things like this. Of course, the fMRI is you have to go into this MRI machine. It's this very large machine and it's a very involved and time consuming process. So to be able to use the fMRI for wearable, I think that would require a lot of development on in terms of MRI technology and stuff like this.

Nathan Labenz: 1:08:00 Yeah. It's a strong magnet to be carried around. No doubt.

Tanishq Mathew Abraham: 1:08:02 So it's more of a hardware problem that I don't know when that would if something like that would be solved. And if so, when. There are alternative wearable approaches for measuring brain activity, things like EEG, fNIRS, different approaches. But again, they have different sorts of trade offs in terms of the signal that they provide, that sort of spatial resolution, temporal resolution. So EEG is good in terms of it's got a good temporal resolution, but its spatial resolution is less than fMRI. And we already talk about how fMRI spatial resolution isn't already, isn't actually that great, to be honest. It's just this sort of millimeter by mill, 2 mm x 2 mm x 2 mm voxel. And EEG is even more coarse grained than that. So it seems very unlikely that you can get a decent signal from EEG, for example. There may be some other technologies that may be able to get some signal, but I'm not entirely sure what other ones you can get signal from. So I think there's a lot more of at this point, it's more there's a need for better hardware that's able to actually get high quality signal if you want this to work for wearable applications. So I'm not sure if this is necessarily possible right now. And, again, even then, it's okay. Maybe if you don't use fMRI, then if you're using some other technology, then you have to validate if a similar approach would work well for these other technologies. I expect if you have a high quality dataset with high resolution and you get the relevant information, the approach the general approach should work well, but it still needs to be tested and validated. So for basically, to summarize for wearables, you need better hardware to be able to do this. And if you are using a different technology, then you need to be able to validate that for that other technology. So that's why I'm not really certain that this is something that could be used for consumer applications. It's not gonna happen within the next couple years, I'd say. But maybe if there are some interesting hardware developments, that could be something that may be possible in the future.

Nathan Labenz: 1:10:10 Because what I would imagine would be a fundamental challenge of a consumer hardware scenario would be what kind of resolution can you get? So I wonder if you could start to anticipate what you might need by taking your current approach and being what if we just make it 4 mm cubed instead, and we just average the that number across if you take an 8 bit turn, go from 8 voxels to 1 by doubling the edge length, and then just take the average of those numbers. Now you just have, instead of 15,000, you have, whatever, about 2,000, a little less than 2,000 numbers. Have you did you try anything like that? Do you think that you could just kind of be what if we just fuzz our data and see how far we can fuzz it before we can't read it anymore?

Tanishq Mathew Abraham: 1:11:04 Yeah. We haven't actually tried anything like that. But one thing that I think we may investigate in the future is just also just some of the other datasets in terms of, yeah. I mean, there are some of these other datasets that use less powerful magnets. I mean, that's what's been used in the field so far until maybe a couple years ago when this new dataset got released. So it could be worth just also trying those other datasets as well and seeing what kind of performance people get with those datasets. Yeah. And then again, your idea sounds like an interesting idea too. So there's definitely lots of different approaches to test out that. Yeah. I mean, yeah, it's definitely an important question to think about, how much data or how much of the actual signal is needed and the sort of signal to noise ratio needed. But, yeah, I think even then, something like EEG or something like this is still significantly coarser than fMRI, even some of these less powerful fMRI machines. Yeah. And then I guess the other question was about research sort of applications as well. Right? We mentioned this also a bit in the paper, and there are some potential interesting applications. So for example, you could imagine just trying to study the image reconstructions of different patients with different neurological diseases, you could see how those sorts of reconstructions change over time as disease progresses. And or you could use them potentially for some sort of diagnostic applications as well. So one example that we were thinking of is if you had someone with depression, maybe their reconstructions look a little bit different in terms of, yeah, maybe a little bit more dull, maybe a little bit yeah. There may be some differences in the reconstructions that we'd be able to pick up on. There are various studies that look at kind of these sorts of responses to images or even just mental imagery, things like this, and a lot of those have been very coarse grained. Just be imagine some sort of object or yeah. Maybe imagine an animal or something. It's very coarse grained, and so being able to also have this more fine grained information, there could be a lot of really useful neurological studies that could be done if you're able to get that fine grained information from a subject in terms of what they're imagining, what they're looking at, and being able to reconstruct that. But, yeah, I think there's a lot of interesting things in terms of, yeah, disease progression and diagnostics. And of course, you have this more kind of basic applications of, yeah, if you have locked in patients who are in coma or unable to communicate with the outside world, maybe some of these could be useful. Again, maybe with further development of wearable technology and things like this, you can use it for interesting medical applications. So, yeah, I think there are lots of interesting medical applications and research applications out there. And, yeah, I think we're starting to reach the point where maybe these sorts of image reconstructions can be used for interest to research interesting clinical problems and neurological problems. So, yeah, it's I think it's quite exciting. So even if it's not necessarily being used for the mind reading for consumers, I think there are still lots of interesting applications that we'll see in the short term for in clinical research.

Nathan Labenz: 1:14:30 Yeah. I think one of my big takeaways from this paper is just how many doors are opening. So many different doors kind of recently opened that enabled you to do this. And I think you're obviously going to be showing the way to others. And it seems like as much as we have seen a lot of awesome stuff in the last couple of years, something tells me we're not anywhere near kind of the end of the fruitful exploration of the current paradigm. And the fact that you're able to do this kind of work with a single A100 is pretty telling in terms of how much value is already kind of embodied in these foundation models and just waiting for a super clever person to come along and figure out how to piece together the right architecture to kind of bridge different spaces and make all these different things talk to each other. It's gonna be wild, I think.

Tanishq Mathew Abraham: 1:15:31 One of the interesting things about this project is kind of how it was done, how it was conducted in terms of the sort of open research environment and organization that we had. So this was, again, done as part of MedArc, which is the sort of research organization that I've founded. The projects that we work on are done in this sort of collaborative and decentralized manner and done in an open source manner. So of course the GitHub repository was always open source for people to look at and contribute, and there would be weekly meetings on Discord and chatting on Discord, just sharing research ideas, sharing progress. So all that was happening, and then the compute was provided by Stability AI, and they were able to support our research. And yeah, I think this sort of approach was really great because we were able and a lot of interesting, smart, and clever people were able to contribute to this project. So really was grateful for the contributions, of course, of Doctor Paul Scotti, who's the lead author and also Ahmedeep Banerjee, who really came up with some of these really interesting ideas that really put pushed this forward and really was able to help get this working. And then just so many other contributors from around the world that was able to work on this. I think this is an interesting project that kind of demonstrates the value of this sort of open collaboration as well. And with this sort of open collaboration, we can do all kinds of incredible things as well. I think, yeah, like you said, there's a lot of other interesting aspects in terms of being able to use foundation models that enables new opportunities. And then of course this sort of collaboration that is happening. So I think these sorts of things wouldn't have been possible just a few years ago. And so it's quite incredible what's possible now. It can be done by people that are sitting somewhere halfway across the world working from their laptop. It's kind of incredible what's possible these days. So yeah.

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything

Approaching the AI Event Horizon? Part 2, w/ Abhi Mahajan, Helen Toner, Jeremie Harris, @8teAPi

Approaching the AI Event Horizon? Part 1, w/ James Zou, Sam Hammond, Shoshannah Tekofsky, @8teAPi

The Mind-Reading Revolution with Dr. Tanishq Mathew Abraham (Part 1 of 2)

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next