Watch Episode Here

Read Episode Description

As recently as January 2021, the challenge of "interpreting what is going on in a photograph" was considered "nowhere near solved." Today's guests Junnan Li and Dongxu Li changed that with their publication and open-sourcing of BLIP, which delivered state-of-the-art performance on image captioning and other vision-language tasks.

BLIP became the #18 most-cited AI paper of 2022, and now Junnan and Dongxu are back with BLIP-2, this time showing how small models can harness the power of existing foundation models to do multi-modal tasks.
We talked to Junnan and Dongxu about their research and how they see the trend toward connector models shaping the future.

We talked to Junnan and Dongxu about their research and how they see the trend toward connector models shaping the future.

(00:00) Preview
(01:17) Sponsor
(01:35) Intro
(05:50) Convergence of AI techniques
(07:33) Evolution of BLIP to BLIP-2
(08:12) How BLIP-2 unlocked multimodal functionality
(12:43) The size, training dynamics, and optimization function of BLIP
(20:15) Practical/Business applications of BLIP
(29:43) Efficiency of BLIP-2 compared to other models
(41:52) Two-stage pre-training
(47:11) Architecture of Blip-2’s connector model
(58:52) Language models as the executive function of the brain
(01:07:32) Vision for an ultimate multimodal system and democratized pre-training for models
(01:12:59) Useful AI tools in these researchers’ day-to-day
(01:14:56) Upcoming projects

*Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Twitter:
@CogRev_Podcast
@LiJunnan0409 (Junnan Li)
@DongxuLi_(Dongxu Li)
@labenz (Nathan)
@eriktorenberg (Erik)

Join 1000's of subscribers of our Substack: https://cognitiverevolution.substack.com/

Websites:
Cognitivervolution.ai

Show Notes:
- Original BLIP demo
huggingface.co/spaces/Salesforce/BLIP

- BLIP 2 demo
huggingface.co/spaces/Salesforce/BLIP2
https://twitter.com/LiJunnan0409/status/1621649677543440384

- BLIP is the #18 most highly-cited paper in AI
https://mobile.twitter.com/LiJunnan0409/status/1631854807505076224

- Image captioning comparison tool
https://huggingface.co/spaces/nielsr/comparing-captioning-models

Full Transcript

Transcript

Nathan Labenz: (0:00) In a way, it's not that dissimilar from how we see. We have our eyes that take in raw light and turn that into a signal, and that signal goes through the nerve and finally gets back to the brain. By that point, it's not that interpretable either. It doesn't necessarily correspond to language. But then there's some further connector that turns that visual data into something that I can understand as language, or at least understand and then articulate as language. So it feels like there is something analogous taking shape in the AI world.

Junnan Li: (0:36) Imagine you are a human. You grow up learning only knowledge, and now one day you open your eyes. You don't really know how to interpret what you see. So that's what we are trying to do here: to build this bridge between these two modalities.

Nathan Labenz: (0:54) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg.

Nathan Labenz: (1:17) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a

Nathan Labenz: (1:33) 10% discount. Today's episode was a fun one for me. Researchers Junnan Li and Dongxu Li, both of Salesforce Research's Singapore office, have coauthored some of the most practically useful computer vision papers of the last year. As recently as January 2021, the challenge of using AI to interpret what is going on in a photograph was considered to be nowhere near solved. But just a year later, Junnan and Dongxu changed all that by publishing and open sourcing BLIP, a family of pre-trained models that delivered state-of-the-art performance on image captioning, visual question answering, and image-text matching. For Waymark, my company, BLIP was a godsend. Suddenly, we had a reliable way to understand the contents of users' images, allowing us to make useful image suggestions for the very first time. This was something we had worked toward for years. Unusually in today's AI landscape, BLIP has held the title of best image captioner for over a year, ultimately becoming the eighteenth most cited AI paper of 2022. And more recently, just as worthy rivals to BLIP started to come online, Junnan and Dongxu changed the game again with BLIP-2. As an aside, for a funny moment in Cognitive Revolution history, you can listen back to episode one, in which Suhail tells me about the release of BLIP-2 live on the show, forcing me to clear my calendar for the rest of the afternoon to go check it out. Now, BLIP-2 uses a different approach, which I think may ultimately prove even more influential. Rather than training a large model end to end, this time, they trained a much smaller model that connects a frozen vision model to a frozen language model. This strategy has several benefits. First, because it injects semantic visual information into the language model's latent space, you can now have an open-ended dialogue about an image in which the language model shows remarkably detailed and nuanced understanding. Second, because the connector model is so much smaller, training time and cost are dramatically reduced. BLIP-2 was trained on just a single A100 machine in less than 10 days, making it easy to upgrade the system as new and more powerful language models become available. You can even use your own fine-tuned language models as well. Small connector models like BLIP-2 show just how much potential remains to be drawn out of today's large language models and seem likely to play an important role in the great implementation of multimodal AI across society. One note for listeners: both Junnan and Dongxu are native Chinese speakers. And while both are perfectly fluent in English, audience members listening at 2x speed might benefit from watching this episode on YouTube, where we've also included subtitles for your convenience. Now enjoy our conversation with Junnan Li and Dongxu Li. Junnan Li and Dongxu Li, welcome to the Cognitive Revolution. Thanks.

Junnan Li: (4:40) Excited to be here.

Nathan Labenz: (4:42) Thanks for having us. So you guys have really done some outstanding work in the subdomain of AI known as computer vision. I guess I would even ask you right off the bat: do you think of yourselves as computer vision specialists? Or now with BLIP-2 and your work branching into multimodal models, do you think of yourselves as beyond computer vision? I'd love to hear how you think about where you fit into the broader AI landscape.

Junnan Li: (5:14) Yeah, I think with how AI has been developing, the boundary between different fields is blurry now. Both Dongxu and I started as computer vision people during our PhD, but gradually, it doesn't require too much expertise to move to another domain. That's why we started to explore the language domain and other domains, to see how we can be involved in building better AI models.

Dongxu Li: (5:44) He's just trying to be humble. He actually has a lot of expertise.

Nathan Labenz: (5:50) There is definitely this great convergence phenomenon. Transformers increasingly working for everything. It is amazing to see how quickly people can move from what used to be one subfield to another and just how much the same techniques are starting to work across these domains. I find that super fascinating. You guys have also done something that is pretty rare in today's AI landscape, which is that you put out a model almost a year ago now—the original release of BLIP. And that continued to be state-of-the-art or really neck and neck with one or two other models for state-of-the-art image captioning, all the way up until you released your most recent paper, which is BLIP-2, which obviously supersedes BLIP and I think now is safe to say is the state-of-the-art captioning model, question answering model as it pertains to images, and really starting to unlock longer dialogues and true understanding of images beyond the bite-sized captions that we've seen in the past. So I think that's really an amazing accomplishment. In today's AI world, we don't see too many things that can hold the top of the leaderboard for a year. Typically, it's more like a month. Sometimes it's more like a week. But you guys put something out there that is enduring work in today's AI context. So I want to compliment you on that, but I also want to dig into both the original BLIP and BLIP-2 and understand how you guys got into these projects, how they work, how you train them. We really do want to go pretty deep on that. So maybe let's start with the original BLIP. Tell us about the origin of that project and how it all came together.

Junnan Li: (7:46) Sure. Thanks for the compliment, by the way. Actually, before BLIP, we had a previous paper or two in this visual language domain. For me, my first paper was ALBEF. I'm not sure if you have heard of it, but it was also accepted at NeurIPS. That is when we started to explore this field and felt like there's something we can do here. In ALBEF, which came out around the same time as CLIP from OpenAI, we found that the idea was to build this multimodal encoder that can understand both image and text. CLIP is a unimodal encoder, right? You have the image encoder and the text encoder, and you can compute the similarity. What we found is that we can build on top of that contrastive learning approach to have another fusion encoder. We encode both image and text together, and we find that achieves quite good performance on some tasks that require understanding of both image and text because of this fusion mechanism. So that's where we started. And then we found out that this encoder architecture is good at some understanding tasks, but it's not really that good at generation tasks, in particular text generation tasks like captioning or VQA that require open-ended generation. In BLIP, the architecture we heavily inherited from our ALBEF architecture. We made some changes such that the encoder can also function as a decoder. We proposed what we call a mixture of encoder-decoder model, which is basically a single model with shared parameters that can switch from either an encoder or a decoder. So that's where the architecture comes from. And then we also found out that in order to train this model, you need a lot of data from the web domain, like image-text pairs. For previous methods like CLIP, you just want to do contrastive learning. You want to learn an encoder. Then it's fine that you have a lot of noise in those datasets, like LAION, those big datasets that have a lot of noise. If you just want to do contrastive learning, to do representation learning, the noise is fine most of the time. But if you want to do image captioning, this noise can really be quite harmful because the language model is more like a finer-grained version than the contrastive results. So that's where another contribution of our BLIP was that we bootstrapped this dataset so that it has synthetic generated captions, and we use a filter to remove the noisy captions. And there's basically these two pieces—the architecture perspective and the dataset perspective—together making BLIP work.

Nathan Labenz: (11:01) How did you remove the noise from the dataset? Some of the captions that actually describe the image have a high similarity score with their corresponding text encoding, but it's such a big dataset. There's so much noise in there. You've got a lot of things that are just like "what an awesome day," and that doesn't really line up with what's in the image, so that can add noise to the system. It sounds like in your case, to create a captioner, this was working against you. So did you just go through the giant dataset and filter out things with low similarity score and work from that?

Junnan Li: (11:39) Yeah, exactly like what you described. We just go through all the examples and filter out those that are not aligned with the image. And I think it's a perfect point that in some cases, that noise could be good. Like I mentioned, in the contrastive learning stage, noise could be good because you are just learning one single vector. You are not really trying to generate the text. But in the image captioning space, let's say you give an image and you want to generate text. If the text is totally irrelevant to the image, then the model could learn something that we are not trying to learn. It doesn't capture anything from the image; it tries to hallucinate some stuff out of basically no context. So I think that's something we try to avoid.

Nathan Labenz: (12:25) Remind me, how big is the original BLIP model? It's like one to two gigabytes downloaded, if I recall correctly. It's not huge, but tell me about the size and the training dynamics. I want to contrast that ultimately to the BLIP-2 successor.

Junnan Li: (12:43) Yeah, I think that's actually a beautiful part about BLIP-2: if you compare the number of trainable parameters—so "trainable" means that you are using backpropagation to optimize the gradient during training—actually, BLIP is larger than BLIP-2 in that trainable parameter count. BLIP has a few hundred million, but BLIP-2 only has less than 200 million. The reason is, in BLIP, we are training everything end to end, including the vision encoder, the language model—everything we train end to end. But in BLIP-2, we have this frozen image encoder, which we don't update at all, and we have this frozen large language model, which is very large. It can be a few billion parameters, but because we keep it frozen during pretraining, it actually incurs very little computation cost. So if you just compare how much time and GPU we need during this pretraining, BLIP-2 is actually cheaper than BLIP because we use these already off-the-shelf available models that are built by other amazing research teams.

Nathan Labenz: (13:52) On the original BLIP, what are you optimizing for there? Are you just optimizing for generating the exact caption, token by token, or is there some more abstracted or semantic loss function that you're optimizing against?

Junnan Li: (14:11) Actually, BLIP is an inheritor of the ALBEF paper. So our optimization function—if you look at ALBEF, BLIP, and BLIP-2, we have the foundational losses, which are very similar. There are basically three losses that we use. One is the contrastive learning. It's the same as what CLIP used to learn better representations and align these image and text encoders.

Nathan Labenz: (14:41) Can you unpack the contrastive learning a little bit better? I think most people will be familiar with CLIP, and they'll understand it largely as the thing that somehow stable diffusion is downstream of, and so it's important. I don't think people have a great sense in general of what exactly the insight there was that created that possibility. So give us a little bit more on that before we go on to the next two papers.

Junnan Li: (15:15) Sure. This term contrastive learning actually originates from the computer vision field. Basically, the idea is that you want to learn representations for your data, such that if your data is similar in semantic meaning, they should have similar representations. When we consider this for the image and text domain, you have these positive pairs, which are what you collect from the web. You have an image and text and they are correlated. You want to train the model such that their representations for this positive pair are more similar to each other compared to the similarity for negative pairs. These negative pairs are basically random sampled image and text pairs that don't correlate to each other. What this loss does is it trains this image and text encoder. They don't interact with each other until the last stage. They extract these image and text features individually. That's why we also call them unimodal encoders, because they don't interact with each other. After you extract the features, the final stage is you compute their similarities. That's where they interact. It's a very simple mechanism. You just use the dot product, which measures the cosine similarity of these normalized embeddings. You try to maximize the similarity of the values from this positive pair. That's how contrastive learning works. It's quite a simple mechanism. In the end, what you get are these very good image and text encoders that can produce good representations that capture their semantic meanings. That's why the vision encoder of CLIP and text encoder of CLIP have been really successful applying to different downstream tasks, because they capture the semantic meanings of the data.

Nathan Labenz: (17:25) Cool. Let's go back to the history then.

Junnan Li: (17:28) Yeah, so the first loss is this contrastive loss that gives you good unimodal encoders. But then if you want to do more fine-grained interaction between the image and text, you'll need more than just a dot product. You'll need some parameterized mechanism to interact. That's where we have this cross attention, where the text encoder can cross attend to the image encoder. It's kind of like a key-value encoder-decoder architecture, where the encoder is your ViT and your decoder is your text model. When we consider image captioning loss, basically this text decoder will cross attend to the image encoder and generate the text tokens. That's basically our second loss, which is the standard image captioning loss. It's just a language model loss, but conditioned on the image. Our third loss is what we call image text matching loss. The purpose of this loss is that we want to learn even finer-grained similarity or matching between the image and text through this cross attention mechanism. You can't really expect a single vector to capture all the fine-grained details of one image, right? Because the image is worth a thousand words, so there are too many ways to describe this image. A single vector is very condensed and a good representation, but if you really want fine grain, you need this cross attention. For this image text matching loss, it's a binary classification task where we give the model an image and text pair, and we ask the model through this cross attention computation to tell me whether this is a matched pair or not a matched pair. By doing this, we can further enhance this alignment between these two modalities. This is basically what the image text matching loss does. In our experiments, we found that these three losses complement each other, meaning that they have this multitask learning objective that will enhance the final performance of each individual loss objective. This works through different experiments. That's why we chose these three losses as the standard losses to use. I think many other papers now will also adopt these three losses or something similar.

Nathan Labenz: (20:16) One of the things that I think BLIP has notably done better than any other model that I've tried is handling logos. It can almost read the logos. A lot of times it kind of fudges the words from the logo. Help us understand that. Just as a user, I see this really interesting behavior. Other captioning models really struggle with logos, and BLIP does quite well. How did you manage to do that? Was that a matter of the training data, or is there a technical reason that that happens? And then also, when I do see something, if it's like the Coca-Cola logo, it'll just know that's the Coca-Cola logo. Or the Salesforce logo, it's going to know that's the Salesforce logo. But when I have these long tail small business logos that it's probably never seen before, I imagine most of them are not in the training set, you'll see these things where it might be like Torenberg Plumbing, and instead it'll say like Torenstein Plumbing. It'll be almost the right name, but it'll kind of just flip it a little bit on the last couple letters and the last token. I've always kind of wondered, what is happening there? Where it clearly can sort of recognize the letters, but it's not doing exactly an OCR type mechanism, obviously. I'm always kind of confused as to how it ends up just being a little bit off in those scenarios, but I'd love to understand that.

Junnan Li: (21:42) Yeah, I think that's quite a fascinating phenomenon. I didn't really observe this before, but I think in terms of why BLIP can understand logos, I would say it's mostly from the pretraining data that we use. I think we were one of the first to scale up to the LAION dataset. Once LAION released this 400 million dataset, we started to use it to train the captioning model. There are a lot of logos in that data, so I would imagine the model can learn such information from web-scale data. Of course, these data are quite noisy, so it's likely that some spelling errors can lead to wrong recognition of the logo text. But I would still say that it is not a perfect OCR model because it's not really designed to do OCR, and the ViT itself, even though powerful, is not really the same as the best OCR architecture. OCR has this detect first mechanism, and then you really zoom in to each individual letter and you recognize. But ViT is more like a holistic view of the image and you're trying to recognize maybe the most salient part of the text. That's why if the text is small, it may fail. Another reason, as you mentioned, is if it's not a common name that appears frequently in the training data, the model may not really know that name, so it lacks the prior information to spell the correct name. I'm not sure, maybe BLIP-2 will be better at this because we are using a larger language model that has more knowledge about words, so maybe it knows more companies, so there's a stronger prior to give you the correct information.

Dongxu Li: (23:36) Yeah, I think I want to add something to that. I think that's an emerging advantage of this contrastive learning in the context of multimodal data. If you look at some of the earlier models before ALBEF and BLIP, many models were actually using some off-the-shelf detectors to provide object-level labels. But because of the usage of these object detectors, they won't be able to necessarily take into account this logo information, because if you look at these detectors, they usually don't tend to detect the logos. They detect boxes and tables, but not the logos. I think one emerging property since contrastive learning was introduced in both BLIP and ALBEF is that you don't need these off-the-shelf detectors anymore. These contrastive losses really align the captions very well. Each text token in the caption aligns to the corresponding region in the image. Because of that, if you see some text in the logo, that really aligns to the text appearing in the caption. So that gives the alignment between the logo image and the text words. I think there is no OCR loss for that, but we also have some other team members who tried to adapt BLIP to the OCR context, which turns out to work pretty well. That really demonstrates these foundation models are critical to serve general purposes for multimodal understanding without too much task-specific design. With just a little bit of adjustment, this pretrained model could be used for different scenarios and applications, which is powerful.

Nathan Labenz: (25:52) So with the original BLIP, you said it's end-to-end training. Is it correct to say that there is no knowledge before that end-to-end training? Like, the only text that BLIP ever has seen is the image captions that are associated with the images. There's no other text pretraining or anything like that that's being built on top of. Do I have that right?

Junnan Li: (26:17) Actually, we do use a pretrained BERT model to initialize this text encoder. Of course, BERT now is not considered to be the best language model, but still if you consider its size, it has some decent text understanding capabilities. So that allows us to initialize with a model that already knows text, then we start to train it on these captions to enhance this capability.

Nathan Labenz: (26:44) What was the total training time like for the original BLIP model?

Junnan Li: (26:51) Yeah, so if I recall correctly, we used 32 A100 GPUs, and the training takes about, for the largest model, it takes about a week to finish.

Nathan Labenz: (27:06) I was under the impression it was even longer, so that's actually pretty reasonably efficient. How many cycles did you guys go through in the process of doing that research? Did you run that week-long training process five times, 50 times, 500 times? We all see the one end product, but how many earlier versions of it were there that ultimately got tossed out?

Junnan Li: (27:31) Yeah, so actually, during research, we really start with not the largest model but relatively smaller size and smaller training data, so we can iterate faster. Maybe within a few days, we know how the training is going and we can adjust. Because we already had this ALBEF work to give us a solid foundation of pretraining losses, we didn't really make any adjustment about losses because we were quite certain they would work well. We made some adjustments to the model architecture because we have this one model that can do both decoder and encoder. We did some ablations on that. I think in total, less than 100 trials, maybe tens or twenties of trials until we finalized the final model architecture and the pretraining strategy.

Nathan Labenz: (28:32) And what would you say were the biggest things that you learned or adjusted during that process? Is it like the learning rate schedule or other hyperparameters or something else?

Junnan Li: (28:47) Yeah, actually we didn't really have a single thing that gave us a significant boost. I would say mostly the model is robust with different changes. We did observe some instability in training if we increased the learning rate too much, so the loss may go to NaN sometimes for mysterious reasons. That's why we decided to keep the learning rate a bit lower. In terms of other hyperparameters like the batch size, we just fed in the largest batch size we could feed within the 32 GPUs. For the data and the pretraining losses, we didn't really make too many changes. We tweaked the architecture a little bit, but it's more like there is a trade-off between efficiency and performance, and we found a sweet spot that gave us the best performance while still being efficient.

Nathan Labenz: (29:43) So this is 32 GPUs and seven days. That is essentially, whatever, 200-ish GPU days. That's a significant amount of compute.

Dongxu Li: (29:59) Yeah. Just to jump in, I think there are tons of models that can take way longer hours than that. Even some of the earlier models, although they are not as capable, they require way longer hours to train. I think that also shows that this model architecture is quite efficient to capture this multimodal understanding. 200 GPU days is indeed a lot if you just think about it.

Junnan Li: (30:46) If you look at

Dongxu Li: (30:47) CoCa and PaLI, especially these ones from Google, they train on tens of thousands of GPU hours, I would say, at least. And I think at least tens of times more training data than BLIP. While saying that, we're still able to achieve comparable, I would say in most cases, even better performance. I think that really demonstrates that it's important to make good choices on architecture and training strategies in addition to proper scaling in terms of datasets.

Nathan Labenz: (31:32) I don't know if you guys know the answer to this question, but I looked it up just today. If you don't know, I'd love to hear what you would guess is the number of times that the original BLIP model has been downloaded from Hugging Face in the last month, just the last month?

Dongxu Li: (31:50) I'm not

Junnan Li: (31:53) I'm not too sure because BLIP was uploaded to Transformers not long ago, right? The Hugging Face team integrated it maybe half a year ago or less. I would guess 1,000 downloads? Maybe 3,000, I would say.

Nathan Labenz: (32:11) From just the last month, mind you, just under 20,000 downloads of the original.

Junnan Li: (32:19) Which model was that?

Nathan Labenz: (32:20) The captioning model? That's a good question, let me see.

Dongxu Li: (32:24) We sometimes take a look at these statistics and feel quite excited about it. We also learn a lot from how people are using it, and we see a lot of different application scenarios that we weren't actually expecting, but it's amazing. I really appreciate the community effort and feedback. That really helps us develop a better idea of what the model is doing and where we're going in the future.

Nathan Labenz: (33:03) Just to answer your question, it is the BLIP image captioning base model that has been downloaded exactly, according to Hugging Face, 18,976 times in the last month.

Dongxu Li: (33:16) That's amazing.

Nathan Labenz: (33:17) I'd love to hear some of those stories of unexpected use cases. And maybe you could give us a little guidance for anyone who's thinking, "How do I get in on this action?" I would say, by the way, for Waymark, we do not fine-tune the model. We just use the base. With the kind of web-scale data that you trained it on, the base works really well for us. I'd love to help the audience understand what would it take for them to do a fine-tuning in terms of dataset, any gotchas that you've observed, and compute resources that they would need to get into that.

Junnan Li: (33:53) Yeah, that's definitely something I want to say. If you want to fine-tune BLIP, we have support in our LAVIS library. We spent a lot of time and effort last year to build this library that offers you not only very convenient inference using our models, but also training and benchmarking. We set up this framework where you can do training of our pre-trained models and fine-tune them quite easily using your custom dataset. All you need to do is prepare the dataset in a certain format and then write some configuration files following the default ones. If you want to change some hyperparameters, you can also override them. Then you call our train method to train the model. In terms of compute, it depends. Of course, if you have a lot of GPUs, that's great. But if not, it takes longer to train. For the base model, one or two GPUs can still run the training. It just may take longer and you may need to do some gradient accumulation. Any size of dataset would work. Of course, if you have only tens or hundreds of captions, I'm not sure how much effect it will have. But let's say if you have hundreds of captions, I would say that's worth a try to fine-tune. That's the strategy I would suggest for BLIP fine-tuning.

For BLIP-2, we also provide fine-tuning in the same LAVIS library, but that may require a bit more computing resources because we use these large language models. Even if we keep them frozen, for our smallest OPT 2.7 billion model, you still need a decent amount of GPU memory. I think a V100 should be enough to run this fine-tuning using the smallest BLIP-2 model that we provide. But for BLIP-2, most of the time you don't really need fine-tuning because it's quite generalizable to different scenarios.

I've seen several interesting use cases for using this captioning. Last year we saw a quite interesting use case where they generated captions for Pokemon images and fine-tuned a Stable Diffusion to generate the Pokemon based on the text. That's one interesting use case I've seen before. Recently I found another demo that uses captions to do image search. That's also something we've been exploring. These captions are a concentrated representation of your image and they're human interpretable, right? If you translate every image into text, the amount of information is concentrated into these very small text tokens. They occupy very small space compared to the original image. You can use all these existing techniques like sentence embeddings to do very fast similarity search across a wide database. That gives you an alternative way to do image search and also image-to-text search.

Dongxu Li: (37:43) If I were to add something about how to use the captioning model, I would say first, I strongly recommend taking a look at the BLIP-2 model. We've tried that one in some of our recent experiments, and we find that the captions from BLIP-2 do significantly better than BLIP. What's happening there also depends on your use case. For BLIP, the captioning model released was fine-tuned on COCO Caption. The effect of that is you observe that descriptions sometimes tend to be a little bit generic in the sense that it loses the variety in specific namings and that kind of stuff.

If you look at BLIP-2, we have multiple versions of the captioning model released, and some of them are just pre-trained on the LLM and other web datasets. If you use that model, it actually gives you very concrete and customized namings of the object. It can recognize, for example, the car makes, all these logos, and that kind of stuff, which could provide more useful information if you want to have really fine-grained information there. Really try the web pre-trained BLIP-2 version first before you actually go into the fine-tuning phase, which is more expensive. Usually, I would say the BLIP-2 version is probably strong enough for a lot of these applications.

Nathan Labenz: (39:53) You kind of anticipated another question I had with the car makes. So I'm in Detroit, Michigan. The auto industry here is in our blood. One of the challenges that car makers have a lot of times is identifying their cars down to the model, and then they've got the trim, which is the specific package that they have. They have lots of data, right? Plenty of shots of all these cars. Do you think it's feasible to get to the point, if you had a significant scale dataset, where you could get accurate down to the model of the car and even the details of the trim? Have you seen people push the performance to that very high level in a narrow range with BLIP?

Junnan Li: (40:46) I think it's definitely possible if you have a good enough and large dataset. I don't see any reason that you cannot do it. Currently, from a research perspective, we don't really have access to those fine-grained datasets. But we are trying to improve the model on certain domains that could be more widely applicable. Like Dongxu mentioned, we try to improve our OCR and try to improve some other capabilities by specific tuning on the individual domains. I would say that's definitely a good way to improve the performance on specific downstream tasks.

Nathan Labenz: (41:33) Cool. Well, we've gone back and forth a little bit between BLIP and BLIP-2. It's funny because I'm just such a BLIP stan that I want to talk about it from all these different angles. But your new thing is BLIP-2, and that has superseded, as you said, even the original as more general. I think the most fascinating part about it is the fact that it uses and really connects these pre-trained vision or image models and pre-trained language models and helps them work together. My gut says that this is going to be a big trend, right? Because we're seeing the proliferation of the language models as well as the image models, but especially language models, really going wild right now.

And yet, the 200 million parameters, as much as that's not the biggest thing that's ever been done, is prohibitive for most projects. Most people cannot get to that if only because the cycle time is just going be too slow and they just don't have the calendar time to figure all that out. That's a lot of research. Now, I've looked at also a lot of research, but the training time, because it's a connector model, because it's so many fewer parameters, I believe is down to 10 days on a single machine. And I assume that could be parallelized down to either even shorter run times. So I really want to unpack a little bit this connector model. Tell us, where did the idea for the connection-style model come from? I really want to get your vision for what that's going to look like over the next couple of years.

Junnan Li: (43:18) Yeah, I think what you mentioned is a very insightful point, and that's something actually we try to push in the paper. Because initially we found that there's all these amazing visual models. You have people just dedicating to pre-training better visual models with self-supervised learning, contrastive learning. And you have another group of NLP researchers trying to push the boundary of these language models, right? You have instruction tuning and all these GPT-style models. But for the vision-language domain, people are still doing pre-training from scratch. That's kind of puzzling for me. Why can't you just bring the available progress together, right? You do some kind of connection that can have a very flexible way to combine different models in efficient ways so that you can harvest the progress from the individual fields. That's our motivation.

In terms of the connector module itself, the architecture, we are heavily inspired by a few previous works because we use this Q-Former technique to extract features, and we have this cross-attention process inherited from the previous BLIP architecture. We were also inspired by Perceiver, which is one of the first models to use queries to extract features. And also Flamingo is one of the previous works that also uses a query mechanism. But I think what we found different is not really the architecture itself, right? This connector module, there are different ways to build it. I think we chose one of the most efficient ways so that you don't need to change anything about the language model. You just plug it in as a kind of prompt tuning.

But I would like to highlight that the reason we make it work is because of our pre-training strategy. This is really something unique from BLIP-2, I think, because we have this two-stage pre-training strategy. What we did is that we first connect this connector to the vision model and do pre-training so that this connector is very well aligned with the vision model. It can understand the visual information very well in terms of how the text can correlate to the image features. And then only after this stage, we plug in the language model and adapt this connector so that it can work as a bridge between this visual model and language model.

What we found in our paper is that if you remove the first stage and just do a connection between these models and you do this kind of image captioning loss, then the performance becomes much, much worse. There will be phenomena like catastrophic forgetting, which was widely observed in previous papers like Flamingo that just use this kind of generation loss from the start. I think the reason is that we need this connector to have a good understanding first before it learns how to teach the language model to generate. Because these language models, they are really large, are prone to overfitting, and they don't really have any understanding about the image.

That's why this pre-training strategy, I would say, is the most useful technique that we propose. And this is also applicable to other multimodal domains, right? You just need this connector, you connect to the first module first, you do some pre-training, and then you connect to the second module, and you do some pre-training. I think that's quite a generic way to do it, and we do hope that this can be applicable to other domains and power other applications.

Nathan Labenz: (47:12) One of the things I think is most fascinating about it is your connector model, which really, I mean, BLIP-2 when you use it is an ensemble, right? You have the image part, which is frozen. You have the language part frozen. Then you have the connection in the middle, which is what you've trained. It is predicting embeddings that get injected directly into the language model. Correct? It's bypassing the text encoding and just going straight to the embedded text layer of the language model.

That in and of itself is an eye-opening thing for me. It also creates some discomfort for me in the sense that, obviously, in February 2023, I don't think we're facing imminent danger from models like BLIP-2. But I sort of extrapolate this trend out a little bit, and I start to think, boy, you could really connect a lot of different sensors to language models in this way. And then you could really start to cobble together not just bimodal but truly multimodal systems. And those can start to do all kinds of things. But what is a little bit concerning to me is the lack of understanding of exactly what is being injected into the language model. How is this understanding happening, right? It becomes ultimately pretty inscrutable.

So I wonder what you think about that. And also, I was curious about whether you have any way to figure out if those embeddings that it is predicting are, can you backport those to text? Is there a way to understand, in a human legible way, what exactly is being injected into the language model?

Junnan Li: (49:12) Yeah, that's a great question. This kind of injecting embeddings has been around for a while from the prompt tuning technique. In NLP, you have soft prompts, which are basically embeddings that you learn and prepend to your text input. You give this to the language model and it can guide the language model to predict certain things. You can fine-tune the prompts to guide the language model for certain downstream tasks and get better performance. People have tried to interpret what these prompts learn, and the conclusion so far is that they're not really interpretable. It's kind of a black box in terms of what soft prompts really capture, because the language model is so big. There are so many hidden representations that can guide it towards certain things. In terms of soft prompts, I would say BLIP-2 is similar in that we're trying to provide prompts that can embed vision information. We don't really know exactly what vision information they contain, but they are representations of the image that the language model can make use of. I think why we're sure that there must be representations of the image comes back to our first stage pretraining. Because in our first pretraining, we're using image-text contrastive loss and image-text matching loss. From these pretraining objectives, we can be certain that this connector, which we call Q-Former, is learning the most representative features from the image. It's kind of like a feature extractor—you extract good features from a frozen image encoder. We're certain of that because we know these are good features that represent the image well. That's why we're confident that if you put this to a language model, it's most likely it will teach the language model about the image rather than something else.

Nathan Labenz: (51:25) Yeah, I wonder if we can give folks even a little bit better intuition for this. As you said earlier, a picture's worth a thousand words. It's fascinating to me that, especially with relatively little compute—1 machine for 10 days is the total thing—you can figure out a way to predict these injectable embeddings into this space which was originally created by embedding text and which is interpreted as if it were text by the language model. The loss is ultimately based on what comes out the other end of the language model, and that that all still works. It's like there is this invisible dark space within the language embedding space that language itself cannot access, but which this model can learn to access in such a way that it is still immediately interpretable by the language model itself.

Junnan Li: (52:43) It's definitely worth more research to really find out the working mechanism. I just want to mention some of our previous efforts. We do have a previous paper where we try to directly map the image to interpretable text tokens, and we give this text as input to the language model to see what it can do. We find that it's quite good. Let's say we generate captions and then we give these captions to the language model as context to answer questions or do some other tasks. It can perform well, but there are some limitations. The major limitation is that those captions—it's hard to represent every single image. We need to first find relevant captions, meaning relevant to the task. If I want to ask a specific question, then I need the caption to be relevant to that specific question so that the language model can make use of it. Secondly, we need to generate a lot of captions. One caption is not enough—we need to maybe generate 20, 30, or even more to hopefully capture more information about the image. That's why we changed to this paradigm where we inject embeddings, because each embedding is itself a vector. I think it's a 768-dimensional vector, and we have 32 of these vectors. This can actually capture quite a lot of information. If you consider images themselves—just 224 by 224 pixels—and now we transfer the knowledge into these embedding vectors. How do we make sure these are interpretable by the language model? If we just randomly generate these vectors and give them to the language model and ask the language model to generate text, there are likely a thousand ways the language model can interpret the image and generate some text. That's what most previous approaches are using. They just train with the language model at the end, so the training signal is purely coming from the language model's output and backpropagating to the connector. There are a lot of ways for this connector to basically cheat. It can cheat such that during training it can guide the language model to produce certain output, but it cannot really generalize. It doesn't really understand the image. It's just cheating because it can change its own output and adapt to the language model. Again, that's why we need this first stage to make sure this connector itself has really good understanding of the image. That's why during our 2-stage pretraining, our first stage actually takes a lot of time. We take maybe 6 days to pretrain the first stage, and the second stage, when we plug in the language model, we only need maybe 2 days. That's drastically different from previous approaches. This also means that after we pretrain this connector first, we can plug in different language models and it doesn't take much time to adapt, because the connector itself already has good understanding of the image. It's hard for it to find the shortcut. Deep learning models always try to learn the easiest solutions to a certain problem. If there's a shortcut, they will find it. By making it understand the image first, these obvious shortcuts disappear because it cannot really overfit. Basically, the easier solution for the connector is now to understand the image, because it already knows the image. That becomes the easier solution for it. That's why I think it works well in our case, and we don't really just rely on the language model to teach the connector. We pre-teach it first so that it can teach the language model instead.

Nathan Labenz: (56:47) That's really interesting and definitely gives me a better understanding than I had just from reading the paper, so thank you. Have you gone as far as to try to find—you might think, okay, what's the closest I could get if I just had text to the embeddings that are predicted by the connector model? Have you tried to figure that out? And if so, are you able to get at all close, or is it just kind of a totally different universe? Or probably a better analogy would be a totally different language that the connector model is speaking.

Junnan Li: (57:23) Yeah, I would say it's quite difficult to map the output of the connector to certain text tokens because, like you said, the embedding space is so huge. I don't think it's really interpretable by human language, but I do think there's a lot of information there—maybe not in the form of human interpretable text, but definitely interpretable by the language model. I do see a lot of research work that can be done in that space to try to make it less of a black box for us.

Nathan Labenz: (58:00) I think of language models increasingly, especially with the rise of these connector models, as kind of the executive function of these expanding ensembles that are going to be fairly general and can do a lot of different things. In a way, it's not that dissimilar from how we see. We have our eyes—the eyes take in raw light and turn that into a signal, and that signal goes through the nerve and finally gets back to the brain. By that point, it's not that interpretable either. It doesn't necessarily correspond to language. But then there's some further connector that turns that visual data into something that I can understand as language or at least understand and then articulate as language. It feels like there is something kind of analogous taking shape in the AI world right now, where the language model feels like the center. I'm not big on analogies. I honestly am very suspicious of analogies. So tell me, I want to hear why you think this analogy is wrong or where you think it falls short. But it feels intuitively like there's this kind of analogy between myself or my conscious awareness and my narrative center, and then the eyes and the image models are analogous. You've now created the circuits that connect the two. Do you like that analogy? Do you think it's missing the mark? How is it falling short?

Junnan Li: (59:41) I think that's a very good analogy, to be honest. I think the reason why language models are so powerful now is because they are pretrained basically on the entire web. The amount of knowledge and information is so much on the web. In particular, these texts are very concentrated information. I would say with the current kind of deep learning models and transformers, it's easier to learn from text than from images because images are raw pixels. It takes some processing to really understand what's going on. But these texts—one word can encompass a lot of information. I think that's why language models have been very successful in learning this kind of world knowledge and having conversations, answering questions, and doing all these amazing tasks. I do see that as kind of a brain, but in the current status, at least, the language model can be considered as the central piece that holds all the knowledge. What we are trying to do is bridge this knowledge to some other modalities and make it process other modalities' information. This could be hard because these language models have never seen that data. Imagine you are a human—you grow up learning only language knowledge, and now one day you open your eyes. You don't really know how to interpret what you see. That's what we are trying to do here, to build this bridge between these two modalities. I do see there is potential that by adding additional vision or other modalities, the language model itself can also improve. We are giving it more information to learn from. For now, we are keeping the language model frozen because mostly our data—this image and text data—is not so great in the sense that the image is good but the text corresponding to the data is limited. If we have better quality text, even better paired image and text, potentially I do see there is a way we can teach the language model to further improve from this additional data as kind of new knowledge it can learn from.

Nathan Labenz: (1:02:17) I can see a pretty direct path to that. You already have, just combining a couple ideas from your last few papers—slicing is one thing we're going to try. We haven't done this quite yet because you just sent me the paper the other day, but with Waymark, we're going to at least start to experiment a little bit with slicing images into just pieces. Slice them into 4 rectangles or 9 rectangles or whatever, caption each of those, then maybe use a language model to try to synthesize all those captions into one overarching caption. I've seen some of that stuff with video as well where you take a frame every second or whatever and then caption all of those and then use a language model on top of that, and you can synthesize the narrative. Again, this is all just frozen stuff. I think I did that with text-davinci-002. Here's a bunch of captions. Tell me what must be going on in this video. And that works off the shelf. So it seems like we're now entering kind of a phase 2 where it would probably take a significant amount of compute to do this, but I would expect that if you went back through the LAION dataset, for example, and revisited either some of those noisy images or even some of the better ones that just may have short captions and ran a process like that, you could probably enrich the dataset quite a bit and end up with a dataset that you could then go back to end-to-end training with. Is that kind of the direction that you guys are headed next? Am I picking up where you're going?

Junnan Li: (1:03:54) Exactly. I think actually this has been done. Not exactly like what you described, but it has been done. Like I said, in our previous paper, we already generated synthetic captions on these LAION images. We didn't really slice them into different crops—that's definitely something we could have done, but I think due to efficiency and speed concerns, the LAION dataset is so huge, we just randomly sample captions for each image. But I think after reading the paper, the LAION team themselves actually have this version called LAION Synthetic Caption or COCO Caption dataset, where they use the BLIP model to generate captions. They make sure the captions are higher quality, so they do some random sampling and even do some paraphrasing. That dataset, in my opinion, is quite good in terms of quality. It may lack some diversity because it's generated by just one model, one BLIP model. If you compare it to the web data, it may lack some diversity, but it's definitely much higher quality—the text is more aligned with what the image is about.

Nathan Labenz: (1:05:11) Yeah. Do you think that these techniques will work for other modalities? When I think of video, for example, or sound, the one thing that jumps out immediately is there's a time dimension to those that complicates things quite a bit. With video, I can more easily imagine just downsampling and running things that way. With sound, it feels like it would be a bit of a bigger leap. But ultimately, I also imagine that it's going to be smart enough to figure that out. Do you anticipate that similar techniques will work across all these different modalities?

Junnan Li: (1:05:48) I would say so. For videos, it's actually quite straightforward to adapt to video input. You just downsample certain frames, add a timestamp position encoding to each frame, and give this to the encoder so that it can encode all the frames and keep track of their relative positions in the time dimension. For sound, I'm not really an expert, but I would say you can do a similar thing. You have these audio encoders that can encode wave signals, and then you have a language model that can understand language, and you try to bridge those two using some of our techniques. I would say it would definitely work.

Nathan Labenz: (1:06:38) This feels like one path to something like AGI. Obviously, people have very different things in their mind when they talk about AGI. But take something like ChatGPT or Claude, maybe a next generation of those, and equip them with all of these connector models where they have the ability now to understand visual data and sound data. And then you can also imagine connector models coming out the other side too. If you strip off that last layer that turns all the logits into a single token prediction and you take that last probability distribution and feed that into another connector model, it doesn't seem like a huge stretch that you could get action or motion predictions out of that. So is that the macro vision that you're working toward? Do you see yourselves as building toward this giant ensemble that ultimately can be a super powerful system?

Junnan Li: (1:07:47) Yeah, our vision is to build this ultimate multimodal system where you can do a lot of things. Not only image, but also maybe coding and actions and understanding other modalities. I think that's something we're trying to build. I would definitely love to have larger models to train with, but unfortunately, that's not available. So at Salesforce Research, what we're trying to push is, first, to open source all the models we have. I think almost 100% of our research will be delivered to the community with open source code and models. We're also trying to democratize AI pretraining in particular because it's so costly, especially those language models. They're so costly to train. I think most practitioners won't have the resources to train them. That's why we're trying to propose these techniques that are more like a methodology rather than a model, so that you can, in a more convenient and efficient way, make use of these available models. On top of our method, you can add some other strategies that you have, like efficient adapter modules. I think it opens up a door for others to really make use of these large models because we can adapt them but keep them frozen during training.

Nathan Labenz: (1:09:26) What should we be on the lookout for from you next?

Junnan Li: (1:09:29) We've got some exciting news. Dongxu will have a new model coming out soon about text-to-image generation. I feel that could be quite exciting. And also on BLIP 2, we're working on the next version, which hopefully will be even stronger than the current one.

Nathan Labenz: (1:09:54) It is amazing, the pace of research right now. Do you feel like everything you do is working? Are there a lot of failures that you're not publishing, or is it basically like all the projects are working?

Junnan Li: (1:10:07) There are definitely failures. I think our strategy is we focus on a direction first, a topic and direction that we feel will be impactful to the community. And then we work on that. So we can meet some failures along the way, but because we're confident that this will work in the end, we continue to push for it. And most of the time, eventually, we meet our target.

Dongxu Li: (1:10:40) A lot of these failures in research aren't necessarily just failures. They succeed at different degrees. If you want to get AGI, you don't get it immediately after one grand research project. It takes a lot of iterative efforts. So we succeed in the sense that we believe we're making steady progress toward that goal, and we also want to take some risks in the meantime so that we can always explore something we're curious about and tell people something we discover, push up the metrics and benchmarking results. We really want to share the insights and discoveries so that others can also benefit from our findings. To do that, it's also important that we ensure our methods and code are all open sourced. I think we really benefit a lot from the open source community from others, especially with BLIP model. We leverage open source vision and language models. We really want to say that this community feedback and community effort are really important to pushing the development of AI, and we're quite committed to that.

Nathan Labenz: (1:12:28) What are your first languages and what do you speak in the office environment? Is it all English at work, or is it a mix of things?

Junnan Li: (1:12:38) The team comes from quite diverse countries. Dongxu and I, we are from China. We also have team members here from Singapore, Vietnam, India. So at the office, we speak English. That's the official language. But personally, we speak Chinese.

Nathan Labenz: (1:12:59) Being able to speak at a high level on technical topics in another language, I think, is impressive. I guess before too long, we might have some AI Babel fish that could sort everything out in real time. But for now, I'm relying on your English ability, obviously. How has AI affected how you work on a day-to-day basis? Are there any tools that have made an impact on your daily workflow?

Junnan Li: (1:13:24) Dongxu maybe can speak to that. He's a heavy Copilot user.

Dongxu Li: (1:13:28) I think there are a lot of things happening behind the scene that we may not be directly aware of. I'm quite sure that search results, for example, have a lot of AI going on behind them. I Google a lot of things every day and use Stack Overflow. That's the main thing I rely on for coding. I think in terms of that, having improved search efficiency using AI is probably quite efficient. Recently, I also gave Microsoft $8 every month for the Copilot subscription. Sometimes it gives me bugs, and it would take me some days to figure out, especially for machine learning experiments. But most of the time, for example, for boilerplate code, it really helps a lot. Code comments, testing cases, unit tests, it really saves me a lot of time.

Dongxu Li: (1:14:44) Sometimes I'm not feeling it, but if I disable that plugin, I feel like my life quality has worsened a lot, so I really have to keep paying for the subscription. Another thing I want to look at is, because right now as a human, I'm working on this text-to-image generation project, I keep an eye on forums and the Hugging Face community. There are a lot of very surprising and impressive image generation results every day. That really blows my mind, and I also learn a lot from there. It's not just entertaining, but sometimes I'm also getting to know people and what are the most interesting things to people. That's also amazing.

Nathan Labenz: (1:15:51) Cool, thank you. I'm also a big Copilot fan, so I can totally relate to that, and a big Hugging Face fan. We're definitely sharing tools from 12 time zones away.

Dongxu Li: (1:16:03) Yeah, we actually have some in-house code generation models, and hopefully I don't need to pay that subscription soon.

Nathan Labenz: (1:16:11) I'm sure you're aware of how Replit used their Ghostwriter model, which was a distillation, I believe, from the Salesforce CodeGen model. If you haven't used the Replit Ghostwriter, it's also very good. And they also now have a chat mode. Hard to say. Both Copilot and Replit, I think, are advancing pretty quickly. It's amazing that Replit's not that big of a company, and they are, I would say, keeping pace with Copilot, but they certainly got a big jump out of the gate by being able to build on the Salesforce CodeGen. So if you haven't checked that out already, I definitely recommend it as well. You're doing, I think, some really useful work with the BLIP set of models. As I said at the top, much more enduring than almost anything else that we see in AI right now, so that's awesome. You have this understanding of how this work fits into a bigger picture of an ensemble strategy for AGI. But zooming out even a little bit further, thinking about society, thinking about the change that we're all about to see over the next however many years as AI goes from a research agenda to an applied reality in life, what are your biggest hopes and also what are your biggest fears for what AI could mean for all of us?

Junnan Li: (1:17:42) That's a big question. I think my biggest hope is getting close to AGI. I'm surprised by the advancement every day. I think the progress is even beyond what I can hope for now, because it's moving so fast. On this journey to AI, I think this year, although for the past two years, it has been growing really fast. I don't really fear it in the near term because, realistically speaking, we are the ones who fundamentally build these models. So I would say maybe the public has a different opinion on this AGI, but as a researcher, I would still consider them to be quite artificial and even stupid a lot of times. I know it's still far from sentient or perfect. Fundamentally it's just a big neural network. It is far from what we expect from humans. So I'm looking forward to that day to come in the future.

Dongxu Li: (1:19:06) Yeah, in the meantime, while we are innovating, we also pay attention to related issues on these foundational models. We have an ethical team who will help us to review our use cases, especially on some of the demos we're making. I think we have seen a couple of examples of threatening user language model interactions in some of the recent language chatbots, and we want to make sure that eventually our model is responsible for what it's actually producing, understanding what it's doing, not just abusing it in cases that are harmful in essence to humans. We do emphasize that a lot, and our ethics team works very hard to ensure that our delivery and development is safe and also production ready when we want to deploy that.

Junnan Li: (1:20:25) Just to add on top of that, explainable AI is also one of our major focuses. We do have a library called OmniXAI that is all about how do you interpret models' predictions to make really safe, interpretable decisions based on those AI models.

Nathan Labenz: (1:20:47) Awesome. Well, Junnan Li, Dongxu Li, thank you very much for joining us on the Cognitive Revolution.

Junnan Li: (1:20:55) Thank you. It's a pleasure. Thanks a lot.

Nathan Labenz: (1:20:59) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

The AI Multimodal Revolution with Junnan Li and Dongxu Li of BLIP & BLIP2

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Transcript

Nathan Labenz

Read next