In this episode of the Cognitive Revolution, join us as we dive into a compelling conversation with Josep M.

Watch Episode Here

Read Episode Description

In this episode of the Cognitive Revolution, join us as we dive into a compelling conversation with Josep M. Pujol, Chief of Search at Brave, about the complexities of developing a privacy-focused search engine. Explore how Brave maintains user data privacy while managing over 1 million searches per hour with an AI-powered system. Gain insights into the significance of human evaluation in AI and learn about the potential of the Brave Search API. Don't miss the shared Google collab notebook link in the show notes!

Google collab notebook : https://colab.research.google....

SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

Recommended Podcast:
Byrne Hobart, the writer of The Diff, is revered in Silicon Valley. You can get an hour with him each week. See for yourself how his thinking can upgrade yours.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...

CHAPTERS:
(00:00:00) Introduction
(00:05:46) Brave
(00:07:52) How Brave collects data
(00:10:45) The problem is not the data
(00:14:02) How Brave builds its index
(00:16:36) PageRank
(00:21:07) Sponsors: Oracle | Brave
(00:23:14) Brave Search
(00:26:26) Hardware improvements
(00:28:33) Language models
(00:34:17) Noise reduction in vertical search
(00:36:50) Fine-tuning models
(00:43:09) Sponsors: Squad | Omneky
(00:44:55) Ensemble approach
(00:58:00) The future of the web
(01:03:30) The business side of this
(01:09:22) Brave Search users
(01:14:21) Brave Search API vs. Google
(01:20:00) Brave Search API Pricing
(01:23:26) Brave Search API Use Cases

Full Transcript

Nathan Labenz: (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to share my conversation with Josep M. Pujol, chief of search at Brave. As regular listeners will know, Brave has been a sponsor of the Cognitive Revolution for the last several months, and that makes this our first sponsored episode. I think it's important to mention that right at the top for the sake of transparency, though I did approach this episode in exactly the same way that I always do, with a deep dive into the company's products and an earnest search for insights that will be worth your precious time. And I can sincerely say that I believe this episode maintains the quality standard that we always strive for. As you'll hear, Josep was very open and shared a number of fascinating details about how Brave has built their search product and infrastructure over time, including the trade offs that they accept to maintain the highest standard for user data privacy. Of the most interesting details center around the challenges of iteratively improving an AI powered product while operating at real scale. Brave handles over 1,000,000 searches per hour or roughly 300 searches per second. They do not want to take any steps backward, and so their approach is generally to add to their system, making their ensemble of models and techniques more effective, but also more complicated with each generation. I found this approach really fascinating to learn about as it's a perspective not often found in younger startups. I also appreciated Josep's candid perspective on the challenges they face in evaluating search results. Brave has developed some scalable systems to assist in measuring quality, but Josep emphasized that there still is no fully automated substitute for hands on human evaluation, good taste, and judgment. I think this too is an important lesson for all AI builders to keep in mind. Coming away from this conversation, I genuinely believe that the Brave Search API is an excellent product. I've spent more time testing it since the recording. And for the Waymark Small Business Profile Builder, which today relies primarily on scraping small business websites, I did find it to be notably better than both Bing and Google. As such, I have recommended it to the Waymark product team. My expectation is that like Brave, we will augment rather than replace our existing system. We'll use the Brave Search API to help us return results faster, to cover URLs that our scraper fails on, and to surface valuable bonus context from discussion and review sites that we simply aren't tapping into today at all. Overall, I think it will be a really nice product win. And given the low cost and latency, I think it's something that more AI builders should consider using. If you'd like to check it out for yourself, I've created a simple Google Colab notebook, which makes search API calls to Brave, Google, and Bing. It's nothing fancy, but this is exactly what I've used to evaluate it for Waymark. We'll put the link to that notebook in the show notes. Finally, before we get started, if your company is interested in sponsoring the Cognitive Revolution and potentially being featured in an episode, I invite you to reach out. We'll always be fully transparent about sponsorships, and we will only produce sponsored episodes with companies that I personally judge to be worthwhile on their merits. So my hope is that this will be a win win for the show, for the sponsor, and most importantly, for you, the audience. As always, we hope you'll share the show with friends, and we invite your feedback. And with that, I hope you enjoyed this inside look at Brave Search and the Brave Search API with Brave's chief of search, Josep M. Pujol.

Nathan Labenz: (3:37) Josep M. Pujol, chief of search at Brave. Welcome to the Cognitive Revolution.

Josep M. Pujol: (3:42) Nice, nice being here, and thank you for for having me.

Nathan Labenz: (3:46) My pleasure. I'm excited to get deep into all the exciting developments and AI technologies that you have assembled into the Brave Search experience. I guess just as a a quick, prelude, regular listeners to the show will know that Brave is a sponsor of the Cognitive Revolution. So we thank you for, your patronage in supporting us. I've obviously listened to the ad copy a few times and I know the audience members have as well. Do you want to start off by just giving us a little bit of a general background on Brave, now my default mobile browser, by the way? Just tell us a little bit about the company at a super high level, what kind of motivations are, the ethos of the company, and then we'll dive after that more deeply into the technology.

Josep M. Pujol: (4:27) Yeah. Brave. Brave, first of all, it's great company founded by Brian Bondi and Brandon Ike. And the aim, they started using as a browser, especially with a focus on privacy. And myself, I joined Brave, like, 3 years and a half ago out of a company that was called clicks or tail cut spin off, and we were basically were developing a search engine from scratch. And when we like, when previous company went belly up because of lack of funding, COVID, and whatnot, well, we had multiple potential places to go. And the 1 that was more appealing was Brave, because we share, like, the same values, privacy preserving, and also, like, the aim to be an alternative to big tech. It's not so much that what we try to achieve is that to do something totally different, right? Because at the end of the day, it's a browser, it's a search engine, it's a talk software, like video conferencing. But the whole point is that, well, you can actually have the full technology suite, right, without having to rely on big tech with all the compromises that thing entails. So that's very appealing. It's very appealing to me personally and and to the team. That's why we joined Brave, I think that's the best definition there is. Of course, there's, like, a bunch of features that are pretty unique, but to me, and I the the biggest value proposition of Brave is that why you use Google when you can use Brave.

Nathan Labenz: (5:56) I've been impressed actually by just how big the the product line is. When I started to get into it more, the mobile browser experience is is really good, and the farther I got it, I was like, oh, there is a VPN that you can enable here, and there's a a chatbot language, but language model powered experience, question answering, all that stuff. It really is a a pretty and an ad network as well. There's a lot going on. How many people at the company?

Josep M. Pujol: (6:20) It's not that

Nathan Labenz: (6:21) big of a company. I'm I'm impressed by how many different products.

Josep M. Pujol: (6:25) But people Brave as a whole and people working on search, engineering and quality raters, like the core of search is about 20.

Nathan Labenz: (6:35) 20. Wow.

Josep M. Pujol: (6:36) On search. Right? There's a of not just that search is around 20 people, right? There is marketing, communications, HR. There's a lot of shared resources. Right? But but anyway, like the search is like 20 people, Brave as a whole, 200. So it's a very small company, and yes, we do have a lot of reach. But anyway, that's the proof that you can I mean, you don't have to be big to provide value?

Nathan Labenz: (7:01) Yeah. It's impressive. We'll get into some of the head to head comparisons as we get a little bit deeper into this. So as head of search, we're going to spend most of our time doing a deep dive into the search technology, how it works. As I've said, our audience is very interested in technical details. Many people that are listening are building stuff, and they are looking for shortcuts to the lessons that you've learned the hard way. Maybe just for starters, you want to give a general overview of how search works? 1 thing that definitely stood out to me again from the ad copy is that the data is at least to some degree sourced based on pages that people are actually visiting. And that was, for starters, Yes. An interesting data

Josep M. Pujol: (7:53) Actually, that project the use of browsing data, if you want, that starts from Brave, it starts from the previous company. That's how we actually like gather enough data to start developing the search engine. But it's a very controversial topic, right, because there is like certain tendency to have this dichotomy, either data or privacy. This dichotomy is actually false. Right? It's simple as that. I wrote extensively about it. People can check it out. But, basically, the the whole point is not so much that data is a problem, it's like how you collect that data. Right? I'm not talking about opt in or pellets or like a legal issue, right? Those aside. It's more like what kind of purpose the data will have and how the data is collected, which makes it dangerous or not. For instance, like Brave, if you people who want to contribute, they can opt in to something that's called web discovery project. Right? And in there, they will send us browsing data. Right? That sounds dangerous. The data would be made. Right? It's not that we collect your browsing history or your search history. No. Right? That would be dangerous. That would be, like, privacy problem. What you will contribute is, like, individual data elements that are anonymous. Right? So someone has visited this site and has engaged so much on with it. Right? Then what any of those data elements that people send us, we have absolutely no way to know whether they come from the same person or not. So we do not learn anything about you or about me. Right? What we learn is that someone is anonymous. There's a lot of technical details here. It's not that just we do not use user IDs, neither implicit or explicit. It's just, like, the data is, like, basically sent using a mixing network to remove, like, network fingerprinting. The messages are homogenized so that they actually have to pass a quorum. So more than x people need to send the same amount of data, but the same data in order to be actually received by us. That's like a to that approval to have so there's, like, a lot of, like, technical merit. But at the end of the day, what is important is that the data that we receive is anonymous. It's not anonymized. Right? Or is sildonimus. No. It's those are dangerous. Right? Those are the things that they can later be be anonymized. The data that we've got is just anonymous and and it serves a single purpose. Right? In this case, to know that this particular page is popular or not. Right? Or that this particular page was clicked after a particular query to build query logs. So that's the kind of data that we collect. And without that, it would be actually very difficult to build a search engine. Google and Microsoft do the same thing, whether they admit it or not. Actually, there is like a nice release of some documents that happened yesterday or like a couple of days ago, where there is like some private APIs at Google. That's natural. This data is collected. And again, the problem is not the data. The same way that the problem is not the advertisement. Right? The problem is like if by doing advertisement, you actually put your privacy at risk because it's tracking extensions. Right? Knowing your browsing history or your search history is extremely problematic, and that's something that we do not want to know and we do not collect. 1 of the attack vectors that that we consider is that if we actually had to release our data either to a hacker or to a government agency of some sort. We could release it knowingly that no particular person could be identified or profiled.

Nathan Labenz: (11:42) Yeah. That's really interesting. So if you're in a country where, you have reason to be concerned about this sort of stuff, then this becomes a an extremely valuable option.

Josep M. Pujol: (11:52) Yeah. First, it's it's an opt in. Even though it's privacy preserving by design, still we do it on opt in because it's something that is, like, people have to be aware of. But they could enable it, we would not be able to it's something that's basically technically, it's called, like, record unlinkable unlinkable. Right? So any of the element that we receive, we have no way to know if those 2 elements come in the same person or not. That means that we only have individual data elements. Because even if the individual data elements, if they if we had a way to link them, we could actually create a profile. And based on that profile, it can always be de anonymized. Right? Because they will always be 1 of the data elements that somehow probabilistically or or optimistically can link to you. And then, basically, the whole session is compromised. The whole point is that we have no technical means to actually build a session so that, you know, like, there is, like, yes, somebody visited this page, but we couldn't know what happened afterwards. But, of course, that data is much less useful than profile data. Because profile data it's not that profile data is is evil. Let's put names on it just for the fun of it. Right? It's not that the Google engineer is evil, and say, I want to collect data. I'm not. But, basically, what you want is that once you collect data, you want to be reusable. You want the same data that you collect to be able to count how popular the page is, but also, like, how big how pages relate to each other or, like, how popular a particular restaurant is at what time. And you want the same dataset to be able to answer all those questions. And that's very powerful, but that's very problematic. Right? Because it can answer all these legitimate questions, but it can also answer illegitimate ones. Like, give me the search history of whoever was here at this particular time. So what we cannot do is answer these generic questions. Right? So for every question that we have, we collect 1 particular data element that only answers this particular question. So that's the difference. Like, the how is that is important. It is not so much about data or no data. It's like the purpose of the data.

Nathan Labenz: (14:02) So how does this compare or maybe it's maybe you combine. I guess you you have a crawler too. Right? So there's Yeah. It's not that the index is entirely built from the page visits, but more that it is

Josep M. Pujol: (14:14) No. No.

Nathan Labenz: (14:14) Of course. Informed by the page visits?

Josep M. Pujol: (14:16) Yeah. I'd like to give specifics on that, but we may learn about our URL through web discovery project. Right? And we know this URL is popular, right, because it has to pass a form. So more than n people have to see that URL in order for us to receive that URL. So that means that URL does not belong to you. It's not your login URL, right, because because, like, you know, a 100 people across the world that access that URL. So that when we receive that URL, we only receive the URL. Right? We don't receive the content because the content, there is, like, no guarantees that the content does not contain private information that belongs to you. So then we receive the URL, which is public information, and then we actually go and fetch it. So that allows us to basically, that our index to be, like, small enough, but still because it's small because we do not crawl blindly. We don't do, a blind crawl of everything. What we do is that we crawl what people basically visit. And from there, I'm from trusted sites, so we actually do some additional crawling. But most of the crawling we do is not crawling, it's called fetching. Right? Because we actually fetched the content. It's not that we proactively crawl the whole web, because the whole web is full of noise. Right? There is like at least 10 different sites that are clones of GitHub. As odd as it sounds, it is. It's very like sites that are clones of GitHub. Right? So if you crawl crawl blindly, you just like increase by 10 your size only by adding noise. Anyway, that's that's like 1 of the 1 of the things that that we use people for, to become more efficient and to reduce the noise, which is like the ultimate goal of any machine learning or AI. I was just going

Nathan Labenz: (16:08) to say the cleaning the dataset is often where the majority of the work goes.

Josep M. Pujol: (16:13) Absolutely. That's like the main realization. Noise reduction is the single most important thing that you have to do ever. Because noise does not only affect the quality of the output of your models, but it does affect the scalability of those models and also the cost of running those models. Noise needs to be avoided as as much as possible. So do you have

Nathan Labenz: (16:36) a sense for the relative size of the index compared to what either Google is doing or the whole web?

Josep M. Pujol: (16:44) It's a it's a tenth of what Google has. It's a tenth of what Google has, and we can achieve, like, 99.9% of the recall that Google has.

Nathan Labenz: (16:54) Yeah. Interesting. Yep. Power laws in action. So a little bit about the signals that you're looking at beyond the behavioral. Obviously, we've got from traditional search projects, we've got like keyword based indexing, page rank. I assume you're still getting some value from like that sort of link hierarchy. No?

Josep M. Pujol: (17:21) No. How polemic can it be?

Nathan Labenz: (17:24) As polemic as you want.

Josep M. Pujol: (17:25) The drink is a myth. You know that.

Nathan Labenz: (17:28) Yeah. Let me tell you what it means to think about what I know, and then you can correct me. I thought it worked well at the beginning and then it became subject to a lot of adversarial attacks with SEO of 1000000 kinds. And now I would still assume that it has some power, but that it has had to be updated a lot of times in a lot of ways to counteract all these SEO Okay. You will. Optimizations.

Josep M. Pujol: (17:54) Okay. I already was an adult at the university. I was just starting to do a PhD when Google came out. Right? I started using Google when it was not even google.com. It was a Stanford. They released 2 papers. 1 was the PageRank, which was like a proper scientific paper, and the other was like a technical report, like the anatomy of a large search engine, where it had the meat of why Google was so good. Google was not good because of Patreon. Patreon was a gimmick. It was a gimmick that worked, of course, had a contribution, but was good because it had a, it was an algorithm and it had a paper on it that was published. And it was something that journalists love to, to stress. Right? The algorithm that does that changes everything. That's what people want to hear. Right? People want to hear a story that there's, like, an invention that does that there's, like, a straight line. There is, something happened and we went from here to here. That's not what happened. In reality, things are, like, a a little bit more the the reason why Google was so good from the get go is because of anchor text. Larry and Sergey through Stanford, were the first ones who, using Stanford resources, were able to crawl the whole web crawl the whole web brute force. And that's why you will see, like, there are some pictures having, like, hard drives on Lego structures. Right? Because they had to buy so many hard drives to be able to store the whole web that they did not have, like, iron to put it. So they actually had to create, like, lego stands for it. Right? So the project was like that where they were able to, like, crawl the whole web. What does it mean to crawl the whole web? Then suddenly, unlike AltaVista, right, you had a much better representation of what the page is. Right? Because you have the backlinks. And the backlinks, what does it say? It basically has the anchor text. It's basically like everybody else's description of your site as opposed to what we had before, which was like your own description of what your site does. Remember, noise is everything. Noise reduction is everything. What is less noisy? The content of the web page or the summary that thousand different people do about your page? Right? It's it's much less noisy than anchor text. And that's why they were able to actually have much better recall with much better precision. And then, you know, like, what okay. Once you have the anchor text, you have the backlinks, and the backlinks have adjacency matrix. With adjacency matrix you can calculate the first eigenvector to which is like a stationary state of random walk, yada yada yada, very nice paper, out. But that's like cherry on top of the cake. The real myth was their ability to brute force the, something that wasn't possible couple of years before to actually brute force and crawl the whole web in a single file system. And then the realization, very smart, hey. We have a much cleaner representation of what the page is than the page content. Right? That's what makes Google.

Nathan Labenz: (21:06) That's fascinating. Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (21:12) How does that play into the Brave Search today? My sense is still that it still matters if a reputable site links to you versus an unreputable site. It seems like there's some sort of authority notion.

Josep M. Pujol: (21:25) There there is authority. Right? But it's not done on on page rank. Right? We can actually write it out of the authority of the pages based on popularity.

Nathan Labenz: (21:33) Just on traffic.

Josep M. Pujol: (21:34) Right? Just on just on traffic, that's that's that's enough. Then, yeah, you might do a 1 or 2 levels deep of, like, calculations. In any way, that's, again, that's a very small contribution to the end goal. What the takeaway message here is not so much the history. It's a little bit more like that. Very often, like, the real innovation happens on something that is not as fashionable as an algorithm. It's more like a it's on the methodology. Right? And actually, that happens on Brave Search too. Brave Search is as good as it can get. It's because for for the we rely very heavily on query logs. Right? With a a query log is, in a way, is like a is a representation of what the page is about. That is not done on the anchor text, but on the queries of of another person. In a way, like Brave Search started being a recommender system engine. And that allows us to, like, to keep building query logs, which are empirically seen. And as you grow, you have more query logs, and then you can start to do, like, semantic queries. You have you've never seen, like, what is the age of Lady Gaga. Right? But you have seen Lady Gaga's age. And then you see a lot of it. Then if you've seen the other 1 and you know that it is good for this 1, it's gonna be good for the for the other query because semantically it's very it's very so you keep expanding the query semantically to do a semantic search on queries, not on content, but on queries. Right? And then you start to be able to use, like, learning models to generate some type of queries, not to try to index the content as a whole, but to to create what would be queries that would be answered to this page. Right? And then you do this, and then you add it, and then you start you become a little better. Then you start to index content of the page. So you start to use engrams. Right? Which is more conventional approach and more expensive than the other ones. So like the cost to benefit ratio is worse, but now it makes sense, right? Because it's the next step that you have to take. So then you keep adding and what you end up is having something that actually works. Does it make a nice history? No, because the story is boring. It's like your story is not, you just keep building and keep iterating until there's a you keep like doing 1% increments until you have something that works. But that's like the only way that things typically work again. And that's why I put so much effort in the example of the page rank, because that's like the story. And you know, like every step you keep adding and every step there are things that happen that were not possible before. For instance, back before before Brave, the, like, we like, this the same concept that we started was not possible 5 years before we start And why not? Because they were like not because doing recommended system means that you have to have everything on a big cache, you know, a big memory map. So at a time where, like, the servers had 16 gigs of of RAM, max, to host something that has 2 terabytes, it takes a lot of machines. A lot of machines means a lot of latency is not feasible. However, at that time, at clicks, Amazon came. Amazon released a machine that had 1 terabyte of RAM. Suddenly, it was possible to create a cluster of 4 machines to actually put everything on memory, and everything would became very easy. Right? Where before 1 year before that, it would have required, like, 2,000 smaller servers unfeasible from the engineering perspective. And those kind of things happen all the time. Right? Another big hardware improvement that without the research wouldn't exist, NVMe hard drives. But without NVMe's hard drive, Brave Search wouldn't exist. It would not be cost effective. But NVMe's, I guess that you know what it is. Right? It's like hard drives that are super low latency on lookups. It's just faster. Is that the from

Nathan Labenz: (25:35) solid state disk as opposed to a spinning disk, or is it even a a more fine distinction than that?

Josep M. Pujol: (25:40) No. It's actually faster than solid state. There is spinning disk. Impossible to do what we do. Flash, still not fast enough, but at least it's like a another magnitude slower than RAM. Gotcha. Interesting. So suddenly, you have, what, 100 terabytes, a petabyte of NVMe's. Now you can actually do what you want to do. Before, it was even the basically, like, the way of doing it would have to be done differently. That would actually, like, increase cost by an order of magnitude, increase everything by an order of magnitude. Right? And it's not doable. So again, that's the the thing like I think that's how Brave Search is was built with Brave. Incremental, always taking advantage of what the world had to offer in a way. Right?

Nathan Labenz: (26:26) So that history of the hardware enablement is really interesting. Can we do the same, an equivalent history for the progression of language models or semantically enabling models? Because going back to like early days, you had literally just keywords and then n grams. Then now obviously we've got pretty amazing language models, I would imagine that like the best language models are maybe an overkill for some of the use cases, if only because of cost and latency. So what's that history like and where are you guys today in terms of the models that you use for semantic purposes?

Josep M. Pujol: (27:02) Yeah. Remember, like, the example I told you before about Lady Gaga, age versus whole world is Lady Gaga? That's a semantic metric. Right? We started to use embeddings, semantic embeddings based on on a model that was called the star space. I don't know if it's familiar sounds familiar.

Nathan Labenz: (27:20) I didn't know it specifically.

Josep M. Pujol: (27:22) 2019. Right? So a long time ago. And and basically, this relationship of being able to go from Lady Gaga's age to how old is Lady Gaga, It was done via embeddings, 200 dimensions, 2 bytes, vector search. But back at that time, it was not called vector search. Was called like nearest neighbors approximations. And that's what actually what we built. So we actually took advantage of these recent technologies. Again, there was like no only semantic embeddings. Actually, there was also like based on queries. Remember, like, we were basically very heavily based on queries. Just to give you a number, a nice number, our query similarity system host like, has more than 9,000,000,000 unique queries. Not so, like, the quiddies, Hotmail is only 1, so 9,000,000,000. So whenever you actually get a new quiddies, we're able to return a set of like similar quiddies that we have seen in the past out of this 9,000,000,000 in 20 milliseconds. Right? And that is done both using embeddings, but also using angrioms based on queries. Right? Because we need both since you cannot be semantic 100. Right? And that's that was true back then and it's true now.

Nathan Labenz: (28:40) Why not? What's the limitation there? Or where does the purely semantic approach fall short?

Josep M. Pujol: (28:44) Because purely semantic is is purely semantic. Sometimes the queries that people do are not doesn't have a semantic meaning. It's like a particular model or a particular product, period. And that's what they want or like an error on a stack trace or like a name of someone. My name is Josep Pujol. Right? It's not Josep Pujol. Semantically, it's very similar. Spanish and Catalan will give you a very close, very close distance in any semantic model, but it's not the same. It's a different person. You need both. The takeaway message, if there's any takeaway message that I could give people, that there's like no single bullet, that there is no single anything that solves a problem that is worth solving. Right? It's always a combination of multiple techniques and models, and then you do an ensemble on top of it. So can you do only embeddings, semantic embeddings? No. You need to have both. So we do have everything on the ground. So on on page content. Right? So we have page content. We do have embeddings based on transformers, DERT like transformers. Yes, we have those. Right? But at the same time, those are like what we call dense embeddings. Right? Because the dimensionality that we have is 3 84 dimensions with 2 bytes per dimension. And each page have about 1 to 5 embeddings, right, of those using different part of the content, etcetera. But that's like this on the semantic matching, but we still have what we call a sparse embeddings, which is 16,000 dimensions, 1 bit per dimension, which are for little matching. Right? And when we search, we use both. Right? And and it's good to have both approaches because you want to sometimes when you search for, I don't know, Trump Tower, you just want Trump Tower. Right? You don't want Trump something else. A lot of noise can come out of the something else. Right? So you you want to combine both approaches and that's what we have. So this mentality, I guess it is shared by many people, including any person who comes to work for us out of the university, that, oh yeah, just like to do embeddings out of, I don't know, OpenAI. Put them on a vector database and do a search and that's it. That doesn't work. It it only works for very narrow use cases or, like, for vertical searches when there is like 1 only 1 domain. But it will not work on a general purpose because it will bring up noise. Statistically, you can think of the following way. The embedding is not free of errors. Right? So let's say that once every 10,000, the embedding has an issue. Right? Doesn't provide the right embedding for some for whatever reason. Right? If you are working on the database of 100,000 arguments, that will not affect you. You will see that it's working fantastically well. Right? But if you have 20,000,000,000, the chances that those errors collate to your initial recall set are very large. Right? And then you're blind. Right? Because you have a recall set of 2,000 documents and 10% of them are noise. What else do I need? I need other things. I need other techniques. I need other models to be able to take out the noise. Right? That's always there. So there's no single thing that works. It's always like a combination.

Nathan Labenz: (32:15) Yeah. That's probably a really good takeaway for a lot of listeners who are building all manner of RAG apps in all kinds of different contexts. I've got a project that I'm involved with that's trying to take old plumbing catalog parts manuals and make that searchable. This is for a company that has, like, a proprietary database of all these old manuals. They literally have them, like, in paper form in a giant bookshelf, and they've embarked on this.

Josep M. Pujol: (32:46) But this actually is I I I can tell you another story that we that might actually work because in a way, like the the level of noise that you have on vertical search is artificially reduced. I don't know if it makes sense with that. It makes sense. Yeah. Because because the problem is the the the problem is always the noise. Right? The uncertainty. So when you are doing a vertical something, vertical search, or you're working a particular vertical, noise is artificially reduced. Right? Because in a particular domain, a particular word has only 1 particular meaning, or particular token has only 1 particular meaning. And so when you work only on a vertical, it's very easy to work on. And, actually, we 1 of the things that when we first started to tackle problem search, the initial thing to to try was like, hey, my search can be split into verticals. Right? So if we are like 20 people, why don't we just do vertical? You do like weather search. You I do city search. You do famous people search, car search. Like, you come up with, like, 100 different verticals. So what ended up happening is that all the 100 verticals were perfectly. Right? But there was, like, no way to put them together in a in a way that made sense. Right? Because what we actually did intentionally is, like, we pushed the problem, the real problem. We pushed it under the rack. Right? On the module that the mixing module that will come in the future. On the mixing module is a problem. Right? Because on the movie vertical, Ted has only 1 meaning. Right? The movie Ted. Could be Ted 1, Ted 2, Ted 3. Right? But now forget about that you are on the movie vertical. Think of the whole web. But that what does it mean? The movie, name of someone, the conference, a part, you see. Right? Which 1 it is? That's like the that's why search is complicated because it's it's general purpose. And I I have the same issue. Right? The narrower the domain, the easier it is because there is less noise.

Nathan Labenz: (34:48) Does that assume though that you are working with enough data? Going back to this plumbing example, and I think trying to generalize from that to what I think a lot of people who listen to this show are working on, I think in a lot of scenarios, there's like a proprietary dataset that is like too big for people to like page through. Right? So they do need some search to access it effectively. But then at the same time, it's not so much data that they could train the model from scratch on it. Maybe they could fine tune. And so I feel like a lot of people find themselves in that in between zone where they're like, I need to use some sort of pre trained open source model at least as a base. And those models, like in this plumbing scenario, everything is in a very narrow section of the, like, broad general purpose embeddings models. And then things like parts numbers, as you said, are are not even really semantic at all in a lot of cases. So we are finding that we do need a hybrid search if only to handle these things like

Josep M. Pujol: (35:53) That that

Nathan Labenz: (35:53) parts models.

Josep M. Pujol: (35:54) That that's correct. And actually, that's why in a way, it links back to the point. Right? And then you're like on this particular use case, and the first temptation of if you want to be like pure AI, let's say, is now I I need to find you. Right? Really? Perhaps you just can do the design model of the shelf and then have another system that does exact matching and then you combine it on. So that is the the right approach. Perhaps you need to find to find filling. Right? It depend on on the use case. But at the but at the end of the day, what is important on engineering is is not so much the tool. The important thing is the problem. And you use use the tool that is most convenient and you combine different tools. That's the most important thing, right, not fall in love on a particular framework or a particular methodology. Right? What you have to fall in love is with the problem.

Nathan Labenz: (36:44) Yeah. Totally. With that in mind, how do you guys approach upgrading technology? I think this is another thing that I haven't I'm like only starting to get to the point where I'm encountering this in live apps. I have a company called Waymark, does video generation. And there we've been through like several rounds of upgrade of generative models. I haven't gotten to this point yet on a search or a RAG type product where, oh, look. There's a new embedding or there's a new vector database that could support that embedding. How I'm sure you've been through several rounds of this stuff over time. How do you approach that challenge and try to tame it?

Josep M. Pujol: (37:23) It depends. Right? For instance, on the summarizer. Right? We have this AI summarizer that people can check where we try to provide an answer to a query. Figures like about like for 25, 30% of the equities. Those basically that's of course, it is like we do rack against ourselves as anybody would. We do put a lot of emphasis, we can discuss later on that, on the data selection, which which snippets we pick and we run narrower QA models on the snippets to like to fit. But ultimately, we do have a prompt that we send to a large language model. Right? The large language model has changed many times. Right? We had some version v 1 that was released 1 year ago. I think that we started with plan t 5, and then we upgraded to another model after a while, then we did some fine tuning version of it. So we did 3 changes in the course of like a few months. And in some ITV2 was even more extreme because when we released, the final model was Mixtral 7 b, 7 8000000000. LLM 3 came out 2 weeks after. We tested. We saw that it's actually we had tests. Tests were like compatible quality using Mixtral resources. We switched it. It took us a week to switch. Right? So on those components, it's very easy to switch because we have a test set and you basically change 1 component from another 1. But it doesn't mean that it's generic because, for instance, like embeddings. How we change the embeddings? No. Not at all. The embeddings is is like forever. We have embeddings that the Starspace embeddings were trained in 2019, and they are still in place. And they will never be replaced. Wow. What we could what we can do is to actually have more embeddings added to it. Right? And we can have the old system and the new system using new embeddings, and yet another system using new embeddings, and then do an example of all. But something that's fundamental of embedding, you cannot change it. First of all, because embedding depends on the data and the data is massive. Right? To to give a concrete number. For example, the content embedding, the dense vectors, which are based on on transformers. Take us we can do about, like, 500 pages per second. It means that we pass a page and we get the embedding in 2 milliseconds. But then if you if you work with embeddings or with reformers, you know that actually that's pretty impressive. Right?

Nathan Labenz: (39:57) Yeah. That's fast.

Josep M. Pujol: (39:58) That's fast. So in 2 milliseconds, like, fully optimized single GPU 8 10 g, I guess, that takes 1 year and a half to do it for the whole dataset. So even being extremely fast is is linked to the data, so you cannot really do it. Right? You have you do it once, and then you build narrower models that are based on those embeddings, on the output of those embeddings, and you build heuristics that combine, like, your part numbers of your plumbing things. You have narrower models plus heuristics, is you cannot change that. Right? What you end up doing is you create, like, versions of it. So you you have embeddings v 1, embeddings v 2, embeddings v 2, and you keep adding. And adding is never a problem. Right? Because at the end of the day, adding more embeddings is adding more feature, adding more distances for the final GBDT that will do the final ranking. So anything that is linked to the data, at least at the scale of a search engine, you don't touch it. Anything that is stateless, like the LLM that does a summary or that does a paraphrasing, then change it then 1 day for the other.

Nathan Labenz: (41:07) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (41:15) So I got a couple couple different follow ups I wanna chase down here. When you say, like, it can never change, do you mean that, like, outright literally? It seems like if nothing else, as the world changes, you would need, like, a new embedding model to understand new developments, Yes.

Josep M. Pujol: (41:30) But let's say that you never change it doesn't mean that you don't add. You can add new things. What would then what I'm trying to say is that why do you need to replace 1 embedding for another 1? You actually want to have both, okay, because most likely they are uncorrelated. Right? And you want to actually build an ensemble of it. Why do you want to end up having like a query document distance? Why do you want to have a distance when you can have 2 And then use both of them to actually do the final blanking. And that's features is basically is like an additive you don't replace, you just add. Right? And then the cost of adding is just a matter of resources, right? Because as I told you before, like it might take us like a month with like many GPUs and actually paying a lot of money to run 1 embedding second dataset. That's not something that we might actually want to do it every 3 months, right? Perhaps it will take another year until we have yet another embedding type into the mix. Different products have different prospects. That's very important too. In your case or like the case of the audience, if they have 1000 documents or like 10,000 or a half or 1000000 documents, that's different constraints require different solutions.

Nathan Labenz: (42:46) Yeah. Definitely. In the plumbing case, I can rerun the whole thing without too much trouble.

Josep M. Pujol: (42:51) But but in any case, for, let me reiterate. It's better to have to try 10 embeddings, different models, and use an ensemble than to try to find only 1 embedding that that does it perfectly.

Nathan Labenz: (43:08) And your ensemble approach is heuristic. Right? It's not or is that also a learned approach?

Josep M. Pujol: (43:15) I'm not gonna learn that. Right? Because at the end the day, you have it's like features. At the end of the day

Nathan Labenz: (43:21) So it's essentially a mixture of experts question Of

Josep M. Pujol: (43:23) course. A mixture of experts is like another way. It's like another like it's a really bad mixture of the expert is because an example is already like it's already used. But the mixture of experts is just an ensemble of 8 different models where 2 are selected at a time. It's exactly same. The the same way on your case. Which is the best embedding that work for my use case? I would use, like, 5 of them. And if they are uncorrelated, which is, like, the the critical factor here, because if they are correlated, they are useless. I mean, they just they don't add anything. But if they are uncorrelated and if some of them are very semantic, others are more literal, others are like multilingual, others are English, all combined probably will give you a better ranking than using other 1.

Nathan Labenz: (44:08) So if I'm understanding the architecture you have today accurately, you have a number of different embeddings that have been built up over time plus a learned combination of those that either does it like choose which ones to do or does it weight them?

Josep M. Pujol: (44:26) No. Weight weights them. Because weight weights

Nathan Labenz: (44:28) Each 1 gets a scalar.

Josep M. Pujol: (44:30) We end up, like, having it depends on which search engine because we have multiple search engines depending on what stage you are in, but you cannot not have features, 200 features, 300, and then you learn boosted Gradient Edition 3, and off you go.

Nathan Labenz: (44:46) Cool. Okay. That's definitely interesting and good food for thought for mine and probably a lot of other projects too. On the going back to the topic of upgrading the language model, you said that's like easy in the sense that you can easily swap 1 in for another. I imagine this is true for both search and the summarizer. It seems much harder to evaluate and actually determine which is the better result. So we've got obviously things like LMSYS doing head to head, Elo style, Scale AI just came out with a new private thing that they're offering. How do you guys determine which if you're gonna make a change to search, how do you determine if it's an improvement or not? And similarly for the summarizer, how do you go about assessing which is better?

Josep M. Pujol: (45:34) We do have a team of human assessors that their work is basically assessed manually. It's not the majority of the search engine, but it's a significant chunk of it. And their job is basically like to have a homogeneous methodology and homogeneous criteria to assess. But what we what 1 thing that we don't want to have is intelligence assessments. Right? So this team work is basically assess of partially either summary outputs or rankings or accuracy. So that's correct. The our, in a way, our ground truth. Right? Of course, like the productivity is what it is. I mean, they cannot read a lot of assessments, but they are high very high quality. And then on the other hand, we have query logs, which are our query logs or what we do collect from our discovery project. So it allows us to compare what Bing or what Google or what Yandex does. Right? So we use comparisons with our competitors. Right? In the sense of NDCG or like just, like, comparing cognitive intersection of the result set. Right? So if for certain type of queries, we do not match what Google does, then what we do, like, we check those results, and then we manually assess those results, a sample of it, right, to see what happened. Are we screwing up or we just have a different criteria for this particular set of queries? So this is like the process. It's a combination of a lot of data from ourselves and competitors, not LO type, but in a way watching what the others do. That's why also Jenkins comes number each other in a way because we we copy each other. It's not that we copy literally, but we end up using

Nathan Labenz: (47:22) You're cross referencing at least.

Josep M. Pujol: (47:23) Yeah. Oh, yes. We are cross referencing. Right? And if if we do not return the same as, for example, if we have seen a query empirical in Google on the web discovery project. We also compare with ours. If we see like the intersection, everything is in flag. It's odd that this particular query has to know. Right? And then those queries are analyzed, and it's added to the test set. And you keep building over time with a lot of patients and with a lot of effort. You end up, like, building datasets that have, like, thousands, tens of thousands of, like, high quality human assessment and hundreds of thousands, millions of automatic assessments. Right? And those is what you use as as ground truth, like, to evaluate, to see if we are going on the right direction or not. And last but not least, what we also have is, like, some of these queries, we kinda keep keep them aside from the engineering team in a way. Only the quality team knows them. They have to avoid overfitting because it's like natural thing. Even the fitting is actually a very big problem on on machine learning because you can actually have overfitting even without knowing just by, like, by fine tuning the the meta parameters of what you have. Yeah. The cross validation set and the follow-up. We do have also reset and at the end of the day is best effort. We are not that many people and we try to do this like this. Clearly in the sense. If the results are not worse and performance is not worse, we just move ahead. And if the performance is worse, then we do we have to do a giant call. Whether, like, we actually go for it and optimize later or we we just cancel the whole release.

Nathan Labenz: (49:06) How does that differ if at all for the summarizer type experience that comes out of the and just to set up the context there, the the summarizer is and this is something that's both you can experience it in the UI of search, but then also we're gonna shift in a second and talk a little bit about the search API, and it's available via both. So this is basically, as you said, your RAG style experience where, okay, there's all these results, but now can we deliver a a synthesis of these results that kinda gets delivered in paragraph explanatory form. The another an additional challenge I would imagine there is that there's probably a lot of, like, taste elements to what would even be considered good. Do you have a sort of principle I mean, it's interesting. We've seen, like, the OpenAI models spec recently where they, for the first time, have really started to articulate this is what we want the model to do. So now we can at least know if it's, like, doing what it's supposed to be doing or if this is by design or not. But I imagine you have to make a bunch of similar choices. Right? Do we want it to be short? Do we want it to be long? That probably varies on context. So do you have a constitution for the summarizer?

Josep M. Pujol: (50:15) Not yes. But it's a constitution of 1. It's like the person who actually does this and he tries to do the best. Of course, there's some government that are like of good taste. Right? We should not be like biased. We should be try to be as concise as possible. There are, like, certain things that are, like, common sense. But other than that, we do not have time enough to define guardrails or guidelines. We believe that it's more also an iterative process. The fact that we do certain fine tuning from time to time, what helps us that there are certain results that we think that they are very good. The results will be used to like to fine tune the model and the model should automatically learn what is the standard that we want. Right? And who decides what is good or not? The quality assessment team. Again, we don't have any editorial criteria. That goes without saying, right? We don't play politics. We do not do any censorship or anything like this. But at the end of the day, it's a matter of personal taste of whoever does the rating. But that's it goes for us, goes for anyone else. It goes for you, right? Do you prefer results from Perplexity or from us or from Google? The information is actually the same. It's just a style. Right?

Nathan Labenz: (51:28) Yeah. I as a power user of all of these things, I have to say I I do find myself going to different services depending on the query. It seems like I've noticed and I would say I would I think Brave is in the middle, I would say, right now between like, perplexity gives me a lot of times the shortest, most to the point answer. You.com research mode is on the other extreme where it will give me, like, the 2 page thousand word. Obviously, that is not suitable for every query. And I would I know don't how you would have qualitatively described where Brave sits in relation to those. But for me, it seems like it would be in the certainly in the middle of those 2.

Josep M. Pujol: (52:07) But it's on the it's on the middle because because Brave has an extra complication compared to proficiency, and and we can discuss about that later. But it's a little bit like it's all it might sound very, like, unthoughtful, what I'm gonna say, but it's a little bit by chance. LLM 3 is much more verbose than Mixtral. It comes out of being more verbose, period. When we switched to LLM 3, we became more verbose than the Mixtral. We, of course, we can turn it down a little bit x y z, but it's certain inherent things that come out of the tools that you use. And in a way, we do not have the time or the will or know what is right in order for us to actually act upon. And then on those cases, the best thing to do is, like, try to do nothing. Right? But it's not stupid, but it's true. Right? If you do not know what is the right thing to do, you just I believe it as a model. Believe it as simple as that. Once I do have more data or I have more resources to do proper assessments and proper judgment, I will try to influence the model. But influencing just because someone on Twitter says something, which is very tempting. So that's 1 thing. And then on the it's very interesting what you said about that you use a different system for different queries because that's not what normal people do. Search is 1 stop shop. Right? And Brave Search is is search. And when people search, they don't want to discover. People search because they have a problem. They want to go to Hotmail or they want to go to port or they want to know, like, a recipe of mojito because they are doing the mojito right now. They don't want to discover the 2 pages history of mojitos or, like, a nice AI summary for mojito. They just want to know the ingredients and how to put it together quickly because you are in the middle of something. Right? And that's important to realize that whenever we think of search, we always like think of fancy questions or like complex questions that are thoughtful. But most of the queries are not like this. And because we are, like, the default search engine on many countries on the browser, we need to cater to the common denominator in a way. And that's actually something that is tricky. Right? Because but we could be much fancier if we had no traffic. Right?

Nathan Labenz: (54:24) That mojito example reminds me of a sort it's almost a meme at this point that all the recipe sites have, like, a long story at the top of the page now. And I'm sure you're familiar with this phenomenon. Right? Why did that happen? Was that is that something that because I would agree if I'm looking for a mojito recipe, I'm not really looking to read like your diary entry of the time that you made these mojitos and how much fun you had. Okay. And everybody seems to agree. So where did that SEO. Go

Josep M. Pujol: (54:56) SEO. By random walk, that actually worked better. But once somebody learns that others copy is mostly SEO.

Nathan Labenz: (55:05) But why did it work for SEO? Is that a is that sort of a misgeneralization on Google's part? There must be something where like more content of this sort is useful and then maybe they overdid it and started to return recipes based on that criteria, even though it's worse?

Josep M. Pujol: (55:22) Now you will how to say that? Do not never underestimate the things can actually happen by accident. So if you do a page that has very good information, but you do it in a format that the the parser of Google doesn't like, your content is not going to be indexed in the in the same way that if you do it on free text where the parser for sure gets it. Right? So what you end up doing, again, notice the writer of things that want to be SEO optimized, what they do is an ensemble approach. They try many different things to see what is works better against the algorithm. And the algorithm might not be something that is really has a strong rationale behind it. It could be that certain types of structure is penalized by accident, just by accident, because that particular thing is not properly passed or because that all the things, like, between this stack and this stack made it to the list of things that has an extra boost. And those, according to the leak, Google had 14,000 features per page. I find it a little bit too much, but we have hundreds. So I wouldn't surprise if they have thousands.

Nathan Labenz: (56:36) So how do you think this will evolve as we go into a future of just language model generated content everywhere? There's a lot of talk right now about the web is sort of the hyperbolic take would be like the web is over. It's all gonna be overrun by LLM spam. And and also people don't wanna necessarily be included in training data, so walls are going up in some places. Yeah. How do you expect the web to change in general and specifically due to those factors?

Josep M. Pujol: (57:09) I mean, the web has been changing for a long time. Right? The upcoming of advertisement was a change. Right? Like, the unblocking has also a change. So everything that is like a change of incentives on the web. Right? And the fact that with AI, not only the economics, which are already, like, pretty pretty even the reputational aspect of the web might actually go away. Because if nobody visits your site you have your podcast. I guess that you'll make money out of it, but also like for there's some reputational factor to it. Right? If you lose both money and reputation is what why should I do what should I do content for? Right? It's worrisome to to say the least. I didn't know what will happen, but certainly the web will change. And I'm a little pessimistic on on it, not only because I I like the web. Right? I like the web. I like I'm old enough that I prefer YouTube over Netflix. Even YouTube is full of crap, but it's also full of gems. Right? And I like more YouTube than Netflix. I like more Myspace than Facebook even though because I I like this variety. And I think that's like a the AI is is not the 1 who's gonna kill it, but is yet another turn towards towards, like, a version of the web that is not what I had when I was younger. That doesn't mean that we are worse off as collectively. Not necessarily. But, I mean, AI the the AI answers in a way, they are very convenient. And that's why Google is forced to implement them. I mean, in a way, that's why we are forced to implement them. Remember that, for us, it's not cheap to run, summarizer. Right? Because you do have you serve about 27,000,000 release per day, 30,000,000. 10% 30% of that to be run through an LLM, that's not that's actually for a lot of Chris. Right? So we are, like, not forced, but there is pressure to it because people find it very convenient. But anything that is convenient has a has a price. You sacrifice something for convenience. That's given. On the other extreme, like, don't want to, like, I don't want to go to the river to fetch water. Right? I want water to come out of out of my tap. So but how much can but that's kind of where the threshold is, I do not know. But anyway, convenience is is king and and answers with AI are here to stay. They are actually very useful and very powerful and and very convenient too, but, of course, they they will pay a price. Not only on the ecosystem, but on of also the variety of of answers that we get. Because I think that some of our companies say like that answers are knowledge. No. Answers are not knowledge. Questions are knowledge. Answers are just answers. Right? So the fact that you have an Oracle kind of type of thing that or anything you ask, it will give you this. That basically means that everywhere anything else does not exist. And that's that's it.

Nathan Labenz: (1:00:20) Yeah. It does seem like it's gonna be a tough time for a lot of the content creators. I'll continue to do the podcast just so I get to talk to leading thinkers like yourself, regardless of any whether anyone listens. But it does seem like there's a lot of people out there who are

Josep M. Pujol: (1:00:36) Oh, and don't know, like OpenAI has, because they have like big pockets they try to pay, but I do not know if the economics actually are because of the current economics of the advertisement. Right? Yes, Google makes a lot of money. Right? But if you have to put that money equally among all the parts, everybody gets gets something that is not enough to leave. Yeah. We've been by searching for endless convenience, we commoditized ourselves out of that. I don't know what will happen, honestly.

Nathan Labenz: (1:01:08) The business side of this work for Brave. I'm just doing a little back of the envelope math, and I'm imagining, okay. Jeez. If you're doing 30,000,000 queries a day, and I don't know how much how many tokens you're putting in per query, but we can get into a little bit more specifically, like, much data is returned. But it's not as insignificant amount. It's gotta be thousands of tokens per query.

Josep M. Pujol: (1:01:32) It's like it's 4,000 tokens per query usually. We we usually max out our context window, which is like 4,000. But it's also it's all self hosted. We do not use any third party. And about it, not because of the independence part, which would also be because we value that we are independent, but also like cost, right? We would not be able to use OpenAI or anything like this cost wise. So everything that we do is like self hosted and we saw that we have to pay machines. I think that nobody will complain if I say that. But for instance, right now, I think that we have 28 hundreds out of Amazon. So that's our infrastructure for the PC LLM plus a lot of, like, GPUs for the smaller model of q and a's and classifiers. So it's a big infrastructure, but it's not as massive as as many people, I think.

Nathan Labenz: (1:02:26) So I guess for now, it's not so expensive that it forces a business model change because this is, like, both this is something Google's obviously gonna be facing in a big way too. Right? Like, it's very unclear how even if you're doing, like, pay per click advertising within search, it's very unclear how that fits in with these AI answers. And then more broadly, like, it's just very unclear how do you guys think of, like, a subscription model in the future for something like this, or do you have any idea for how this evolves business wise?

Josep M. Pujol: (1:02:57) It it it depends. If if you enter in into the chatbot domain, what you have for back and forth that is transferred, then the economics might work differently. Right? But if it's on the context of search, where, like, the there's query there may be query reformulation, but there is only 1 1 trip, the cost. Let's say that the worst case scenario is down on the cost of infrastructure. Right? People tend to over designate the new things over the old ones. Yes. Running an LLM is very expensive. No question.

Nathan Labenz: (1:03:27) But it's not like a search engine is super cheap to get it.

Josep M. Pujol: (1:03:30) That's my point. Right? If Google doubled infrastructure, I'm sure that they can do any LLM that they want without any issue. And they suddenly have the margins to double infrastructure. I think that's what the cost problem is more like, whether we want to do that problem, whether people actually want to have that. It's very unclear that majority of people want to have that. With search is very again, let me go back to the point I said before, like, that once you have traffic and and you, like, serve all use cases. Right? People are not when they search it, they have a problem. They don't want to be stopped. Right? Because they are in the middle of something. Right? And people do not usually, like, remember good cases. They only remember bad ones. Right? So you you don't go to that particular shop because they treat you wrong once. It's not yeah. You go to the other 1 because they on average, they treat me bad most of the time. You avoid yeah. You don't reward the place. You readily punish the 1 that that you don't like. Right? So if you start to do certain experiments, if you have a lot of users, you risk a lot users because they don't like what you're offering, right, because it's in the way. Right? You are annoying them. And that's why, again, without traffic, you can do many things. Or with a very small traffic or with a traffic that is of a very particular segment of population, you can do many things. But once you have the general public and your service is like running water, you open the tap and please you cannot you don't have the luxury of the innovation there. And actually, Google is suffering. I I pity them honestly in the sense that it's a very big thing what they have to to do. And we have a much easier position than Google has to react to LLMs. And u.com or Bravexy or TAGI have a much better position than where we are, right, because we have less community. So like a blank slate for the win. It's always the case, right?

Nathan Labenz: (1:05:39) What does your user base look like? I know that obviously we're speaking in English but you're from Spain and others are the company's headquartered in Europe. Correct?

Josep M. Pujol: (1:05:50) No. The no. Brave is headquartered in The US. Search search is a 100% Europe. There is, like, only these people only people in Europe and some people in Africa. All 10 same time zone. But All Search is EU EU or not not not EU. Europe based. But Brave is an American company, but it's, like, heavily developed. It's heavily remote, like, how the people.

Nathan Labenz: (1:06:13) Maybe a a more general purpose question would be, again, where are Brave users? And I wonder if you see differences cross culturally in how people are responding to the AI moment? Like, how would you characterize what how how consumers are reacting to AI experiences in Europe as opposed to in The United States?

Josep M. Pujol: (1:06:36) No. No. We don't. We have not done this analysis. Brave Search users is actually is a good sample of the the global population. About like 30% of the people are in The US. The next big market is already in Europe. So we are like pretty well spread across all continents. They know hot they even know odd countries or odd behaviors. It's large enough to be statistically, well distributed.

Nathan Labenz: (1:07:03) My intuition would be that I'd I'd maybe be more interested in the summary just because less fewer of the sites are mobile optimized, and it's just more hassle to browse around.

Josep M. Pujol: (1:07:14) Could be. But at the same at the same time, when you're on desktop, you also would expect to have more work related difficult queries. On more because on mobile queries, as said, usually, they are very instrumental. They are very, like, phone number of restaurant. Yeah. Or, like, fun facts. Again, age of 80 gallon. It could be all, like, playing a trivia or some sort with friends.

Nathan Labenz: (1:07:38) Let's maybe turn to the search API. I mean, this is an area of the business where there is a very natural business model. And this has been like the main subject of of the ad as well that we've been running. I went ahead and created a little Google collab notebook that allows a user to easily show up and try the Brave API as compared to the Google custom search API and then also the Bing search API. Mhmm. And few observations that I would be happy to share, but maybe you should start by just telling us, like, how do you see the product as being differentiated from the other options that probably a lot of people are more familiar with?

Josep M. Pujol: (1:08:20) Yeah. The the reason why we had we released the search API to provide an alternative to to Big Tech. As far as we know, there is only 2 APIs out there that do a search engine, like Google and Microsoft. And Google is not public. I think that you you use it, the custom search from Google. That's not meant to actually be used in production. I think they have a cap of 10,000 queries. So you actually want to use Google API for something real, you actually have to go to talk to them and they might give you access or not. Like, it's their own it's their own decision. On the other hand, it's it's a little better on that respect because they have a public API. Of course, you pay, but it's public. And you go there, you you accept the term of service, which is a bit draconian, but that's a different topic. But you pay and then you use. That's it. And at Brave Search, what we try to do is, like, same as being API, is it's a public public API. You pay. You can use it. No questions asked. We try to be less reconvene on the terms of service. Right? So we do not require, like, attribution. We allow for you to mix your results the way you want. Things that, like, Brave would not sorry. Not Bing would not allow. But other than that, again, there is no there's, like, nothing else other than before you only had Microsoft. Now you have Brave and Microsoft. It's not that's not a small thing. And we did that, like, the very moment we the very moment we actually could launch the API, we did. Right? Because we're not, like, a 100% independent all the way. Right? When we launched Brave Search, were able to answer like 87% of the queries. Right? For the remainder, 30%, we actually use Bing. And we split that over the month. Right? So we went like from 87, 3 months after it was 90, 92, '93, '94, and a little more like than a year ago. We might went from the '94 or '96 or '97. We went back to the to be not only independent from from Bing. And at that moment, we could offer the API. Right? Because before it was not possible because of the terms of size of Bing.

Nathan Labenz: (1:10:38) Yeah. That's interesting. I I would agree definitely with the idea that the Google custom search API is not really a competitor It's

Josep M. Pujol: (1:10:47) to be it's meant to do site search.

Nathan Labenz: (1:10:51) Like your own site?

Josep M. Pujol: (1:10:51) Yeah. Actually, I'm pretty sure that if they you can do the trick and not pass domains so that it doesn't do site search. It does global search, but I'm pretty sure that it goes again in terms of service. But in any case, it's limited to 10 to 10,000 queries per day, which is

Nathan Labenz: (1:11:07) The thing that stood out to me the most, and folks can go judge this for themselves with our collab notebook, which I'll put in the show notes, is really more than anything else, just more information being returned. I in my case with Waymark, for example, we have this flow where a new user shows up. And I'm always amazed, like, that so few products implement something like this because it's worked really well for us. New user shows up, they want to make a video for their business. They typically don't have a ready to go folder of all their assets, let alone like a good profile of their business. Right? So now we're going to use generative AI to put things together for them, but we need to know who are you? Tell us about your business. What do you offer? What makes you special? Do you have any images, you know, that we can feature? And people just don't have that stuff. So most products just require you to make a profile and upload your images and whatever. We have worked pretty hard over the years to create a good experience where we go grab that content off the web for you. And as you'd imagine, the generative AI era here has really improved how well that works. We have used Google custom search for that. Our volume is not super high, and we're mostly just trying to get, like, their homepage or kind of then we'll maybe, you know, bounce a few links off of that. But I am gonna look more deeply into using Brave for this use case in particular because just more information really jumps out at me. We get like a snippet from kind of the the Google custom search. And with the Brave results, get, like, often a much deeper cut with multiple snippets and even, like, some structured data for some results. I looked at this, like, a few different ways, but in just raw characters returned for a 10 result query, the volume of data was often like a multiple higher than what I was getting with Bing and and Google. And that was that also extends down. You could fill out characters with anything. Right? But that a lot of that seems to come from these additional snippets, which for us are really useful. We right now first run that query and then we go crawl ourselves and then we pull all that stuff in and then we run our own language model to to process it and and turn it into a profile that we can then feed downstream into the actual generation task. But I could definitely see us simplifying a lot of things and that's always nice to take multiple steps and condense them into 1 if we just could get that multiple snippets alone, I think would be a pretty notable enhancement. That's before the summarizer too, which is something I also could see us doing because again, we're doing that ourselves today and you've implemented that whole thing. I'll have to check on like coverage for our small businesses because some of our, customers are pretty long tail. They're generally not super famous businesses.

Josep M. Pujol: (1:14:11) Yeah. Have to check that you get the right business at the end.

Nathan Labenz: (1:14:15) But for me, volume of information was the biggest thing that jumped out. Would you expand on that anymore?

Josep M. Pujol: (1:14:23) Yeah. But basically, the main motivation to provide we did not have the motivation to provide more volume per se. Right? Because, again, remember that the amount of data is meaningless sorry. The amount of data is very important as long as there is no noise. Because if you actually just the risk of adding noise precludes any amount of data that you can provide. So our rationale was the following. It's like we because we do the dynamic snippet generation, we're depending. Right? We end up having multiple candidates that we run through a q and a a q and a model. That's a cost that we already paid. So as long as that is above a certain threshold, which means that in principle, there should not be noise, actually, why we should not send it to the user so that they can they can have the choice to decide which snippet they want to show. Or in the case that they want to do LLM inference to have a little bit more data to to fit the prompt or to cherry pick. Right? Because remember, cherry picking is actually very important, and it's on the topic of summarizer. The the secret sauce of our summarizer is not the model, the LLM. Again, we change it from Mixtral to LAMI in in a heartbeat. It's a cherry picking of which data we fit the prompt. Right? Like, that's what we spend the time on. Like, only feeding, like, nonrepeated, non contradictory information, so we do a lot of reprocessing on on that. Anyway, so hopefully, like, on the API, what we try to offer is to give you like all the elements so that you can innovate on top of it. That's like our goal. Of course, we want to make money out of it too, right? So if it has more value than Bing, also it's also good because you will have more users. But anyways, it's not like 1 or the other. You can have both reasons at the same time, right?

Nathan Labenz: (1:16:16) It is notably significantly cheaper than Bing as well. You have multiple different tiers of service for different amounts of data that get returned and somewhat different, like, price points for different licensing as well. And there's whether you're gonna be able to store the data or not so folks can get into the the different price points. But Bing also has a lot of different price points. And at least for me, when I was looking at what I would naturally wanna use, including the summarizer, that is a $9 per thousand queries price point with Brave. And with Bing, I would say the most comparable thing, we didn't even really have that summarizer feature at all, but the most comparable price point that returns, like, the varied results where you might get, like, video results or different kinds of discussion type pages is a 25 per thousand price point. So saving more than 60%. And I I I would say getting more both in terms of the additional snippets, the extra information, and the ability to tap into that summarizer. Anything else that I should be conscious of as I think about the right option for my app?

Josep M. Pujol: (1:17:27) No. The right option for your for your app is like whatever works and if you can afford all data combined and do an example out of it. But no, it's I mean, we are very happy with the search API and it's not just that it's 1 of the it's actually it's a satisfaction from us when we talk to customer and they say that, well, it's about time that there was an alternative to being. That's actually for us is very it's like very rewarding to be able to achieve that goal to you might use Bing, might use Google on our private deal, you might use whatever, right? But like the fact that you have more and more to choose, that's a big deal for us.

Nathan Labenz: (1:18:11) You wanna shout out any customers that are using the search API that folks can go check out if they wanna see a good implementation?

Josep M. Pujol: (1:18:21) Oh, we I wouldn't be able I wouldn't be able to play favorites. So because I say that there are 2 things here. There is the vast majority of our customers do not allow us to mention them. It's a very competitive space. Many of the big players do not want to admit or let others know that they have or that they use research or or any other system. And then for the 1 who actually acknowledge that it will be kind of unfair to single out 1. But I can tell you, like, some nice use cases. Of course, there's, like, plenty of RAG use cases that are being done by names that have been mentioned before. It has also been used by to train significant foundational models. 1 use case that I find very interesting is that this is also being used to do grounding of LLM outputs. LLM output, then you actually run it. You submit it. You run it against search engine to try to verify whether this the output is legitimate or not. So there is, a lot of and many other use cases that we do not know. Or because at the end of the day, we do not unless they ping us, we do not ask questions. Right? We use it and hopefully it's beneficial for both of us.

Nathan Labenz: (1:19:43) Are there any favorite technology compliments that you would, almost like in a cookbook or demo showcase sort of way? Are there things that patterns of development that you've seen where people have combined the Brave Search API with some other framework, technology, user experience that you think is worth sharing just to inspire people?

Josep M. Pujol: (1:20:05) Yeah. There is like this. I don't know how old it is. There was 1 guy who actually did a YouTube video, and I said these are on GitHub too. They like to build up a complexity clone using Blanchine Blanchain, Brave, Clock. So he did like a whole convolution of APIs and he actually built like a very nice prototype. I think it's good kind of to get the idea of like, while the things sit together and then people can decide what to to speak how to go deeper.

Nathan Labenz: (1:20:36) Cool. You've been very generous with your time. Anything else that we haven't touched on that you wanna make sure we cover?

Josep M. Pujol: (1:20:43) No. I think that we I think that I really talk too much. So hopefully, I did not bore your audience with my ramblings. But it's been it's been a pleasure, Nathan. Very nice host. I appreciate the chat.

Nathan Labenz: (1:20:56) Cool. The feeling is mutual. Josep M. Pujol, thank you for being part of the Cognitive Revolution.

Josep M. Pujol: (1:21:03) Take care. Bye bye.

Nathan Labenz: (1:21:04) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Building Brave: Private Search, One AI Layer at a Time with Josep M. Pujol

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Building Brave: Private Search, One AI Layer at a Time with Josep M. Pujol

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software