Watch Episode Here
Listen to Episode Here
Show Notes
Anton Troynikov, cofounder of Chroma, joins Nathan Labenz to discuss the importance of keeping the retrieval-augmented generation (RAG) loop in house, what it means for Chroma to be in “wartime” mode right now, and semantic storage and retrieval. If you need an ERP platform, check out our sponsor NetSuite: http://netsuite.com/cognitive.
SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
LINKS:Part 1 with Anton: https://youtu.be/ogy37CdIljg
X/SOCIAL:
@labenz (Nathan)
@atroyn (Anton)
@eriktorenberg (Erik)
@CogRev_Podcast
TIMESTAMPS:
(00:00:00) - Introduction by Nathan, setting up the conversation with Anton
(00:02:16) - Anton articulates Chroma's mission to build a horizontally scalable system and deliver it as a cloud service
(00:03:06) - Rise in popularity of retrieval-augmented generation (RAG)
(00:05:00) - Anton explains what it means for Chroma to be in "wartime" mode right now
(00:06:03) - Chroma's focus on delivering a horizontally scalable cloud service for vector search and storage
(00:08:07) - Nathan describes his experience building a RAG application for a client profiling use case
(00:10:27) - Anton advises measuring retrieval quality and maximizing relevant information returned
(00:15:05) - Sponsors: Netsuite | Omneky
(00:17:02) - Popular use of open source vs. proprietary embedding models like Anthropic's Ada
(00:19:30) - The importance of keeping the RAG loop in house and not relying solely on external APIs
(00:23:31) - Approaches for adapting the embedding space based on user feedback
(00:27:41) - The huge amount of unstructured data that can now be processed by AI
(00:30:40) - Providing a unified interface to structured and unstructured data
(00:31:21) - Chroma's plans to bring more intelligence into the data layer
(00:32:13) - Analogies to Salesforce and Oracle in enterprise software partnerships
(00:33:15) - Much of the data going into Chroma has never been in a database before
(00:38:47) - Categories of organizations adapting to AI: legacy, AI-native, and AI-first
(00:40:55) - Where Chroma is seeing most of its growth right now
(00:42:48) - Anton sees retrieval as an important component for developing good agents
(00:46:20) - Interpretability work like Anthropic's circuit evaluation
(00:48:15) - Keeping representations grounded in human-interpretable data
(00:52:23) - Anton believes new tooling can make latent spaces accessible without AI expertise
(01:02:21) - Fine-tuned models bringing more of the RAG loop in house
(01:03:32) - Thinking of data as a control loop rather than static
(01:04:30) - Offering continuously improving public data sets like Wikipedia
(01:06:08) - Scaling constraints between search indexes vs. application databases
(01:09:10) - Potential for time as a dimension in embedding spaces
(01:10:55) - Language models discovering implicit representations of time and space
(01:13:46) - Likelihood of missing results due to representational issues vs. approximate nearest neighbor
(01:15:22) - Automatically handling small data sets without needing elaborate indexing
(01:16:11) - Partnerships with AI labs to mutually reinforce RAG applications
(01:17:20) - Anton's perspective on whether OpenAI will build its own database
(01:19:43) - Partnering with OpenAI and other labs to increase use of their models
(01:21:19) - Anton's experiments probing GPT's reasoning abilities with Game of Life
(01:25:41) - Closing thoughts on the conversation
The Cognitive Revolution is brought to you by the Turpentine Media network.
Producer: Vivian Meng
Executive Producers: Amelia Salyers, and Erik Torenberg
Editor: Graham Bessellieu
For inquiries about guests or sponsoring the podcast, please email vivian@turpentine.co
Music license:
VS9VAZULYVID8VNL
Full Transcript
Transcript
Nathan Labenz: (0:00) We are conditioned to think about data as this static thing, right? It's sitting somewhere and it has a particular instance in time and then we access that instance in time and then the next time it might be different, but it's still essentially, mentally we think of it as static. I really think of these things more as a loop, right? It's almost a control system loop, where data is actually an engine. It's something that is interacting with the outside world mediated through computation and then it's constantly adapting and it's improving itself.
Nathan Labenz: (0:32) Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Anton Troynikov is the founding CTO of Chroma. We first had Anton on the show back on episode 5, which was released in early March before GPT-4's release, and he impressed me then and since as a real intellectual force. Anton is someone for whom the fundamentals seem to come easily, who codes all the time, and for whom it's natural to think in higher dimensions and abstractions. He's also a super quick wit, a prolific and at times provocative commenter, and amazingly, at the end of the day, still a bit of a doubter, certainly more of a doubter than I am when it comes to LLM reasoning abilities. In March, I treated my conversation with Anton as an opportunity to speak to an expert tutor on core concepts of embeddings and retrieval, and I tried to explore how he thinks about navigating the latent space as a way to develop my own intuitions and to help you develop yours. If you haven't heard that episode and want something more foundational from Anton, definitely check that one out. This time around, I wanted to talk about everything that's happened since, including the fact that RAG or retrieval augmented generation has been LLMs' first big commercial application hit, with tons of companies spinning up all manner of document backed bots for all sorts of labor saving purposes, and in the process, nearly as many companies spinning up and deploying a vector database for the very first time. We talked about what's working today in retrieval and also what's next, including for Chroma as a business. Now coming in, I was honestly wondering if Chroma and other vector database startups might struggle to grow businesses given how incumbent database providers are now racing to implement their own vector stores. But Anton did a good job of reminding me of one of my own maxims, that there's a good chance that we're all still thinking too small. His observation that most of the data stored in Chroma has never been in a database before suggests that there will be plenty of growth to go around, at least at the infrastructure layer for a while yet. And his plan to bring more and more value into the database so that retrieval becomes as simple as dumping text into a text box, not unlike how people use chatbots today, seems like one that can drive Chroma to a great outcome even as lots of incumbents get into the game. I just wanted to take one moment to say that I'm really quite proud of this content. This is a focused but fast moving conversation conducted at an expert level with a genuine intellectual leader currently working as a wartime CTO. I've learned an incredible amount from the process of making this show, and I'm grateful that thousands of you from an incredibly diverse range of backgrounds have become regular listeners. I also want to thank the team at Turpentine, including our producers, Vivian Meng and Natalie Toren, our editor, Graham Besalu, and, of course, Erik for putting the whole thing together. Working with Turpentine does make for an unbelievably convenient operation in which I am 100% focused on understanding, communicating, and interviewing as well as I possibly can. With that, as always, we'd appreciate it if you'd share the show with a friend. I'll suggest that you send this one to a software or an AI engineer in your life. And we always welcome your feedback or other outreach at tcr@turpentine.co or by DMing me on the social media platform of your choice. Now, here's my conversation with Anton Troynikov, wartime CTO of embedding database company, Chroma. Anton Troynikov, welcome back to the Cognitive Revolution.
Anton Troynikov: (4:21) Thanks for having me.
Nathan Labenz: (4:22) Lot to cover. Last time you were here 8 months ago, it was before GPT-4. It was before RAG, broadly speaking, overtook agents as the hottest trend. And it was before, I believe, you had "wartime" and a pirate flag in your Twitter handle. So...
Anton Troynikov: (4:44) I believe I had the pirate flag. I'm not sure that I had the wartime yet. Wartime is a more recent one.
Nathan Labenz: (4:50) So tell me, what it means to be a wartime startup founder right now in the incredibly fast moving AI space, then I'll dig into lots more questions beyond.
Anton Troynikov: (5:00) Yeah. I think the thing that caused a switch for me at least mentally, and of course part of it is just a branding exercise, but what caused the shift for me mentally was I think that Chroma as a company has a very clear objective and a very clear mandate right now which we need to achieve and we are taking on what I think of as the right level of risk for a startup right now, which is a lot of risk in order to achieve that. So having that concrete objective, knowing what we're risking is a very different place to be in than when you're still in this exploratory mode and you're figuring out what people want. I think we actually have a very clear idea of where we sit in the market, what we're able to achieve. And so for that reason, I think I made the change.
Nathan Labenz: (5:44) Gotcha. Interesting. So can you articulate that straight away? Is that something you can lay out for us at this point?
Anton Troynikov: (5:50) Yeah. Absolutely. I mean, our roadmap is fairly public, so I think it's pretty clear. But the main thing right now is to build a horizontally scalable system, on the basis of Chroma's vector search and storage engine and then deliver that as a cloud service to people. It's the number one thing that people are asking us for. We have a very clear shot on getting that done, especially as the Chroma team has been growing over those months since we last spoke. We know what the shape of it has to be. We know the quality that is expected of us and that we want to deliver to the market, we know our key differentiation, we understand what we're building for and so all those things are just in place and now it's purely execution risk. It's how well can we achieve this mission that we've set out for.
Nathan Labenz: (6:31) Okay, cool. I have a lot of questions. Just for context, and this is not meant to be a customer role play, but I think it naturally could go a little bit in that direction because I reached out to you with a general feeling that I need to do some sort of rundown of all things RAG because it has been the most talked about application development trend of the summer. And it seems to be the thing that is really working and driving a lot of value for every business that tries it, at least to some degree. I think our audience is pretty deep in the weeds, so I usually don't sugarcoat it too much on the terminology, but just for anyone very briefly, RAG, retrieval augmented generation. This is the loop where a user asks a question or has some sort of input at runtime. A database is searched using that query to bring additional context into the context window, and then the language model can use both the query and the retrieved information to generate stuff. Obviously, a lot of variations on that, but that's kind of been the big trend. And so, Swix had an episode on this recently. Thought it was quite good. I know there's a lot of content about it at the AI Engineer Summit, which I understand you made it to in person. I unfortunately was traveling elsewhere for a wedding at the time. But I'm also doing this, right? So I'm advising this company Athena. I've mentioned this many times, and we have recently gotten to a good v1 that we're at least comfortable putting in the company's hands. And the company's in the executive assistant space. They have 1,000 executive assistants with 1,000 executive clients, highly idiosyncratic. What we developed in a RAG vein is an individual client profile backed application that could hopefully help the assistant get access to useful content about the client.
Anton Troynikov: (8:24) Yeah. Perform the work more efficiently and faster with all the available information.
Nathan Labenz: (8:30) So I don't know if I'm a typical customer or not a typical customer, but we're definitely looking at this. Boy, wouldn't it be great if you could have this kind of semantically sophisticated access to all the things that a client has accumulated or declared or the preferences, the email history, could go on and on and on. Now we're in the early days of this, but a big part of that obviously is the backing. Where does information get chunked? How does it get stored? How do we access it? How do we assess whether we're accessing effectively or not? So I've got a product quote unquote in market with an internal group there. And I think what kind of stood out to me most as a developer based on what you're talking about is how do people choose in the first place? Feels to me like a very black box. You said people are coming to you and saying, "Hey, I want a hosted service." I honestly can't necessarily yet differentiate between the vector database providers. I'm not even sure how we should think about differentiating.
Anton Troynikov: (9:30) Yeah. I think that's a very reasonable question and I think that's an artifact of the position that we find ourselves in on the adoption curve of these technologies overall, right? So many individuals and enterprise teams and just people building with AI in general are very early to this technology. The way that we're thinking about it is 99% of people who are going to one day build an AI in the loop application haven't touched that at all yet. And so because we're very early in the adoption curve, differentiation isn't as clear, right? The differentiation comes into play more as you get further along the adoption curve and we have users all the way from people building their very first AI project at a hackathon where they're finding out about RAG for the first time where I have to really educate them about, "Hey, this is actually possible. You can inject information into the model. Here's how you do it." All the way down to pretty sophisticated production deployments where those are enterprises who have the core RAG loop up and running and they're asking, "Well, how do I improve my retrieval quality? How do I actually incorporate all the human feedback and evaluations that we've done back into the data layer? What can I do to make sure that the output is improving over time?" And there are far fewer teams towards that later end of the curve than there are in the earlier one and there's an intermediate step here as well where it's, "Okay, my RAG application experiment has succeeded, now we're scaling," right? You have to scale in the right way as well. So Chroma has differentiation at each point in this curve actually. The first part where we really think of ourselves as enabling experimentation in this space. If we make the retrieval piece of retrieval augmented generation loops really easy to use, and Chroma is definitely the easiest thing out there to get up and running with. You just pip install ChromaDB and actually these days with our CLI, you can literally just do `chroma run` and have a Chroma server up and running. It's the easiest thing to start experimenting with, right? It's why we're the default in many frameworks, it's why it's so easy to get demos up and running with us is it requires no setup at all and it gives you sensible defaults and it just works out of the box. I mean, with Chroma, you never even need to think about an embedding function. You literally just throw text at it and it's going to work. So we're differentiated at that starting point and then the thing that we're working on right now of course is horizontally scaling this in the right way. Today, you can easily scale a Chroma instance and many people do actually. They just put it on a machine with a lot of RAM and then they run tens of thousands to hundreds of thousands of collections with millions of vectors each out of the box with a single node Chroma. But obviously for really enterprise grade data, when you need your entire company to run on one thing, even though your engineers are building different applications with it, you need something that scales horizontally across multiple nodes. And it's important to actually get that right. A lot of other products in the market were built as search indexes, right, in the last 2 or 3 years. We've talked about this before but most products that are doing vector based retrieval were built for use cases that existed before AI became practical. Web scale recommender systems, web scale semantic search, the sort of thing that runs Pinterest similar image results, right? That was the application of embedding space retrieval. However, it doesn't scale the same way if you're using this as a component in an application database for a variety of reasons. The first of these of course is a search index is mostly data that doesn't change very much over time. You have your billion existing entries, maybe each day if you're very rich, you add a couple million entries, so it's a tiny percentage of the overall data. Most of the changes that you're making are additions, deletions and mutations are actually very rare, right? And then the other piece to this of course is that in a search index, the entire point is that every user has access to the entire index, right? You want everyone to see all the available images in Pinterest for example, which is fine. But that imposes very different architectural constraints both for the core vector retrieval algorithms but also for the way that such a system scales. Scaling and sharding and distributing a single monolithic index which can then be replicated and written in a particular way is very, very different to an application stack where you have an index maybe per user space and all of these are scaling independently and being updated independently, right? Scaling that horizontally and then providing that as a service again is another more difficult challenge even beyond that. The way that this will express itself to our users, especially when Chroma Cloud is up and running, we're going to provide completely elastic scaling. Instead of thinking about pods or deployment or managing a server, which by the way, makes total sense if you were building a search index. The way these things are architected is fine if you wanted a search index. We will provide this in a completely elastic way. We'll charge you per query essentially and some rate for storage, and then you just never have to think about the scaling part of it at all. But to deliver that requires us to build a quite differentiated architecture from the one that's traditionally been made for these kind of search indexes. Now, that puts us on the scaling point in the adoption curve. So as I mentioned, there's different points. The final point, and this is something that we're working on in parallel, is basically taking all of those problems around retrieval quality, around how do I chunk my data, around how do I select my embedding model, and bringing those down into the Chroma product. Because the way that I think about this often is AI application developers right now are putting a lot of effort into making sure that the retrieval component is working very well, and that's effort that they're not putting into experimenting and actually building products, right? So you often see at the enterprise level who are further along this adoption curve, you often see data science teams tasked with figuring out how to make their retrieval system better, which to my mind is really the responsibility of the product itself. It's not the responsibility of the application developer to do that and so Chroma intends to solve all those problems at the top end too. That's basically our starting point. That's our roadmap over the next few months.
Nathan Labenz: (15:01) Hey, we'll continue our interview in a moment after a word from our sponsors. Got it. I think the last part there is certainly extremely compelling immediately because I agree. There is a lot of effort that is still going into "Am I getting the right things back?" And it's not always even super obvious whether you are or are not. Maybe we could just spend a little bit of time there. I've kind of gleaned from the literature and validated for myself, is probably a more accurate way to say, a couple of things that I think are pretty good best practices. But I want to hear, even just with current Chroma, what you would advise people to do, what you'd advise me to do to improve on what I've got. And then also want to hear how you're planning to bring that in. I would assume today, OpenAI Ada embeddings, I assume, have a huge percentage of the market. Is that true of the usage that you're seeing?
Anton Troynikov: (15:57) It's interesting. So actually, what we're seeing is most people are happy with open source embedding models as well. A lot of people are deploying their own local embedding models because it's actually fairly straightforward to fine tune a sentence transformer to get more performance out of it than you want. I think Ada embeddings are fine but I think people are finding that there's not that significant a trade off between these two. There are other reasons though to use Ada embeddings. One of them is that there's an Azure endpoint for them. So if your data is already sitting around in Azure, then it's easy for you to do that, right? You don't have to run your own model because you might be running this data at large scale. You don't have to pay egress costs or the network cost penalties of using something else.
Nathan Labenz: (16:37) Yeah. I was going to ask how those preferences relate to scale because I think I'm maybe atypical in some sense or I'm in some maybe middle ground because you're kind of saying, "Yeah, Chroma is the default because you can just hit run and it goes." That is certainly great for the very starting point. I often jump one layer ahead, and I'll definitely count me among the customers that are, "Please give me the hosted version." Because the cost is often so elastic on these kinds of products these days that the entry point is super low, even free. So I can hack as freely with the hosted version as I can with pip install version. And then if I do actually start to take it to any sort of mid scale where I'm going to have users or whatever, then I'm happy I did that in many cases because I'm like, "Sweet, now I can pay $8 a month for as long as this thing that I've built lasts." Cool. And then real production scale comes later. And then I honestly haven't taken any embedding, any RAG project to that level of scale yet to where I'm weighing the cost there. I guess the main trade offs, I guess there could be performance trade offs, but at the high scale, I'm curious to know if you see people actually outperforming Ada with their own stuff or trying to save money on it because, damn, it's cheap. Right? So...
Anton Troynikov: (18:02) I think it's not just a question of saving money. I think it's a question of keeping as much of the data and RAG loop in house as you can as well. Going out and sending over your data over a network is not necessarily cheap or desirable in many cases, right? We're seeing actually very much an increased interest in bringing the entire RAG loop in house, not just calling out to say GPT-4, but looking at Mistral and Llama 2 and fine tuning those for particular tasks. And I have these conversations pretty often with analysts and other founders and VCs in this space where there's an open question right now, which is, okay, OpenAI is pushing for AGI, they want to develop the one model to rule them all, but what utility, if any, do the smaller fine tuned open source models have? And I think that there's actually an interesting inflection point in the market right now where people are saying, "Okay, well actually we want to run this entire RAG loop in house and optimize it for our particular use case, but also not give other people our data and not have to worry about that at all." And I think besides the scaling question, because actually using GPT at scale is expensive. The embeddings are cheap, but using the LLM itself is very expensive. People are asking themselves, can they keep that entire loop in house? Keeping the embeddings in house might be an early step in that direction.
Nathan Labenz: (19:19) So we'll be going back to just the strategies I've been using. So main one, typical, straightforward, just throw some tokens at it. I do think if you're doing something like this and you're early and you're trying to get to proof of concept, don't skimp on the number of records that you retrieve is one obvious but definitely useful tip. Use the context window.
Anton Troynikov: (19:42) Yes and no. So here's the thing, right? This is one of the things where we need to make retrieval work better for it to actually work well in production. It is a fact that has been folklore for some time in this space, however, the recent research has demonstrated this more empirically: distracting information in the model's context window does tend to measurably destroy the performance of the overall application and it destroys that performance in actually a very difficult to measure way, right? So the question is, try to return all the relevant results but make sure inside those relevant results, you haven't accidentally returned irrelevant information. And this is actually a fairly complicated system to isolate where the performance gains are coming from because obviously, the results that you return are dependent on your chunking strategy. You need to make sure that the chunks that you're creating are semantically meaningful, not just in some general sense but in the task specific sense that you care about, right, and your application needs to reason about. So I don't think it's always the case that you always want more tokens, that you always want more retrieved results. You have to find a way to retrieve only relevant sections. But of course, that depends then on the embedding model that you're using, the chunking strategy that you're using. All of these things are interconnected.
Nathan Labenz: (20:58) Yeah, certainly at the very high end, I've experienced Claude 2 if you really push it to the full 100K, it seems to kind of still go off the rails for me, whereas at 50K, it performs as expected and seems to have reasonable command of the full 50K. So certainly at the very high end, I've experienced some of the "too much information makes things go crazy." Do you have a rule of thumb? We've started with, for token conservation reasons, 2 to 3 results in our first implementation, and then we've now boosted that to 10. And I think at least on that margin, strictly for us, more is better because the best models can handle the 10 chunks. And our retrieval, I trust less than I trust GPT-4 or Claude 2 to find the best thing in the 10 chunks if it's there.
Anton Troynikov: (21:51) So what I would actually do is take a step further back here and ensure that you have a good way of measuring performance. And that can come from two places, that can come from offline evaluation of results and it can come from human feedback directly in the application itself. I would make sure that I was measuring effectiveness first. I had a really good measure of that. And then, of course, these are knobs to tune, right? But the first thing is to have a reliable measure of how well things are actually working in the first place. And again, my advice is straightforwardly as much relevant information as you can fit, as little irrelevant information, try to get as much irrelevant information out of there as you can.
Nathan Labenz: (22:25) Another thing that I've experimented with, and I think you have really good ideas here as well, is trying to create some sort of adapter layer between the user's original query and inner product type math that the vector database is doing.
Anton Troynikov: (22:45) Yeah. So this is an idea that, again, has been around for a while. It's something that I've pointed out many times that to fine tune your embedding space, first of all, it's been demonstrated that it's sufficient to fit an affine transform to actually transform one embedding space to another embedding space, right? An affine transform, if you don't know, is just squeezing, stretching and rotating space, that's all it does, right? And it tends to be enough. Intuitively, what it's really doing is expanding the importance of certain dimensions of the embedding vector and constricting the importance of other dimensions of the embedding vector under the certain distance space. Now, one way to do that of course is to recompute all your embeddings and apply that transform but instead of computing the forward transform, you can just apply that to the query instead. And what's interesting is that means that you have great flexibility to even apply this transform per user, and you can learn the individual user's preferences because the same user may be using the same data for different reasons, right? And you can imagine different applications operating on top of the same vector store, which are using the data in different ways. Without modifying the data in the vector store itself, you can just apply these affine transforms to fine tune them per application or per user, which is a really exciting thing. And it's great that the math works out. Of course, the other approach is to fine tune the embedding model overall, and I think that you can actually probably backpropagate information, feedback information from all the applications or all the users sitting on your database to just overall get an overall improvement by performing that fine tuning. And these are features that Chroma intends to provide as well out of the box.
Nathan Labenz: (24:18) Okay. Cool. Just in case that sounds daunting to folks, to be honest, where we are in actual practice is not yet to any fine tuning of anything, but just using the language model to transform text to text. And our instruction for that, which is actually quite effective, is to essentially hallucinate or we do it a little bit more of a generic, abstract way. We tell the language model, "First generate what you think the answer might look like," and then it will do that. We put variables, just kind of placeholders in. And then that has a much higher chance of hitting the right thing in the database if it exists, of course. Nathan Labenz: (25:00) Yeah. That technique has been around for a little while. It's called hypothetical document embeddings. It's interesting. The open question there, of course, is that typically general purpose embedding models are already trained in that forward way. In other words, they are trained to land queries about information near where that information may be located. I would say that should be a property of the embedding model itself, but if you're finding that HyDE works for you and it's working in a cost effective way - you have that extra model call there per query - if it's useful, then use it. Otherwise, it may actually be in the long run cheaper to fine tune the embedding space itself.
Anton Troynikov: (25:37) I think at scale, it certainly would be. That is, yeah, interesting. Though it's a very interesting note on the forward nature of the training and the objective function. The whole thing is working in that direction. Very concretely, things that we've seen are just like, "what are the client's kid's birthdays?" It should work, but definitely getting it way more likely if we're like, "kid A is born on whatever date" - it just seems to pop the results up to the top a little more reliably.
Nathan Labenz: (26:15) That speaks to a different possibility here, right, which is some data is better stored in a structured way, and some data is better stored in an unstructured way. Now numerical information about individual people is probably better stored in just a SQL table, right? It's easy. If you have everyone's names, birthdays, everything in a SQL table, you can just go look it up. The interesting part is, if the model is equipped with knowledge of where to look for that information and has information about the structure of the database that it's in, it can of course generate SQL queries against that table without having to go through what is potentially this lossy embedding search. Embeddings are great for unstructured data. Embeddings are great for finding information about unstructured data. But if the question is about things like birthdays or numerical data or events, typically you can also get the model to directly query a structured data store as well. That's another approach to take here.
Anton Troynikov: (27:04) We've had database splits, conceptual splits, split personalities in the database realm before with things like SQL and NoSQL. Now we have structured and unstructured - traditional tabular versus vector style database. And I definitely do find myself wanting both at the same time. Right? So this seems like you have developers of applications on both sides of this divide running toward one another. Do you think that these things end up being served by the same solution, or do you envision people having distinct solutions for different kinds of databases and somehow making them talk to each other?
Nathan Labenz: (27:45) I think it's worth zooming out a little bit here and talking about what problem we're actually trying to solve. And the problem we're trying to solve is not to store our data in one particular representation or another. The problem we're trying to solve is to make sure that the model has all the information that it needs to complete a task or a query that the user has asked to do, right? So whether or not the actual relational database is a part of your storage product, the interface that you provide to the application developer has to take that into account. That's what matters, right? And so the way that we'll be solving that of course is to just have the appropriate adapters. If it turns out that there's advantage in expanding into, say, having a relational store, which we already do, by the way, have inside Chroma, the document store and the metadata store is a relational database. We use that as an augment to the vector DB and that's how we support, for example, keyword search out of the box, right? You can filter by keywords in Chroma out of the box because we have that relational backing. It's an open question about how much the actual database component should live inside something like Chroma, but what's very clear is it needs to be behind the same interface, so that the application developer, again, isn't thinking about all these plumbing components, isn't thinking about, oh, how do I wire together this framework? How do I get these techniques like hypothetical document embeddings and when should I know how to use them? And writing all of this stuff that belongs to the database. I mean, the metaphor that I like to use is we're in an era today where instead of Postgres, the database management system, giving you everything out of the box, you have the Postgres storage engine. But it's up to you as the application developer to write the query planner, to write the SQL interpreter, to write the block storage algorithms, right? If you're spending all your time doing all of that, you're not spending your time actually building the thing you want to do which is the AI application. So I guess what this comes down to is, Chroma will provide a unified interface for data for your AI application. That's the way that we think about it. And what we're doing behind that depends on how we interface with either other data sources or how we're going to plug those data sources directly into Chroma.
Anton Troynikov: (29:42) Yeah. That certainly makes a lot of sense, I think, from a developer, customer profile standpoint. If you can unify an interface and take away a lot of those problems.
Nathan Labenz: (29:53) And the thing is the models can help us a lot with that. We have, as a general principle, something we're definitely going to do in the future is to bring more of the intelligence down into the data layer itself. So we imagine, for example, having a local model whose job it is to essentially figure out how to connect to the data that is responsible for a particular query. When you model this, for example, as a decision problem, as you would model a decision problem in reinforcement learning or other AI paradigms, what you're looking for is conditional on the user's query, find the right set of data. That's what you're actually doing and that's readily modeled already in this language model or RL model way. We have ways to do this, it's just we need to execute on them. In the time being, I think that there's very obvious things that we can do. This kind of SQL query generation stuff is an obvious step that we can take. And then all the user has to do is, yeah, here's some SQL tables that I have and Chroma will just talk to all of that.
Anton Troynikov: (30:55) Well, I'm just thinking about the business side of this a little bit more as well. I think the unified interface makes a ton of sense if you are hitting somebody at the beginning of their life cycle, and that is where most of these AI apps are today. Somebody's starting one. I do wonder...
Nathan Labenz: (31:13) How do you think this takes shape?
Anton Troynikov: (31:14) Because obviously, this is a debate that is raging across every sub part of the AI space. How much value can these startups capture? How much accrues to the incumbents? Nobody's going to rebuild Salesforce before Salesforce implements their AI layer. Nobody's going to switch from Salesforce in the window when that happens, even if somebody kind of does. I tend to believe those sorts of things. I wonder how you see that dynamic playing out in the just enterprise software space. Right? Because there is so much deployed database infrastructure and you're going to pry that out of a lot of mid career developers' hands with great difficulty, I would think. Right?
Nathan Labenz: (31:53) So here's something very interesting that we've noticed. Data that goes into Chroma has never been in a database before in many cases. That's actually a big wave coming. Right? We are giving access to data that has never been accessed before in a computable way, essentially, right? It's been sitting there, sure, in document stores or whatever for humans to read, but it's never been - and the reason that this is happening now is because that data has become processable by computers for the first time. The AIs can reason about large quantities of documents. They can reason about natural language, whereas before they could only reason about structured data. So all the data that was available to computers was sitting in these structured data stores. So it's not that I need to convince people who already have a database deployment that they ought to be using Chroma instead. It's more, no, listen, now you can actually work with all this other data that you have available that you weren't using before. It's great that you bring up Salesforce because I've been reading Softwar. Do you know this book? It's a Larry Ellison Oracle biography and of course Mark Benioff was an Oracle executive and Ellison was an early investor in Salesforce. And what Salesforce did of course is unify an organization's data around the customer funnel. The earliest you can get information about a customer to your organization is when they're in your sales pipeline. So it makes complete sense. This is where the data is coming in, that's where you unify it and then you build this platform of applications around it. I see Chroma actually following a similar model. This is all speculative. Honestly, right now, if I'm completely honest with what we're doing, Chroma's current stage is about getting us a seat at that table, right? This is what the mission of the company is currently. It's demonstrating that we can deliver a product that the market wants in a way that the market wants it and then building out from that. So I can speculate about the future in many different ways. I actually don't necessarily think that we are in competition with a lot of these incumbent data stores. I think that we're actually a great complement for them so long as we can interface with them in a way that it's possible to build applications on top of it. Right? And the same way that Salesforce and Oracle work together.
Anton Troynikov: (33:59) When you say the data that's coming in, I think that's a fascinating analogy and definitely, I would think, compelling to your investors and potential future investors as well. That's the kind of thing that the VCs are looking for, right, is the, "We're not competing with the existing database. We're 10x-ing how much data goes into databases." When you say that, I have an episode coming up with a historian, so first thing that's coming to my mind is handwritten letters, but I know it's not that that you're dealing with. So it's stored digitally already, right?
Nathan Labenz: (34:30) It's stored digitally. A lot of it is documents. Today it's documents and the reason that it's documents today is because we have text processing models, general purpose text processing models, that's what GPT does today. Tomorrow, and I mean this almost literally, it's more like a week from now, we will have models that can interpret images and sound, right? And so that's data that's just lying there. We've heard really interesting ideas that people have about this stuff. They have data lying around as PowerPoint presentations, and they want the model to be able to understand the slide and distill it into actionable information. All of this stuff is lying around in a way that is not machine interpretable until we have these general purpose models that work in the space of human inputs, which we have now. There's tons and tons of that data. And in terms of volume, in terms of sheer megabytes, I think it rivals what lives in the relational tables too. So that sort of stuff, right? Just an example. I mean, frankly, the ability to interact with general purpose textual data in a conversational way, completely nonlinearly, that technology is already enough to create enormous business value. But the fact is we're far from done with what you can actually do.
Anton Troynikov: (35:36) Yeah. No doubt. Honestly, the first use case, we've got this, again, this Athena chat, and that was predated by Athena GPT, which is just an internal tool that simply has 400 documents and allows the thousand executive assistants to query policy, best practice, whatever, over these 400 documents. And I think it's non trivial...
Anton Troynikov: (36:02) In fact, that was created with chatbase.co very simply. Just drop in the things. Boom. They handle literally everything else. And yes, I agree. I mean, that's already quite valuable. Apparently, they used to get hundreds of questions a day and some significant share of those can now just be answered immediately by the AI, which is...
Nathan Labenz: (36:24) If that's all we were getting, this is already a significant business. Right? But it's not. There's so much more to do and it's a function of what information can the models process and how well can they process it. That's what matters, right? I wouldn't say we're downstream of it, we're actually complementary to it because having retrieval makes the models more useful and then the models being more useful gives you more reason to do retrieval. It's this very virtuous cycle between us and the model developers. We're very, very complementary to each other, which is another reason why I'm actually glad that we're focused on this component instead of the model improvement component. You touched on a really interesting point here. You touched on an internal AI tool, right? And again, if you look at the history of computing, most of what is built on unified enterprise data today is internal tooling to make your business processes more efficient and AI has this latent potential to make so many of those processes more efficient. And again, reading Softwar, which I highly recommend for anyone interested in the history of this space, the whole point of Oracle, one of the things that they really pointed out to their customers, especially during the dawn of the web was you can't just stay on your old processes and expect to get the most mileage out of this. You have to adapt your processes to get the real returns out of a lot of this technology. But if you do adapt them, the returns are huge and I think that we haven't seen this yet in AI, but I think that's the potential. If this stuff works, that's the potential. And I think of this - there's actually 3 kinds of organizations around this. The first kind is an organization with existing business processes that is now adapting to AI, right? So for want of a better word, that's a legacy business, that's a business that exists today. Every business that exists today is a legacy business which is adapting its internal processes to use AI in one way or another, right? And some of those will succeed, some of those will fail. The ones that succeed are going to outcompete the ones that fail. The second category is of course companies that are building AI tooling, right? Tooling that is enabled only by AI, couldn't exist before. Brand new stuff, right? Totally new automation, stuff you couldn't ever do without the model being there. New businesses entirely. There's a very interesting third category though, which I think is yet to emerge but I think will, which is businesses serving the same customers as legacy businesses but which are built from the ground up around AI first processes. And those I think are actually going to be incredibly wildly successful, the first few that manage to pull it off and I'm very excited for those. And of course, Chroma is going to power all of that, so.
Anton Troynikov: (39:04) What do you think is a good category to try that in if I was going to go out and steal somebody's business?
Nathan Labenz: (39:11) I have spoken to a lot of people in this sector. There's so much complexity in the business processes in real estate and so much of it is just to do with processing documents because they come in in heterogeneous formats, they're filled out differently, they all have different requirements, all these things. It's a vast human powered document processing operation. If you can verticalize around AI processes in that domain, I think you have a real chance of doing things well. But of course, there are actually tech forward incumbents in that space too, which I think have a crack at it if they modernize fast enough. That's the one that comes to mind immediately. But the reality is you have to think about this as where can I get distribution fast enough such that my AI efficiencies actually matter? That's where to look for those businesses. I wouldn't try to compete with Coca Cola today for example on the basis that you're using AI and they're not.
Anton Troynikov: (40:08) Do you have a sense for where most of your growth is coming from today and how you think that will evolve across those categories?
Nathan Labenz: (40:14) Absolutely. I mean, look, we have, as I mentioned, users up and down the adoption curve. Most users are down the earlier part of the adoption curve, so that's where most of our growth is coming from. But of course, users graduate up that adoption curve, so the longer that Chroma is around, the more of our users get further up the adoption curve. We have enterprises with Chroma deployed in production today. We have fast growing startups which use Chroma. We have people at, I mentioned, literally at hackathons trying RAG for the first time using Chroma. And I'm actually very proud that we've built a product that serves all those points along the curve.
Anton Troynikov: (40:44) The curve.
Nathan Labenz: (40:44) I think we've succeeded. I'm actually even proud of the fact that we're even being used in ML research, right? The Voyager paper that came out, the Minecraft playing bot, Dr. Jim Fan's group at NVIDIA put that out a little while ago. I actually didn't know this when that paper came out at first, but I went and looked in their code base and said, oh, Chroma is the memory engine for this agent system. So it's coming from everywhere and you have to look at this - I hate to use these '90s enterprise sales metaphors. I catch myself doing it more and more often, but it really is a wave that's coming, right? And it is about riding that wave to the right point and I feel like that's what we're doing. One thing I wanted to mention actually, you talked about how for a second there it seemed like agents were the main thing and then RAG became the main thing. I don't think that these are orthogonal things. I think that agents use retrieval and I think making retrieval work better will make agents work better out of the box. And we're talking with a lot of the agent developer companies about that right now. I actually think that good retrieval is a significant fraction of what we need to make good agents work.
Anton Troynikov: (41:47) Yeah, certainly that Voyager paper got people talking and it was a very impressive accomplishment. And that one was no training, right? That was just gripping around developing policy, saving it, and retrieving it.
Nathan Labenz: (41:59) It does, in fact, have some human examples. And this is actually part of the beauty of retrieval as a complete AI system, in that you can implant things in it and you can delete things from it when you don't want it to behave a certain way or when you want it to learn a skill it hasn't learned yet. It's actually very different to - in the notes that you sent me ahead of time, you talked about how fine tuning doesn't seem to work to implant facts. And this is a very rough mental model, I wouldn't rely on this 100%, but I think of fine tuning as the style or manner in which you want a task accomplished. So in other words, fine tuning is about what information you should pay attention to and how you should then synthesize a conclusion. It changes that, not the facts. And then retrieval is the actual data that you operate on. So these things are actually complementary again. There's so many things here that are not at odds with each other at all. If you look at their latest crop of retrieval augmented papers where they do retrieval augmented pre training or they do retrieval augmented fine tuning, you'll find that fine tuning allows the model to better use data that it's getting from retrieval as well. So these things are just very complementary.
Anton Troynikov: (43:04) Yeah, certainly the bitter lesson of just any end to end training that you can manage to do seems to be the top performer continues to hold true. That's definitely been a theme of my study of the RAG literature.
Nathan Labenz: (43:18) I would also note though, of course, that the scaling curves are real. If you don't have sufficient tokens to overfit your model onto the facts that are available to you, it's unlikely that you're going to get much out of tuning at all. You're likely to get more out of a general purpose system with retrieval, right? Which comes again back to this debate about is it one giant death star model to rule everybody or are we going to have these little task specific guys running around? Because that also really defines the capital requirements of the market too, because it tells you where the compute's going to live. Is it going to live in an Azure data center or is it going to live on your network? Nobody knows. It's tremendously exciting to be a part of this.
Anton Troynikov: (43:57) Yeah. Do you have a sense - I mean, it's obviously a huge question, but do you have a sense for how much knowledge can be squeezed out of the model while still retaining the raw g, if you will?
Nathan Labenz: (44:14) Yeah. Isn't that the question? Right? The way that I've started thinking about this is you arrive at a machine that at least emulates reasoning to some degree by showing it examples of reasoning - concrete examples of reasoning, right? There's no abstract corpus of reasoning that we can show it. So it's, okay, you need all these facts and you need to see the different ways that they can be combined so that you can infer something that can emulate a reasoning system. But then it's, okay, how do you preserve just the abstract reasoning part of this without it remembering the facts part? I don't think anybody knows the answer to that yet but I think we are on the way to discovering and working with it. There's been incredible work in interpretability lately. I'm not sure if you saw the big Anthropic circuit evaluation. I think that is a tremendous breakthrough. It's on a very small scale. I'm not entirely convinced as they are that it's just an engineering problem from here on out but I think it is a very important breakthrough. There's other work in terms of finding latent knowledge vectors in the latent space of the model, which is by the way exactly the type of interpretability work I've been advocating for quite a long time now. It's actually let's just look into the space of what the model is reasoning about. I think that as we understand more of that and as we look at different training regimes, I mean if you look at the caveats in a lot of the retrieval augmented generation papers today, it's, yeah, the next step is obviously to train this with more retrieval in the loop. If we had better examples during the instruction fine tuning step for retrieval, we would have done that too. Stuff like that I think is yet to be tried and I think that in the ideal case, you have this 140 IQ machine which knows literally nothing about the world until you tell it something and then it's able to reason about it. That would be ideal because that would make the system also completely controllable and interpretable too. You know exactly what information it's using and exactly how it's combining it down to the individual neuron level.
Anton Troynikov: (46:05) Yeah. I found both that Anthropic paper and also the representation engineering paper from Dan Hendrycks and Zekai Coulter and Andy Zou. Yeah. Those are both very compelling. And I do think they sort of lead into another question I wanted to ask you, which is to what degree do you expect that that which is stored and that which is retrieved may start to slip from the space of that which is human interpretable or even represented in any sort of familiar format?
Nathan Labenz: (46:41) But that's not interesting. That's the thing. So this is a nice little control, right, because actually this is where AI safety and commercial interest completely align. A commercial deployment has no interest in the function of a system it cannot predict. It's just not valuable. So the data that the model is working with is necessarily human interpretable. Now of course there's other potentials here, right? There's the potential that the model learns steganography. There's the potential that multiple models learn to collaborate invisibly because they're reading and writing from the same data store. But at least that is a much tougher ask than the model storing knowledge in its weights in a way that we can't interpret. It is a much tougher ask than, okay, this has to be human interpretable text and if I don't understand it, I'm going to delete it.
Anton Troynikov: (47:33) Yeah, yeah. So let me take one step back on that because I wasn't meaning to suggest - although I do think it's an interesting question distinctly, not quite the one I meant to pose. If I understand your response, you're referring to the fear that the models will come to deceive us and slip our detection abilities. But taking one step in the more modest direction, what I'm wondering about is instead of storing text in some embedded form and then using the embeddings to retrieve the text and put the text in the context, I think we might cut text out of that loop before long and just end up being, I'm going to store embeddings and I'm just going to inject numbers right into the model wherever.
Nathan Labenz: (48:14) Yeah, yeah, yeah. 100%. I actually think that this is a natural progression of where we are now. I think having this extra lossy pathway of returning documents and then putting those documents in a context window, it's silly because we have an embedding, we essentially decode that embedding and then re encode it into another embedding space. We'll just pass it into that embedding space. That's the way to really do things in the future. But I don't think that's actually a problem for interpretability. We know where all of those vectors came from. We know what inputs and outputs they correspond to. I don't think that's a problem. Anton Troynikov: (48:43) Yeah, maybe. It seems like it could easily start to detach from the text, though. I mean, when I read these representation engineering papers or even the Anthropic paper, I'm like, okay, in these middle layers, we have this kind of high concept representation that we're starting to be able to tease apart. And it's not exactly, once we tease it apart, certainly it's not 1 to 1 with text. But it does seem to be the kind of thing that we can store and then load in, inject into the process in any future forward pass, as we may wish to. I sort of see that being really effective probably, especially if you kind of do some end to end fine tuning and you're like, if you were to relax your design constraints and say, I don't really care if I can see what's going into or out of the database. I just want to save stuff and get it and use it and get the right answer. It seems like that will really work.
Nathan Labenz: (49:40) I think it will. But I think, again, there is you want controllability. You want interpretability. It's completely aligned with commercial interests to have an interpretable internal representation. And there's a few ways to impose that, like making sure that the embeddings that you're working with are decodable, and if they pass outside the realm of what's decodable, then you're like, okay, this is garbage. I want to delete this from the database. Or the model is misbehaving, let's find out why it's writing this weird representation and fix that. Or you can even say, okay, well, I'm not going to allow you to go into this part of latent space because I know that there's not any stuff there that I want you to process. I think it's actually not that clear cut. I think we need to get away from this idea of just because a representation isn't directly human interpretable because it's not in language, it's a vector of numbers, doesn't mean we don't understand it. I think we have to get away from that idea a little bit. Sure, you and I can't sight read it, although I think in the future, especially people working with this type of data will start to develop intuitions about how things are laid out even if they can't explain them. But I think that doesn't mean that we can't develop tools that allow us to work with this data. I think that that's actually completely normal. Most physical systems are encoded as sets of numbers that we operate on. When we develop an airliner, we're not writing the instructions for the airliner in English, but we know how it's going to work because we have tools to interpret that.
Anton Troynikov: (50:59) Yeah. I mean, does that relegate this stuff to the realm of real specialists? Because certainly the layperson can't fly the jet. And I sort of imagine a lot of the AI engineers that are in the space, kind of new to the space, could very easily end up developing systems they don't really have visibility into.
Nathan Labenz: (51:19) It's just a tooling question. Right? It's a question of developing the right tools to work with this type of modality. And that's something we're also thinking about as well, starting with the right visualization tooling, which in the history of Chroma's life, the first thing we built was actually a latent space visualizer. And now coming full circle, we need to build it again into the product because in a traditional, say, SQL database, you can just get the top N rows of your database and be like, oh, that's what my data looks like. Here, you need different tooling, specialized tooling so that a developer can be like, oh, that's what's in my database. And I think it's just a question of tooling. I don't think that only experts will understand this stuff in the future. I think actually what we're seeing in general in AI is the barrier falling and falling and falling to the point where because you can interact with them in natural language, an average person just by talking to it, can really perform what would have been considered advanced AI research in the past by just understanding what the models are doing.
Anton Troynikov: (52:15) Yeah. It is crazy that just literally chatting with a bot is legitimate research in today's world. I didn't have that on my 2020s bingo card by any means.
Nathan Labenz: (52:25) It reminds me so much of the early web, right? Because in the early web, everybody was experimenting in the same way because it was so accessible. Anybody could put a website up, right? Anyone with an ISP could put up a website. Nobody knew what was going on. Nobody knew how to build these things. Certainly there was no UI frameworks, let alone anything else. And people just kind of talked to each other and experimented. The fact that it's powerful but accessible is what produces Cambrian explosions of innovation and makes me very bullish on the potential here.
Anton Troynikov: (52:58) Yeah, it is certainly going exponential on multiple dimensions at the same time. So let's go back to how you're going to take more of the stack and take away my practical RAG problems. How are you going to do that and how should I be thinking about doing that right now while you're still building the hosted version?
Nathan Labenz: (53:17) Yeah. The first step to this of course is have good evaluations. Our friends and partners at other companies are building great evaluation tooling and our first step there is to plug Chroma into those evaluation tools, right? Evaluations are great but you need to be able to do something with them. So what we're doing there is we're developing a feedback endpoint, which is basically you will plug into your favorite evaluation tool of choice, LangSmith or any of the others that are coming out right now. Press a button and it will automatically adapt your embedding space based on the evaluation that you're getting, right? So that's one of our very first steps. The next step that we're taking of course is we are exploring ways to sub-sample our users' data and select the right embedding model for them automatically, right, based on a sub-sample of your data so that before we have to embed your entire dataset, we already know it. You don't have to think about this or decide. We say this is what we think works the best for you and here's how we think fine tuning might help. All this stuff will output automatically. And that'll come with the Chroma Cloud platform for the most part. It'll be a button you can press and be like, give me the best embedding function. What sort of performance improvement can I get?
Anton Troynikov: (54:27) How do you measure that sort of thing? Do you sort of look for variation in the embedding space and try to see that you can see the difference between the data that's being input or what sort of heuristics?
Nathan Labenz: (54:39) So the answer here is complex, right? There are several different attacks on this. Some of them are from the sort of let's say pure reasoning perspective where we're dealing with densities and distributions of data in latent spaces which is kind of Chroma's bread and butter. And the other side is of course signals from evals that we have around for example similar tasks and data like okay, well, your data resembles this distribution that we've seen before. It's probably this particular task. We know it performs well from evaluation. This is the embedding model we think you should use. Approaches like these, right? And then we can do a much more holistic and in-depth evaluation as your data is actually being used and sort of be able to do that for you in the background, switch that on.
Anton Troynikov: (55:20) How much data do I need? How many, this would be specifically, was this a good retrieval? So kind of envisioning, I'm in my Athena chat, I ask for something, I've got my function call, it shows me what's been retrieved, and I just have a thumbs up, thumbs down to say irrelevant, relevant.
Nathan Labenz: (55:36) That's exactly what we're working with. That's exactly right. And so the number of data points that you need is at minimum enough to construct that affine transform densely, which is a function of the dimensionality of your embedding space, which makes this a really interesting trade off. Right? Because it means a higher dimensional space may be sort of natively more representative, but a lower dimensional space is more amenable to this kind of fine tuning because you need less data to do it. Now, of course, there's a third point here, which is even though you've used, say, 8,000 point embeddings, you aren't using the entire dimensionality of those 8,192 embeddings. Your data is probably on some manifold embedded in that vector space. We can project down to that manifold and then do our fine tuning on that. So again, unfortunately, only answer I have for you is these are open questions but we have a lot of lines of attack. We are hiring a head of research for anybody who's deeply interested in sort of the applications of embeddings and in this retrieval context, especially if they're interested in a lot of those advanced retrieval techniques around, for example, just pushing embeddings directly into the inference layer. So that's a little shout out to our hiring process.
Anton Troynikov: (56:39) So you're actually, I think I understood this correctly, but you're actually kind of doing dimension reduction on a commercial embedding like an Ada embedding to then be able to fine tune it more conveniently.
Nathan Labenz: (56:51) Correct. Yes. We could do that. Because again, if we realize that your embedding set occupies a flatter manifold, then we can throw away tons of dimensions, project down and see. And of course, that's not fixed, we don't necessarily lose any information there either because we can still maintain the full dimensionality of the dataset and as your data adapts, we can also adapt that lower dimensional manifold too because that projection itself is another transform in embedding space, it's a learned transform. All of this stuff is very malleable.
Anton Troynikov: (57:22) Everything is one linear projection into something else, maybe a couple of hops at most. That's been one of the most profound things I think I've learned over the last 2 years is just how bridgeable all these different spaces ultimately seem to be.
Nathan Labenz: (57:38) Yeah. And it's about just finding the right space a lot of the time, right, to maximize the performance of your retriever. And of course, there's plenty of other little things we can do. Right? There's a million little convenience things that we can build, like automatic deduplication and summarization for your data. If there's a bunch of documents and inside your collection space, they all lie near to each other and it's like your retrieval results will return it. So this is actually the other part, right? We're going all the way back to one of the first questions that you asked me in this recording was should I try to put as much information in as possible? Well, if 12 of your documents are redundant, you've actually managed to capture nothing new and there's this idea of maximal marginal relevancy as well, which is a heuristic for not returning the same information more than once. But we have intelligence. We can summarize that set of documents into a single document which the summaries of which highlight the information you're actually interested in because remember, summary, everything else, is conditional on what the user is actually trying to accomplish. But the models can reason about that. So we can get these conditional summaries which can collapse all these redundant documents into a single data point. And stuff like that is very good. It's just convenience. It just makes your stuff work better. It makes your costs cheaper because you have fewer data points. All this stuff is stuff we will build or are currently building, I should say.
Anton Troynikov: (59:01) Yeah, I was just thinking about doing something like that for this Athena chat project as well where for probably both cost and latency and maybe also overall quality, a sort of Claude 2 Instant filter that sits between my retrieve 10 and my insertion into context might actually help because then I'm using a bigger model right now, GPT-4, and it's like, that's a bunch of tokens that I'm feeding in there. If I could feed in fewer and feed in the right ones, I would save money and latency. Although that obviously comes with complexity. But yeah, those are the kinds of things I can certainly see. If the database just handles it, that could be very convenient.
Nathan Labenz: (59:43) You never need to see it. You shouldn't ever be thinking about these things. Right? You never think of what query plan did Postgres generate for this thing into my web app. You've never thought about that in your life unless you're doing infra work. Similar kind of set of problems here and this is actually one of those spaces where I think the application of those lightweight open source models that we talked about earlier actually I think are really applicable. If you are running an embedding function locally, then that can be made more efficient by actually running inference hardware at the point where you're running some inference hardware. Well, you might as well run one of these tiny models on it, right? And you can definitely get to a higher utilization on that inference hardware that way, and everything just connects together in a way that that creates a lot of value.
Anton Troynikov: (1:00:26) Yeah. It's funny. I've had that happen, I think, maybe a couple times for me now where I've started from some commercial or in some cases, just nothing available, but some need to actually scale something up into the AWS, kind of Lambda function type situation. And that is a pain in the butt, but it definitely once you kind of get it working for a particular task, we do enjoy the cost savings and the super easy horizontal scalability.
Nathan Labenz: (1:00:57) But also you have the data flywheel. You own the data flywheel at that point. Right? In the same way that you own the customer, you own the data flywheel and that means that you have a model that adapts continuously to your specific business use case, not some general idea of a general purpose model. To be fair, I don't think general purpose models are going anywhere. I just think it's an open question right now whether these specialized fine tuned off the shelf open source models, which are smaller and in many ways more efficient in terms of compute and cost could actually be very useful.
Anton Troynikov: (1:01:31) There are kind of 2 related ideas. There are owning the data flywheel and the kind of ongoing improvement of the model. And if I was advising, I think most people, obviously context is everything. But for most people, I would say it's really important that you start capturing that information and gathering that data. Probably less important in the short term that you'd be doing ongoing fine tuning.
Nathan Labenz: (1:01:57) Correct. But you should be capturing that data, right? This is again, I may be revealing too much of the master plan here, but consider how similar that is to just capturing information about retrieval quality. We are conditioned to think about data as this static thing, right? It's sitting somewhere and it has a particular instance in time and then we access that instance in time and then the next time it might be different, but it's still essentially mentally we think of it as static. I really think of these things more like a loop, right? It's like a control system loop almost, right? Where data is actually an engine. It's something that is interacting with the outside world mediated through computation and then it's constantly adapting and it's improving itself and if you have the right affordances to be able to do that and so that comes into this system of continuous improvement in many, many different ways. I agree with you, I think that most people shouldn't be fine tuning online. I think that most people don't have enough data to be ever able to do that, but I think that there's a lot of possible approaches for this in future. Here's another example, right? One of the things that we're going to be deploying sometime around when we launch cloud is Chroma datasets and this is the fact that a lot of people want to be able to do, say, retrieval against English language Wikipedia, right? Because it's useful to have facts about the world that are not in the training set of the model yet or if not Wikipedia, then find something from Bloomberg or whatever, real time news sources. There is no reason for everybody in the world to go around embedding that on their own. They could just hit a Chroma endpoint and get it, right? And then we charge them a few cents and just get the document or the embedding. We just do the retrieval for you and obviously once we have the cloud service up and running, that's as simple as running a public collection almost. The open question about that though of course is now, you have all these applications using the same data, right? So Chroma is now a data platform powering a bunch of applications as well as the database that people use for their applications. Now you have a lot of information about how all those different applications are using the same data. And so now you're like, okay, we know which model this is being fed to, we know what the task is, this is how it's being used. We can fine tune it to that particular task. We can give you that data in exactly the way that your particular application and use case needs it instead of this generic embedded way. So that's a path. That's a way for us to provide almost this public service of these continuously improving representations of these datasets, right? Because for example, it's very unlikely that people would use the information from a Bloomberg terminal for a dating app, unless it's a very high end dating app.
Anton Troynikov: (1:04:29) Depends on your definition of high end as well. We may have to put that out to the audience.
Nathan Labenz: (1:04:35) If the marital status of your CEO can meaningfully move markets, that's a different dating app. But basically, the embedding space that you would use for that kind of information is very different to the embedding space that you would use for, say, demographic information or information about
Anton Troynikov: (1:04:48) Right.
Nathan Labenz: (1:04:49) I don't know, basically news. They might be related, they might actually even be used the same way in the same tasks but the information that people care about is actually specific to that particular type of information and the tasks that people are doing with it. And so over time, by having these publicly available datasets, we can converge on those things. You can have these living sets of data that are constantly improving to what people actually want to be doing with them.
Anton Troynikov: (1:05:10) I think that's also a super big topic, the sort of evolution of datasets. And a lot of angles, that's such a big topic, a lot of angles on it. So you're kind of evolving there specifically with how do I serve this particular customer use case or even individual better over time based on kind of what they're doing or how they relate to data and so on. I'm also really just interested in maintaining the actual data in the database and having some interesting challenges with that where you said earlier that updates and inserts are relatively rare.
Nathan Labenz: (1:05:52) Not in AI land. What I was saying is they're rare in search indexes. They're not rare in applications at all. This is happening all the time. The data is being updated interactively and so you need to build the data layer in a way that's robust to that. Now in terms of actually keeping the data up to date, we've been speaking to quite a few folks around this. I think a lot of people are developing tooling to essentially let you stream data into your database and keep it up to date and keep it synchronized with events in the outside world and we're happy to support those projects too. We've spoken with the Unstructured folks. We've spoken to a few other folks that are doing stuff like this. I think that those are very valuable projects that need to get built.
Anton Troynikov: (1:06:29) That kind of speaks again and it makes sense from a database provider standpoint that you're just, stream it on in, capture everything. It'll naturally find its first and hopefully eternal home here in this database. I guess I'm a little slow. And I'm surprised that I'm not sure how many people you've actually seen doing this kind of stuff. But what I'm still kind of trying to figure out is just to what degree do I want to store everything that happens in my little Athena chat app as history? If I do start to do that, it probably needs to be some sort of different collection than my kind of canonical reference material or my initial customer profile. Then I also do want to update my customer profile. But then do I want to have a record of the fact that I updated my customer profile? Feel like everywhere I look, I'm like, boy, I'm appreciating my own memory. And while it's certainly got shortcomings, its relative elegance is apparent in contrast to what I'm kind of hacking together. So what advice, solutions, what advice do you have for people that are struggling with that right now, namely me?
Nathan Labenz: (1:07:30) Yeah, that's very interesting. I mean, this sort of journaling almost is how to think about it. One of our hackers in residence, Savant, actually ran into this as an interesting set of problems to tackle about how do you represent the fact that more recent information is what we want to reference unless we explicitly want to reference information about the past, right? And there's a few approaches. There's of course, as with everything in the space, there's kind of 2 ways to go about this. It's a heuristic approach and something like Chroma allows you to do date based filtering and store data in your metadata and be like, okay, I only want information that came from this date and onwards, right? I don't want to know about the history. You can store the history in a collection but you just filter by dates that are further in the future or further in the past depending on what you want to see. That's kind of a heuristic way of doing it. Frankly, I think that there's no reason why we can't have a time dimension in the embedding space itself and be like, oh, these are statements about the past and then you know to decay that value over time and then the model itself, the retriever piece of the model can be like, okay, well actually I need to weight the time dimension more heavily because I need to look at the history of one person over time or I need to look at the history of the space over time and land in the right place and launch that query. One way to think about it is if you attach time as an additional dimension on these embedding vectors, then if you want all the information, you can project out the time dimension or if you want it from a particular time, you can just slice it there. I think that that's the long term solution. I think that getting the models to actually understand, getting the embedding model itself to actually understand time is the right way to go about it. Again though however, don't forget that not every piece of data probably belongs in the vector store. Some just is better off in an algebraic data structure like a SQL database.
Anton Troynikov: (1:09:19) Yeah. You're certainly making me recall also the recent DeepMind paper, I think it's Language Models Represent Space and Time. And I don't know if you've seen that one in-depth yet, but
Nathan Labenz: (1:09:30) The one where they found out it has an implicit representation of latitude and longitude?
Anton Troynikov: (1:09:34) Yeah. And there's a timeline component to that as well.
Nathan Labenz: (1:09:36) Yeah. I'm yet to read that paper. It's unsurprising to me that they'll develop these sorts of representations. Right? Information about place and time is encoded in the training set. I don't see why they wouldn't represent that. Again, everything is trained on the same contrastive triplet loss. If you have signal that places are different in different places and times and they're different in a specific way, then it will capture that signal. That result was actually not that surprising to me. I thought that, okay. That's a very natural consequence of this kind of training.
Anton Troynikov: (1:10:04) Yeah. I would agree. I wasn't, what I was blown away by was the visualization, which I think is becoming the signature of the group because they've done a couple of really nice versions of this where they'll have, here's 12 seconds of whatever, you immediately get it. In this case, I really felt like it solidified this model that I did kind of already have, but wasn't so sure how confident I was in it of the middle layers being where sort of the highest border concepts live. And they seem quite decoupled there in many ways. The amount of associative stuff that's loaded in, even just when we give it a place name.
Nathan Labenz: (1:10:46) If you think about that, it makes sense because of the way the contrastive pre-training works. Contrastive pre-training is only about similarities and differences. When you are at the highest level of difference, it doesn't matter in which direction the difference is because you don't have a third thing with which to make reference to, right? All difference is relative. So you can be completely floaty at the top level. At the top level of the hierarchy, you can be very ambiguous without any meaning in the space itself because is this left, right, up, down, doesn't really matter as long as they're separated in some way. And then as you get inside these clusters, which is the lower levels, the structure of those differences matters more because more of those points are nearer to each other and you still need to separate them. So structure gets imposed in that way. For me, makes complete sense actually.
Anton Troynikov: (1:11:32) You should apologize for your privilege for saying that. Increasingly, I feel like I'm right there with you, although, I think still a step behind. But I wanted to also ask just this is a very tactical question. But if I'm missing stuff and heard you loud and clear on kind of the importance of evals as a way to have some sanity around whether I am or I'm not missing stuff. If I am missing stuff, am I more likely to, how, I assume it's more, but how much more likely is that to be happening due to representational issues versus the nearest neighbor being sort of approximate and maybe missing something at that step?
Nathan Labenz: (1:12:16) Almost all of it is to do with the representation.
Anton Troynikov: (1:12:18) Is there a scaling law there? Can I quote something? If anybody ever tells me, oh, your nearest neighbor could be missed at the nearest neighbor step. How do I say, oh, no. That's a 1 in a billion? Or how do I quantify that? Anton Troynikov: (1:12:28) So here's an easy way to figure this out. The nearest neighbor benchmarks, we understand the trade-off very well between recall and speed. That's a completely tunable parameter. You can just go look at the graphs for various algorithmic implementations of approximate nearest neighbor. When your results are differing from the graphs, that's the representation's fault.
Nathan Labenz: (1:12:46) Gotcha. Okay. Is there one number I should keep in mind for if I just use the default setting, what do I miss? One in a thousand, one in a million, one in a billion?
Anton Troynikov: (1:12:55) Things are quite tunable. I think one in a million is probably about right. I would have to check our internal benchmarks on this, but I reckon on our default setting something... And again, the thing is this is never a concrete answer because it strongly depends on your data distribution. If you have some pathological data distribution, it's going to be much worse than that. It's kind of easy to cause this. If you build an HNSW graph where you just repeat the same entry over and over and over again as unique entries and then you add some other entries around that, then it'll be completely corrupted. You'll never get good results. So my point in saying that is it depends on your data distribution, but I would say one in a million is pretty reasonable. But you can tune that. You can completely tune that depending on the scale of your dataset as well if you're willing to trade some latency. And I think, of course, right now, you should probably be willing to trade a bit of latency given that the latency is completely dominated by the model inference itself.
Nathan Labenz: (1:13:46) Yeah, definitely. So I'm actually thinking for an application like what I'm building, I could eliminate that fear entirely and just search the whole database, I think. I don't need the approximation.
Anton Troynikov: (1:13:57) Yeah. You could. I mean, you could get away with a matrix multiplication in NumPy depending on the scale of your data.
Nathan Labenz: (1:14:01) Well, that sounds like more work than using Chroma.
Anton Troynikov: (1:14:04) It is. It's certainly more work than using Chroma, but you could do it. The other thing of course is we're not stupid. We know that that is faster and so we will actually, if the data is small, we'll put you in a small thing that just does that without constructing all the index and everything. We automatically switch over to the right data structure when we need to, when the scale reaches it.
Nathan Labenz: (1:14:23) How much of the future do you think is partnership with other technology companies for you? It seems like Chroma can be the piece that, for example, any number of other dev platform type things add, or you can... I mean, this is another classic... You're going to go back to software on this one because how much do you partner into existing platforms versus try to get them to partner into yours?
Anton Troynikov: (1:14:48) Yeah. I think so. Look, we like to work with the AI labs, with the LLM API providers because, again, when their API gets better, we get better. And when we get better, more people have reason to use their API. Right? So it's this very virtuous cycle of working with all of them together to make sure we're shipping the right thing to our users and customers. And again, we've got great relationships with the application development frameworks for LLM applications too because it just makes sense. If retrieval works well, then the rest of your application is going to work better for the same reason. We are building, as I mentioned, our sort of evaluation endpoint where we can consume evaluations which are produced by some of these eval platforms which are coming online. We have spoken with the people doing ETL for this kind of unstructured data and feeding it into Chroma from other places. I think it makes sense for us to work with a wide fraction of the ecosystem. I think that Chroma's remit is well scoped. We understand what we're building and what we're building for. Certainly, we're out of the first part of the exploration phase of our products. We know what we have to do. And so we feel pretty comfortable working with a lot of people to do that.
Nathan Labenz: (1:15:51) Do you think OpenAI will build their own database at some point?
Anton Troynikov: (1:15:55) Yeah, I've been thinking about this. Certainly, they've hired retrieval people. If you know where to look in the industry, you can see it. I think that in general, they are unlikely to release something for developers like we've built. I think that it's possible that they will want their own retrieval system internally. Certainly, the things that have been teased or quote unquote leaked about the developer day implies the existence of at least a lightweight retrieval system. Personally, I hope that they run Chroma because we're building for enterprise use case at their scale. I don't think that they have a great reason to build their own, to be honest. I think that this is a pretty significant undertaking on its own. It's truly... And I don't see it as something that is necessarily in the main line of what they have to build. I think that first of all, I do believe that retrieval is a really important component of getting to AGI, but I think building the nuts and bolts of the retrieval system itself is probably not something you need to worry about too much. So I think it's possible. OpenAI has greatly expanded its team recently. I know that they've been working with retrieval, but I don't think it's in their best interest to build something. And certainly, I doubt that they'll build something they're going to hand to developers. They may have something that's a service where you upload your data or plug in your data here. But if they have that, then it's also good for us because the thing that they plug into should be Chroma.
Nathan Labenz: (1:17:14) Yeah. It seems like GPT Enterprise is the first place where this might show up as something that is added for them as a feature, but not necessarily a service.
Anton Troynikov: (1:17:25) Behind the scenes, OpenAI has been doing enterprise deals pretty early in sort of enterprise integrations, and we know that they go in with retrieval vendors on some of those deals. We'd like to be in on those too. I think that that's a pretty clear sign to us that they don't intend to actually build these systems for their customers. They just want their customers to pay for GPT, which makes total sense, right? If we enable people to pay more for GPT, we're happy to do it. And again, you can look at our relationship with Google as an example here. Google have highlighted us as a launch partner for PaLM 2, the PaLM 2 public API as an end-to-end thing and one of the reasons for that is just how developer friendly we are. We have great rapport with the developer community. Our tooling is the easiest to get up and running with and people do use this in production. I think that for us, it's about staying friendly, it's about delivering the best product, and then given that the best version of our product makes more people use OpenAI's actual product as opposed to some ancillary piece of software, it makes sense for us to build this and them to build models.
Nathan Labenz: (1:18:25) How much do you think about Anthropic that has not come up today?
Anton Troynikov: (1:18:29) Yeah, quite a bit. I mean, look, I've been playing with Claude 2 quite a bit. I know that they have the very large context window, they do the thing where they do PDF ingestion. I've been playing with it. I think that Claude are on a slightly different path to a lot of the other labs because their objectives as a research group are pretty substantially different. I think that things like Claude are intended to be test beds as well as commercial products, test beds for their approaches to a lot of these things. We haven't had much cause to speak with them yet. We did talk to them about using a retrieval system in the context of training data, but that use case looks more like the search index use case where you want a scalable search index and a lot of people use FAISS for that and I think that that's fine. Although FAISS is kind of hard to scale, I expect that we'll be deploying something there. Naturally over time, the sorts of tasks and the sorts of applications that the different general purpose labs serve will converge to a similar set and people will be choosing among those vendors. Given that they'll be choosing among those vendors and given that we're providing certain capabilities to the people using OpenAI and Google, I don't see why we wouldn't provide the capability to Anthropic too.
Nathan Labenz: (1:19:32) I was going to ask if you have any updates to your p(doom) in light of the recent AI engineer slide that's made the rounds and...
Anton Troynikov: (1:19:38) Oh, that's so funny. I think there's absolutely no consequences for saying whatever number you want when you get asked that question. So people will just say fun stuff. I'm actually not really worried about figuring out doom. I'm much more worried or interested in, I would say, the question of can we actually get a reasoning system out of this approach or not? I don't know if you saw my fairly recent experiment with Game of Life. Did you see me doing that with GPT?
Nathan Labenz: (1:20:07) Tiny bit, but I'll say no.
Anton Troynikov: (1:20:09) I'll send you a link, right? So I was like, okay, can GPT apply its own stated rules in a deterministic fashion even when the rules are extremely complex? Because people were a while ago talking about, oh, it can play chess and then I was like, okay, well how come it plays chess badly when you give it random positions? And then people were like, well, it's actually emulating bad chess players. I'm like, damn, that is a very strong statement of what it's able to represent internally, right? It's not only knows the rules of chess, not only knows what optimal chess strategy is, but it also knows how a bad player would play so it's able to condition on that. That seems insane. That is a very strong statement to me. And so I came up with instead the weakest possible version of that statement that I could, which was, can... first of all, does GPT know the rules of Conway's Game of Life? And if you ask it, it will tell you. And then does it apply those rules consistently or correctly? So I actually cared less about can GPT play Game of Life? I 100% believe it's possible to get a transformer to play Conway's Game of Life, 100%. I'm sure it's possible, right? You just feed it enough data about the state transitions, it'll be fine. There's only 512 of them. It's really not that hard. What I wanted to find out though was there's plenty of information in GPT's training set about the rules of Game of Life and I'm sure it even has lots of examples of the interesting states of Game of Life like a glider. So it probably has plenty of information of interesting state transitions. Oh, look, Game of Life can do all these general purpose stuff, but what I'm interested in is can it just apply the rules of Game of Life consistently, right? And then perhaps correctly. And I gave it the rules as a word problem which was like N cells are alive, K cells are dead, the center cell is alive. What is the next state of the center cell? Got that 100%. Cool. That's confirmation that it has at least one representation internally of Conway's Game of Life rules. And then I presented the same information to it in a grid, the 512 possible state transitions. And then I said, okay, well, it actually does pretty badly at that. It repeats the rules to me and then it does it wrong even when I tell it explicitly what to do and even when I'm very careful to represent the grid as individual tokens for each cell, it doesn't do it right. And then I said, okay, well, it doesn't matter if it does it right or wrong, what I want to know is does it have an internal set of rules that it's applying, right? Because you would expect that if it has rules for Game of Life, it will apply them across representations and in different settings. And so I ran this experiment where I fed it 10 by 10 random boards, random binary boards representing Game of Life states and I asked it to iterate just once on those states. And what I wanted to measure was not does it get it right because we already know that it didn't. What I'm trying to measure is does it do it consistently? So first of all, does it do it consistently for that representation and then does it do it consistently with the rules that it actually executed when you give it all the state transitions individually? And the answer is no. It's consistent inside the 10 by 10 grid representation, so it tends to do the same things with the same state transitions. Although you can look at a very interesting graph that I'll send you, which is probably the most interesting part of my result. But it does not apply its own rules stated in one representation to this other representation, which to me points to no world model. But that doesn't answer the question, it's just a piece of evidence, it's just an angle of attack. So these are the questions that I'm actually much more interested in right now. Can we arrive at a general purpose reasoning system? Because I think arriving at a real reasoning system is the thing that we require for this technology to really work. I think we can already get value out of it today, don't get me wrong. We can get value out of it in domains which are subject to human interpretation as opposed to exact results. But I think the real power of this is if we can use it as a general purpose computing machine, and I don't think we're there yet. I stopped arguing about doom and I started thinking about much more practical questions.
Nathan Labenz: (1:23:34) Yeah, I will have to return to this in greater depth in a future episode. This has been a ton of fun. My quick reactions are, it doesn't sound like no world model. It sounds like incomplete world model. And I guess I would also be, how general is general?
Anton Troynikov: (1:23:51) Take a look at my thread, see what you think. I'm very curious about what you think. There's one particular piece of the result that I found tremendously interesting, which is it is conditional how random the model is on certain decisions that it makes and that was very interesting. And you know what's interesting? How random it is is conditional on how often those states appear. So the less often those states appear in general statistically, the more random the model's state transition actually is, which is again evidence against for me, but you can argue otherwise. Let me send you the thread. This has been a lot of fun. I really enjoyed this conversation.
Nathan Labenz: (1:24:24) This has been a really good expert level discussion, which is what we try to deliver. So thank you for being a part of the Cognitive Revolution, Anton Troynikov. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.