Unlocking Enterprise Data with Knowledge Graphs | Juan Sequeda, Head of AI Lab at data.world

Watch Episode Here

Video Description

In this episode, Nathan sits down with Juan Sequeda, Principal Scientist and Head of AI Lab at data.world. They discuss how knowledge graphs can be your organization's "brain" for AI, integrating structured and unstructured data, benchmarking enterprise AI systems, and more.
If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

We're hiring across the board at Turpentine and for Erik's personal team on other projects he's incubating. He's hiring a Chief of Staff, EA, Head of Special Projects, Investment Associate, and more. For a list of JDs, check out: eriktorenberg.com.

---

LINKS:
Data.world: https://data.world/

SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

X/SOCIALS:
@labenz (Nathan)
@juansequeda (Juan)
@datadotworld (data.world)
@CogRev_Podcast

TIMESTAMPS:
(00:00:00) - Introduction to Juan Sequeda and data.world
(00:01:11) - Discussion on data and generative ai
(00:06:15) - Data.world's origins as an open data catalog platform ("Github for data")
(00:09:35) - Using knowledge graphs and semantics to integrate and query data
(00:12:52) - Main use cases for data catalogs: search/discovery, governance, data operations
(00:15:00) - The process of building knowledge graphs automatically from data
(00:16:07) - Sponsor: Shopify
(00:24:29) - AI for unlocking and capturing tribal business knowledge
(00:31:00) - The importance of semantics
(00:32:59) - Sponsors: Netsuite | Omneky
(00:34:24) - Understanding the data landscape in enterprises
(00:38:32) - The emergence of knowledge engineers and data product managers
(00:40:44) - The consumer experience in data.world
(00:45:36) - The importance of context in data analysis
(00:46:58) - The role of AI in improving data analysis
(00:48:08) - The importance of accuracy and explainability in data analysis
(00:50:08) - Question cataloging in data analysis
(00:51:44) - The future potential of "chat with your data" interfaces
(01:18:02) - Finetuning with data vs metadata
(01:29:24) - Future of enterprise data teams

Full Transcript

Transcript

Juan Sequeda: (0:00)

This is the essence of what my organization is. This is the brain. Everything here is accurate. I can use this to explain things. The LLMs, these foundational models, don't have that accuracy, don't have that explainability, don't know my organization. Do we expect these foundational models to know every single organization? No, because this information is private. I don't want them to know, but I want to use them. So that's why this combination of these foundational models, large language models, with your internal brain of the organization, which is your knowledge graph, that's where I see the future. And I think what we need to work on is understanding that integration point between the knowledge graph, the brain of the organization, and these foundational models.

Nathan Labenz: (0:44)

Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host, Erik Torenberg.

Hello, and welcome back to the Cognitive Revolution. Today, we're talking data and how generative AI interacts with enterprise data with my guest, Juan Sequeda, Principal Scientist and Head of the AI Lab at Data.World.

Data.World, like many established companies that we've featured on the show over time, built much of its platform, products, team, and business before the current generative AI moment, and is now working to make sure that it's taking full advantage of this new technology paradigm for its customers. This is part of an economy-wide trend. Frontier generative AI models are now making their way into all sorts of high-value, often very challenging contexts, from process chemistry labs to novel protein design to US federal courtrooms to the practice of medical diagnosis to enterprise data lakes. GPT-4 demonstrated enough raw power that we now have experts in literally every field working day in and day out to figure out how to make generative AI work for their organizations.

Now, it's not all instant success. From Juan's work, for example, it's clear that early benchmarks understate the complexity of real-world enterprise data and that naive chat-with-your-data type implementations are not up to enterprise challenges. Even the more advanced work that Juan and his team are doing with knowledge graphs, while it does deliver major improvement, is still at best a partial solution. So while Data.World, again, like most established technology businesses I've talked to, is not particularly worried about competition from fast-moving AI startups, they do see the transformative potential, and they are realizing enough practical utility today that they are rolling up their sleeves and settling in to the process of AI implementation and optimization.

Interestingly, zooming out, I think this creates the potential for another major phase change in the history of AI. While GPT-4 takes meaningful effort to implement and often still falls short of our dreams, we are nevertheless building foundational capacity for both last-mile distribution and customization, such that the next big model release will have the opportunity to almost immediately plug into many millions of live business processes and systems. The electrification of America took 60 years and included significant public works projects, and all the appliances were designed with a clear understanding of exactly how they'd be supplied with electrical power. Today, in contrast, we live in an internet-mediated, software-enabled world in which updates can quickly be pushed to everyone, everywhere, all at once.

Today's software application developers are building somewhat ahead of AI capability, both so that they can deliver frontier features to their users today, and more importantly, so that they're ready to flip the switch when the next big advance comes online. How many months we'll need to wait before GPT-4.5 or GPT-5, I don't know. And it might be just long enough that folks start to wonder whether AI generally is underpowered and overhyped. But my expectation remains very clear that we will continue to see additional jumps in capability and that with each future leap, considering the foundation now being laid, the deployment cycles will naturally get shorter and more disruptive.

One note for listeners: enterprise data is a complicated space, and we spend some time in the first half of this conversation discussing the general state of enterprise data and data science teams today. If you're already well-versed in enterprise data science, you might find some of this a bit basic and introductory. But for most, I think the additional context will be very useful. If you have any suggestions for how I can better handle the introduction of such advanced topics where listeners inevitably will have very different levels of prior knowledge, and I do expect this to be more and more common throughout 2024, I would love to hear your suggestions. As always, you can email us at tcr@turpentine.co, or you can DM me on the social media platform of your choice. And of course, we always appreciate it when you share the show with friends, starting in this case with the data scientists in your life.

Now, I hope you enjoy this conversation about generative AI and enterprise data with Juan Sequeda, Principal Scientist and Head of the AI Lab at Data.World.

Juan Sequeda, Principal Scientist and Head of the AI Lab at Data.World, welcome to the Cognitive Revolution.

Juan Sequeda: (5:37)

Well, thank you so much for having me. Really excited about having this conversation.

Nathan Labenz: (5:40)

Yeah, me too. I've been interested in knowledge graphs recently. I've been feeling from a few different projects that I'm working on that more structured data, something between your classic SQL database and the totally amorphous data that people are just increasingly vectorizing and hoping for the best from, is probably going to be important. And I went on a little quest to see who's working on this and what's the state of the art, and came across a paper that you had published. So I'm excited to get into that with you. But I thought maybe we should just start with an introduction to you and the company, because I don't know how many listeners will be familiar. Do you want to just give us a quick intro to Data.World? How did the company get started? Was it an AI company at the beginning? We can take it from there.

Juan Sequeda: (6:32)

Yeah, happy to. So Data.World is an enterprise data catalog platform. A data catalog platform is essentially an inventory for all the data and the metadata within your organization. Basically, your library card catalog. That's the high-level concept. Managing metadata has been something since the beginning of data management, but I would argue that it hasn't really been a focus until the last probably five or six years, when it became a focus in the enterprise.

If I zoom out and we look at what the data management world has been, it's always been about moving data. So ETL and ELT, storage and compute of data, so your data warehouses, your data lakes and such, and then you're using your data for analytics, dashboards, machine learning, and all that stuff. And then there's always been this thing on top that we haven't really focused on, which is the metadata, the data about the data. And that is what data catalogs are focused on, being able to understand what's going across all the stuff that's being moved, all the different data sources, tables and columns that you have, keeping track of all your dashboards and how they're connected across all the different sources, and then also keeping track of your business terminology, your business glossaries and everything. So that's what a data catalog is.

Data.World started in February 2016, so very early on. And the focus was to be a GitHub for data. The first phase of the company was to be a catalog for open datasets. And that was the first phase of the company for a couple of years, and we continue to be the world's largest open data catalog. We have over 2 million users. I think two-thirds of Fortune 500 companies are on Data.World on the open data platform. There are half a million datasets. During COVID, all the COVID data was housed on Data.World. It was completely open, and it continues to be open.

And we're actually a public benefit corporation. So that means that in addition to being a C corp and maximizing shareholder value, we have a public benefit mission which we're evaluated on. It's to be the world's largest and most abundant data resource, advocate for open data and linked data standards, and to be an archive for the world's data. So those are our public benefit missions.

And during that first phase of the company, the goal from a technology perspective was to create a platform that can scale at web scale. And that's what we have with over 2 million users. We really built a platform that we know could have high scale and be able to share data. So around 2019, the next phase of the company was, hey, we built this whole platform that the open data community is using. Within organizations, people want to share their data and find data and so forth. So that's how we entered into what's now being called the data catalog. People are now doing data marketplaces and such. That's where that fits in. So that's the phase of the company that we're in right now.

Now, from a technology perspective, from day zero, basically, the entire platform has been architected on top of a knowledge graph architecture, which means that we use, first of all, we're all about open standards. I mean, part of our public benefit corporation. So we use the open web standards of RDF, which is the metadata graph standard, OWL, which is for ontologies and schemas, and SPARQL, which is the graph query language. And that's the architecture we use. So basically everything that we bring into Data.World is turned into a graph.

Why graph? Remember what I was saying: you're moving data, storing data, using data. There are all these pieces that get connected across all those pipelines. You want to keep track of things and how they're being connected, and that's a graph. I always say that your first application over a knowledge graph is really the management of all your metadata to understand basically what is that infrastructure that I have within my organization. To understand, hey, this data comes from this source and then it gets moved to this thing, loads to this data warehouse or this application. There are people who are using it and so forth. So you're keeping track of all of that.

And now with AI, what happens is that you need data. Data is the foundation of this. I'd argue that we've always been a data company and data is the lifeblood for AI. So the data catalog and being able to manage your metadata, manage your semantics, manage your context, that is critical for generative AI because that doesn't have the semantics, the context, the knowledge of your organization. It has LLM trained from whatever it's been trained on, but doesn't have your own internal context.

So we've always kind of seen this over the last year and a half when LLMs came out. For this to be applicable in the organization at a scale where you need accuracy, explainability, knowledge graphs are critical for that because they provide all that context that the LLMs don't have. And that's what we're seeing and kind of what we've been talking about, and the paper that you mentioned originally was kind of the evidence that we wanted to put together to say, yes, knowledge graphs are critical for this, assuming you want accurate question-answering for your questions.

Nathan Labenz: (11:56)

Let me just ask you to get a little bit more concrete on a couple of things. The 2 million datasets is awesome. The public benefit aspect of that is really cool. I'd love to hear a little bit more about those datasets. Are there big attractors? What are people coming to you for? I imagine that, again, most people probably just don't know. And then I'm also really curious to get a little bit more concrete on, okay, but what is a knowledge graph? How do they typically get created? And what role does Data.World play in helping enterprise customers do that work?

Juan Sequeda: (12:39)

So there's the open data catalog, which is open for folks, and then people want the data catalog architecture to create their catalog internally. So I see there are usually three main applications that our customers will use Data.World for. One is search and discovery. I don't know what data I have. It's a typical problem. You have data scientists looking for data and they don't know where to find it. Your data lake has turned into a data swamp type of approach. So they need to have a way to search and find the data they have. So that's usually one of the applications. Search is something that we're very focused on.

The same way you search things on Google today, and if you search on Google for Austin, you get that panel on the side. Oh, Austin is the capital of Texas, and here are all these events going on and here's the weather and such. That is what Google calls a knowledge panel. And that is results coming from their graph that they've built. So we do the exact same thing.

So if you think about the knowledge graph that we build automatically when we bring your data, think about it from the concepts that show up. Hey, there's a database. A database has tables. Tables have columns. There are dashboards. There are different types of dashboards, Tableau dashboards, Power BI dashboards. This dashboard was built using data from this table or this query over these tables. You start to see how things get related and connected. There's a person who created this dashboard. And then there are policies that we're defining. Who can... There's PII. There's personal information. This table has PII data. So all of these things get connected.

So that's the graph that happens underneath. And when we connect and extract the metadata from all these sources, we build that graph automatically for you for all that technical metadata that we're seeing. So that's the digital landscape of all your data and how it's connected within your organization. So search is one of those applications over that big graph of your metadata.

Another one is around governance. I want to know, again, keep track of all the policies around our data. Who can use what data for what reasons? Who can access this data? Where is all the sensitive data and so forth? So that's the governance side of things.

And then the third one we see a lot is for DataOps, the operations of data, making sure that data's getting moved from different places. Again, keeping track of the movement is what we call the lineage. Oh, this dashboard extracts data from this table. This table comes from this source and so forth. So you have that whole graph and lineage of where data is coming from, how it's living. And then you want to make sure that, hey, if I'm going to make a change somewhere, what could that affect? Or if somebody is saying I have an issue in this particular feed of data, where could it be? I can follow that. So that's all the operations of data. And then you can be in your dashboard in Tableau and then you can actually get notifications saying, hey, there was an issue just reported. So just be very careful with the data that's being shown here, and it's going to be fixed and so forth. So that's the third application.

So to summarize, data catalogs and our users, they use it for three main things, which is for search and discovery, for data governance, and for DataOps. And at the end of the day, all of this is connected and we build that graph of that metadata all automatically once we have our collectors up and running.

Nathan Labenz: (16:04)

Hey, we'll continue our interview in a moment after a word from our sponsors.

I'm really struck by how many of those things seem like they may already be undergoing significant transformation in the AI era. 2016, obviously, not pre-AI, but pre-transformer, certainly pre-large language models, certainly pre any kind of general conversational interface with a database. I guess I'd love to hear a little bit more about how you build a knowledge graph automatically in a pre-AI era and how that might be evolving, and then also the way the data is stored.

I had Anton, who's the CTO at Chroma, one of the high-flying new vector databases, on as a guest on the show not too long ago, and he made this really interesting point that the vast majority of data that goes into their databases has never been in a database before. When you were first building knowledge graphs automatically, how did you approach that? How is that changing? And how is the actual nature of the data beginning to change in light of what AI can now do with it?

Juan Sequeda: (17:21)

First of all, we need to think about, when it comes to knowledge graphs, there is the notion of the schema, of the ontology. Let's define the semantics and meaning of a particular domain. So everything I've been talking up to now when it comes to data cataloging, the domain there is this technical metadata. Now just think about the map or let's draw the bubbles and lines on the whiteboard. You have a table as part of a database. You have a dashboard. A dashboard takes data from a table. That's the schema that we're creating.

That's one knowledge graph. We create knowledge graphs about any type of topic. So for example, let's create a knowledge graph around e-commerce. Well, you have customers. You have orders. A customer places an order. An order consists of a set of order lines. An order line can have products. And the orders are shipped to an address and you could have a shipping address. You can have a billing address. Well, that's basically the schema you've started defining. You populate that coming from... that data may be coming from so many different types of sources. They can be coming from structured sources, relational databases. Some of that can be coming from unstructured, from text and so forth. They can come in from feeds that come in JSON feeds or from APIs or whatever.

So the knowledge graphs are a means of integrating data coming from so many different types of sources and you can model anything that you want. Now, what takes time always is what I call the knowledge engineering process. It's understanding what do you mean by a customer? Understanding what do you mean by an order or what is an active order or so forth. And I think this is the human nature of the problem, because you go ask multiple people and they'll probably have different interpretations. So what is the correct version? What is the correct answer? So I've always argued that if not even humans will agree, how will the machines even know? The machines will definitely then be able to generate suggestions. Oh, it could be this, it could be this, could be this. But at the end, the humans need to go say, no, our decision is going to be this definition. And that's, by the way, where governance comes in, because you want to have an agreement on what things mean. Because otherwise, depending on the type of business you are, you can have different types of chaos.

So creating these ontologies, people understanding what this data actually means. Now what's interesting with LLMs today is they're helping a lot. Experiments we're doing is using LLMs to help us accelerate that process of, hey, create candidate ways of modeling these things. And how about, through a chatbot, go talk to the end users to be able to extract the stuff that's in their head to say, how do we unify what a customer is? How do we unify what an order is? So I think that's something that we're seeing right now.

So then when it comes to the vector databases, I think one thing that we're seeing today is the low-hanging fruit of being able to show some really cool AI applications, specifically when it comes to chatting with your data, is usually about unstructured. So it's the text, as you said, the stuff that has never been inside of a database. So it's your documents, your PDF files, and in other places you have images and so forth. And that's where you want to have... that's where vector databases come in and you have to do all of this chunking and all the bits and pieces for the unstructured side.

When it comes to structured data, the data that's in your relational databases and your data lakes that come in with thousands of tables and tens of thousands of columns, that structure has meaning and that meaning already... a user or human being will look into that table and they're like, oh, this is what this means. You want to add these two columns together, but not these other ones. There's knowledge behind that. And I think that's harder to do based on what I was talking about before. It's like, but what does this mean? You've got to go talk to too many people. That's not a scalable thing to do. Again, LLMs today are helping us. They're going to help us because now we can try to have these chatbots that can help us acquire knowledge.

So the low-hanging fruit has been focusing on the unstructured. But I would argue that the untapped potential here is to be able to focus on the structured data, because this is the data that goes into your reports, into your dashboards. That's the stuff that executives and the boards are looking at those graphs to make decisions. And that's the data that's coming from your CRMs or your ERP system. This is all very structured data. And those types of questions that people are asking, they expect accurate answers and they expect to be able to explain where those answers come from because they need to trust them.

Well, if I'm asking questions over text, I'm like, oh, I mean, you have a little bit more freedom over there. And then I can point you to this larger document where a human would go in. So I think that's why I'm arguing the low-hanging fruit is focusing on unstructured and there's just so much of it. So we can be very busy for a long time and provide a lot of value. But I think where there is tremendous value, one could argue and I would argue here that probably even much higher value, is if I can chat with my structured SQL data, because that's where your critical decision-making is happening for the organization.

Nathan Labenz: (22:41)

What you're describing there, I think, connects to a couple of big themes that I notice across a lot of different areas of AI today. One is just how much knowledge is kind of undocumented. A friend of mine used to call this tribal knowledge. It's the know-how that people share by watching each other do tasks over their shoulder perhaps, or on a screen share perhaps more likely today, or in a quick chat where people ask each other questions. And that knowledge, because it's not localized anywhere, is very hard to make available to anyone. But we're starting to really feel that acutely now in the AI moment where people are like, well, why aren't these agents working? And often the answer is, because they're very generic and they don't have any of the context, or the sort of... what feels familiar to you is just totally alien to them.

So it's interesting that this problem, I guess, has been felt acutely enough in the data world even before AI that you guys have built a whole platform about it. Could you sketch a little bit more there? Typical enterprise customer? I have a little bit of a sense for this. I did a project once. I was a very junior consultant at a firm that was working with Washington Mutual, which was once the country's biggest thrift, it was called. And it was in that mortgage crisis moment that I was doing this project, and I do remember just the absolute total gnarliness of the data and just how many different columns and, you know, there are all these columns derived from other columns and it was kind of the data swamp situation. I only saw a corner of it in my brief engagement before the bank or the thrift went bust.

But maybe you could give us a little bit more color on typical enterprise. Where are they today? They obviously have a lot of data, but are we still in a moment where it's spread across tons of systems and people are super siloed and nobody knows where things are? In my case, I remember vividly completing an analysis and then having somebody look at it and be like, this is totally wrong. And me being like, I would have had no way of knowing this. I literally could not... all these suffixes on these columns, they didn't mean anything to me.

Juan Sequeda: (25:23)

This is exactly it. And here's the thing: the problems that we deal with today are the same problems that people have been describing 10 years ago, 20 years ago, 30 years ago, even more, since the beginning of the modern digital enterprise, since going into warehouses and so forth, back in the nineties or even before that. So this is my annoyance with the entire world. Can you imagine that the problems we're complaining about today are literally the same problems we've been complaining about 30 years ago? We're kind of messed up if we're still trying to solve the same problem. Now, things are getting easier. Cloud has made things a little bit easier to do. We have self-service that lets people do more things, but essentially the problems are the same. So let's ask ourselves, what have we been doing for the last 30-plus years? And then I'm thinking, wait, right now AI is coming. Is this magically going to solve this? No, we've got to understand our history. This is so important. So let me go through a couple of examples that I've seen across my career.

At the end of the day, with data, your goal is to answer some questions. You want to be able to answer some questions to understand how the business is doing, to make sure that you're going to make a decision that is going to help us make more money or save money. You make money, you save money, you want to mitigate risk. Three main things you want to be able to achieve in your organization. So you have a question you want to ask. You got asked a question and you need data to answer this question. Where's that data?

One approach, I call it the spreadsheet approach. You basically go ask somebody who has the data. "I need data about X, Y, and Z." And that person will go off and send you a spreadsheet of X, Y, and Z. Then you get in and you know Excel pretty well and you do lookups or maybe you have an Access database on your laptop and then you'll be able to do some data munching or whatever. And then you answer your question. There it is. Okay. So that's how we've always been doing things. Is that a scalable approach? Probably not, because then tomorrow I say, "I need data about X again because I need it for the new month," and they send it again and so forth. But I'm asking about X. I'm asking about orders. I'm asking about our latest customers. But how do we know we're talking about the same thing? How do we know you have that same interpretation and you're actually delivering the correct thing that I'm expecting? I don't know. But this is how we've been dealing with data for so long.

Another approach is what I call the query approach. You're like, "Look, you're asking me for this spreadsheet of data every month. What I'm going to do is give you access to the database directly. You can just query the database, you can connect your dashboard directly to the database, and here's the query that I use." Perfect. Now I can query in real time and do things. But depending on how SQL-savvy you are, you can go edit the queries, you can add things, and suddenly this query that was 15 lines long gets joined with something else and it's 30 lines and it's 100 lines, and I've seen queries that are 10, 15 pages long, takes 20 minutes to run and executes, and the result of that query is what I'm going to use in my report to present to the CEO. That query has so much knowledge embedded into it. Do we actually know what's happening in there? And we're making billion-dollar decisions based on stuff like this. And everybody is doing this themselves. If I make a change in my query, how do I know that that's not a change that you should have made too? Or somebody else made a change that I should have interpreted too? So that's the query approach.

And then you can go off and say, "Well, no, we need to have this one standard database or data warehouse where everything is well-aligned and we're all going to query this thing." And we go off and say, "Yes, let's go build the data warehouse." But what happens? That takes millions of dollars and many years to do. You get requirements, and then suddenly, you're like, "Oh, now you can go answer your question on this data warehouse that we've all invested so much time and money in." You go off and answer that. You now run the question, run your query again, answer that question. But what happens? You compare that answer to the way you've been trying to answer that question before. Do you think it's going to be the same? Probably not. So then you're like, "Wait, I have control of this way. I get one answer, and then I get the answer from this other system which I don't control. What am I going to trust?" You basically trust your process that you control. So this is why data warehouses fail, and they don't fail for technical reasons, they fail for social reasons because people don't trust them and so forth. And then the problem with that is I'm doing ETL and I've got to structure my data beforehand, and then we went off and said, "No, let's go ELT. Let's go dump our data into a lake." And then the lakes get all messy and they turn into swamps and then, you know, on and on.

So you can imagine, this is the story that we've been going through over and over again. And a theme throughout all of this is keeping track of what something means. This is the meaning. This is the semantics. This is the knowledge. This is the metadata. And I think what has happened is that throughout the last 30-plus years, we've never focused on keeping track of the meaning, of managing the meaning. And now I'll argue that LLMs and all of this AI is making us realize, oh, if I want to trust that, I need to know what it means. And now my hope, and I'm starting to see this, is there is now a new focus on we need to invest in semantics. We need to invest in meaning and knowledge, and that's where knowledge graphs come through.

Nathan Labenz: (31:24)

Hey, we'll continue our interview in a moment after a word from our sponsors. One kind of extra little detour before we get into the paper and how it all works. I mean, obviously, there's a lot of data professionals these days, and they fall into different kinds of roles. I wonder if you could characterize in rough terms who are the people that you work with? What jobs do they have? What are the activities in those jobs? You kind of gave a sense of the different approaches, but if you were to actually just observe, do a time study of these folks, how much of their time are they spending on different kinds of activities? How much of it is that routine query running and kind of process that they've done month after month? How much of it is data migrations or these kind of mega projects that may or may not ever come to fruition? How much is ad hoc analysis where leadership or whoever has one-off questions that haven't been asked before? I have no idea really how people are spending their time. I'm probably varies a lot by context, but I always find that to be an interesting foundational question to then ask, okay, well, which parts of that is AI going to impact? But for starters, maybe you could just kind of characterize what do they do? I never know what they're doing.

Juan Sequeda: (32:48)

So there's what I call the data producers and the data consumers. So there's the two types of folks in here. The data producers are going to be more in the backend, the technical side. They're the ones who are moving the data, migrating the data, creating, managing the data, making sure it exists and so forth. Then you have the consumers who are the ones who are searching for data, who want the data, have a particular task that they want to accomplish. They have a question that they need to answer. So those are the two types.

If we look at, let's focus on consumers. On the consumer side, you can have folks who are going to be your data analysts or your BI analysts. People who build their reports and dashboards. You have your data scientists or your machine learning engineers. They're saying, "I need to go find this data so I can go do some work to create a model. I create a dashboard because I'm trying to answer a question or a recommendation or something." So I'm consuming data to be able to accomplish that task. Now on that side, you hear always the 80/20 rule. 80% of the time I spend cleaning my data, and that's a problem.

I think, actually, another annoyance I have is when people say, "I have to clean my data. It's data janitorial work." I'm like, no, that's understanding the critical meaning of your data and you're just sweeping under the rug like, "Oh, that's annoying pieces." That is the essence of what the meaning of your business is in that data. And you're like, "Oh, just kind of write some quick code and it's just annoying to do." That's a problem. That's the problem is that we don't treat it as a first-class citizen.

And then on the producer side, they're the folks who are saying, "We got data coming from these sources and we get the requirements. So we hear that you want this data. So we're going to go move the data over here and now we need it to be more scalable. So we're going to the cloud. So we're going to use Snowflake or Databricks." And they know that the consumers want to be able to run these types of queries, so they're going to have the data set up in this way so they can do that. But they're the ones who are actually putting the data together. How do we know that the requirements that they get are actually being fulfilled correctly? Especially because the requirements from the consumer side are connected to the business, how the business thinks. They're talking about customers. They're talking about average order values. They're talking about all these metrics. Are the producers, the more technical side of folks, do they understand that? Is it very clear to them, "Oh, I can present the data this way or that way. How do I know which is the correct way?"

And I think that's always been a gap, what I call the producer-consumer gap. And that producer-consumer gap is there are roles that are missing. One of those types of roles is, I'm seeing this evolve now, I'm calling it the knowledge engineer or also the knowledge scientist. It's really the person or role who can work on both sides of the aisle. "Hey, I can be a non-technical people person, go understand what you're trying to do. I can draw the models on the whiteboard. Then I can go talk to the technical folks and say, okay, it's this data for this reason and so forth." And we're starting to also see this called the data product manager. You're bringing product management into data and stuff.

So I think we have this gap. There's the producers, the consumers, and I think one of the things, to try to get at your question, I can't give you numbers because I don't know these numbers, but I think there's a lot of this repetitiveness happening. We're wasting our time a lot. We don't know if this is correct. And a way that we should be able to kind of address that is by having a role which is focused on being that bridge. And it's a role where you're bringing in more of the people, the process side, doing product management, doing this knowledge acquisition. It's things that people are doing today but they're not doing it in a first-class way. They're just like, "Oh, that's a second-class thing. That's the annoying thing I have to do." I'm like, it's annoying to you, but it is critical, and somebody should take the ownership and accountability for that.

Nathan Labenz: (37:04)

Interesting. So that role, if I understand correctly, that would be essentially the power user of the data.world platform. I'm thinking almost like knowledge librarian, knowledge historian.

Juan Sequeda: (37:15)

So that's a good point. Yeah, it's like a librarian. I would argue that for data.world, you have all three of those. So for example, you're bringing in all your different sources. I need to deal with all that technical data stuff that's coming in. You want data.world to say, "Hey, I just got these data and I just want to know that I'm keeping track of these 400, 500, 600 tables. These other thousands, I'm going to ignore them." So I'm going to try to keep my inventory of stuff that's going on. We're moving our pipelines of data. All of that gets managed inside of data.world. So you have that for your data engineers, for the more technical folks.

And then you have your consumers of data who are searching for data. Those are the ones who will also come into data.world, but they're looking at it from a more consumer perspective. "Oh, I'm looking for data about customers." They can now search for that. They find those datasets. And then what you would want is the data product manager, these people in the middle saying, "Hey, I want to expose to you, I want to expose to the consumers this beautiful data product." Something that, just like your shopping experience in Amazon. If you go to Amazon and you're buying a water bottle, for example, well, there's not only one water bottle. There's probably hundreds of water bottles. You go as a consumer, you search for something and you're like, "Oh, these are ranked. I have reviews. I have pictures. There's metadata. There's a lot of descriptions." And which one are you going to trust? The ones that have great reviews, that have more stars, that have great pictures, that have more metadata, more descriptions. There's other ones that are like, "Wow, that doesn't have any pictures. I'm not going to trust that." So that's the experience of a consumer.

Now somebody had to put all that inside of Amazon, in this case, put inside of data.world, and that's kind of on the producer side. And how to decide who goes into what, I think that's where these data product management roles are coming in. And at the end, I think when you talk about the lifecycle of data and stuff, we're spending money to keep this running, keep this data. You should tell me what is the ROI on this. I'm like, "Look, we've invested all this stuff. We've created all these datasets. Look at all the people who are coming to consume this data. I can actually go survey the people and say, if I turn this off, how is this going to affect you?" I should be able to know these types of things. And you want to be able to keep track of all these metrics on how people are using the data and how they're accessing and searching also inside of your catalog and data.world. So I think data.world is something that just connects the entire landscape of data within an organization, from the technical all the way to the consumer and everything in the middle.

Nathan Labenz: (39:57)

That's kind of another angle on one of these big themes that I keep finding as AI impacts all sorts of different knowledge work. I have a thesis that the platforms in which people do their work are extremely well positioned to start to kind of create long-form narratives of what people are actually doing. And from that, perhaps begin to train future AI systems on this data, this sort of data. In this case, it's the data of how we work on data, but in other platforms, it's the data of how we work on other things. It could be creative. It could be, you name it, a million things.

Are you capturing these arcs? Because I could imagine that in a catalog, you could imagine different modes of interaction where somebody maybe comes and says, "Oh, I have a question about what does this particular thing mean." They come, they get an answer, and then they go somewhere else and execute a query. And so you only see sort of a glimpse of their work process. But I'm sure you have all these data connectors and things too. So I wonder how often people are actually going on a journey within the platform versus kind of flipping in and out of the platform at different parts of their journey.

Juan Sequeda: (41:17)

This is a great observation you're making. So I think when you look at tools in general, the data catalog market, it is very separated. You manage all the metadata to search and to find things, and then I go to another place to actually access the data. One of the things that makes data.world different from the entire market is that we have both of those things. You can actually search and find the data, then you can actually access the data within data.world because we have virtualization, federation. I mean, your data continues to stay where it is. So you got your data in Snowflake or Databricks or whatever, your data stays there, but you can actually find it and access it through data.world. So we actually have your full view of everything.

Now you brought up something really interesting, which is something that I'm pushing people to do, is we need to track and catalog not just all this technical metadata, but it's also keep track of all the questions people are asking, who's asking those questions, what are the answers that they're getting back, the actual answers. Not just, "Oh, I looked for this, I'm looking for this data and here's this dataset, go do something with it." Because what we're seeing, we're actually working on at data.world, it's like, I want to ask a question. "Show me all the amount of orders that have happened in the Eastern Region." I should be able to get, "Oh, here's your numbers by region, by date, over time." You get the actual answers to your question. So that's kind of where the whole chatting with your data comes from. We're working on this stuff.

Now, what happens is that we should be keeping track of the questions and answers because we should govern also. We should keep track of these questions and what is the actual stamped answer for some of these questions. And what happens is that if you're asking a question, you should be able to say, "Hey, this other group of people have been asking very similar questions and here's the answers that have already been stamped because these are official things that we have to know for, I don't know, regulatory purposes, whatever." So you should know that type of information.

So I think that's something that we're going to start evolving. And then because we'll be able to learn from patterns that people are having. That's why, by the way, all of that just continues to get connected inside of your big graph. So remember I was saying, my original graph was you have databases and tables and columns and dashboards. Well, then I can extend that when I have questions. I have answers. I have people. This person asked this question. This question was executed. This question used an LLM to generate this query that was executed over this database and here's the answer. And Nathan is the steward who said, this is the official answer for this type of question. And so next time somebody asks a question like that, they can get a really similar suggested question and they have a governed answer to it. I think that's kind of where we're going, especially for enterprise scenarios where we're going to have accuracy and explainability.

And then obviously you're going to have questions where you're just discovering things. You're trying to be creative around that stuff. Then I think that's also when it comes to question answering, we should also think about what types of questions. I think this is where a lot of the agents come in. So if you think about it from a question-answering perspective, it's not just give me a question and I can answer it. It's like, okay, you're asking a question. What type of question is this? Oh, is this a factual question? Is this a subjective question? Two different approaches. Is this a factual question about the knowledge? Is this a factual question about the data that I have? Do I even have the data to be able to answer this question? Or do you even have access, authorization to be able to get answers to that? Oh, maybe not for all of it, but for this subset you can. So then the agent should be able to understand what you're trying to do, understand what context, where you're coming from and saying, "Hey, I can't answer that question, but here's a subset of that question I can answer and I can't answer it because of these other reasons." I think that's the type of stuff that we're heading towards. That's where I get really, really excited about AI right now.

Nathan Labenz: (45:18)

Yeah. Accuracy and explainability, those are two key things. It's funny how many, I mean, it's not surprising in the sense that obviously AI systems, AI models are all trained on data and they're very much directly derived from their data. Accuracy and explainability come up in all the AI systems work that I see. It's interesting that exact same framing clearly predates it in the data management work that kind of underlies now the AI that's getting built on top.

When it comes to accuracy, do we have any good benchmarks, rules of thumb, reference points for how accurate humans are? This is one thing I often find everywhere I go, that people have no idea how accurate the humans are.

Juan Sequeda: (46:08)

I love that you're asking this question and this is really how we need to have fierce conversations with each other and push back. So all the research that we've been doing about accuracy, the first set of critics, what do they say? "Oh, but 60% isn't accurate enough." Yeah, of course I know it's inaccurate. So what is accurate enough? 100%? 99%? What is it? Do you actually know? You got data today to answer a question that you made and based a decision on, moving millions of dollars. Do you know how accurate that was? Are you going to bet your life on the accuracy of that? Go ask that question. I don't think you know. You got that dashboard or that spreadsheet from Alice and Bob. Do you know how that work was done? You trust it exactly? Really? Do you?

People aren't asking those questions. They make the assumption that the machine needs to be accurate in everything, but then they don't even ask, they're not being critical about their own process to do that. I think this is the opportunity right now to be very critical about how do we evaluate ourselves. How are we doing this? And going back to how AI is impacting all the tasks that we're doing. It's like, let's go figure out how we're solving that problem today and say, "Wow, what is that process and can we improve that process?" And we figure out, "Oh, wow, there's a big bottleneck or that's something that is pretty sketchy and it's not working well." And you want to be able to bring in the machines, the AI to help with those things.

So I think it's really important to ask ourselves, how are we solving these problems today and be honest about it. It's like, "Oh, crap, that's something that may not be working." I remember once working with a customer who were getting this report for years and making decisions about it. And I went off and analyzed how they got that report because they hadn't done it. And I'm like, "Do you know that to get this value over here, you've got some hard-coded values that are multiplying 0.02 or whatever. That's happening." And they're like, "Why is that?" And then somebody said, "Oh, I remember two years ago we were doing some special discount." Well, guess what? The person who wrote that query hard-coded that discount, whatever, and it's been there for the last two years. So all of the decisions you've been making has that embedded in there. All your numbers have been wrong for the last two years and you've been making decisions.

Now, the honest, no BS here is the company still may thrive and stuff. How big of a deal is that? So I think that's also it's like, how critical are these things? Yeah, sometimes you're like, "Oh, that's good enough." I think it's also part of the culture that you have within your organization.

Nathan Labenz: (48:58)

I've experienced very similar things in a non, you know, at the company that I started, we did a lot of digital marketing for a while and we both had our clients do it and then we did it. And we found just so many instances over time where you'd come to some initial conclusion and, you know, we really had to challenge ourselves. Especially digital marketing will punish you for being wrong on these things. We really had to challenge ourselves. Is this really right? Does it check out across everything else we know? Does it violate our world model? You know, dig in, dig in, dig in as many layers as we could to finally often find that there was some problem.

That happened at the Washington Mutual thing too. I remember one time where I was summing something and it was greater than the limit that it was supposed to be governed by. And it was like, "Well, what's happening here? Your query must be wrong." And it turned out, actually, no. Some other process in the bank had changed some underlying data and all of a sudden the sands had shifted under us.

But yeah, broadly, I find that I don't know, I mean, if I had to guess, I would estimate that a significant fraction of the data analysis that people are basing business decisions on is just fundamentally flawed to the point where, you know, they're kind of detached from reality. I don't know if we could get any more rigorous in our estimation of that, but for me, it has been a recurring theme that just, you know, somebody gives you the results of a query or a spreadsheet or whatever and tells you it's X, and I do not take that stuff at face value anymore.

We were just talking a little bit too about podcast metrics. Oh my god. We had one, we were looking at the success of the podcast, and it was like, "Oh, you know, we did really well in this time, and we did really poorly in this time by comparison. Look, there's a spike and there's a trough." And it turned out that it was the way that the month fell with respect to the weeks and the days on which we published the podcast. It was like, well, we actually had 9 episodes released in this month and 7 in the next month. And that's why we got way more downloads in the first month than the next month. Really nothing, the per-episode downloads were the same. But even at that simple of a level, I've seen people just get so confused by kind of aggregate measures.

Do you have any, I mean, this is a little bit of a digression from our core topic, but do you have any tips or sort of habits of mind that you encourage people to adopt? Mantras maybe even that you encourage people to adopt? I always say there's no substitute for reading the raw data. If you have not looked at actual raw records at some point in the process, obviously you can't read all the raw records, but if you have not familiarized yourself with at least some raw records, my guess is there's a pretty good chance you're going to be wrong in the aggregate analysis. Any things like that that you preach?

Juan Sequeda: (52:16)

Ask the why, right? Ask the why five times. What is the goal? What is the objective? What is the objective above the objective that you're trying to accomplish? Because it goes back to having all these ad hoc questions that we're trying to deal with. I'm like, why are you doing that? How is that valuable for the business? The person you give that to, what are they going to do with it? For me, it's always about: are you tying the work that you're doing to specific business value? And the way to tie that is you need to know what are the objectives, corporate strategies of the quarter, of the year, whatever. You need to know what those are because you need to be able to push back saying, "I'm not going to spend time on that because that is not related at all to what our quarterly goals are, unless you explain it to me." So I think that's how, from where I am, you need to get connected to the business to make sure that we're not wasting time.

Another approach or technique that I do is what I call the iron thread, which is: how do I know if something is working correctly? Don't boil the ocean. That's another trick. We love to get ambitious, but start from something very specific, one small thing, and figure out what is the path all the way to the end goal. Do it really small. You basically get that one thread through, and then you're like, "Okay, I understand all the things I need to go through." Then you say, "Let's do it again." You add another thread to it, and that thread gets bigger and bigger. At some point, you're like, "Well, how about let's do it for another area, something different?" You can have somebody else in a distributed way independently go off and do another thread somewhere else. Now we start getting these threads all straight. I think that's the approach I always do, because otherwise we get focused too much on the bottom stuff and then we're not connecting it to the value that people are seeing. I need to understand not just the output, but what is the outcome that needs to happen. You want to be able to drive all that through from top to bottom, from the executive all the way down to getting into the nitty-gritty, rolling up your sleeves. All that needs to be connected. You don't have to boil the ocean and do it for everything. Just do it for one very specific thing, go through it. Then you understand how things flow, what's working, what's not working, where's the struggle, and so forth.

Nathan Labenz: (54:51)

Yeah, interesting. That's software development sometimes can be like that more generally too, where just a vertical spike through all the layers of the stack to get everything connected and make one successful round trip is a good early milestone in a lot of things.

Juan Sequeda: (55:08)

Tie this to AI. That's all the steps you need to do, and then you figure out, well, where can I put AI in here to automate things and make it faster, better, more productive? So perfect transition then to the paper.

Nathan Labenz: (55:21)

We've got a lot of preamble before the actual paper that you've published, but high level, it seems like the big demonstration of the paper is that you take GPT-4 and you say, "Hey, I want you to write some queries for me to answer some questions," and you give it either the pure schema of the table definitions. There's a lot of meaning often implicit there, or to some degree explicit in the names of the tables and the names of the columns and the way that database keys and whatnot relate to each other, where there's an implicit graph, again, somewhat explicit just in the schema structure, and you get a certain level of performance. And then you compare that to, "Okay, now let's also give GPT-4 a higher-level semantic knowledge graph representation of what does this data actually mean." That, perhaps not surprisingly, given all this preamble, yields quite a bit better results. What additional details about that experimental setup or about the findings do you think are most important for people to understand?

Juan Sequeda: (56:37)

Everybody was excited about chatting with their data. This is also called text-to-SQL, which has been an area in computer science we've been looking into for, again, probably 30-plus years. There's been all these techniques happening. Now with LLMs, people are like, "Wow, this should be much easier to create, to do text-to-SQL." From an academic standpoint, there are these benchmarks and techniques people have been developing, and they show 95% accuracy. I'm skeptical about this stuff. When you look into these benchmarks, they're all very simple. There's a couple of tables and the semantics are very explicit. That's really disconnected from the enterprise because the enterprise doesn't always have that clear and small setup. The other part is the questions people were asking were like, "Oh, here's a laundry list of questions." Why these questions? How did you come up with these?

So the main research question, there's two questions. One is we want to understand to what extent can large language models generate SQL queries that could be accurate. To what extent can that be done? And then the follow-up question is, to what extent can that accuracy improve if you actually put the knowledge graph in the middle? The hypothesis here is that, hey, well, if you do the knowledge graph, it's going to improve the accuracy. We didn't know to what extent. We didn't know how much. The thing is, just talking to folks, everybody's like, "Yeah, adding semantics, context, knowledge, stuff like that, that should help." But when, how would we do it and how much better is it? People didn't know. We were at Snowflake Summit in June and we were having this conversation with a bunch of product people, and they were the ones who frankly were just challenging us saying, "You guys should just do a little bit of a benchmark around it." And I'm like, "Yeah." Literally, I got on the plane, came back, and I started working on this stuff. Kudos to them for challenging and pushing us to it.

So what we did was we have an enterprise schema in the insurance domain. We're using an open standard from OMG, they have a property and casualty schema. This is representative of an enterprise domain, enterprise schema right there. The other thing was the questions we're asking. You have an enterprise schema and we have a set of questions, but these questions, we created this quadrant about two kind of spectrums of complexity. One spectrum of complexity is on the types of easy questions to harder questions. Easy questions being, "Give me a list of things. Show me all the claims that are open." Those are easy questions. Harder questions are things that are more about strategy, which involves questions where I need to do some aggregations, I need to do math throughout that stuff. Then from a technical perspective, you also have easier questions and harder questions. Easier meaning I only need a couple of tables to answer this. A harder question is I need a lot more tables, eight, nine tables that need to be joined. So then you put these two spectrums together, you get this quadrant of easy questions over easy schema, and then you get harder questions over easy schema and so forth. You get all these different quadrants. So that gave us a perspective of understanding the types of questions that people can ask.

Then the third thing that we looked into was let's invest in putting in, in coming up with the semantics. Invest in creating the context. That's what we call the context layer. So it's, here is the ontology, the semantic layer, and here is how these things map. Very simple. There is a claim that has a claim number. Oh, there is a table called claim. Okay, perfect. That matches. But there's actually that table has, I don't know, 15, 20 columns. One column is called claim identifier. That is not the claim number. There is a column called the company claim number. That is the column that has the claim number. So then I make these mappings. And then you would have things like values. You'd be like, well, I want to know what are my policyholders? Well, you have to, it's everything that has the role column equals PH. Agents, oh, where role equals AG. That's the semantics. That's the context that you have in there. So we made that very explicit. That's what we came up with as an input to the benchmark.

And what we did was evaluate, very simple setup. On purpose, wanted to do very simple because I wanted to get the bare minimum. The simplest prompt that you can imagine. "Here is a SQL DDL schema." Copy and paste the SQL DDL and you say, "Write a SQL query for the following question." And then for the knowledge graph, we said, "Here is the OWL ontology," the ontology that's a semantic layer, which is an open standard that we gave it. We've understood that GPT-4 knows these open standards. You copy and paste it and you said, "Write this SPARQL query for this question." SPARQL is the open standard for knowledge graphs for the queries. So also, the caveat is that we put the schemas and the ontology, the semantic layer, as part of the prompt. So it has to fit within the amount of tokens. You just give that to GPT-4 and it generated the query.

When we started comparing all the queries that were generated, I think we get 3x differentiators. There's three times more accurate queries if you're doing the knowledge graph versus if you only do SQL. That's if you compare all the questions. If you look inside of the quadrants, effectively, you had questions that required high complexity tables. Basically, questions that require joining more than four tables, the SQL query that was generated by GPT-4 would always fail. I think that was an interesting point. It's like, oh, so if you have queries that require more than four tables, that's not going to work very well. And that was for this very simple prompt.

The work that we're doing right now is, well, we're going to test it with other models too. And I think what we want and what the community can do too, and we're seeing this, is like, well, yeah, let's improve the prompting. Let's figure out more context. Is there a way I can pass the context to SQL in some other way? This is where we need to start working as a community to figure out. We presented an initial baseline. Going back to our original research question, to what extent? We found an extent. We know this. And now this is how we advance science. Let's get other people build on that work, and let's see if we can improve the extent.

Nathan Labenz: (1:03:26)

It definitely is, again, another major theme, just how many of the benchmarks that largely have come out of academia over the last 10 years have just been kind of blown away by the latest models. I think that's one of the really interesting indicators of just not only how fast things are moving, but how fast they're moving relative to people's expectations. You see 2019, 2020, even some 2021 benchmarks that are just obsolete because they're totally broken by the latest language models. They were built in the way that they were because people didn't even, you know, couldn't even really conceive of at the time that an AI would get good enough to solve all of it. It was like, "Gee, if we solved all this, we'd have AGI." Turns out there's a little bit more room between solving these benchmarks and AGI, at least.

So now you've got this kind of first stake in the ground, basically, right? Here's, okay, all these old simple schema benchmarks are increasingly obsolete. Now we've got to increase the challenge. Let's bring a real enterprise-grade challenging problem to the models here and see how they do. They don't do so well. Here's one initial technique of layering on semantics that makes a huge difference. And, of course, we're fully expecting that people are going to continue to improve on those techniques and get better and better performance. Do you have any tips for how people, and especially for enterprises that want to do this for themselves for actual practical purposes internally, what sort of guidance would you give them on constructing their own internal benchmarks or eval suites?

Juan Sequeda: (1:05:13)

The contribution of our work, one, is not just the results of the benchmark, but it's the benchmark framework itself. So what I'm telling everybody, and I'm actually working with our own customers doing this too, because they're like, "We're working together to build these chat systems, these agent systems," and they want to evaluate how good this is.

So one, they have their data, so first check. Number two is on the questions. So I think this is really important to understand what are the questions people are asking. It goes back to what we were talking earlier before. You should catalog and keep track of the questions. What are the questions people are asking today? Who's asking them? Why are they asking them? How are they solving those, getting those answers? Are they getting those answers today through a spreadsheet that somebody gives to them? Or do they write a query? Do they get it through a dashboard? Let's figure that out today. So then, those questions, let's put them into those quadrants. This is an easy question, this is a harder question, or this is a question that is just a list of things versus this is a question that actually involves aggregating all that stuff. This is a question that requires small amounts of tables or large amounts. So start putting that into those quadrants. That is my main recommendation for folks. And then in order to avoid boiling the ocean, make sure that you're focusing on the questions that are actually going to provide some value. And that's why it's important who's asking those questions and why they're asking those questions. So I think that's a very important thing.

Then the third one is the context. I think the whole point of the work that we're presenting is that you should invest in the context. You need to invest in the semantics of what this stuff means because frankly that's what the LLMs don't have. So you need to start investing in that. That's where, at least our customers, they're already invested in the context because they already have data.world. But for other customers who are thinking about this, I'm like, well, if you want to have a chat with your data on your relational SQL databases, and if you don't have a catalog, you've got to solve that. You've got to build, bring in a catalog, but the catalog is for the sake of understanding, keeping track of all that metadata to start building that semantic layer and everything. And then don't boil the ocean. That's where I would say then have that iron thread approach and tie it, bring in the data you need to answer those questions and then keep track of all that metadata. Treat metadata, treat the semantics as a first-class citizen.

So again, just to summarize: one, you already have your data, so great, check. Second, make a list of all the questions, put them into quadrants, and prioritize them by the why behind them, and understand how they're answering those questions today. And third, if you don't have a catalog, then that's where you can start from a technology perspective.

Nathan Labenz: (1:08:13)

So what do you think the prospects are for this kind of chat with your data paradigm? The dream would be the AIs get really good, and the decision makers, whether it's the middle of the night when they have their brainstorm or whenever, AI has a lot of advantages, right? It's always instantly available. You can kind of pick up where you left off. I also recite these. Here are the big challenges, presumably accuracy. It seems like we're probably not there yet. You said kind of 60% is not good enough. I would assume that, yeah, probably not. Do you have a sense for what is good enough and sort of what the steps are likely to be to get there?

Juan Sequeda: (1:08:58)

In the benchmark, it was 60% for all the questions, right? Now, what we really need to start thinking about is, given a question, I need to analyze, can I even answer this question? And so I think the thing, this is what we're working on right now, is saying, "Hey, give me questions. I know that I can answer this question," or give some confidence measure that if I can answer the question or not. So then, what you're going to do is reject the questions that you know you can't answer, and then you're going to focus on answering the questions that you know you can answer. And I highly suspect, and this is something our research has shown us, that from there we can get to the 90% of that stuff. So if I zoom out and I imagine the user going into your chat, you ask a question, you'll either get an answer with an explanation, and I'll dive into that in a second, or an explanation saying, "I'm not going to answer this question for the following reason." That already gives you confidence around that.

Now, when it goes into when I do know that I can answer a question, what happens is, and this is where the knowledge graphs come in, it's tied to accuracy and explainability. If I ask a question, I can extract, using the LLMs, extract the concepts, all the things that are being asked in that question, and I can look them up in the graph. And if all the things that I'm asking show up in the graph, show up in the knowledge graph, I'm like, "Oh, I know about all these things. I can then formulate a query." But if I'm asking for a question and I'm like, "Oh, I can't find this in the graph," I'm like, "I don't know." So basically, you know what you know and you know what you don't know. So then that's how I can explain that I don't know this stuff. I'm not going to answer that question. And then if I do know, I can create the question and then I can answer it, execute it.

And here's the big distinction: LLMs are not answering the question. Because if the LLM is answering a question, you're training an LLM, then it's always going to be probabilistic. I can't have certainty behind it. LLMs are going to be an assistant to create the code that is going to be executed over the system that's answering the question. And when it's code, that is something I can reason upon. That is something I can govern. That is something I can manage and it's something I can use for an explanation. So if I match it to the graph and I'm like, "Okay, here it is. I generate this query." This is a query over the graph and I can... And then this is where all the technical metadata catalog comes in saying, "Oh, this exists in the graph for this reason. It comes from this part and it goes to this table and this person was involved in the creation of this. It was authorized by this." That's why you're getting that information. So I can then not only give you the answer, I can give you an explanation and I can go as deep and granular and nitty-gritty and technical that I want, or I could give it high level.

So I think also one of the things that we're working on is, another kind of annoyance I have is when we talk about explanations, explanations to whom, right? You're giving a non-technical person an explanation, they probably just want an explanation in English and also know who they should go talk to if they have a question. Don't give them code, they're not going to understand it. If you're giving an explanation to a technical person, they probably want to see the code. If you give them just a high level, fluffy, they're like, "That's BS." So you also need to understand who are the personas you're giving explanations to. And again, the graph is just basically, I can traverse the graph as much as I can to get as deep a level of granularity that I need. So I think that's why knowledge graphs are critical for the accuracy and for the explainability.

Nathan Labenz: (1:12:46)

I'm mapping a bunch of familiar AI application building techniques onto this problem, and one is decomposition. It sounds like you could imagine just taking the semantic layer and the schema and dumping that all into context and saying, "Hey, have at it." Or you could imagine kind of breaking that up and saying, "Okay, first thing, your first job is to just look at the semantic layer and determine if you have the information to answer the question." Then you could have a peek at the schema and actually generate the query. That decomposition takes more work. It definitely kind of makes things a little bit more brittle depending on how quick your system is to change or whatever, but it really can drive accuracy. I imagine that's a part of it.

Fine-tuning models, obviously, also a big trend. GPT-4 fine-tuning is in limited release. I don't know if you've had a chance to fine-tune that or if you've had experience fine-tuning other models to get better at this, but that seems like another opportunity for improvement. Then you mentioned earlier too, established questions and kind of canonical ways to answer them. I think of these sort of skill database type paradigms, like the Eureka paper or the Voyager paper out of NVIDIA where when they find something that works, it's like, "Okay, cache that basically to a database." And next time we get a similar question or similar challenge, in their case, it was Minecraft type of skill building, but here it's obviously database-type work. Let's see if we've done this before and have kind of an established answer. Imagine all three of those are kind of directions you are pushing on?

Juan Sequeda: (1:14:28)

Yeah, the answer is always, it's hybrid. It depends, right? Yeah, for sure. So I think from the science perspective, I start with one thing to see how well that stuff is. This one is, let's start with the bare metal prompt engineer to do things. And then you have other tools that you want to figure out, what can I use for vector databases that I can use as another tool there? Also note that for the first experiment, it fit in the context window. So maybe my ontology, my semantic layer is going to be so big that I can't always pass it in here, so I need to be able to get parts of it out.

Nathan Labenz: (1:15:00)

Were you working with the 8K version in this work or the 32?

Juan Sequeda: (1:15:05)

The 8K. Yeah. Because the DDL wasn't that big anyway. And then when it comes to fine-tuning, I think there's two things to look into. One is, are you fine-tuning with the data or fine-tuning with the metadata? And I would think fine-tuning with the data is probably not going to be valuable because first of all, you got millions and billions of rows in my Snowflake or Databricks, my data lakehouse. Are you really going to extract that and put that in there and it's so expensive and then it gets updated all the time? And by the way, then in that case, your LLM is trying to answer the question, but then it's going to be probabilistic and it's non-deterministic. If you're going to ask something and it's going to get your balance for your bank account, it better be freaking accurate all the time. So there's particular scenarios where the accuracy is critical, critical.

But when it comes to fine-tuning on the metadata, I think that's going to be huge opportunities. As you said, it's like, "Oh, here are these patterns of questions. Here are these types of questions and it fits these patterns of queries." That's the stuff where you want to fine-tune on that metadata there because then it should increase the possibility to generate the right query. Now, another thing is when you generate a query, you're generating code and I can reason upon that and I can do that in a deterministic manner. Now I can actually check. I can do static analysis and I can check if this is correct, if the code is going to be correct or not, with a bunch of heuristics. So then I think there's another set of techniques that you can use. It's a post process.

So it's prompt engineering, you're going to do vector databases, you're going to do fine-tuning on the metadata, not on the data, the metadata. You'll do some post-processing. So these are all basically just a bunch of tools that you'll have. And what we'll see is an agent framework that will, for each kind of state in that agent, will use this tool, for this state, they'll use this tool, for this tool. It has a lot of tools at its disposal. And I'd argue that a lot of the IP that will come out is on how people are structuring their agent frameworks and how, at the end of the day, you want to have an intelligent agent who can really understand its environment, perceive its environment, make decisions autonomously based on the environment, and that's really going to be how sophisticated your agent framework, your state model is going to be.

The current kind of RAG architecture that we all talk about, I think that's the simplest, most naive type of agent system. Get a question, you send something to the vector database, which then gets context back and you send it all in. That's kind of one stop shop for everything. That's going to get broken down into so many different pieces, and that's how we're going to start developing these agents. Which, by the way, this agent stuff is another thing I'd call out there. We've been working on agents. This is the good old-fashioned AI. People are working on this stuff for 50-plus years. There's just so much work on agents, on planning. So, hopefully, we don't all spend time reinventing the wheel and build on the shoulders of giants so we can advance very quickly here.

Juan Sequeda: (1:18:23)

A tip that I've found, and then I have a project I want to ask your advice on, is when fine-tuning. This is definitely proven true for GPT-3.5. Unclear, you know, everything has to be kind of revalidated when you move up an order of magnitude in scale. So I'm not sure yet how this will apply to GPT-4 fine-tuning. But for 3.5 at least, fine-tuning on chain of thought type data has proven extremely valuable for me in terms of getting the fine-tuned model to behave how I want it to behave. It has not been good at learning facts that way, so certainly I wouldn't fine-tune on raw data. Even fine-tuning on the definitions of tables and stuff, I wouldn't expect that to work very well either. You're still going to need it in context to be accurate from my experience. But where I've seen the most boost is in explaining the way in which we want you to go about going through this problem, and it doesn't have to be huge in my experience. As few as a hundred examples can get you a huge boost at the 3.5 level. But it's just really critical to have that kind of explicit step-by-step that the model, because obviously we know that chain of thought, you know, "think step by step" is really helpful. But you have to kind of demonstrate what kind of chain of thought do you want or do you need to solve the problems in the particular context.

So here's my challenge, and this is really why I kind of got down this rabbit hole of knowledge graphs a little bit in the first place. I'm working with this company Athena. We're in the executive assistant space, and we are trying to deploy AI in a million different ways to enhance our service. One of the things that we know our clients would really like is if we had some way for the EAs to have always-on access to everything about them, right? All their preferences, all their history, whatever, anything in the background. A lot of times right now, the EAs have to ask questions, and it'd be better if we could somehow answer those questions without them having to ask. So the idea is, can we create a long-lived client profile that we can sort of maintain and update over time?

A huge challenge though is almost all the data is unstructured. We interview clients when they come on board and record that call and transcribe that call and then summarize that call, but it's all text. The whole thing is text. There's no schema whatsoever. So we can then throw that stuff into a vector database and query against it and get, you know, kind of what did they answer at the time of the onboarding call. But then, of course, things change, right? So how do we add more to that database over time? And how do we sort of manage what supersedes what? I wonder, you know, this is pretty early stuff. I think we're on the absolute edge of tackling these problems, but there's not too much work out there on this kind of thing yet. I wonder if you have any kind of thoughts or intuitions for us as we think about something where, you know, the core challenge here is there is no schema. It's all just text, and we're trying to sort of be able to dump new stuff in there, but then have it maintained in a way where we get the right information at any given point in time. We don't have great solutions for this yet, so any pointers or suggestions would be valuable.

Nathan Labenz: (1:21:59)

I will argue that there's always a schema. There's always an ontology. There's always semantics, and it's implicit because you're talking about the same concepts over and over again. You're maybe using different words that mean the same thing. So an idea here is, take all that text that you have, all that unstructured stuff, and just ask GPT and say, "generate an ontology or a taxonomy or a business glossary from these things," right? And then identify what these concepts are. I think for your particular space, think about the domain. Model the domain. What do people do? What are the roles involved? What are the tasks? Those things are probably going to have really generic ones that are going to be specific. They're going to be generic across all types of companies. And then I'm sure there are going to be things that are very specific, but then kind of figure out what are the high-level ones that are generic. It's there. I mean, you said there's no schema. There is a schema, right? There's a meaning behind everything we're talking about.

So just automate it. Just go get all that text and just say, literally the prompt, and I do this all the time, "generate a business glossary" or "generate a semantic layer, generate an ontology based on the following texts or questions." It's going to come up with a bunch of stuff and then you're like, oh yeah, that thing is related to this thing and that. And then now you have a consistent, controlled vocabulary way, and that's something that you can use later on in your chain of thought, right? Or just even how you're presenting things in the user interface or how people are just chatting or conversing with it.

I think, in a way this is anecdotal, but when we set up these, when we created these semantic layers, ontologies, what I've seen is that people start changing their behavior a little bit and using the words that were defined inside of that ontology, inside of that semantic layer. And then that's a human behavioral thing. It's like, oh, well, if I use these words, I actually get my answers, right? So then they start changing. And then actually as an organization, we all start using a common vocabulary. We have that lingua franca. And it's like at the end of the day, we should. It would be great if we start doing that. Bring on things and I think this is also a way to kind of get that, have that human behavior. So anyways, long story short, there is a schema in there. Use GPT to help you extract what that schema can look like based on all the text that you have, and then reuse that in just any other techniques that you're doing.

Juan Sequeda: (1:24:31)

Yeah, interesting. So would you recommend, right now we are kind of just chunking and dumping stuff into vector database and querying against that. We're using the HyDE technique, the hypothetical document embeddings, which is basically the translation. When a user comes to our chat and says, "what is X about the client?" we translate that to something that we think is what the answer is likely to look like. And that's the hypothetical document embeddings, make sure you're familiar. And then that seems to improve the accuracy of our retrieval, but we're purely using this vector database. We're not really using much structured data. Do you have a recommendation for what database would you recommend? Should we go with a Postgres with their vector thing, or should we be thinking graph database?

Because we do have a lot of, I think a lot about that AI Town paper. I don't know if you saw that one, but it was, you know, it made the rounds. It was these little AI agents were running around and making plans and interacting with each other. For me, one of the most interesting aspects of that paper was the way they handled memory, was through a periodic sweep of all these raw observational memories into these kind of higher-level synthetic, thematic, or more episodic memories. They would, you know, the agents would write down, "I talked to Juan at this time, we said this and this." But then there's too much there pretty quickly for, especially with limited context, that's of course expanding. But to manage that, they had to kind of synthesize it into this summarized layer. And I thought that was really interesting. We haven't implemented anything quite like that. That would seem like a very natural thing perhaps for a graph database of, this was synthesized from this, or you know, these, if I'm looking at some abstracted summarized memory or description, what are the raw observations that that came from? Where do you think I should put that data?

Nathan Labenz: (1:26:36)

I think this is going to be a combination of using whatever vector database, vector similarity indices, whatever, and then I think some sort of graph. I think this is my perception, that's where things are going to head. And then, I mean, at the end, just use whatever graph database you want and then things can change. I mean, I'm not a big proponent of this database or whatever. You're going to have virtual knowledge graphs for things where you have all your enterprise data. You're not going to move all your data from Snowflake or Databricks into a graph database and keep copies of that. No way. That's not going to happen. You're going to have a virtualization layer, so you can query as a graph and it gets translated to SQL over that. I mean, that's kind of what we, that's exactly what we did for our benchmark work. So that's something if you already have existing data in a relational database layer or your data lake, you're going to have a virtualization layer. But if you're starting from scratch, then I think you're going to have a combination of those two.

Guess what's going to happen? Every single structured database is going to have vector features. If they didn't have it yesterday, have it today. If it didn't happen today, they'll have it tomorrow. So all graph databases are having this right now, as well. So much. All the SQL databases that are out there today. Are vector databases going to have SQL interfaces and graph interfaces tomorrow? I've all been explicit about this. Vector databases, that's a feature, not a category. There'll only be one winner. Whatever vector database is, it's only one winner. Everything else, it's just going to be, I'm just going to use this other database that's going to have a vector feature and it's going to be enough. That's what's going to happen in the next year or so. And you will use, you will need to use the vector database if you have a real scale issue, a scale need. Otherwise, all databases will be adding vector features.

Juan Sequeda: (1:28:27)

Well, folks can juxtapose that against Anton's answer in an earlier episode. His big thing, which I did think was compelling, he basically said the data that comes into Chroma has never been in a database before, and he was basically like, just the overall growth of how much data is going to go into databases is headed for 10x because of just all the structured, unstructured stuff moving into these new structures. And so everyone can win, you know, because the whole market is growing 10x.

Nathan Labenz: (1:28:59)

Well, my whole point is that you look at the Snowflakes and the Databricks and the Azures of the world who have already all the SQL stuff, they'll be able to say, well, we should be able to support more unstructured, and they will. And they will do that. Now these vector databases, they're like, oh, we do that too. But they're like, oh, but you don't offer SQL. So what? Do they now re-embed SQL, add SQL to that stuff? No, they're not. And then it's now for scenarios where you need best scalability and all that stuff for vectors, then you will want that, but that market is not going to be that big. There's going to be one big player and everybody else is going to be behind. I mean, it's like MongoDB, right? MongoDB is the winner and there are other kind of NoSQL databases. They're all small behind. There's one winner there, it's MongoDB, period, right? Go look at their stock. That's what's going to happen.

Juan Sequeda: (1:29:52)

So what other uses for AI in data are you most excited about? I'm thinking about things like, you know, as an application developer, I find it still pretty tough in a lot of cases to figure out what are people doing as they go through my application. It's easy to log stuff, you know, log clicks and log whatever, but to really synthesize that into a story, you know, a lot of products try to do that, and it's been a big struggle in my experience to get them to work well. But I feel like maybe AI can kind of come along and narrativize. I also think about things like anomaly detection or determining what's relevant or not relevant when it's not obvious by definition. But I'm very curious to hear, aside from this chat with your data question answering use case, what other roles do you see AI playing in the data world in the near future?

Nathan Labenz: (1:30:53)

Suggest descriptions about things. People have to go write these descriptions and they're like, well, at least I don't start from a blank slate anymore, right? So and that's just applicable to me. I write an email about something, right? I just use it and often start from scratch, right? So think you're applying that for that type of metadata, just like any type of documentation, that's a big lift on that thing.

You can also use it, here's another one. I mean, talking about anomaly detection or PII, just pass the schema to GPT and say, "out of this schema, which columns do you think may have PII data?" Personal information, right? It will give you some things like, hey, those are actually pretty good candidates. And guess what? You don't have to look at the data at all, right? And that probably, and that's something already, it's a big lifter on that stuff, right? So a lot of the documentation and kind of this augmentation of metadata, I think that's one thing.

Second, for search, just in general, I think just more natural language search. That's another one. Again, that's a productivity lift, right? One that I found really exciting is on explanations, right? So it explains code very well, which is another type of documentation. So a lot of this documentation you can do.

And one that I'm really finding interesting, working with one of our customers, WPP, who's the world's largest ad agency, they wanted ideations. So I want to kind of use the LMs as a way to inspire me. So basically it's like, data.world, you have, you are a catalog of all our data, you know all our data sets. Tell me what are the questions that I could be asking with the data I have. And then it starts generating candidate kind of questions. Then you could be asking this question and so forth. And people are like, oh, that's interesting. I didn't think about that. And you can keep track of all these questions and start connecting them. And then we've been doing these things like, oh, I have this problem I want to go solve. What data do you recommend me to go use around that? So you can give all this context. Here's all the context that you have. Here's the problem. GPT, LM, come up with some interesting solutions and then make suggestions of what they can do. So it's about ideation. But I think that's, again, more of the productivity lift. So those are things that we're seeing here.

Another one is around this, call it knowledge acquisition. It's like, let's go talk to people. Wait, so what do you mean? When you mean customer here, what do you mean by active customer? Right now we have to have those conversations, human to human. And now I can kind of automate and scale that out saying, hey, can you please spend a couple of minutes chatting with my bot here? They're going to ask you questions. We've done this, set up our own GPT. So be a psychologist, be very Socratic, and ask somebody what are the questions that they're trying to ask to the data and why is that important? They keep going in, why, why, and trying to extract the knowledge. In one day, I could talk to 20, 30 different people and then I could use GPT again to summarize all these discussions saying, hey, look, when we talk about active customers, here are all these different definitions of what they could be, right?

So that's really helping me again in the productivity way, faster understand what we mean by things. And that's the one I'm most excited about, and I hope people get excited about that too, because that's the one that has always been that bottleneck. Acquiring knowledge from people's heads has always been a bottleneck, which is you've got to go talk to somebody. Now we actually can do that at scale with this. That means that we can learn so much about how the business operates, map out business operations. We want to have operational excellence and understand basically all our business processes and how can we improve these things? Like, well, we don't even know what our business processes are. Well, I can go interview somebody and say, how do you do your job? And I can map that out and say, wow, look at all these manual steps that happened for this stuff. That's a bottleneck. We need to go improve that. Giving you more, we can improve our operations.

Juan Sequeda: (1:34:51)

How far do you think this goes in terms of productivity lift in the short term? Meaning, let's just say we have the technology we have and we apply it well. Are we talking like a multiple X increase in productivity? Certainly I personally feel that in the coding use case. And what does that ultimately mean for data analysis? Do we end up, you know, the old bank teller thing is sort of the happy story where, you know, we introduced ATMs and then people said, oh, that's the end of the bank teller, but it's not. You know, there's been a lot more bank branches open. Classic example. Is that where we're headed in data work as well?

Nathan Labenz: (1:35:35)

In all work. Everywhere. I mean, I think on our podcast yesterday, we had Jeremiah, and he and I were discussing this, okay, who is not going to be affected in data? I mean, we came to the conclusion that only somebody who literally lives in the middle of nowhere and doesn't talk to anybody, who doesn't depend on anybody else, because you're part of some supply chain. And that supply chain is going to get affected by AI. So I think everybody is going to get affected by AI, getting more productive and stuff.

And then there's just the honest, no BS here is that yes, we're going to have to deal with the job change that's going to happen. There's going to be less jobs, and we're probably going to have to figure out UBI and what we were kind of just talking about is, this is probably going to be on the top of presidential campaigns in 8 years from now. Like 8 years from now, this is going to be a deal that this is going to be top of the conversation when we're electing president. There's going to be premiums for dealing with humans, right? Because at the end, everything's going to be so much automated. We're playing around with Perplexity yesterday. We're doing our podcast or Pi AI and this stuff is replacing a lot of the humans, I mean a lot of stuff that humans are doing. So what we'll see is I will pay a premium knowing that I'm going to interact with a human and not with an AI.

So I think everything is going to get touched little by little. Another thing I'll say publicly, and I'm hoping that people won't take this the wrong way is, I feel bad for people who are doing coding boot camps because that's just going to be completely commoditized. Writing code is going to be all fully automated. What we're going to need are people who are true algorithmicists, computer scientists, people who are thinking about solving the problem because you're going to come up with the algorithm, the step-by-step to solve a problem, and the implementation, because you're going to break it down into small pieces. Then the implementation, the code for that, it's going to be the product. It's going to be 90% correct, right? So knowing how to code, but without knowing how to think about problem solving, think about problem solving like that. So I think all these jobs are going to, a lot of them are going to go away.

I mean, we've seen this again over a generation. I mean, when the typewriter came out, people were against it because it's like, oh, if I write it by pen, I know who wrote it. I can see their handwriting. They're like, no, this is offensive because I don't know that people are going to be typing these things that we don't know who's going to say these things. People thought that was the argument back at the beginning. Well, I don't think that's an argument anymore. So all these arguments will come in and then they'll will go. And I think it's going to move so fast that we're going to see so many changes in our generation.

Juan Sequeda: (1:38:36)

How would you anticipate that playing out for an enterprise data team? Obviously, you know, it can be generic, but let's say you're big enterprise and you've got all these systems and you've got however many people. I don't even really have an intuition, you know, for what's a typical enterprise data team size and what are the role breakdowns. But how would you expect that to evolve in terms of headcount, you know, which roles kind of change or go away? How much are people then also spending on AI? Are they saving, you know, an order of magnitude on their costs, or is it, does the AI end up being expensive? What's the before and after picture as this transformation happens?

Nathan Labenz: (1:39:16)

The roles that are going to stick around or going to be created are the ones that are very people-heavy, are the ones that are going to where you need a lot of thought, right? So think about a data analyst. I mean, if you're doing just the generic analyst that's reporting on metrics that are all very well known in a particular industry, that's going to go away because all the large language models, all the AI systems will know all the foundational metrics in every single industry. That's not, right? So, but then it's like people are going to be the creative thinkers and they're like, well, what about this that nobody's thought about? Not even AI is able to come up with this. I mean, using AI to come up with this. So people are really, really critical thinkers who can connect the dots from so many different places like, oh, look, this is probably something we should go figure out. I think those are the types of roles that will really go value.

And then you can see that in the analysis of the data and coming up. The other one that I'm very focused on is on managing the knowledge of your business, right? That's going to get managed in your, but being able to keep track of that, understand the, the LLMs, agents are going to assist me in collecting all that knowledge, but really then figuring out, okay, we're going to make the decision that the note that an active customer needs this for these reasons, right? There should be people behind all that stuff. So I think when it's all about critical thinking, those are the roles. And then other than that, I think a lot of that stuff is going to get pretty commoditized.

Juan Sequeda: (1:40:58)

You kind of alluded to creativity, you know, insight, eureka moments, if you will, being one of the things that the AIs don't deliver that we rely on humans for. I very much focus on that as well because it seems like that is a, I think about AI a lot in terms of just threshold effects. When AIs can do something, we can suddenly be in a very different regime from, as you know, as long as they can't. And having these insight, you know, kind of breakthrough eureka moments, it feels like a huge threshold that we have not yet passed, certainly not with any reliability. I would say it's very rare.

Nathan Labenz: (1:41:37)

This is on my to-do, my to-read list. I've seen a couple of papers, and I was like, whoa, this scientific discovery only happened because we're using AI. I'm like, that is freaking amazing if that's actually happening. Because that means that we're actually going to advance science, taking unknowns and making them known. This is freaking amazing that that just happens.

Juan Sequeda: (1:41:58)

It didn't do it by itself, right? A lot of scaffolding. A lot of kind of framework around it is definitely still necessary. But yes, there are, I think we're just starting to scrape at the bottom of that threshold right now with DeepMind recently had an interesting one where they advanced the state of the art in some challenging math problems. It's just funny. It's like I don't even really understand the problems, and they're, you know, the AI is advancing the state of the art. FunSearch, yeah. I think of Eureka, which is an actual name of a paper out of NVIDIA where they used GPT-4 to write reward functions for reinforcement learning for robot control of a robot hand. Gabe Gomes, who did the autonomous chemical reaction optimization with GPT-4. So we're starting to see these, you know, you might call it sparks of AGI if you're so inclined.

What else though aside from eureka moments is a, that's a huge one. Is there anything else that you would say, you know, if I'm an OpenAI or a Google DeepMind or an Anthropic, and I'm like, where are they falling short? What are the persistent weaknesses that you would love to see us address? Do you have a thought on that?

Nathan Labenz: (1:43:13)

These foundational models don't know anything about my organization. These foundational models, large language models, they're experts in language. They know general knowledge, right? They don't know anything about my organization because I've never, they've never looked at it, they've never trained on it. The context, brain of my organization. And that brain is the knowledge graph. Now the thing about the knowledge graph is that it's not a language thing, right? It's not like I can't do natural language without the knowledge. This knows everything about my organization because I keep tracking it, I govern it, and I use AI to govern it. I use AI to create it. But I, there is this thing that I have and this is my precious, this is the essence of what my organization is. This is the brain, right? And everything here is accurate. I can use this to explain things. The LM, these foundational models, don't have that accuracy, don't have that explainability, don't know my organization. Do we expect these foundational models to know every single organization? No. Because these things are private. I don't want them to know. But I want to go use them.

So that's why this combination of these foundational models, large language models with your internal brain of organization, the knowledge graph, that's what I see where there's a future. And I think what we need to find, what we need to work on is understanding the best integration point between the knowledge graph, the brain of the organization and these foundational models. The naive one is prompt engineering, right? Use some RAG architecture. We're going to expand this. This is what's going to happen, what we're going to be working on the next couple of years to figure out what that integration is. And then it's going to be plug and play. You're going to have all these foundational models saying, I can work with your knowledge graph? Can work with your brain, right? Just plug and play. Plug in your brain here and boom, you're powering up, right?

And then also, I don't want to be, as my company, I've invested so much in my brain, I don't want to get tied just to one. I want to probably move around. I mean, it's just like it's going to get like another cloud, right? You maybe could get all bought in, right? We use all, everything is, we're an AWS shop, right? We're an Azure shop. I do everything there. It's probably where they go, but people are still multi-cloud, right? So I want to work with different foundational models. So then that's going to happen. So we've got, we're going to have these integration points. So I do think that organizations are going to be investing in creating their brain, which is their knowledge graph. We have all these foundational models, and then it's just going to be, we're just finding better ways of plugging and playing, and that's how we're going to, that's the integration points that are going to happen, and that's what we're focused on. That's how I think about it. I don't know. What do you think?

Nathan Labenz: (1:45:51)

A project that I'm maybe starting to embark on is a kind of taxonomy of weaknesses of current systems. And I do think you're hitting on a big one there that ultimately boils down to the fact that in today's systems, there are the weights, the model itself, and then there's the stuff that you're putting into it at runtime. That is the input, the context, the prompt, but it's limited, and there's nothing between those two. I've been really obsessed recently with state space models as something that may enable a much longer and perhaps more coherent, perhaps more agentic, perhaps more human-like system. Although, I heavily caveat the human-like because I always try to remember that these things are pretty alien in critical ways.

You can fine-tune a transformer model for behavior, but you end up narrowing that behavior typically, and you lose a lot of the generality when you do that. So you're dialing it into a specific task. That's been my finding, usually with the fine-tuning that I've done so far. You can't really do something where you're training a transformer model like you onboard a new employee. And so I'm looking for something where I want to be able to give an AI system a big body of knowledge and say, "Here's a bunch of history and a bunch of things that we've tried in the past that have worked and some that have failed." And I want to download that knowledge into a system, not necessarily update the weights, but have that encoded in some way where it can be then used to inform the runtime tasks that I want to give it. And I think state space models are shaping up to be a good solution there.

Juan Sequeda: (1:47:50)

If I abstract what you just said, you're giving your body of knowledge, things that you know that work, that don't work. Right now, how are you representing that? Well, as text or whatever. I think all these things should be structured in a graph, in a knowledge graph. But you're giving your knowledge, something you have, and you're going to pass that on to these other foundation models and you're saying, "Hey, let's go do something together." Right? So there's that. You have these foundation models, you have your knowledge, you want to be able to integrate, combine that. And how are you going to go do that? Where are the integration points going to be? That's where we're going to find out what's the best one, what does "best" even mean? I don't know. We're figuring this out. And then you'll be able to combine that. And that's how you're going to deal with the privacy issues. That's what I find really exciting. You're combining external and internal things together, and that's what's going to make it super powerful.

Nathan Labenz: (1:48:43)

You said a couple times along the way, the "no BS" take on it, and I know that's a theme of your own podcast. Do you want to tell us where people can hear more from you if they want to go deeper down the data rabbit hole with you?

Juan Sequeda: (1:48:58)

Yeah. So in addition to the other side work I do at Data.World, together with my cohost, Tim Gasper, who's our chief customer officer, we host Catalog and Cocktails, the honest, no BS, non-salesy data podcast. We've been at it for, I think, our fourth year now. We started season seven. I don't know, maybe episode 160 or something like that. We talk about all things enterprise data management, analytics, governance, all this stuff. Honest, no BS, because life is too short for BS and drama.

Nathan Labenz: (1:49:30)

Cool. I love it. Well, I'll check that out. I'm sure others will too. For now, Juan Sequeda, principal scientist, head of the AI Lab at Data.World, thank you for being part of the Cognitive Revolution.

It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Unlocking Enterprise Data with Knowledge Graphs | Juan Sequeda, Head of AI Lab at data.world

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next