Synthetic Data with Alex Watson, Founder of Gretel AI

Watch Episode Here

Video Description

In this episode, Nathan interviews Alex Watson, founder and CPO of Gretel AI, about the company's work in synthetic data. They discuss why we need synthetic data, Gretel’s new pre trained tabular LLM that creates synthetic data on a zero shot basis, privacy techniques to prevent LLM memorization, and more. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

With the onset of AI, it’s time to upgrade to the next generation of the cloud: Oracle Cloud Infrastructure. OCI is a single platform for your infrastructure, database, application development, and AI needs. Train ML models on the cloud’s highest performing NVIDIA GPU clusters.
Do more and spend less like Uber, 8x8, and Databricks Mosaic, take a FREE test drive of OCI at https://oracle.com/cognitive

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

X/SOCIAL:
@labenz (Nathan)
@AlexWatson405 (Alex)
@Gretel_AI
@CogRev_Podcast

TIMESTAMPS:
(00:00:00) - Intro to Alex Watson and Gretel AI's focus on helping build better data
(00:03:02) - Origins of the company name Gretel AI and initial vision around enabling data sharing while protecting privacy
(00:05:16) - Alex's background in data privacy and compliance from his previous startup Harvest AI, acquired by AWS
(00:06:37) - Early experimentation with language models in 2020 to recreate data distributions and improve ML model performance
(00:07:24) - Using synthetic data to create additional examples and improve detection of rare diseases
(00:12:50) - Why use synthetic data?
(00:17:02) - Sponsors: Shopify | Omneky
(00:19:00) - Training models to recreate real-world data distributions and using validators to detect unrealistic outputs
(00:21:30) - Generating tabular data row-by-row with transformers vs token-by-token with language models
(00:24:40) - Pre-training the Gretel tabular LLM on diverse internet data sets to learn good data
(00:26:27) - Challenges of limited context window size relative to full tabular data sets
(00:30:40) - Sponsors: Oracle | Netsuite
(00:34:00) - Using an agent planning architecture to break down large data generation requests
(00:37:40) - Having the agent determine when to use code vs the LLM for different parts of the data
(00:39:41) - Example use case of adapting models with synthetic data samples for long-tail cases
(00:43:00) - Using reinforcement learning to intentionally generate more diverse and representative synthetic data
(00:46:24) - AI Regulation: Biden’s Executive Order
(00:48:20) - The importance of alignment checks and controls while still providing model openness and flexibility
(00:51:16) - The potential of efficient, lightweight models compared to massive LLMs like GPT-4
(00:56:00) - Analogizing model specialization to specialized parts of the brain rather than ever-larger general models
(00:59:22) - Stochastic parroting vs reasoning
(01:02:18) - Focusing on solving data problems for users and iterating based on their feedback
(01:06:04) - Using differential privacy techniques to prevent memorization and exposure of private data
(01:14:37) - Adding noise during training to blur memorization while still allowing model convergence
(01:18:42) - Optimism that synthetic data quality issues reflect details not fully understood yet vs inherent problems

The Cognitive Revolution is brought to you by the Turpentine Media network.
Producer: Vivian Meng
Executive Producers: Amelia Salyers, and Erik Torenberg
Editor: Graham Bessellieu
For inquiries about guests or sponsoring the podcast, please email vivian@turpentine.co

Full Transcript

Transcript

Alex Watson: 0:00 Your data is messy, it has gaps in it. I can't create new additional examples. It's too expensive, or there's no way to go back to it. So we really focused our efforts on, first and foremost, helping you build better data. That's been the guiding light. That's what we're really aiming for. No LLM today can generate 100,000 or 1,000,000 row data set. So the first purpose of the agent was interpreting that user query that's coming in, and then figuring out how to divide it up into a set of smaller problems that the LM can work on one problem at a time. The promise of a really lightweight model, really fast model shows the power that you can have of taking a domain specific data set you have or task, and doing something meaningful without having to do something at the GPT-4 scale.

Nathan Labenz: 0:46 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Eric Tornburg. Hello, and welcome back to the Cognitive Revolution. Today, my guest is Alex Watson, founder and chief product officer at Gretel AI, the synthetic data platform for developers. Synthetic data is a fascinating topic. Since the early days of deep learning, it's been well known that training computer vision models on a mix of original and programmatically altered and degraded images ultimately improves model performance. It seems that learning the concepts through the noise boosts robustness to the random unseen oddities that models inevitably encounter in the wild. And more recently, dozens, maybe even hundreds of papers have explored how LLM generated data can be used to improve training sets and ultimately model performance on a wide range of problems. Yet at the same time, some research results and many observers of the evolution of the Internet in general have cast doubt on just how much synthetic data the system can absorb before models begin to lose touch with their real world origins or otherwise degrade. With these questions in mind, I reached out to Alex who's been building a business on synthetic tabular data generation since 2020 and who proved to be an amazing guide to this domain. While synthetic data might sound like a niche topic, I think this conversation will be of general interest. We started with a discussion of why we need synthetic data, how Gretel has trained specialist models to maintain realism while also preserving privacy in creating it, and how we can be confident that we can trust this data for analysis, testing, and, yes, AI model training. Along the way, we also explored the trade offs between statistical realism and social manners. The impact of LLMs on Gretel's business and the new pre trained tabular LLM that they've recently introduced to help create synthetic data on a zero shot basis for a wide range of data types and scenarios. We even took a detour into AI regulation in the wake of the recent Biden White House executive order and the UK AI Safety Summit. This episode is a great example of why I love making this show. I learned a ton in the preparation and had a lot of fun with the conversation, and I think you will too. If so, I always appreciate it when listeners share the show with their friends. And, of course, we invite your feedback via our email at tcr@turpentine.co or via your favorite social network. For now, I hope you enjoy this conversation with Alex Watson of synthetic data company, Gretel AI. Alex Watson, welcome to the Cognitive Revolution.

Alex Watson: 3:38 Appreciate it. Thanks, Nathan. Excited to be here.

Nathan Labenz: 3:40 Yeah. This is gonna be great. So you are the founder and now chief product officer at this company, Gretel AI. I love to hear how you come up with that name, by the way. But what you guys do is synthetic data. And I'm just so interested to learn so much more about it. It's been really eye opening to explore the product a little bit. You can do some of the best live product demos that I've seen. Your recent 12 YouTube short, thought was really good.

Alex Watson: 4:08 I appreciate that. Thanks.

Nathan Labenz: 4:09 Yeah, think this is going to be a ton of fun. So tell me, where'd Gretel come from? Give me just a quick backstory, and then let's talk about why do we need synthetic data?

Alex Watson: 4:16 The original vision for Gretel was around a better way to make data that we can't make accessible, accessible. And it's evolved quite a bit. Synthetic data has so much more capabilities and promises that we've discovered over the past about 3 years of running our business. But it was a reference to that, the digital breadcrumbs that we leave behind. And really an effort from our company with using synthetic data to enable data sharing at a scale that hasn't been possible before. Imagine hospitals sharing medical records, research institutions, financial companies sharing data in a way that doesn't compromise consumer privacy. And that's really where we started. So you'll see that as we go through the technology we built and talk about differentially private training and things like that, you'll see some of that come through in our product. We've expanded that vision and that scope quite a bit, but we really started around privacy and around the idea of protecting individual privacy, but enabling learning and data sharing at scale.

Nathan Labenz: 5:13 Yeah, it's awesome. So the big 2 value drivers today, obviously, the founding premise is privacy. And now there's this massive takeoff in AI and so many people training stuff and trying to figure it out. And so the other big use case that we're seeing is improving the data that people are feeding into their training processes. So tell us a little bit about that one as well.

Alex Watson: 5:39 Yeah. Maybe I'll start with the history of how it happened. And it actually happened incredibly early in our company. So for a brief history, prior to Gretel, I was a co founder of a startup, security startup called harvest AI. We built products that help customers scan and detect important data in the cloud. We ended up getting acquired by AWS in 2016. We went out for series A raise and got approached around launching that service as an AWS service. So I was a GM there for about 4 years for Amazon Macie, which people used to scan AWS cloud for important data, and saw that even the biggest, most cloud native, incredible data companies struggled with enabling internal access to data. So the Pinterest of the world, the Airbnb's of the world and things like that. So you saw what a problem this was at scale, and also the power of when you can make data accessible. At AWS, we had, at the time, a 500 person compliance team that could work wonders for making data accessible. So we've started out with that privacy thing. And our first open source example we released in 2020, actually, I think about a week before the pandemic hit. And open source ability to essentially, we used a language model. This is 2020, so we weren't using transformers at the time. We were using an LSTM where we had started to partner with the Google TensorFlow team around technology called DPSGD, which enables you to train models with differential privacy so you can make sure they won't memorize secrets. But one of the early features that we had was the ability just like we all do today to prompt the machine learning model and ask it to create something new. So our first real experiment was saying, can a language model like an LSTM, instead of learning a language and text, can it learn to recreate the distributions inside of a data set? So we really started focusing on tabular data around 2020. And that can be mixed numeric categorical text data, anything in between. And then we had the ability to prompt the model where you could give it a subset of those features. Like given a zip code and an ethnicity and a date, generate the rest of this record for me. Very early in our journey. I think the first time we had this was working with some researchers at UCI, so University of California Irvine. And they were doing, they're working with a rare disease data set that was highly imbalanced. So you have thousands of patients, but the only people inside that data set that had the really rare disease were in the 10s to 20s. So the question was, can we address some of the representation bias here? First of all, so essentially boost that minority class. And if we do that, can we improve the detection for this disease? So essentially, the idea is using synthetic data to create additional labeled examples when they weren't able to go back and recreate their experiment or their collection. And can that data set be used to improve downstream machine learning training? So the idea is it introduces new examples that weren't in the training data and that'll help the machine learning model. And we had a lot of success there. And since that point, I think we've seen more and more focus. Fast forwarding to today, and happy to talk about where Gretel is today and what we're seeing today, but it's about fifty-fifty. One corner we have as a value driver is safe sharing of synthetic data, where we can create data that has up to mathematically provable privacy guarantees. And the other area where we're saying, hey, how do we improve on machine learning data sets? And this can be tabular data for fraud detection, for ad recommendation systems. It can even be text data. And there's such cool research coming out recently to support that. We're essentially using an LLM to create additional diverse examples, it was mentioned in the Microsoft Phi 2005 paper.

Nathan Labenz: 9:24 Yeah. There's a lot of connections here. I mean, right off the bat, I'm just thinking curriculum learning. That's such a huge theme in my mind these days. The ability to get smarter in terms of what data you feed into even the pretraining stage and, enhancing, filtering, enhancing, curating, boosting, so many different manipulations there. But this one is probably one of the most intuitively obvious where, especially, you think rare diseases. It's just not in there that much, and that makes it hard for the gradient descent to reinforce, to reward the learning of it. So boost it up a little bit, and next thing you know, you're getting better performance. I mean, so many opportunities like that. And that's what was the light bulbs that were going off in my mind as just, application builder today too. I was like, boy, I see just so many quick patches in my future of rare cases that I wanna handle better. So I think that's super interesting. Just going back to the Amazon thing as well for a second. So, basically, this is because I do love to also contrast recent history approaches. You're scanning for important data. Does that mean you're, I imagine, this 5 years ago version of that was just a whole Swiss army knife of different explicit techniques, regular expressions and classifier, a handful of classifiers. What did that thing look like? And now today also, I'm like, maybe I'd use Claude Instant and clean out a lot of that old code. What do you think? Is that a reasonable intuition?

Alex Watson: 11:05 Yeah, that's one of the reasons I am so immensely grateful for the LLM technologies and transformers that are out, is that there is a light at the tunnel for people doing traditional NLP and NER of a better, more general way to do it. So really excited about that. But you're exactly right. So going back to Macie, it uses a combination of traditional named entity recognition technologies, as well as, as you were saying, regular expressions and things like that to help identify any type of personal data that might exist in the cloud and label it so you knew where it was. And it would really take a look at it and say, is this exposed to the internet? Is this shared to outside organizations and things like that, and give you the visibility that you needed across your organization. The real goal was to enable developers to make decisions about what tools to use and use the best available tools, but also get the enterprise visibility necessary for that to happen. I think the big challenges we faced doing this at Amazon scale was we went from a startup that had a couple amazing customers to the first week that we launched Macie, we had 6,000 customers. And we were doing named entity recognition at up to petabyte scale. So much time focused on how you make that even traditional ML technologies work at the scale. Part of the reason I'm so excited about technologies today is just the amount of specialization that was required, or tuning anytime your data characteristics changed that were required. And now it's really, as much as all of us probably get annoyed with the need to prompt tune and do things like that, the promise of an LLM that can understand your natural language question and make that change for you automatically is really, really cool.

Nathan Labenz: 12:45 It's certainly a game changer in so many different respects. Coming back to the present and the synthetic data, it's unfolding today. There are a number of use cases that you guys highlight in your product and your demos. I'd love to hear you talk through a few more beyond the boosting of the underrepresented set. One that jumped back to me and I think really highlights the challenge is insights. The idea that, and I can just imagine, I've done a lot of data analysis in my time. And it's like, okay, I certainly hear why at the corporate level, you don't want to be passing around the crown jewel dataset. I mean, I did some work with Rocket Mortgage, for example, and the care with which they maintain their customer data access to all that stuff. It's a serious, serious effort, so you can't just be passing stuff around like crazy. That makes total sense. But then when you say, okay, well and super creative concept, instead of having to deal with all that, we'll just make fake stuff and use that instead. But insights, I was like, okay, boy, insights. It's, I'm gonna need some real theory to start to trust that you can make fake data that is enough like whatever, and that's obviously something that probably most people are gonna struggle to wrap their heads around. Well, how do you define that, prove that, whatever, such that I can actually do my pivot tables on this and trust that what I'm getting is making any sense? I've been thinking about that a lot, and I'm really, I've got some guesses, but I'm really interested to hear more about the provability of how this stuff works.

Nathan Labenz: 12:45 It's certainly a game changer in so many different respects. Coming back to the present and the synthetic data that's unfolding today, there are a number of use cases that you guys highlight in your product and your demos. I'd love to hear you talk through a few more beyond the boosting of the underrepresented set. One that jumped out to me, and I think really highlights the challenge, is insights. The idea that, I can just imagine, I've done a lot of data analysis in my time. I certainly hear why, at the corporate level, you don't want to be passing around the crown jewel dataset. I did some work with Rocket Mortgage, for example, and the care with which they maintain their customer data access to all that stuff is a serious effort, so you can't just be passing stuff around. That makes total sense. But then when you say, okay, well, super creative concept, instead of having to deal with all that, we'll just make fake stuff and use that instead. But insights, I was like, okay, boy, insights. It's—I'm going to need some real theory to start to trust that you can make fake data that is enough like whatever, and that's obviously something that probably most people are going to struggle to wrap their heads around. Well, how do you define that, prove that, whatever, such that I can actually do my pivot tables on this and trust that what I'm getting is making any sense? I've been thinking about that a lot, and I'm really, I've got some guesses, but I'm really interested to hear more about the provability of how this stuff works.

Alex Watson: 14:24 Yeah. Our approach, and I think the one that seems to be gathering a fair amount of steam in the synthetic data world, is to train a model. Of course, we're minimizing the loss function as we're training it and doing the best we can. But that doesn't tell you how that model is going to capture the real world distributions that you care about and the ability to apply it. So for us, regardless of the modality, if it's text, if it's tabular, if it's time series, it really starts with having the model master the ability to recreate data matching the same distribution as the real world data it was trained on. And if you can have confidence in that, you can start to alter the distribution for whatever your task is. So how we do that? We train the model at each iteration, and really at the end, we sample a bunch of data from the model, about a one-to-one equivalent of the real world data. And then we essentially, from a statistical perspective, throw the kitchen sink at it. We have two ways of measuring. One, I would say, is meant to be as objective as possible, and the other is meant to be task specific. So we have something called our synthetic quality score. What it's doing—it's easy to walk through from a tabular perspective, we actually have similar scores for text and time series as well. But we sample a bunch of data from the model, we look at pairwise correlations, and that creates part of a composite score. We look at the per field distribution. We even do PCA distributions for each field and then do a distance metric between the real world data and the synthetic data. And the idea is to give you a one through 100 score that you can look at and you can reason about and say, if this is above 80, we expect it to work well for the types of machine learning use cases that most people use synthetic data for. If it's below that, maybe that works for your use case. Maybe your use case is just testing or something like that. But as you were saying earlier, you don't want to create pivot tables on that. So really, we start with trying to give you that sense of confidence. We've added in the ability, and really just after seeing a lot of customers do this, to automatically test a downstream task for your data as well. So after the model's done training, we can run a regression or a classification task or things like that automatically within our platform. We have a lot of customers that use Vertex or SageMaker or things like that to run this as well. So we just built it into the product. You know, not everyone had to write code. But I think a mixture of that somewhat completely objective, not task specific score, that is a good general indicator, and then also that understanding of your task—what you want to do with the data—and making sure it conforms to those expectations feels like the way to get that sense of confidence you need.

Nathan Labenz: 17:01 Hey, we'll continue our interview in a moment after a word from our sponsors. Yeah. Interesting. So could we unpack the loss function a little bit more? Because I'm wondering about the relationship. That's all pretty quantitative stuff. It's a code base, ultimately, a test suite that you can execute on any dataset that comes through and say, we're going to characterize what you gave us, and then we're going to characterize what we generated and show you that that hopefully lines up distribution-wise. On the generation side, and it's important to probably keep in mind for folks because we're also used to one token at a time language models, I'm very much thinking of your latest TabLLM that you demoed. But it might be worth distinguishing too between that latest thing and the more purpose specific models that you have. But I'm imagining for the new big one, it seems like there is a really natural and insightful thing here maybe for people where there's a decoupling of the prediction, which is the distribution, and then sampling from that distribution. I think this is something that people maybe don't conceptualize super rigorously, but the task that you have helps me, I think at least crystallize it a little bit. So most people have a general sense. At the end of the language model, you're putting a prediction onto every token. And then you can with your temperature setting, this is the experience that people are most familiar with if they're an AI engineer or whatever, you can turn that temperature down to zero, and you can get the most likely token, or you can turn that temperature up, and you can randomly select from those probability distributions. But in the practical experience of it, we really only see one token. And in the training, also, there is a ground truth text document that is firing one on the actual token and zero on all the other tokens. So it strikes me that your situation is a bit different here where you can potentially define the target as the distribution and just directly optimize and form the distribution of the predictions to the distribution that you've characterized from the data. And then the sampling from that becomes—you can understand it in—I mean, it's the same fundamental thing. But the difference between that one token being right and wanting to generate the actual distribution seems like something that was really helpful for me to wrap my head around over the last couple of days.

Alex Watson: 19:39 Yeah. Yeah. That's probably a byproduct of starting working with tabular data, where rather than looking at a row being generated at a time, we were using a variety of models when we started. We started with LSTM, we used GANs, we used diffusion models, and now, as you mentioned, with our TabLLM model, we use transformers. A byproduct of how we built our product is we end up looking at the row level. So every time a row of data is generated, then we examine everything. Similar if you're generating a sequence of new LLM instructions, for example, rather than looking at the per token, what we're going to look at is the per line or per record distribution. So essentially, we let the model generate everything. The first step during training is we're sampling and we're looking at it, but also when you're using the model for inference. So when you're asking the model for data to come out, there is the risk that the model is going to hallucinate or invent something new that no one wants to have happen. So we have a secondary level of validation that we call them—not very creative—we call them validators. But essentially, what it's doing is it's looking at all the outputs in the model and asking how different it is versus the original data that it was trained on. And you have the ability to filter out things that are too far outside of the distribution. And the idea there with the tabular data was to make sure that we didn't invent anybody's age that was 135 inside of a dataset. But it works really well for text data as well, just when the model goes off on a rant or invents something that's way outside of what it should be working with. You have the ability to filter that type of data out and it helps you have more confidence in a generative model that it's going to give a usable response. Another cool thing is that with so much of the focus for synthetic data really being on creating machine learning training sets, you can't have someone looking at every record in a row and say, yes, this is fine, this isn't, this is fine. So we've really focused on making sure that when we generate data at 1,000 or 100,000 records, or even a million records, that you have confidence those records match your expectation. So I think that's another really neat thing that I see happening. To go back six months ago, there were so many questions about, I want to use this model, this LLM for summarizing content on Reddit or things like that. And the risk was that it would summarize something that was off base and would be an inaccurate summarization. And I think technologies like what we built for text scoring, there's been a few open source metrics have been released recently, really help you quickly check in and reason about a generative model's output in a way that would allow you to serve the results to customers without having to have necessarily a human look at it. So a nice quick AI check on data makes these models so much more usable.

Nathan Labenz: 22:28 How do you do the pre-training? And how big of a foundation model is this? Again, I'm so fascinated with the TabLLM. I'm imagining that you've gone out and just assembled every public dataset you can and in a sense taught a statistical world model to this thing. So it's supposed to have all the right priors, basically. How do you go about creating and validating that strong baseline?

Alex Watson: 22:57 Yeah. So for some background, we are about to release for listeners here—we're about to release a model. We're calling it TabLLM, Tabular LLM. What it is, is an agent planning and execution architecture built to help people work with tabular data using natural language queries. And really at the core of that is both the agent that is making a decision about whether to use an LLM to generate data, or whether to use one of our tools and write code to generate data to serve your response. And what we're referring to here is the actual LLM model that we have fine-tuned on datasets from across the internet. So it's one of the first examples that you'll see of an LLM that's meant to work with tabular data. Tabular data can be text, time series, numerical, categorical, or any combination of those. The approach we took, and I think this will be a constant evolution from us, the initial approach that we took was exactly like you mentioned, Nathan, it was crawl the internet, specifically crawl GitHub, find any accessible datasets there, Kaggle, things like that, anything with an open source license. One area we were particularly lucky with was I was noticing a lot of times machine learning papers will reference datasets actually in the readme. So there's really great data linked inside READMEs there, and we could pull down the license and really understand if it was usable or not. But the idea was to train an LLM that would be used for a data generation task on what good data looks like. And something interesting is that while we all feel that LLMs today are trained, it's mostly accurate, on almost all of the content that are on the internet, if you're working with an OpenAI model or a Palm or even a Llama model. But these models really aren't trained largely yet on tabular data. And tabular data also introduces some interesting challenges in the sense that when you look at the context window that are available to LLMs today, which is on a great LLM, let's say, 16k tokens, it doesn't translate into a lot of rows in a typical tabular dataset. So 16,000 tokens, assuming 50 tokens per row is going to give you about 350 rows. So I think most of us who grasped to work with datasets much bigger than that. So one of the things that we noticed as we started working with LLMs and asking them to generate tabular data, the power of asking an LLM to generate tabular data is, one, they are just by byproduct of how they work really good with time series type data. There's been some cool research about that recently. Second, it allows you to apply a global level of knowledge to your dataset. So one thing I think that's really resonated with our users on the platform is realizing that your dataset is awesome. Everybody's dataset is unique and really cool, but it's also in some way limited. You don't have enough data. You never have enough of the examples. Nathan, you were mentioning the long tail of data that you deal with and finding a more systematic way to work with it. So the idea of applying a model that is seeing most of the datasets on the internet to that problem and saying, can you help me create some new meaningful variations in the data to help a downstream model generalize? Is really powerful. So that's where we started. In the initial model, we haven't done, for the TabLLM model, we haven't done anything super clever with how to encode or model numeric distributions. Rather, we just treat everything as text, and it goes through there. As I mentioned earlier, our first approach was crawl the internet and train it on everything. And I think very similar to other research and academic work we see right now, I think a more curated, highly diverse set of high quality examples is the way to go. So we're seeing our team really work on that. And some of the opportunities here is that even the GPT-4s of the world, when they've seen tabular datasets, it's usually a table on Wikipedia or something like that. So it's a couple hundred rows at most. The LLMs have not learned that it's important. Sometimes the relationships across the dataset might be thousands of rows or hundreds of thousands of rows. So that's a real neat application we're looking at right now: what if we train LLMs on a much larger context length and much more data? How good of a job do they do learning the subtle insights and distributions of the data that'll help improve ML generation when you're using the model? Alex Watson: 22:57 Yeah. So for some background, we are about to release, for listeners here, we're about to release a model. We're calling it Tabular LM. What it is is an agent planning and execution architecture built to help people work with tabular data using natural language queries. And really at the core of that is both the agent that is making a decision about whether to use an LLM to generate data, or whether to use one of our tools and write code to generate data to serve your response. And what we're referring to here is the actual LLM model that we have fine-tuned on datasets from across the internet. So it's one of the first examples that you'll see of an LLM that's meant to work with tabular data. Tabular data can be text, time series, numerical, categorical, or any combination of those. The approach we took, and I think this will be a constant evolution from us, the initial approach that we took was exactly like you mentioned, Nathan. It was crawl the internet, specifically crawl GitHub, find any accessible datasets there, Kaggle, things like that, anything with an open source license. One area we were particularly lucky with was I was noticing a lot of times machine learning papers will reference datasets actually in the readme. So there's really great data linked inside readmes there, and we could pull down the license and really understand if it was usable or not. But the idea was to train an LLM that would be used for a data generation task on what good data looks like. And something interesting is that while we all feel that LLMs today are trained, it's mostly accurate, on almost all of the content that are on the internet, if you're working with an OpenAI model or a Palm or even a Llama model. But these models really aren't trained largely yet on tabular data. And tabular data also introduces some interesting challenges in the sense that when you look at the context window that are available to LLMs today, which is on a great LLM, let's say 16k tokens, it doesn't translate into a lot of rows in a typical tabular dataset. Right? So 16,000 tokens, assuming 50 tokens per row is going to give you about 350 rows. So I think most of us who work with datasets much bigger than that. So one of the things that we noticed as we started working with LLMs and asking them to generate tabular data, the power of asking an LLM to generate tabular data is one, they are just by byproduct of how they work really good with time series type data. There's been some cool research about that recently. Second, it allows you to apply a global level of knowledge to your dataset. So one thing I think that's really resonated with our users on the platform is realizing that your dataset is awesome. Everybody's dataset is unique and really cool, but it's also in some way limited. You don't have enough data. You never have enough of the examples, Nathan, you were mentioning the long tail of data that you deal with and finding a more systematic way to work with it. So the idea of applying a model that is seeing most of the datasets on the internet to that problem and saying, can you help me create some new meaningful variations in the data to help a downstream model generalize? Is really powerful. So that's where we started. In the initial model, haven't done, for the Tabular LM model, we haven't done anything super clever with how to encode or model numeric distributions. Rather, we just treat everything as text, and it goes through there. As I mentioned earlier, our first approach was crawl the internet and train it on everything. And I think very similar to other research and academic work we see right now, I think a more curated, highly diverse set of high quality examples is the way to go. So we're seeing our team really work on that. And some of the opportunities here is that even the GPT-4s of the world, when they've seen tabular datasets, it's usually a table on Wikipedia or something like that. So it's a couple hundred rows at most. The LLMs have not learned that it's important. Sometimes the relationships across the dataset might be thousands of rows or hundreds of thousands of rows. So that's a real neat application we're looking at right now is what if we train LLMs on a much larger context length and much more data? How good of a job they do learning the subtle insights and distributions of the data that'll help improve ML generation when you're using the model.

Nathan Labenz: 27:26 Yeah. Quite a bit. I'd love to hear a little bit more about the agent structure because I'm imagining you said, okay, you generate one row at a time. For one thing, the order really matters there. I wonder if you have a systematic approach to reordering fields because there's been some interesting research lately that a implies b does not imply that b implies a from the language model's perspective. And then I guess there's a sequential probabilistic evaluation, where you'd be saying, okay, if I'm once at least some amount of pre-training has been done, if I were to give you the ZIP code as the first field, then you would expect to see reasonable demographics back just based on that ZIP code. But then you'd have, depending on the first variable that you predicted, you would have a very different conditional distribution for subsequent in all sorts of varying ways, correlated variables. So you're doing a little Markov almost process randomly down the Plinko board of possibility and then going back and evaluating each token for its conditional accuracy, right, or conditional real representation. Is that conceptually right? Hey, we'll continue our interview in a moment after a word from our sponsors.

Alex Watson: 28:51 One small modification I would make to that is that we have found the more data you sample from a transformer LLM based model for tabular data up to the level that the LLM is capable of working with. So let me give an example there. I'll start there first. Let's say you're working with Llama 2, or you're working with OpenAI's 16k context window model, right? It might be capable of generating all that data, but if it's never learned that more than a couple thousand tokens are relevant to a dataset, you're going to start to lose some efficiency as it generates more and more data. So what we do is we sample from our model and our trained model, up to as many tokens as we can at a time, and then we evaluate it row by row. And the purpose of the agent is realizing that with current LLM technologies, a couple purposes of the agent. But the first one, the most obvious one is that no LLM today can generate 100,000 or million row dataset, or can go in and edit your dataset, which is a really popular use case for us right now, right? If I want to add new fields, if I want to summarize product reviews, if I even want to just search for anomalies across my data, we've got to be able to process data way bigger than what an LLM can handle in a single batch. So the first purpose of the agent really is to take a complex user query, for example, create a demo dataset with a spike in sales activity in November, I want a million rows. Or if you're editing data, convert this unit from Celsius to Fahrenheit across my entire data warehouse or things like that. The agent's first goal is interpreting that user query that's coming in, and then figuring out how to divide it up into a set of smaller problems that the LLM can work on one problem at a time. So a good analogy there would be if you asked in the NLP world, if you asked GPT-4 to write a book for you, you would probably get a really short book and you want a novel that's got several hundred pages. If you were able to divide that up, take that problem, what someone's asking for, and divide it up into smaller problems, write a paragraph or a chapter at a time, you could see how an agent planning and executed based architecture that would say, okay, first step, I need to write introduction. Next step, I've got to have character growth and start to work on the character arc. And finally, I need the conclusion and things like that. And it can divide those up into smaller problems. That's the approach we're taking with the dataset, either editing or dataset creation, where we've got something that is breaking it down into a step by step problem that a smaller, in this case, our data generation LLM can work with and start generating high quality data for that particular window.

Nathan Labenz: 31:33 So is that more of an instruct type model that is creating code as policy outputs and then a dedicated actual data point generation model that is receiving those commands and doing the cell by cell?

Alex Watson: 31:52 Exactly.

Nathan Labenz: 31:53 That allows you to put language models in too, right? I mean, I saw one of the demos was reviews of the product, and obviously, that's a pretty different situation from the tabular data. I assume that's a little more random somehow, or it seems like it would be harder to give a representativeness guarantee on customer reviews?

Alex Watson: 32:14 We've got some research, which I'll link over to you, on how we assess the quality of text based on what you're looking for. But so often, datasets are mixed. Imagine EHR data, where you've got doctor's notes mixed with initial observations from patients as they come in. So I think that happens quite a bit. So we try to learn across all of them. And one of the interesting things is that you don't necessarily want your LLM to do everything. And that's maybe the other part of the agent planning based architecture. If you were asking for an incrementing ID or a Fahrenheit conversion, we've got a neat example. We're doing maybe a high school physics level problem. The LLM will approximate, but you don't want it to actually approximate your answers, you want the real answer. So the other part of, I think, making synthetic data using this Tabular LM work at scale is having the LLM just recognize which areas are best to calculate or compute, and doing that for you automatically. I think we all see that a little bit if you're experimenting with GPT-4 or ChatGPT, and you ask it to help you work on a dataset, sometimes it'll give you a dataset back. Sometimes it will give you back code that you could use to solve a problem. And really, that's the type of stuff that we are trying to streamline. We're essentially applying the agent to realize when something should be a Fahrenheit to Celsius conversion, right, just multiply by 0.6. And you get the right answer. You don't have to have an LLM figure that out. So the first step is look at that user query, figure out, given the available tools that I have, can I solve this problem with code? If so, execute that code, get that into the dataset so you have high confidence in the answer. But for other things that require that level of knowledge or intuition that an LLM would have, summarize a review as positive or negative, things like that, that require you to look across fields and understand natural language text, and then we use the LLM to fill in that data.

Nathan Labenz: 34:09 It comes together very nicely, I think, in the product demo. I'm definitely excited to spend a little more time with it. I do think it'll be really helpful. And, you know, it's probably a good idea also to just contrast this as you started to do a little bit with trying to use GPT-4 or, you know, certainly any of the RLHF models. I think you have just very fundamental problems here, and that's where even for a project like mine. So where I'm thinking of applying this immediately is, you know, we have the script writing model, and its job is to write a video script for a given user who comes to us often naively. And, you know, we grab some content off their website or whatever and figure out who they are. So it's extremely diverse. And you might say extremely sparse. Right? We have a healthy usage, but we're not that big. It's a big world out there. So especially internationally now, different languages, just all sorts of long tail stuff that we have not previously put into our dataset, but can at any time come our way. And then I think, okay, if I want to do some patching of my fine-tuned 3.5, which is currently the state of the art best thing that can best nail our task, then one is probably not quite enough to get it to really learn the pattern I want it to learn. Five to ten in my broader mix of a few hundred samples probably is. But, you know, I want to create something. First of all, if it's an unfamiliar area, it's very hard to even know what to do. Right? You show some examples where it's like France, and I'm like, oh god. I don't even know the, excuse my language. Excuse my French. I don't even know the structure of the postal system in France, let alone how to make semi-realistic examples that I would want to throw into 3.5 fine-tuning. So if I'm making this up totally on my own, I'm just destined for underperformance for just garbage in problems. And then, you know, if I ask GPT-4, I'm just like, it's going to be so RLHF, to mode collapse on things like this so often that, you know, just like it answers 42 or whatever, or 97, way too many percentage of the time when you ask for a random number. You know, here, I just don't trust it at all for that representativeness. And, you know, I think OpenAI would readily agree that, yeah, you should not use it for that. Obviously, it's been trained for a very different purpose. So that to me is exciting. I think it gives you guys a real different position in the market that is so distinct from the mainline AI assistance. I think that's pretty cool.

Nathan Labenz: 34:09 It comes together very nicely, I think, in the product demo. I'm definitely excited to spend a little more time with it. I do think it'll be really helpful. And it's probably a good idea also to just contrast this as you started to do a little bit with trying to use GPT-4 or certainly any of the RLHF models. I think you have just very fundamental problems here, and that's where even for a project like mine. Where I'm thinking of applying this immediately is we have the script writing model, and its job is to write a video script for a given user who comes to us often naively. We grab some content off their website or whatever and figure out who they are. So it's extremely diverse. And you might say extremely sparse. We have healthy usage, but we're not that big. It's a big world out there. So especially internationally now, different languages, just all sorts of long tail stuff that we have not previously put into our dataset, but can at any time come our way. And then I think, okay. If I want to do some patching of my fine-tuned 3.5, which is currently the state of the art best thing that can best nail our task, then 1 is probably not quite enough to get it to really learn the pattern I want it to learn. 5 to 10 in my broader mix of a few hundred samples probably is. But I want to create something first of all. Especially if it's an unfamiliar area, it's very hard to even know what to do. You show some examples where it's like France. And I'm like, oh god. I don't even know the, excuse my French, I don't even know the structure of the postal system in France, let alone how to make semi-realistic examples that I would want to throw into 3.5 fine-tuning. So if I'm making this up totally on my own, I'm just destined for underperformance for just garbage in problems. And then if I ask GPT-4, I'm just like, it's going to be so RLHF'd, it mode collapses on things like this so often that it answers 42 or whatever, or 97, way too many percentage of the time when you ask for a random number. Here, I just don't trust it at all for that sort of representativeness. And I think OpenAI would readily agree that, yeah, you should not use it for that. Obviously, it's been trained for a very different purpose. So that to me is exciting. I think it gives you guys a real different position in the market that is so distinct from the mainline AI assistance. I think that's pretty cool.

Alex Watson: 37:01 The flip side of RLHF, which I think is so interesting, we've done some initial work that we published on using RLHF to reduce a model's propensity to talk about stuff that it shouldn't initially. So particularly, don't return PII and data and parse the model, which is a big enterprise use case. But I think what you talked about also is sometimes you want that, sometimes you need it. We have customers, for example, that are generating, no joke here, they're generating spear phishing emails to test their own spear phishing detection system. So there are times that you need to go against what the model has been trained to do. And another interesting piece that came to mind when you're talking about RLHF. So I think the enterprise use cases are a little different than consumer, where in some cases, you need the ability to turn off some of the guardrails, because you need to create a level of diversity or talk about things that an RLHF model probably just doesn't want to talk about. Big use case for us, I'll give you another example, would be healthcare companies that are looking to create synthetic versions of patient medical records. A lot of models, and you're trying to augment your examples will just refuse to do it, because they think you're talking about something that could be potentially harmful, but it's really being used for a good use case. So I think there's definitely a case for times where you want to take off some of the guardrails or the set of expectations that organizations might have are just much different than what teams developers might have are much different than what a consumer might have if they're using ChatGPT or something like that. One that I think is so cool, that really, you really start to notice with tabular data, big application, and it's the most basic thing that we would see with synthetic data, but people always start generating that mock data set they've had to generate for a demo, or UI or something like that. And inside that data set, you're going to have names, you're going to have addresses and things like that. And genders, things like that, a lot of protected class type stuff. And what you'll notice is the models have a tendency to return one type of data. So you've got names that seem very consistent, you have probably represented the training set, demographics that are across the United States or things like that. One cool application of RLHF that we've done, that we've been experimenting with is actually training the model to be more diverse in the results that it gives back. So if I ask for a set of demographics for a particular zip code or city or things like that, having the model want to return a more diverse and aligned set of demographics, I think than what a model might do off the shelf is pretty powerful. But you want the ability to be able to control that. Sometimes you want real data. Sometimes you want ethically aligned data. Those are both really important. I think the irony of the whole thing is RLHF can be really good for both of them. It's just like what direction. It's a tool, right? And it's what direction you point the algorithm and the loss function it's solving.

Nathan Labenz: 39:55 This goes back to my decoupling of the distribution prediction and then the sampling from it. It seems like you could achieve that largely with just temperature. Like, if you said we're going to make our core model and its logits or percentage weights outputs as true to real world data as we can, then you could slide your temperature slider from 0 and be like, at 0, you get the modal prediction. And at 1 or whatever, you get the normal represented real world distribution. And then at 2, you get the minorities overrepresented on any and all dimensions, version. But it sounds like you're approaching that in a different way. So is what I'm saying not workable for some reason? And why is there more complexity?

Alex Watson: 40:44 Even a temperature distribution when you're editing it is going to, when you turn it up relatively high and you start to get even crazy data, for a really imbalanced dataset, even then the temperature isn't going to introduce something that is like 1% of data very often, or at least to the level that you're wanting it to. So this is a technique we can use to force that to happen.

Nathan Labenz: 41:06 This reinforcement learning is really a tool for saying, we really want you to be, to create a more diverse representative dataset. This is like the stock photography thing where it's like, we're going to be intentional about this. But to do that, you have to apply even more than a standard technique.

Alex Watson: 41:22 I love their engineering blog at Pinterest, they had a really neat example on how the search results for example, if you're searching for pictures of wedding rings, would bring back a picture of wedding ring with very diverse skin colors or things like that, which I think is a nice feature to have. Once again, you don't always want that. Often, what I found particularly in the machine learning or the tabular data space is that the classes are incredibly imbalanced. Let me give you another example, even for things at scale. So we work with a major social media company, they were impacted just like everybody else by the changes to the third party ad tracking and things like that. So really, they're trying to make the best possible use out of their data that they have. And when you look at the ad recommendation problem, it's massively imbalanced. For every thousand people that you present an ad to, maybe 1 or 2 click, right? Hopefully better than that. But often, that's I think the case that you're looking for. So you're trying to make the absolute most out of that data, but that data represents 0.01 something, in that range, I'm making these numbers up, but it's meant to be illustrative of the data. So you need the ability to tell the model, this is really important, I want to learn from these particular features, this very imbalanced class, and then generate meaningful new variations to improve detection with this. And so that's where I'm coming from for the different, very strong techniques outside of altering temperature, where you want very explicit control over the model output, to make sure it meets your expectation because you've got a task at hand.

Nathan Labenz: 42:51 Well, control over models is definitely something we should all be striving for. So in all aspects of AI. Definitely, in the big picture of worries about, are we going to keep this whole AI technology wave under control in any number of ways? I think your situation here is one of the more compelling things I've ever heard also for the need for the raw model that has the more accurate world model, even if it is sometimes not so pretty to look at. So I wondered how you. I'm not super bent out of shape about these issues, but I guess a lot of times I view the model developer's ability to control things as a canary in the coal mine. Like, if they can't prevent it from being offensive today, are they going to prevent it from following build a bioweapon command tomorrow is the most incredibly alarming scenario, I think. But this is definitely, I certainly see your point about like, hey, we want to have all these different dimensions of control. And we've got even just to build stuff to test our detection systems, we've got to have data that's going to set off alarms. So that is all super interesting. I wonder what role does conventional AI ethics have in the company given all these use cases that you want to enable?

Alex Watson: 44:20 I have a very optimistic view of AI. And I think a lot of times I view synthetic data as a tool that could be used to improve the alignment or the ethics by someone that wanted to do it. I think to your point, it could also be used to do stuff that's harmful. So I think that's a real question that we're going to be wrestling with as a community over the next couple years. I'm a big fan of having the alignment checks and the warning flags everywhere. But to the extent possible, giving people control over what the model does. And generally speaking, curious to hear your opinion on this one as well, too. I think I'm more of a fan of the open model where it can be adapted for whatever particular use case that you might have, versus moving to a more closed space where a very small group of very powerful companies have control over what the models can and can't return. So there is no perfect answer there. But I would say I want to believe that people want to do the right thing. AI is advancing so incredibly fast that from what we're seeing, I think that, for example, the White House executive order that came out this week, it is a sign that people are paying attention to the right things. I'm also, maybe this is a byproduct of being one of the smaller companies out there, we've got about 65 people at Gretel right now. I'm curious to see how regulation will play into this world, where smaller companies that are innovating may not have 100 people on a regulatory or compliance team to help work on this yet. So love the direction. I think something really important that we do as we move forward is thinking about enabling competition and things like that and enabling innovation while protecting people's privacy, while protecting the use of AI across our ecosystem.

Nathan Labenz: 46:13 Yeah, no simple answers on all that. And the 100 plus page order that seems to mostly be ordering another 10,000 plus pages of reports is certainly reflective of that. I mean, I also thought it was a pretty good first step. And at the highest level, I've been saying a lot lately that as someone who does take big picture AI risks pretty seriously, it's hard for me to imagine a much better situation for the overall game board to be in today than the one we actually have. At a minimum, we can say all the people at the big companies that are developing the most powerful systems are pretty serious minded. And the most rogue one is Meta, and they're still more responsible than you could easily imagine people being if they were just didn't care or thought the whole thing was totally ridiculous. So I think that's all a pretty good start. I like the flop thing pretty well. I mean, you had said 500 billion tokens, right, as the pretraining base. So, I mean, if my intel is correct, GPT-4 is 10 trillion tokens, so 20 times as many tokens, however many more parameters. It feels like you probably have 3 orders of magnitude between what your compute budget was for this and where the even just the reporting threshold would kick in. It seems like you have plenty of room to run as a small company before you would hit any onerous regulation.

Alex Watson: 47:53 And that's one of the neat things is we don't have to, and we're not trying to compete with GPT-4, to create a tabular data model at that scale. I think that the promise of, for other people that are building AI powered applications right now, the promise of a really lightweight model, really fast model, at Microsoft's textbooks are all you need paper using like a billion parameter model that is super fast on inference, super low on training costs, trained on a relatively diverse but small set of examples, shows the power that you can have of taking a domain specific data set you have or task, and doing something meaningful without having to do something at the GPT-4 scale. So personally, excited about that, because I think it's going to enable innovation from the life sciences companies, from fintech companies, you name it, AI video content creation companies, things like that, that can create small, efficient, fast models that do something really cool and really unique that the big models haven't or can't do at the same level. So personally excited about that. And then the combination of those two, still leverage that big model where you need it, is, we use it for the intent parsing, really understanding what type of query a user wants. Then we use our small model for speed. I'm excited about that because I think it enables people to experiment without needing, as you were saying, to train on 10 trillion tokens or something just so big that it becomes a barrier to entry. Alex Watson: 47:53 And that's one of the neat things. I think that we don't have to, and we're not trying to compete with GPT-4, to create a tabular data model at that scale. I think that the promise for other people that are building AI powered applications right now, the promise of a really lightweight model, really fast model, at Microsoft's textbooks are all you need paper using like a billion parameter model that is super fast on inference, super low on training costs, trained on a relatively diverse but small set of examples, shows the power that you can have of taking a domain specific dataset you have or task, and doing something meaningful without having to do something at the GPT-4 scale. So personally, I'm excited about that, because I think it's going to enable innovation from the life sciences companies, from fintech companies, you name it, AI video content creation companies, that can create small, efficient, fast models that do something really cool and really unique that the big models haven't or can't do at the same level. So personally excited about that. And then the combination of those two, still leverage that big model where you need it. We use it for the intent parsing, really understanding what type of query a user wants. Then we use our small model for speed. I'm excited about that because I think it enables people to experiment without needing, as you were saying, to train on 10 trillion tokens or something just so big that it becomes a barrier to entry.

Nathan Labenz: 49:30 Yeah. I think if we give it a little time, there are some really positive natural trends because there are some ways where everybody's interest can be aligned. Generally speaking, the systems that worry me the most are the super general ones. Things that are designed and engineered for narrow purpose seem inherently just a lot easier to keep under control because AlphaFold may be a world changer. But up until at least this week, I think it does a couple things now, but it previously did one main thing, and that's predictive protein structure, and you gotta fit that into a broader system. Lots of awesome examples like that. Alpha Go can play Go better than any human, but that's all it does. So I think that's all really good. And there is a vision for long term AI safety that's the ecology of small models that I think Eric Drexler has a manuscript on this. He calls comprehensive AI services that is a good early articulation of it.

Alex Watson: 50:38

Nathan Labenz: 50:38 Pretty prescient, actually, given it's like 5 years old, I think, already. And his idea is just that let's have narrow superhuman AI in everything, and then we don't necessarily need superhuman general AI, which might be hard for us to control. But right now, we're just still figuring out all these techniques and how to make things work and what the curriculum is supposed to be and what the learning dynamics are. And the one thing that is working without question, I mean, a lot of things are working without question, but it's so tempting in the meantime to be like, woah. Why don't we go, what happens at 10 to the 27, 10 to the 28, 10 to the 29? And there, I'm like, yeah. I actually would like to see us be a little bit more cautious before we just race through however many more orders of magnitude. Because I have no idea really what comes out the other end of 10 to the 30 at this point. All bets feel off. Does that feel safe to you? I mean, if somebody were to come today and be like, hey, great news, everybody. All my H100s just warmed up, and we're going 10 to the 30 right now. We'll see in 100 days. With my 50,000 H100 cluster, it should take 100 or oh, probably a little more than that to go 10 to 30, but whatever.

Alex Watson: 51:53 I don't think there's any way to stop someone from doing that.

Nathan Labenz: 51:56 Well, you are working at small city size electricity consumption at that point. So, I mean, that is the kind of thing that the state can currently intervene. Now there may be algorithmic breakthroughs in the future that make that impossible to stop. But

Alex Watson: 52:10 Well, getting a little bit meta here, maybe, just talking to a little bit of theory. It feels like these advancements we made have really been modeling how the human brain works. And that's neural networks, right, at a go. And at some point, nature stopped saying we should have a bigger and bigger brain, and started to say, we're going to start having parts of your brain that are specialized for certain things. So that's where I don't think there's anything we can do to stop someone from training something on every token that can be found across the entire internet. But I actually think this idea of there is a quantity of smart enough, where for general reasoning, you've got good stuff there. And then it's the task specific, the code generation LLMs or the synthetic data generation LLMs or generation for AlphaFold, a task specific model that is really great at what it does as a tool that's available to the others that both, I think, helps us reason a little bit about more what's going on in a way we couldn't do if the model got bigger, but I also believe is actually probably a smarter and more efficient way to build out that AGI. So I would see the future as hopefully, because I like the idea of the auditability and the understandability of these small expert models of a world where you've got a lot of models that are trained on small amounts of domain specific, really special data, and then orchestrated by a larger, smart enough LLM without creating the uber intelligence that no one understands how it works. Curious how you've thought about this as well.

Nathan Labenz: 53:50 Yeah, I think largely similarly with maybe just a little more tinge of fear in my affect. But yeah, safety and narrowness, again, I think is super important. It would be, I see a case. I guess if I were to try to summarize the case there, it would be like beyond a certain point, scaling isn't necessarily economical anymore because you're good enough to do a good job at tasks, and that's what needs to exist. Now I want to revise maybe my earlier statement on being pumped about the state of the game board. Because I do think we look at some of the leading developers and it's like, there isn't, in some of them, one maybe in particular, there is a borderline ideological sense that we're going to keep scaling and we're going to make something that's the most powerful thing we can make. And we're going try to do it safely, but we are going make the most powerful thing we can make seems to be a prevailing notion. And I'm like, that is the part that doesn't seem super wise to me. And it does seem like the kind of thing that the state can do at least something to control for a while. So I would, again, like to see a little caution there, but it's funny. I just did an episode today with my friend, the CEO of Lindy, and we were running down all the places and ways in which we are both EAC, which are many. And then there's just this one little corner of the world where we're like, Yeah, maybe let's not rush to 10 to the 30, not knowing what kind of alien pops out the other end of that. But I really appreciate your perspective. It is so interesting. You have just such a different angle on so many of these core fundamentals. I'd love to hear how that plays out for you in terms of your sense of understanding on the part of the language models. You've got the stochastic paradigm, obviously. You've got the reasoning engine characterization. What do you make of that, as somebody who's focused much more on the representativeness side of the challenge?

Alex Watson: 56:08 Our number one focus as we built out our service is, and I think it's helped keep us grounded, is helping data scientists, developers with the problems they have with data today. Your data is messy, it has gaps in it. I can't create new additional examples. It's too expensive, or there's no way to go back to it. So we really focused our efforts on first and foremost, helping you build better data. That better data is either more accurate or is more private than the existing data. That's been the guiding light. That's what we're really aiming for out of the gate and learning as we go. One of the areas we're about to release a very early version of our service and to see and really learn from users for what they're able to share and feedback with us for how they use it, and use that to guide development. So instead of starting making a set of assumptions that prove it to be incorrect, I think one of the areas that we've been successful as a startup is getting code out there really fast, getting samples out there people can iterate with, asking for feedback and iterating on that feedback. So I'm super curious to see where this use of generative AI for working with, our big focus here is working with tabular or mixed modality, tabular text and time series data goes. We'll use that to drive our own investments on how much time do we spend working on a better agent, for example, so a better agent tooling. If you wanted to create a time series or something like that at the scale of a million rows, right? How do you take in our knowledge around building time series that works and then combine it with the other technologies? Where we see it being successful, that's where we're going to double down. One of the areas I feel is so neat and de risking in this space right now is there are so many potential tools you can bring to the problem, whether it's retrieval augmented generation, so bringing in example datasets into the LLM memory just to help it, whether it's the React or the agent approaches for breaking things down into smaller problems and the LLM training itself. So we've got a bunch of different dials we can use to solve the problem. We're hoping to learn from how people use the service and see which areas we really need to double down on. But I'm psyched for that. I think in the sense that I really don't, I have some ideas on where it's going to go, but I don't really know where things are going to go. But this, we think about tabular data as a resource, and how much of the data we work with every day in organizations is in some sort of tabular format. It's pretty unique space to be in. I'd say most organizations like 85% of data, in some levels, in some sort of structured, semi structured format. So being able to work with that and leverage it is a niche, but really cool space to be playing in right now. To your point earlier, I'm sure it will be just about our time until we've got competition from the OpenAIs of the world, Anthropics of the world and things like that. But right now, we've got a great set of users, we've been building on it. And I think this combined approach across advancing LLMs to the point we need them to do without trying to build the Uber LLM. And then also combining other cool technologies that are happening in our space to solve a problem is working out pretty well.

Nathan Labenz: 59:27 Yeah. It strikes me that there's maybe another Pareto curve between these two modalities. I'm trying to find a synthesis for the stochastic parrot reasoning engine debate. And in your architecture, I'm seeing maybe they can be both. You very much are training the core LLM here that generates the data to be like a stochastic parrot in a highly principled way. But nevertheless, you want that randomness. Right? That's a big part of the value driver. And then you also need this planner that has to be much more reliable. And probably a lot of the models we use today are at some sort of outer part of the curve on the production possibility frontier. But maybe there's a bifurcation that happens there too, where you're really pioneering the sort of high integrity stochastic parrot side, and then other people are really pushing the reasoning side.

Alex Watson: 1:00:32 That's such an interesting idea. It mimics, we went to a conference. There's a major health organization called HL7, and they have a FHIR. It's called FHIR, it's the most popular medical data record format in existence today. And they ran a whole conference on synthetic data. And the feedback we heard at that conference was the exact opposite of probably every customer conversation I've ever had, but it was so interesting. In the synthetic data world for healthcare, there's a few projects, there's an open source project from MITRE, called Synthea, that allows you to generate medical format record data that you can use for testing systems and things like that in the healthcare space. It's been under development for 4 or 5 years, purely statistical based approach. And what they called out was that for many of the use cases they want, particularly on AI or machine learning, that data from Synthea is just too clean. I'd never heard in my entire professional career up until this point. But what they were saying is, you do want a little bit of that variability. You want that slight variation, that stochasticity that gets introduced, but you don't want crazy. Right? So it really is about finding that balance, just enough within the scope that you need. And then also thinking about it at scale. Right? You can't evaluate these things one at a time. You need to be able to reason about 50,000 examples you create for an LM training set or a million examples you create to boost an ad recommendation dataset, something like that. So you really have to think about it at scale. And that's just been starting with tabular data where it's so easy to look at it and say, this is right or this is wrong, I think maybe had us thinking about this as a company a little bit earlier than the rest of the industry that is now like, wow, we can generate really amazing text, or images for the commutes to train a machine learning pipeline, but how do we know that all thousand images that I created meet my expectations? So I personally really like this idea of letting an LLM be an LLM, let a machine learning neural network generate whatever it wants to, but then examine the outputs at each step and build some controls that if it goes too far off the reservation, turn the temperature up too high, something doesn't make sense anymore. Through world data, you can detect it and filter it out.

Alex Watson: 1:00:32

That's such an interesting idea. It mimics, we went to a conference. There's a major health organization called HL7, and they have a FHIR. It's called FHIR, it's the most popular medical data record format in existence today. And they ran a whole conference on synthetic data. And the feedback we heard at that conference was the exact opposite of probably every customer conversation I've ever had, but it was so interesting. In the synthetic data world for healthcare, there's a few projects, there's an open source project from MITRE, called Synthia, that allows you to generate medical format record data that you can use for testing systems and things in the healthcare space. It's been under development for 4 or 5 years, purely statistical based approach. And what they called out was that for many of the use cases they want, particularly on AI or machine learning, that data from Synthia is just too clean. I'd never heard in my entire professional career up until this point. But what they were saying is, you do want a little bit of that variability. You want that slight variation, that stochasticity that gets introduced, but you don't want crazy. So it really is about finding that balance, just enough within the scope that you need. And then also thinking about it at scale. You can't evaluate these things one at a time. You need to be able to reason about 50,000 examples you create for an LM training set or 1,000,000 examples you create to boost an ad recommendation data set, something like that. So you really have to think about it at scale. And that's just been starting with tabular data where it's so easy to look at it and say, this is right or this is wrong, I think maybe had us thinking about this as a company a little bit earlier than the rest of the industry that is now like, wow, we can generate really amazing text, or images, or compute to train a machine learning pipeline, but how do we know that all thousand images that I created meet my expectations? So I personally really like this idea of letting an LM be an LM, let a machine learning neural network generate whatever it wants to, but then examine the outputs at each step and build some controls that if it goes too far off the reservation, turn the temperature up too high, something doesn't make sense anymore. Through world data, you can detect it and filter it out.

Nathan Labenz: 1:02:46

That recalls how in a lot of image tasks, there's training on systematic corruptions of the image as well. You want to make your stuff robust. You add a little noise here, and you distort this way and change aspect ratio. And if it can work across all those different things, then you're going to be much better off in a real world situation. And there's a similar problem, I'm sure, for a lot of medical things where stuff is whatever. Anything from illegible to incomplete to contradictory to, I just saw a funny story about a person who had the same name and birth date in the same hospital as another person and spent their whole life trying to be disambiguated and struggles. So, yeah, I mean, just so many crazy things out there. And we don't have too much time left, I've really enjoyed the questions in this conversation. I did want to ask a little bit more about how you go about training for privacy protection specifically. We've talked a lot about how you train for representation and super representation. But I understand there's probably a whole different technique for just making sure that you don't spit out somebody's real email address or whatever. That is interesting. When I fine tuned, one of the experiments I ran, maybe you can help me understand this a little bit better. This has been a topic of discussion lately too, is I ran an experiment on OpenAI fine tuning with a bunch of my writing, my resume, my data, whatever. And I was like, you and I need to do it again with 3.5. This is a little while ago. But it was the 002 generation of fine tuning, which they never launched publicly, but I had an opportunity to test a version of it. And anyway, it moved in my direction. It did not know who I was, though. I was trying to turn up the epochs a bunch. And it's still never learned that it was supposed to answer as I am Nathan Labenz. It was Nathan something else sometimes, and it was vaguely similar to me, but definitely not memorizing those facts. So I'm very confused about memorization in general. Jeremy Howard recently had a thing where it's like, they LMs can memorize from one example. It definitely hasn't been my experience. So maybe for background, what do you observe about this LM memorization? In my case, I was trying to get it to do it. You're trying to prevent it. But what is happening there? And then what's the technique that you're using to really make sure that it's not happening for your product?

Alex Watson: 1:05:15

Where we started was training language models from random weights on a dataset from a customer. In that sense, the model learns from the data that it sees, and it has a very high propensity to memorize and replay secrets in data. There was a great paper that came out, and this is towards the beginning of our company in 2020, it came out from Cal Berkeley, The Secret Sharer paper. Don Song's team and several others working on it. And what they were highlighting was that, given training a language model on data, how quickly it starts to memorize even rare occurrences in the data and the chance it'll play it back. It's an interesting example you gave where you're training GPT-3.5, fine tuning it on an example, because I haven't seen written up exactly how their fine tuning works, if it's actually updating all model weights, or if it's using a PEFT-based approach or something like that, just adopting the model on top of it. But it gets a little harder to detect when you have this massive pre-trained corpus and you're making very small changes to only percentage of the model weights, for example, across the entire model. But it still happens. One of the really areas that we see customers doing a lot is fine tuning a model and then just running a series of tests, we call them canaries, but essentially trying to get a model to auto-complete a credit card number or things like that. Here's what I've seen work. Starting with the removal of PII or personal data, it's the first thing. You can use an LM to do it, you can use NER to do it, whatever you want. The first step is really removing the data you never want to have show up inside of your model. The second risk, and this is really the risk that trips people up, particularly when you're working with patient medical data or things like that, is that some combination of attributes, really easy to imagine in a tabular work case example, right? You might get rid of a name, but you have a height, and you've got a zip code, and you've got some disease or something like that. And just that combination of attributes can very quickly become identifying. So none of them identify by themselves, put a few attributes together, and you have a real problem from a privacy perspective. Same thing with text, the types of styles that people have for writing and things like that, as well as the data that you're training it on, that combination of attributes can become identifying. I suspect that the OpenAI approach, as you trained on more and more data, would become more likely to start to have things like that, where the combination of attributes, writing styles, anything like that can become identifying. The answer to that across both tabular and text is actually the same type of approach works. There's an approach called differential privacy. Everyone's heard of it. No one really seems to really know how it works, and I always try to find a simpler and simpler way to describe it. What differential privacy does is it inserts a quantitative level of noise into your data. And so when you're training a machine learning model, if you're training an LLM in this case, you're training with differentially private fine tuning, for example, it's inserting noise into the optimizer and clipping gradients on the way out. And what that's making sure is that some rare combination of words inside of your data, like, hey, my name's Alex, I'm 6 feet, I live in Southern California, something like that, doesn't become memorized and replayed by the model. So essentially, you can guarantee either a training set example. So per example, or per entity inside of the data, it could be a set of examples about an individual user, you can guarantee that none of the tokens inside of that dataset will be replayed directly by the model. And that's so important when you are training on compliance-controlled data. So we've got how many different customers we have going on with different healthcare organizations that are trying to train on doctor's notes, or customer support records or things like that, where you need to make sure the model does not memorize a customer name or replay it, or a combination of attributes there. So things like differential privacy give you a tool that you can say, it's no longer that I think the model didn't memorize it, and I haven't been able to extract it. You can actually say, with a level of confidence that given the way we look at an individual training example or record, I can guarantee that the model will not be able to replay that or will not have memorized to be able to replay that example. So in the tabular world, this has really opened the doors up for us. We've got a couple national level healthcare organizations that have been able to get approvals to share data between hospitals by training not just on de-identified data, so not just removing names from patient medical records, but in this case, creating a synthetic version of those patient medical records, where you know that the model did not memorize my combination of zip code and height and gender or things like that, that could become identifying. With those actual mathematical guarantees, it becomes possible. So super excited about differentially private fine tuning, particularly in the LM space. When you look at small companies that are trying to train those models on their domain-specific data, but they hit compliance or privacy issues, it gives you a tool that it's not just a best guess or we think that it's going to be fine. You can actually convince yourself that the model's not going to return something that it shouldn't.

Nathan Labenz: 1:10:32

As you do the training, you've taken the gradients and you're working your way through back propagation. You are literally adding a noise factor to the updates to the weights.

Alex Watson: 1:10:48

For each subsequent token generation.

Nathan Labenz: 1:10:51

And that basically allows you to say, we've essentially blurred the picture. In aggregate. There's probably, I guess, a trade off there where you probably, the model converges more slowly, I guess, by almost by definition, but without learning this stuff.

Alex Watson: 1:11:07

Yeah. That's such an interesting thing because research has been coming out recently. We had a conference we ran on synthetic data, and we had some folks from Google come in and talk about some of the research they're doing. And that's exactly right. When you're introducing a level of noise into the data, it requires more training time to get down to the same level of accuracy. One of the things that's really interesting with this approach is that and this increases compute requirements, but there's a theory that by really increasing the batch size bigger and bigger and bigger that you're sending into the model at any given point, so it's going to increase your computational complexity here. But you can use differentially private techniques with increasing larger batch sizes and approach the same level of accuracy as real world data. So in this sense, you're getting privacy without a real hit on utility of the data. But essentially, by more compute budget and more data, it becomes possible to reach the same level of accuracy that you would have if you just trained on the data itself, which is pretty exciting. So that's pretty new. From what we've seen using Gretel today, you're going to have a utility hit using differential privacy, especially on small datasets. It makes a lot more sense when you get into a dataset where you've got 100,000 or more examples. Essentially, that level of noise it has to introduce to blur out somebody of a particular zip code is a lot lower. That's why you've seen differential privacy, for example, the US Census Bureau uses it, Google and Apple use it on the next keyword prediction or emoji prediction when you're typing a text. At that scale, then differential privacy really starts to work. But I am personally really excited about this, public LM stream on public data, you're fine tuning it on a private dataset, and you're introducing differential privacy as you do that. Large batch sizes, plus being able to interleave public examples will help a model converge really quickly. And I think in a lot of cases, we got into the weeds there a little bit, but it is the key to unlocking AI for regulated industries that are going to have to convince a regulator that there's no way that this patient who is part of this dataset, their identity can be compromised. I always love this term, you want the model to learn about a disease, but not about a patient. This is a really great technique to make sure that you have that separation.

Nathan Labenz: 1:13:30

That's cool. Really, I've learned a lot by going down this rabbit hole, so always excited for a journey into the weeds. One last thing I wanted to just get your take on a little bit is there's obviously a ton of activity going on in synthetic data. And I would flag Anthropic's constitutional AI as an interesting version of this, where they're constantly iterating on this HHH basis to make things more helpful and honest and harmless. And so that seems to work. Claude is really good. So that's great. And then you see the, I think you even mentioned earlier the synthetic textbooks project out of Microsoft, which also seems to be a great proof point for the value of synthetic data. Then you see these weird stories, like self-consuming generative models go mad, which I think most people, if they listen to this show, they probably at least saw that blurb whenever it came out not too long ago. And there, they say, if you do this generation after generation, things get weird. Do you think there's anything inherently about synthetic data that is a long term problem? Or do you think that this is all just these weirdnesses are just reflections of not having figured out some of the details yet?

Alex Watson: 1:14:53

I'm pretty strongly in the not in the details category.

Nathan Labenz: 1:14:57

I could have guessed that.

Nathan Labenz: 1:14:57 I could have guessed that.

Alex Watson: 1:14:58 I've also heard another story that if GPT-4 and Anthropic and other LLMs are creating so much content on the internet, is the next cycle of LLMs going to have a regression because it's just operating on data that's already created from previous generation LLMs? I think that's an interesting question. We're going to see how that kind of plays out over time. But I would maybe posit that in a lot of cases, LLMs for where we are today can generate and often do generate, which is why we do it, a higher quality version of the data that was fed than it originally started with. I think so many people use this today. We use Grammarly to improve our text. Sometimes we run an email through an LLM and ask it to help us make some improvements or things like that. So I think the signal inside of there, and that's kind of what came out of that Textbooks Are All You Need paper, is a very promising thing. I don't think this is fully understood yet. But the idea that synthetic data can be kind of a cleaner, more diverse version of the limited data you might be starting with is a really powerful idea that I think we're going to see through. So I'm optimistic that these models, and I would say that maybe the math example is just an example of an opportunity to configure things or kind of work with them better, that we aren't moving towards some sort of mode collapse or anything like that with synthetic data feeding synthetic data. As long as the data that we're generating is high quality and ideally improving on the data that you have, then I think we'll be in a good spot. That's going to be playing out, so I'm really curious to see how that works out.

Nathan Labenz: 1:16:40 Yeah, the dynamics certainly of the future of the internet and a changing mix of content being published there is definitely going to be another just fascinating society scale story. So anything else that you wanted to touch on that we didn't get to?

Alex Watson: 1:16:56 No, I think it's been an awesome conversation. I was just kind of laughing about the last topic. And as long as every LLM generation doesn't start with, "I'm a helpful AI assistant, how can I help you?" or "Let me explain this for you," the things that we see coming out of LLMs all the time, I think that we'll be moving in the right direction. So definitely enjoyed the conversation today, and thanks for inviting me on.

Nathan Labenz: 1:17:17 Alex Watson, founder and chief product officer at Gretel AI, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

Synthetic Data with Alex Watson, Founder of Gretel AI

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next