In this episode, Inna Tokarev Sela, CEO of illumex, delves into the latest advancements in AI and data management.

Watch Episode Here

Read Episode Description

In this episode, Inna Tokarev Sela, CEO of illumex, delves into the latest advancements in AI and data management. She discusses the foundational vision behind illumex, a company she started in 2021 to revolutionize the way businesses interact with data. Inna explains illumex's approach to creating application-free futures for knowledge workers by utilizing metadata to automate context and reasoning, ensuring efficient and secure data handling. The conversation covers illumex’s techniques for integrating business logic without directly accessing sensitive data, challenges faced in AI implementation, and the future of data analytics. Inna also offers insights into preparing individuals for evolving job roles in the tech industry, touching on academic research and real-world applications.

Check out illumex at: https://illumex.ai

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

CHAPTERS:
(00:00) Teaser
(00:38) About the Episode
(03:54) Introduction to the Cognitive Revolution
(04:42) The Vision Behind illumex
(05:33) Application-Free Future
(10:47) Demonstration of illumex Capabilities
(13:39) Understanding Enterprise Data
(15:35) Automated Ontology Creation (Part 1)
(19:28) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite
(22:08) Automated Ontology Creation (Part 2)
(23:39) Challenges and Solutions in Data Mapping (Part 1)
(29:18) Sponsors: Shopify
(30:38) Challenges and Solutions in Data Mapping (Part 2)
(40:18) Evolution of AI Models in Business Context
(51:13) Illumex: The Perfect Platform for Control Freaks
(52:32) Building Trust and Governance in Data Analytics
(55:21) The Role of Analysts in Data-Driven Decision Making
(55:55) Addressing Underutilized Data and BI Dashboards
(57:59) Challenges in Data Accuracy and Anomalies
(59:05) Managing Data Volume and Context Windows
(01:01:02) Semantic Entities and Workflow Validation
(01:03:24) Metadata-Based Business Models
(01:09:04) The Future of Data Management and Integration
(01:11:55) User Personas and Self-Service Evolution
(01:13:58) Pricing Models for Enterprise SaaS Products
(01:17:11) The Future of Data Analytics Careers
(01:21:24) Academic Research and Industry Inspiration
(01:26:39) AI Education for the Next Generation
(01:28:39) Outro

PRODUCED BY:
https://aipodcast.ing

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

Full Transcript

Transcript

Inna Tokarev Sela: (0:00) It's actually about augmentation of jobs. All of us want to make smarter decisions and this is what allows you to actually, you know, be factual about that. Illumex is definitely a playground for machines, humans, data and applications. So it's a playground when everyone can collaborate. You absolutely have to have the shared context. If each of your models will have separated contexts, they're never aligned. We're actually able to allow our users, our customers to build workflows from different agentic models by different providers and keep them together aligned over the same context.

Nathan Labenz: (0:39) Hello and welcome back to the Cognitive Revolution. Today, my guest is Inna Tokarev Sela, CEO of Illumex, a startup that helps enterprises get their data speaking the way their employees do and aims to create an application free future for knowledge workers. Illumex's approach is super interesting and at least to me, novel. They began by first creating a foundation of canonical, almost platonic data models that represent how different types of enterprises work, from ecommerce to pharmaceuticals to manufacturing, with idealized implementations of all the little details that are common across such companies. With this foundation, they can then use an automated system that combines knowledge graphs, semantic embeddings, and large language models to automatically analyze companies' metadata, including query logs, API signatures, schema relationships, and so on, so that they can automatically map a specific company's idiosyncratic data environment, complete with all of its inconsistencies, ambiguities, and redundancies onto their idealized templates, all without requiring any manual data labeling or movement. Once that process is complete and validated by human domain experts from within the company, Illumex can begin to reliably translate natural language questions into database queries and other system calls without Illumex ever seeing the underlying data itself. The value for enterprises can be tremendous. Data analysts can be more productive and leaders can get instant answers 24 7. And the implications for the future of software and work more broadly are potentially profound. While so much of the AI world focuses on creating more and more niche and even personalized applications, Illumex envisions a future in which we interact with our data and other resources through a few core natural language interfaces such as Slack or Teams. 2 things can be true at once, of course, but this vision does strike me as a bit more in line with how I tend to imagine my own future AI enabled life. With 1 or a few interfaces to rule them all and the ability to get questions answered and work done from anywhere via chat or voice interactions, I look forward to 1 day untethering myself from my desk and spending more time outdoors, all without sacrificing intellectual stimulation or productivity. This conversation has a lot to offer. A fresh take on data architecture, a reminder of the importance of metadata, a visionary approach to the challenge of reasoning over messy enterprise data that strikes me as both effective and perhaps even defensible, and even a bit of insight into the psychology of enterprise decision makers who need help building trust in AI systems even as they can clearly see its incredible potential. If app making platforms like Bolt and Lovable represent the beginning of a software supernova, Illumex seems to me to foreshadow the software black hole that might follow in which interfaces and applications collapse to a single point. As always, if you're finding value in the show, we'd appreciate it if you take a moment to share it with friends or rate a review. And we always welcome your feedback via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. For now, I hope you enjoyed this thought provoking conversation on AI enabled enterprise data analysis and the possibility of an application free future with Inna Tokarev Sela, CEO of Illumex. Inna Tokarev Sela, CEO at Illumex. Welcome to the Cognitive Revolution.

Inna Tokarev Sela: (4:01) Thank you for having me over, Nathan. I'm happy to be here.

Nathan Labenz: (4:04) Yeah. I'm looking forward to the conversation. So right off the bat, on your homepage, the call to action is get your data speaking the way your employees do. And I think this is a really interesting space. It's kind of a subset of AI assisted coding heavy, sort of into the paradigm of tool use. And it seems like you're on the verge, like so many different companies are of transforming a job role, I guess I would say. And I'm really looking forward to unpacking it, both from kind of the technical underpinning standpoint and also the way that you think it's going to and perhaps are already seeing it impacting the way that your customers are conducting their businesses. So for starters, when you started the company in 2021. Right? That's what I saw in my research.

Inna Tokarev Sela: (4:51) Yeah, the golden age of startups for sure. Yeah.

Nathan Labenz: (4:54) Was, so was this a generative AI play from the start? Like at that time we sort of had seen like glimpses of GPT-three, but you know, I don't know how many people realized how far it was gonna go. So what was your mindset at the time and maybe what was the original founding vision? How, if at all, that transformed over the last few years?

Inna Tokarev Sela: (5:15) Yeah, it's good question. So imagine me going into deep tech investors, 2021, and explaining that what is automated context and reasoning for agentic is. So after kind of 2, 3 pitches, it was clear to me that I need to refine that. So I presented Illumex as we do it today. So we are on the mission to enable application free future for knowledge workers and our mission to augment people in their daily jobs with self-service access to data analytics, for structured data sources. To me, it was always fascinating, you know, from my early days at SAP and then Sisense to basically understand why after such a heavy investment in data practice and analytics practice, why majority of our business decision making is still based on guesstimation. And it's not to blame anyone, just we, as humans have so many questions or we cannot really have this helper, answering all our questions with data automatically, or at least we didn't have this ability, And then when 2017 arrived with semantic models and then hugging face and all of that, it was clear to me that here we are. The time is now, and I have been in love with graphs from my first degrees. I wouldn't mention how long it is, but just imagine we were just programming the graphs in MATLAB. So very, very long time. To me, this is okay, graphs all in here, just like context relations. And then semantics here, so it's a content. Context plus content automation together gives us, you know, this exquisite fabric, which can connect data to people, to workflows and enable this programmable self-service finally.

Nathan Labenz: (7:12) I think this idea of application free is really interesting and it's a very different take actually on the future of software compared to a lot of others that I've been exploring recently. Like I've done 2 episodes with companies that are creating sort of full stack software developer AIs. And the idea there is like, you can get any application you want in record time. And so the sort of vision is like, we're going to have, you know, tons of applications, custom applications, personal applications, disposable applications. So tell me more about the application free vision, because I think that is a, it's really the first time I've heard that phrase and it's a striking contrast some of the other visions for the future of software that are flying around.

Inna Tokarev Sela: (7:58) Yeah, so I do not really find it in necessarily contradicting what you just described, like different functions and different niche agentic, full stack implementations. So to me, especially business users like us tech folks, just love learning new technologies, new tools. The business side, they actually have day job. And the day job is not looking into new software. So to me, having multiple interfaces, multiple applications and context switch between them and integration between them is just too much for business folks to to tackle. And to me, right now, they have lots of embedded experience. For example, if you have salespeople at Salesforce, you might have integrated analytics, agentic, you know, even plugins from customer success applications inside Salesforce. So it doesn't really contradict. We can always integrate experience into experiences. To me, it will eventually boil down that we'll have a launcher, right? Like think about this plain interface, and then you ask question or narrate your task, and then it happens. And you do not really care which application you need to invoke and which data is going into that, and which I don't know workflow process facilitates meetup. So you don't need to care about all this orchestration, and remember which order of clicks you need to perform to basically get your answer or perform a task. So to me, application for the future is coming. It's not contradicting what you just described. It's just going to this whole workstation is going to be in the background and not necessarily as interface to the end customer.

Nathan Labenz: (9:40) So I'm totally with you that like, it's not gonna be 1 or the other and probably both of these visions for applications galore, applications, you know, conjured out of nothing and your sort of application free paradigm can coexist. But I do think it is a really interesting idea that people should be considering more. Like, what would it look like if you could do everything while you're walking around? You know, that I've been really wanting that with advanced voice mode from OpenAI and haven't quite got it yet for what seemed like very mundane reasons. Like, it doesn't have all the same features. I can't load a lot of context into it like I can with the normal chat interactions. But I do envision a future for myself where I'm like untethered from my chair and out in the world more, but still able to interact with information and even take take actions within the digital world that are just not accessible to me right now if I'm not kind of locked in at the workstation. So I think that is a really fascinating and potentially kind of, like, liberating paradigm for people that feel like me. Like, they kinda can't do their usual thing unless they're at the desk. Do you wanna show what this looks like?

Inna Tokarev Sela: (10:52) Totally. So I just mentioned, like, would it be nice to to just invoke something from your phone? So can you see my Slack screen at the moment?

Nathan Labenz: (10:59) Yes, okay, there we go.

Inna Tokarev Sela: (11:01) So you have your Slack on your phone, right? And it just like, how many products do you have in stock right now? Just got the message about maybe low stock and so on and so forth. So we do have those 2 modes, let Omni decide, this is autopilot. And if you are the least like data person, you just want to dig deeper and choose by yourself and so on and so forth, we also have that. So let Omni decide just goes and matches your query with semantic ontology that we create on the background. This is a more sophisticated way to say that we actually capture business logic from your data sources, from your data lake warehouse, your database, your business intelligence tools, your SAP and so on and so forth. And then they match your prompt to those business logic definitions and refine everything. There's nothing which is hidden from Illumex. They actually also explain you in more or less detail what you actually see right now. So you can see that we mentioned the semantic entity. You can go into Illumex and actually explore that. And then show everything from data to the actual SQL code, to the number itself. If you go to semantic entity itself, now when you actually click the semantic entity from your Slack, you can see all the explanation, you can see the lineage, you can see the attributes, you can see the actual business ontology behind that. So, you know, this is a platform for control freaks, right? So majority of the people would just get the answer and, you know, take an action upon that. Some of the people who are interested to actually see what's the business ontologies, relationships, the definitions behind that, will go deeper and actually see all those explanations and related metric and see what is the definition of all data and basically code which goes in. So it's really fascinating how we can have different user experience even within the same workflow. So I do have, I love my RayBands. My RayBands because I do have the speaker inside in addition to camera and everything. It's actually liberating not to have headphones in addition to that. So it's the same for basically having a self-service data copilot access. You are in your environment, you're in teams, Slack, your regular tools. And suddenly you can have this friendly chat with a kind of analyst experience. So as I mentioned about automating of jobs, it's actually about augmentation of jobs. All of us want to make smarter decisions and this is what allows you to actually, you know, be factual about that.

Nathan Labenz: (13:42) So let's dig in if we can to like, where does this data come from? You know, I I know a lot about, like, SQL databases and, like, how to query them. I don't know nearly as much about where enterprises store their data, you know, in terms of, like, what platforms are they using and how do they get those things to talk to each other and what challenges exist there. So how do you create that understanding in the first place. If I'm gracking the approach, it's like first go kind of scout out the data environment at a company, try to make sense of it, get to some, like, established canonical understanding, like this is what this data means, this is what it represents, this is how it relates to each other. And then once you have that and that's, like, vetted and good, then these runtime queries become a lot safer to use because you're not asking the language models, like, each and every time to make all these determinations on their own. Mhmm. So if I have that right, how do you actually go about doing it from the access to the understanding of a big company's data?

Inna Tokarev Sela: (14:50) Yeah, totally. So it's actually fascinating, well, since I started this company in the first place, naturally. You have to have corpus of knowledge about each industry and each line of business. So you need to have understanding of terminology, processes, metrics, analysis, dependencies in all areas that you cover for customers, so from IoT to manufacturing, to insurance, to pharma, to chart, to finance, all of that. So what we built is actually domain of knowledge about all of the above and we encapsulate this knowledge as business ontology. Architecture wise, it's a knowledge graph of semantic embeddings. It's a combination of relational models and semantic models. And what we do for our customer onboarding is actually, again, very interesting approach to me is only looking into the metadata. So we work with a pretty regulated and data intensive companies which have hybrid data stack to your point to have on premise Oracle, might have SAP, is have an SQL, Teradata, vertical, all the same, more traditional stack, but they also have modern environments with Redshift and Snowflake and Databricks, BI business intelligence tools, like think about Tableau, it could be Power BI, lots of systems, dozens of them in each department and absolutely no single source of choice. So what our system actually does, we bring our ontologies as benchmark. This is how industry benchmark for single source of truth would be looking like for the specific customer. And then we retrain this ontology on customer data stack by only using their metadata. So we only look into schemas and query logs and APIs for the applications to actually automatically retrain those business ontologies. And as a result of this process, which usually takes a few days, it really depends how big the customer is. We have customers with millions of tables and their data sources and spread their own data stack. So it really depends on their size, few days to onboard them automatically and to have their own custom ontology. And this is namely their context and reasoning automated. But in comparison to RAG, graph RAG and other techniques, you actually do not need to have any manual onboarding, providing us any manual examples or labeling. You do not need to shift your data. So we do not require you to move your data to a vector database, for example. We do it as a virtual layer. So this virtual layer of knowledge graph of semantic embeddings, which represents a semantic single source of truth of your whole data estate. We support both federated and centralized models. And this means that we have enough knowledge to understand that order ID in this table, in this system is actually vendor ID in other system. They have incorporated this knowledge in our platform as well. Single source of truth is 1, but it's also important because we do bring human to the loop. Why human is important? I think for us, it's always when we model data for business intelligence, we're already creating this gaps between business users and data, right? We have this specific subjective understanding of business matter delegated to data people to basically model data for applications. With generative AI it's even more complicated because with RAG and other techniques, you actually trust your data scientists to represent correctly your business logic corpus into semantic model. Why would this happen? Why would busy actually understand all required business examples to be fed into system? Is it not? Is it not actually running those processes that is actually, you know, like this background. So what's good about Illumex when we generate the semantic reasoning and context automatically, we actually have application workflows which are user friendly and even business users, non technical users can verify and certify our definitions. These are using our web interface, which I just showed or using the own environments like Slack, Teams, or other environments they already work in. So it kind of bringing humans and AI into the same playground.

Nathan Labenz: (19:30) Hey. We'll continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what cohere, Thomson Reuters, and specialized bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (20:44) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (22:08) Do I understand correctly that you have essentially created, like, the platonic form of enterprises in different businesses that represent, like, just the result of a extensive, I guess, experience, but also brainstorming of, like, here's all the different aspects of, for example, an ecommerce business that we would expect to find

Inna Tokarev Sela: (22:33) Mhmm.

Nathan Labenz: (22:33) In a customer's data environment. And then maybe here's another 1 for a pharmaceutical company. And obviously, those could be, you know, quite different in terms of the information that they're handling. So you've gone ahead and done this work for different customer profiles. Mhmm. And then when you onboard a new customer, you're essentially comparing and contrasting and sort of finding all the idiosyncrasies that a certain business has that ultimately are the ways that they depart from the kind of idealized data model that you pre created. Is that the right way to understand the approach?

Inna Tokarev Sela: (23:08) Exactly, exactly. So every customer special and the customer, their systems, have preferences. Sometimes they use personal names to name tables and columns and all of that. So yes, we do have this automated cleaning of definitions and automated labeling and semantic entity resolution in the system, which comes from this canonical industry model and picks up different clues from customer metadata for the correct mapping. So you might have heard about a company called Palantir. This approach is not novel, they have business ontologies and then they have process which maps organizational data into those business ontologies. We just do it automatically.

Nathan Labenz: (23:52) So that's really interesting because I would have guessed that you could only do that with any reliability quite recently. Meaning, you know, if I imagine trying to do this in and I'm sure you've, you know, had many versions of this process, but if if you take me back to like a 2021, 2022 timeframe, I did a lot of fine tuning of models in that era. And I would have guessed that they would have been like pretty unreliable in, you know, reasoning through it. Right. Those are right. We're now, I'd say mostly past the whether or not AIs can reason debate. But back then, I think it was like a much more reasonable question to to sort of ask, like, are these things, you know, just stochastic parrots still or are they, you know, reasoning a little bit? And so how did you manage to get anything working? And was it automated at that time? Like, I'm really struggling to imagine how you would get this to be reliable enough Mhmm. To be valuable with anything other than models that we've had maybe since like Claude 3.5, SONET?

Inna Tokarev Sela: (24:54) That's a great question because now, actually there's some truth to it, but also these modern reasoning tools, you do need to program them and basically improve this reasoning. Right? So we do not really trust semantic models reasoning in any way, even till now, because for us it's not customized enough to what our customers require. We pick up this reasoning from the industry benchmark, those canonical ontologies which we built over time and we always enriching them. It's ongoing process always. And also we pick up this reasoning from existing relationships, which we pick up from customer metadata. So for example, if you have procurement to purchase process, we already understand what are the thresholds, what are rule based decision in this process, which is implemented in the applicational APIs. We pick up those cues from applicational APIs or from metadata.

Nathan Labenz: (25:54) It still sounds hard.

Inna Tokarev Sela: (25:55) Yeah, it is.

Nathan Labenz: (25:56) Who is making the connection? If I have for, you know, in your canonical thing, like let's say we have ecommerce business and you've got, you know, variations on a product. I was just looking at some Shopify data and they have this sort of collections, products, variations, and then you could have styles and all these, you know, different sort of cascades, right, of from high concept to, like, low level detail. Now I imagine you go into the data environment of a particular ecommerce company and it's, like, probably pretty easy to say, oh, okay. This looks like a product. You know, I can kinda understand that that's a product. But then I imagine they must have so many different low level things, right, and such idiosyncratic names and maybe even named in different languages, you know, depending on what the situation is. How do you actually, you know, if I've got in my, you know, Nathan's idiosyncratic e commerce business and it's like, I've got, instead of calling them product variations, I call them like V A R X. Like, how do you, how can you get confident enough to know what my sort of idiosyncratic thing is in the, as it relates to like the canonical sort of idealized representation? That part still sounds really hard to me.

Inna Tokarev Sela: (27:12) So it's a great question. We do not trust semantics. Even if semantic is self explanatory, we still do not trust it 100%. We analyze usage. We do not only build ontologies, so basically semantic entities and their relations, but we also build taxonomies. Taxonomies is understanding the usage context. So for example, is this proverbial column, is your first name embedded, we do analyze different usages in that context. So for example, you might have this column used for transformation and it's assigned alias in your data pipeline, in your DBT data pipeline. This column might also be used by your business intelligence report to calculate channel attribution. So we analyze formulas, analyze logic and we cross validate it with the metrics which are already embedded in our platform. Understand the usage context and all its appearances and proximity of the usage context between different elements and then we deduct the mapping to formulas to understand what it is. So namely, if you don't have a meaningful column name and non meaningful column name participating in a formula in the metric in some calculation, and then we have another example of the same non meaningful column participating in different calculation and we mapped both of those calculations to industry metrics, we can deduct the meaning of this non meaningful column. So lots of context analysis, and this is only available when you do have a logs history that you interact with the systems and you have some application usage. If you're a blank page, you just created your warehouse and it's all blank, we of course analyze a higher case. They analyze the proximity of different semantic definitions. They analyze also the data pipelines, feeds those specific calculations and so forth. So lots of usage analysis, what people usually use taxonomy for.

Nathan Labenz: (29:18) Hey. We'll continue our interview in a moment after a word from our sponsors. Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1 and the technology can play important roles for you. Pick the wrong 1 and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (31:19) Interesting. So how much of that analysis is done by because I can imagine, like, multimodal analysis. 1 level could be fully explicit code. Mhmm. Another is, you know, asking a language model to figure it out. Obviously, you can have a hybrid of those with language models, you you know, tapping into explicit code and calling functions and getting results. You also mentioned embeddings, is another area where I'm a little bit kind of struggling to make the leap because, you know, when I think of embeddings, think of just highly semantic, you know, grounding for those embeddings. Right? With Waymark, which is my company, we do video creation for mostly small businesses. Mhmm. And we've had this challenge over time of like, okay, we've got, you know, huge library of like different forms of content, kind of similar in a way where we have done the sort of canonical work of like, this is a really good way for a small business to present itself. Now, instead of like mapping, you know, their presentation onto our form, we're kind of like painting, you know, their identity, their brand, their content onto this form. But we've often had this question of, well, how do we determine what is the best video template essentially to use for a given business? They're not gonna watch them all and pick. Right? We wanna be able to make an intelligent choice. And so we've explored using embeddings, but then we've kind of found that our notation is not well understood at all by standard embedding. And so we end up getting these matches that are based on things that are like not actually what we want the match to be on. Like, if we have any sample data in a template, that will dominate over the actual vibe. We want to match on vibe. We wanna match on sort of pace and kind of energy. But if we put any placeholder text in there, those things seem to be the real matches. And so we don't have an embedding component our system because of that sort of notation versus semantic disconnect. So the big question I guess is what's the mix of different kinds of analysis that go into this process? Then specifically, I'd love to hear a little bit more on the embedding side of how have you made that work? Because I have kind of tried and failed personally to make that work.

Inna Tokarev Sela: (33:41) Well, this is a complicated space. The structured data is as well. Of course, documents, it would be easier naturally because they have like lots of context and corpus and all of that. So yeah, structured data is almost bare from context unless you analyze like metadata and usage and all of that. So how do you do that? The short answer is we leave absolutely no ground truth embeddings from the input ontology and output. So it will be absolutely no traces or leftovers from the basic input ontology in the output. What we do not find matching between input and customer ontology, we remove completely. It actually means that there is business concept or business metric, which other companies use of workflow for example, but this specific customer hasn't connected us to data source which supports this workflow. So we simply cannot introduce business concept or semantics, which is not grounded in customer data. So this is how we keep it clean. We do not leave it to undertake. We do have dozens of semantic models and graph models for different tasks. For example, explicit semantic labeling will be done by 1 model. SQL usage queries are analyzed by different models. Semantic entity resolution is additional 1. So we do use combination of different models and we always benchmark them on performance specific task. We haven't found 1 open source model that is actually good for everything. It's usually a combination of many of them, but also ROI wise. So we do have this combination. So after semantic labeling we understand the usage, after usage we understand relationships. We build those graphs and we analyze subgraph match. So for example, you determine what are the subclusters in your ontology and how it matches to canonical ontology. Some of them algorithms are GNNs, graph neural nets, and some of them like more traditional ones, like for sub clusters, for example. So it sounds tedious, the thing is it's automated. So we don't really feel that. We always benchmark it with our golden data sets. It's funny, but based on customer requests, actually built a benchmark for different types of the spider academic benchmark, like the spider data set which uses for text to SQL benchmarks. What we built is automated ontology for all of the versions and then we ran our comparison and we were 91% accurate. And then we analyzed why it's 91 and not 99 and then we understood because it's open source, it was lots of garbage. So some of the examples were actually not true. Now, when you know in spider you have around 60 different corpuses or domains in the same database from baseball to cooking, to flying, with very limited context, very limited examples of queries. And we're able to do it automatically. And from this system can only improve because in real life we actually have a lot of context from the usage part. So this very long answer to, we simply do not leave any reminders of original canonical ontology event.

Nathan Labenz: (37:17) What do you do if there's something, I imagine this must happen too, right? Where the customer has some workflow or some sort of data structure that is not represented in your starting canonical version? Is that something that you'd like, just have to flag and say, I guess we need to make another expansion of our canonical version?

Inna Tokarev Sela: (37:38) It's more about supported connectors. So basically we do have a list of support connectors in all formats. We need to understand what is the format of schema and APIs like signatures, API swaggers to support them for automated onboarding. We can always do manual onboarding, basically say, okay, so this JSON format is different and you know, it's more engineering, it's not really data science or agentic part, but we do have a list of supported connectors for automated onboarding because we already know the formats. It's not about really logic formats, it's more about system formats.

Nathan Labenz: (38:18) I guess I was wondering how often, I mean, think the manual onboarding sort of answered it, where if there's just something that the customer has that you're not fully prepared for in an automated way, that's where the manual supplement comes in basically.

Inna Tokarev Sela: (38:30) From ontology creation, we have it all automated. For where API is in box, for example, how export metadata, this is technicality which we might set up manually. Basically if it's manual, the metadata exports, if it's like scripted or API, but again, this is an engineering. We absolutely have no manual tasks for ontology creation, but I'm proud to say that we do have the certification workflows for business users or domain experts to go and certify this ontology, because I think it's super important for you to trust the answers of agentic workflows. You must be able to go deeper and understand what's the concepts behind that. So for us to build this application workflows where it allows people either to review where the answer came from or to certify the definitions ahead of time is crucial. It's also cost saving because if you pre build this context and reasoning and your call, your prompt is not going to LLM but it goes to for this pre build filter, you actually save up to 80% of your tokens. It's like closed systems for all your calls and only as a result of this matching is going as a runtime call. So it's huge saving. About that you don't need to actually train every time new runtime, right? So for you to invoke a new semantic model, an agentic model, you need to create a context in the form of this model knows how to digest and feed this context and spend a lot of money on that. We created context in this format which is pluggable for any LLM runtime. So for example, for this proverbial agentic workflow orchestration everyone speaks about, right? It's not good enough to have agentic niche applications. Now we, as industry, we're heading into workflows, agentic workflows or agentic orchestration. You absolutely have to have the shared context. If each of your models will have separated contexts, they're never aligned. So we do have that by automating this context and reasoning and having connectors to different run types like AWS Bedrock or Nvidia names or other platforms, we're actually able to allow our users, our customers to build workflows from different agentic models by different providers and keep them together aligned over the same context.

Nathan Labenz: (41:00) How have you noticed that that has evolved over the last couple of years? Like, in my experience, again, with the video creation, which is a less complicated problem, but it's a multimodal problem and definitely has its own complication. The trend has been pretty clear from initially, we basically had to fine tune and we had to fine tune 1 model per task and we had a lot of subtasks. You know, we had to break the the tasks down into a lot of subtasks. And we often and especially in the early days, you know, certainly in, like, '21, '22 time frame, couldn't necessarily provide all the context that we ideally might have liked to because, you know, at 1 point in time, the context limit was 2,000 tokens or 4,000 tokens or 8,000 tokens, and you couldn't describe everything that you wanted to describe. And so you had to be very, you know, careful with context management. For us, we've definitely seen a trend toward less need to break things down. Like, the tasks are getting a little bit bigger. There are fewer of them. There are then also fewer models. We can definitely put more context in. And in some cases, we don't even have to fine for certain tasks anymore because the base models are just doing well enough. Or we used to fine tune and ensemble or whatever. Now a great example of that for us is understanding the images that a small business has in their image library. We used to do convoluted things like caption them and then try to figure out just from these like often very generic captions, which of these images like seemed relevant, you know, to actually use. And we could only do 1 at a time. And now we just throw like a bunch of images into a vision language model and say, which are the right ones to use? And more often than not, it gives us just a really good answer. I think you're dealing with like far bigger environments. So I imagine you, your life has not been simplified as much as ours has been, I would guess. But how would you describe that evolution over the last couple of years? And maybe like, if you dare, how do you think that will continue to evolve for you?

Inna Tokarev Sela: (43:07) It depends. So we benchmark all the time. When a new model comes around, we always benchmark it. I must say because we deal with very proprietary corpuses and domains of knowledge, we do not have significant breakthroughs on general understanding of business corpus from out of the box models yet. It's because the context and reasoning is improved on a public domain, internal data, news and so on and so forth. So concepts which do not necessarily belong to business world. In many cases it's very hard to actually build those concepts and this reasoning and the providers are, you know, the foundational models, the goal is the widest denominator. And what we do see is specific task, for example, like query description, SQL query description. We used to have many models for decomposing query and then categorizing this as filters, they mentioned, this is measure. So we used to do it with different models automatically but it's like an ensemble, a whole ensemble tackling that and now we do use just 1 simple semantic model. It had to be trained on specific corpus which we have, but still it outperformed DEXA Ensemble. So I think it's combination of what we have as a training data, but also our ability to basically benchmark all the time what's the latest and greatest. We still do not see significant breakthroughs or even small breakthroughs in business understanding from general purpose models.

Nathan Labenz: (44:46) Yeah, that's quite interesting. Do I understand then, I could imagine a couple different ways that you you could architect this. 1 is like, I could imagine that you might say, oh, we fine tune a model just on your dataset, you know, for each and every customer. And therefore, you know, it's gonna be the best for you because it's really dialed in. Or you could say

Inna Tokarev Sela: (45:06) Mhmm.

Nathan Labenz: (45:06) We train 1 model that's like the best at handling all this complexity. And then we sort of do a mapping from your world to our world where our model always kind of speaks in the, like, it's native tongue of our sort of canonical idealized data structure. And then that gets kind of mapped to each customer in its own idiosyncratic way. I'm gathering that it's the latter that you're doing, and so you're getting benefit of the core model is, like, are getting smarter, but that mapping is the part that, like, they can't do and where the ontology creation is, like, really important. Am I inferring the right things here?

Inna Tokarev Sela: (45:48) The foundational models get much better in basically understanding intent of the users. So for example, if you use the word like just show me, it can already infer that it's count, like if it's something which is numerical, right? So that breaks us about how the question is understood, but what like the topic of the question is where foundational models very are struggling in specific organizational context. So to your point, yes, we are closer to the second, which means we already have ensemble of graph and semantic models which are trained on the main specific corpus. And then it's automatically fine tuned on organizational metadata, which means every organization gets their own custom uber semantic LLM graph thingy. Okay, so it's customized at their own, it's automatically fine tuned with all possible examples. Think about Uber RAG, right? So we fed into every example, every possibility, every combination we already fed in to this fine tuning. On the other side, we always extend and improve our ontologies. For example, we do have right now cross industry ontologies. So there are many concepts which are applicable between industries. So just sharing ontologies, projecting ontologies between different use cases, always improve in enrichment. So it doesn't really end up with having this fine tune, uber agentic model for a customer. It also gets enriched all the time. Which means if a new employee comes and for example it's a supply chain company and they call something like how many pieces do we have today? We know that piece is like delivery, right? And we can map it to semantics and so on and so forth, without the fact that it's actually mapped. So it's a simple example, but there are lots of lingos, lots of jargon, which is used in different industries and now we have a cross pollination between them. We do not assume anymore that person who talks to us talks in specific industries jargon because we have this projection. So it's actually fascinating how this context thing being shared across users, across domains, across companies, and be able to pick it up automatically. So it's having your fine tuned model, but also enriching it all the time.

Nathan Labenz: (48:17) Yeah. Okay. That's fascinating. I'm always kind of looking for patterns of things that I can come back to. And I think this is a really interesting pattern of trying to create a platonic ideal of what an enterprise looks like and then get really good at handling that and then deal with whatever sort of departures from that via fine tuning or mapping or all these different tricks, which are really important, but it does create a sort of software engineering, like separation of concerns. Sounds like. Do you organize your team that way? Is there a sort of like structure within the company where certain people are working on the canonical ontologies and getting the core models to be amazing at those. Then other people are working on the projections and adaptations to different particular instantiations?

Inna Tokarev Sela: (49:08) I would love that. But, you know, those are data scientists and they get bored. So we all for rotation, we have rotation. And of course, if we need to have like, I don't know, quick evaluation of a new model coming out and someone is experts of like building those benchmarks very quick, weekly, like something which you need to publish tomorrow, like a POV or something like that. This person is going to be assigned on the task, but naturally we would rotate tasks between team members just because now everyone learns to learn everything. And also saying the biggest benefit of actually working for Illumex is not only creating this new future and meeting happy customers, also developing your skillset. This industry changes all the time and every employee has to keep up. And to me, part of the benefits which we gave, you know, in the and salary and equity and all of that is ability to keep your relevancy in this fast paced environment. So this is why everyone is allowed to touch the aspect of our business.

Nathan Labenz: (50:14) What would you say are the sort of performance benchmarks that we should have kind of roughly in mind if, for example, I wanna run a certain query and you can complicate my framing, because I'm sure there'll be some nuances around like, well, it depends on how difficult it is and so on. But let's say I have sort of a database or maybe a couple different data sources and I just go like, okay, here's my schema. Here's my other data source and its structure, paste that into ChatGPT and give my, like, language for what I want and ask it to kind of handle it, you know, sight unseen previously, but it does have my, at least, like, database definition. If I go that route versus if I go to a data analyst, you know, in my company and say and just give them the question and say, hey. You know, here's my question. Can you do this? Versus if I go to Illumex and give that same question, you know, as you showed earlier through the the Slack chat or whatever. How good are each of those things? I I I think the people often sort of assume tacitly, if not consciously, that, like, the human is a 100% reliable. Whereas I know from personal experience, you know, in my own businesses, like, that's definitely not true. I've actually become like kind of radically skeptical over time of things that I get from data analysts. I'm always like, really?

Inna Tokarev Sela: (51:32) So we always benchmark to to human analysts because of this bias. Right? So the thing is if you're on a POC, naturally like creates queries on data source, and then they're going to compare those queries to, for example, BI reports which created by analysts. If we are better than that, we are good, right? So this is a benchmark. To me, Illumex is a perfect platform for Control Freaks. Why? It's because you can check everything. So with human analysts, they get agitated. Sometimes they get annoyed by all the follow-up questions like, why did you use this data? How did you calculate that? And, you know, there is limit to how many questions you can interrogate with human analysts. With Illumex it's endless. You can interrogate system, you know, to death, you can ask as many follow-up questions or as many like reverse engineering questions if you would like to. So for example, okay, so I would like to do a channel attribution. This is, you know, this is the answer like table, right? And then like, why did you calculate it by this definition of channel? Okay, because this is what I found in definition of your business metrics, blah, blah, blah. And why did you use this? Because it's define your data source used by 90% of API calls. So it can actually interrogate system in many ways by asking questions. And this is usually, you know, doesn't work with humans this way. You cannot really interrogate the person till death. And I think it's a good thing because usually people will do it like once or twice to understand how the logic works. Then we'll be confident to actually make decisions based on data. The whole thing is to bring trust and bring awareness about how cake is baked. There's a cake analogy and not sausage on purpose. And this is where it comes to, you know, to business users to get this trust. In addition, governance is a big thing. So we didn't touch governance at all during this conversation, but thing is in a generative AI and especially for data analytics, governance is not built in component in any RAG or ontology based tracks. So governance is a separate practice. Someone runs in the like GRC department and it has nothing to do with the authentic workflows. And this is a bad thing because here is where we need governance the most, right? So bias, ethics, access rights, skewed data, all of that should be taken into consideration and Illumex has governance component built in. We audit by ourselves, audit conflict, duplications, PII, we audit all of that and you can export those audit reports, but you can also go and verify and certify and govern by yourself. So this is the flexibility that platform gives. In addition, on the input level, in the interface level, what happens is that when you interact with the system, your prompt is going to always be mapped to part of your business ontology. So you cannot reprogram system from the interface, from the input on purpose, because we do not expect business people to reprogram business logic, which is accepted on organizational level. Again, on purpose we grounded that. And the only way for this logic to change is that we see that the metadata is changing, the systems are changing. So you might add a synthetic data, you might want to bring API, maybe you just deleted few tables. So this is how changes arrive to the system and we flag them, we alert about them, we generate new descriptions, new definitions automatically. This is how a system is programmed, not by interface, by inputs from the underlying systems.

Nathan Labenz: (55:19) That is really interesting though. And it definitely does sort of foreshadow, I think, you know, changes to how companies are gonna be run. I guess before maybe going into that, are there, like, rough numbers right now? Is there a mental reference that I should have in mind for if I paste my schema into ChatGPT, I'll get x percent accuracy. If I go to a human, I'll get and do you do these benchmarks? Like, you don't benchmark the humans at every customer, do you?

Inna Tokarev Sela: (55:50) You know, it depends on on the requirements. It really depends on the company. It might be just generic tests that we need to pass. It might be really comparison to actual activity what's happening. It doesn't come really from this place of, of course, replacing analysts. Analysts are irreplaceable. I wouldn't imagine in the near future having this SEC report or board report for a public company generated by a data pilot without a human analyst approving and confirming that because it's like legal liability around that. So of course we need analysts. The thing is analysts are always trained. They have endless pipeline of requests they never get to. So we kind of serving this underserved customer. So we just had this inquiry which are coming like, we have all specific department and we are waiting for 9 months to to have a BI dashboard implemented for us. Can we go around that and just give a service to our users finally, because just desperate for that. It's not a priority for the company to build the BI dashboard, it's very expensive, but having this data co pilot facilitated for them, it's something that we see tomorrow. So it's kind of cool. If you're speaking about underutilized data, underserved employees, under tackle to use cases.

Nathan Labenz: (57:15) If I had to guess, I would say that my typical sort of 1 off data request that I might give to an analyst at a big company would come back totally accurate somewhere in the like 90 to 95% of the time range. How does that match your experience or understanding for accuracy in real, live, large complicated businesses?

Inna Tokarev Sela: (57:43) Above, if you speak about on utilized data and it goes below a little bit under 95 in underutilized data. So when you have unused data and you start asking questions, we might return there is no answer due to corrupted data. And actually it's 100% that this is true. Corrupted data could be, for example, know, missing values things like that. So duplicated data which is like, for example, you have single source of choice and this is just duplicated data to that with different value distribution. So from our assumption, we should never use that. So when we analyze actually unutilized data, why we have no answer coverage, we actually have 100% conviction that it was due to corrupted data. So frame it as you would like, but it's above 95.

Nathan Labenz: (58:40) Yeah, that's really interesting too. I mean, because it is often, I've found in my own experience that when I get an answer that's not right, often it's that they did something, you know, at first pass seems reasonable.

Inna Tokarev Sela: (58:54) Mhmm.

Nathan Labenz: (58:54) But then it turns out that there was some flawed assumption, which ultimately means that the answer is wrong. Mhmm. And it's always kind of tricky to figure out, like, should I have expected this person to question that assumption at that time in the right way or not? But yeah, it's a good nuance to point out that a lot of times the systems themselves have lots of problems that they contain. And so you can't just apply the naive query on a database and expect to always get the right answer. You are building in many cases on a flawed foundation.

Inna Tokarev Sela: (59:31) The thing is because we automatically learn from all those human interactions, we actually learn from everything which already happened. So think about this Uber analyst who actually had experience of all your analysts all together.

Nathan Labenz: (59:45) So how do you manage that? Because certainly at any large company, the volume of interactions is gonna be too large for like, you know, throw it into 1 context window. Mhmm. And then I also, I was just talking to a friend the other day who was saying that, and they're basically working on a research product, not with internal company data, but with, you know, kind of broad open literature as the grounding data source. And they said, you know, a huge challenge is keeping the AI system making progress step over step.

Inna Tokarev Sela: (1:00:20) That's right.

Nathan Labenz: (1:00:20) How do we make sure that our answer is actually getting better as we apply more and more inference as opposed to just kind of cycling or drifting or, you know, going off in random directions. It seems like there's probably kind of a diamond in the rough sort of problem here where if you're looking at query logs, you've got, like, just an overwhelming amount of information and then, you know, needle in a haystack in there, there's gonna be a few queries that tell you, oh, somebody realized that like there's a problem with this data and this is how they're fixing it and whatever. Right? So how do you think about identifying these anomalies that are super informative and, you know, and not missing them and also like, especially because I'm just imagining so many of these. I mean, good God. You know? Yeah. My business is not that big. Right? Like, it's only 30 people and our database is is nothing on the enterprise scale, but I could, you know, take you down my memory lane and remember many instances of this. So how do you handle the volume, I guess is the core question there.

Inna Tokarev Sela: (1:01:24) So naturally we do not use the same context windows for starters. And because we do not, we have our own architecture and our own models, we'd not use APIs, but of course it will be super expensive to do it with a third party tool. So that's why it's more optimized to our needs and to our processing. To your point, logic does change over time and we pick up those changes and we see if it's like ad hoc change. So it might have just a new analyst running ad hoc query with a faulty logic and we don't need to change anything in like in our model. So it's something that it gets consistency. We do have those also building blocks. So this is a iterative approach. We do not only have as workflows mapped, but we also cross validate if the workflow embedding stays similar as compounds embedding. So iteratively we go down to the lowest level of semantic entities, definitions and relationships, then we build it up, right? So we have top down and bottom up comparison that the logic stays the same, it doesn't deviate over time. And if you do have a conflicted logic introduced, we do not embed it into model right away, like at spot, They actually flag the semantic entity as conflicted and they do have this workflow for data zone owner or domain experts say, okay, we have this totally new definitions, not deviations like serious conflict introduced by this analyst, by this report and so on and so forth. What do you like us to incorporate it into the ranking system? Right, so we really ask humans about that. We do not really need to do that, we can have different heuristics, for example if you have above 20% of deviations, we start to adjust to them, you know, this different benchmark that we can automate this process. In reality, what we see, companies prefer to understand the logic deviations before incorporating them into model as far as the flag automatically. Complicated in the sense that you need to really understand all the nuances, but it's easy how it works because you do not build anything, you just review stuff. It's actually much easier. It saves 90 plus percent of the effort actually having agentic workflows because the context and the reasoning are built, the changes I introduced, they go into review, right? You have explainability so I can dig in, you know, top down and bottom up. So it might sound complicated, but the thing is experience is 90% less friction that what companies have today.

Nathan Labenz: (1:04:06) You've mentioned a couple times that it's all metadata based and never looking at the actual content of the tables or, you know, data stores, whatever they may be. Is that a decision that you sort of had to make because people just don't want to allow other companies to see their data? Like, would this whole thing be a little easier if you had some visibility into what is actually contained in the tables? I assume it would have to be helpful to have that, right?

Inna Tokarev Sela: (1:04:38) It would be, it would be, but to me it's kind of the assess tool and this is our business model. I think it's, you know, it's the future and we stick to it. For companies that we work with, it's absolutely imperative to keep the data to themselves and separate even to some extent, know, definitions and business logic from the data values. Because if you have the same provider in SaaS, right, who has all your business logics and all your data, it becomes a great liability. And to us, bear those liabilities, it's not a priority for us. Lowering the risk for customers is the biggest priority.

Nathan Labenz: (1:05:20) So when I do something in Slack and I talk to the Illumex bot, my query gets sent to system. Your system knows about metadata.

Inna Tokarev Sela: (1:05:32) Mhmm.

Nathan Labenz: (1:05:32) It then sends a sort of tool call essentially back to, like, an app, like a Slack integration or whatever that lives on their infrastructure, then that goes and actually calls the data source and then returns to them. And so you're only generating the tool calls basically, but not actually directly interfacing with the database at runtime. Is that right?

Inna Tokarev Sela: (1:05:55) Exactly. So we basically send the query for execution and the results are presented in the same interface the prompt is coming from. And again, this is for security reasons.

Nathan Labenz: (1:06:07) Yeah. That's fascinating. That's very clever. I mean, the idea that you can do this much with data without ever having actually to see any customer data, I do imagine that has to like, that's gotta be a major advantage in the sales cycle. Right?

Inna Tokarev Sela: (1:06:21) Yes. So well, I wouldn't know otherwise. You know, people ask me how it's a big female founder. I'm like, you know, this is the only experience I have. So it was a early decision to to base our solution on metadata due to, you know, concerns in our early discovery calls that it will be an absolutely no thingy to have SaaS solution touching enterprise data with generative AI and all of that. So we made this decision early on and then it was easy. And now when we have those questionnaires, it's like, everything is metadata, do not touch API, we do not touch this and that, and you know, it gets of course. And then this automated vendor clarification tools, you skip like 10 other things. So it's byte streamline some operations, but the majority of companies, they do want to understand exactly what you do with metadata, how it goes to home, like who are your subprocessors, like where you hosted, can we have instant separation, can we have account separation? It's all justified. So to me, it's all justified with all concerns. We need to make sure enterprises feel very comfortable with this implementation for them to trust the results. Because if you do not trust your vendor, you cannot trust the output of your system.

Nathan Labenz: (1:07:33) I know there's like different reasons of course that people are very sensitive about their customer data. Obviously wanting to maintain customers' trust and, you know, not get sued and, you know, probably regulation in different jurisdictions as well. Do you think people should be more like, is it as I'm hearing, you know, all the stuff you can do with metadata, I'm almost wondering, like, should people be more concerned with their metadata? Cause in a sense it's sort of the scaffolding of the business, right? If I wanted to compete with 1 of these businesses, it seems, it sounds like my ability to access metadata would be maybe even more valuable than the like actual underlying raw data.

Inna Tokarev Sela: (1:08:12) So I would say processes are less differentiate, you know, and business metrics usually, you know, they differentiate basically on the results. In majority of public companies, they anyhow report and they, you know, leading metrics and, you know, business structure and all of that. They have to report on that. So I will take your question actually in different direction. So right now, companies saying that majority of their differentiation is basically the way they're doing business. I'm saying majority of the differentiation is in their data, because foundational models are going to be faster, cheaper, more agile. Companies are going to build lots of automations around that. And those automations are going to be as personalized as good as your data is. So the real differentiation for companies is the richness of the data they accumulated over time. And if companies do not accumulate data about everything they could put their hands on, they're actually missing out because it might be the next revenue engine. It's like, I don't know, 5 years ago, everyone was saying like every company is a digital company, right? It's a digital product company or something like that. So I believe this is the future. So you're going to have lots of data products and services that you can sell to your customers, to your partners, sell as maybe even as like data agents, what have you, but you have to have this differentiated data.

Nathan Labenz: (1:09:46) So in other words, you sort of see a maturation of data management and querying and analysis, perhaps analogous to like earlier waves of computing where at some point somebody might have said, you know, an e commerce business at 1 point might have said, we're the best at running the servers for this e commerce business and that's why we're gonna win because our like site loads faster and people have a better experience that way. And now it's all kind of everybody's site can load and like it's not really about that anymore. It sounds like you're saying something similar and obviously you're wanting to play a big part in making that happen. But the idea is that before too long, everybody's gonna have the ability to, like, get all of the value from their data. And the question is gonna be how much value is there actually to get Mhmm. As opposed to can you get it?

Inna Tokarev Sela: (1:10:39) Yeah. And also how integrated you are. We're going to have industry clouds, cross companies, services. So how well you recognize the data value and how well you integrate it into different systems. So we started from a, this application free future. This application free future to happen, a few things to have happened, like shared context that we discussed in-depth, like shared context data and agentic and workflows can run around. So as orchestration, you have to have the same data formats. So basically being able to share data between companies, between industries and so on and so forth. It could be also facilitated with a semantic mapping, semantic may change or context. It could be something which is part of it. And finally, you have to be able to integrate your software into other companies offerings. Customers can actually invoke cross company workflow and pay for what they call like this consumption free model when they ask you know, question and they're not really care like which systems answer them. All kind of encapsulated, so no actual data values are shared, like all secure and everything on 1 side, but on the other side, business align, semantic alignment. Something that you don't need another lost in translation in this experience. So it's kind of very forward thinking from 1 side, but on the other side we already have that. We have those semantic models which came from nowhere and suddenly they expect to understand your business logic. And now we have mechanisms to feed them this business logics with RAG or with Illumex, JSF, what have you, Right? So different benefits and so on vector databases. So we all they have like this beginning, so it's not far fetched at all.

Nathan Labenz: (1:12:38) So who are your users today? How does it break down? You know, because you've got the data analyst persona, of course. Then you've got the sort of, you know, executive or business leader or whatever that might like to get a quite this is another pain point for me over the years. For some reason, I always found myself asking the questions after hours. And my you know, the person who would answer that question for me was, like, usually not working right when I was wanting to ask the question. So you could, you know, break it down into multiple personas. And then also interested in, like, it's probably a little early for this, but I sort of expect that you're gonna have AI users before too long as well. Right? I mean, Google has talked about making, like, virtual coworkers, OpenAI is making all their various specialist agents, And I can easily imagine that it might be an AI that's chatting in the Slack with Illumex in the not too distant future.

Inna Tokarev Sela: (1:13:32) Yeah. Illumex is definitely a playground for machines, humans, data, and applications. So it's a playground when everyone can collaborate. So very good questions for our users are, and this is changing rapidly. I would say 6 months ago, majority of our users would be more like tech savvy people from data management to governance teams to analysts and so on and so forth. And of course they use self-service, but maybe for domains they're not experts in. So I'm analyst and now someone requested report on the database, I don't know anything about it, so I'll just run Illumex, so something like that. In the last 6 months, all of our inbound requests are around business user self-service. So as you know, industry itself matured rapidly and self-service for business users are not intimidating as it used to be, especially when you know, we bring built in governance and also cost management, so do not overgo so specific budget and so on and so forth. And I expect, as you mentioned, this agentic workflows to become our users as well faster than we expect.

Nathan Labenz: (1:14:40) How do you price the product with that in mind? Because obviously the per seat model, you know, people have not said there's only 1 model, right? But there's certainly been a lot of emphasis on the per seat, you know, we can grow with your growing team, all that kind of good stuff. But if you imagine a future in which like an AI agent working on behalf of, you know, some senior leader is doing, like, a lot of querying, your per seat model, if you have 1, you know, might not play out super well. So do you have any thoughts about how Mhmm. Enterprise SaaS products in general, how you are and how enterprise SaaS products in general should be thinking about adapting pricing to the new reality?

Inna Tokarev Sela: (1:15:19) So our investors say, know, we're a venture backed company. So investors like predictability And I as well, when I pay the bills in the end of the month, I don't like surprises at all. So I mentioned now like enterprise scale. I think the biggest factor where enterprise adoption slowed down last year is actually this kind of a surprise factor regarding costs and ROI. So cost of AI success, what's called right now. So in Intermex we actually aggregate that. We have a sealed tiers for different sizes of companies based on the number of the data sources. If you just feed 1 data source into Illumex, it will be a starter package. If you feed in like 3 to 5 data sources, it will be mid size. And then we have all you can eat this enterprise level higher up and it's sealed. So it will never have surprises in the term of the contact, no limitation on seats consumption. It just because no 1 likes the surprises and it can really slow down the adoption. But we are not crazy. It's not like the Silicon Valley episode where they're like selling pizzas and they got bankrupt and selling pizzas. We all about actually encapsulating those calls, this interaction with agentic inside the company itself. We pre build this context, so the majority of interactions are coming inside. It's all the users the same pre built context. It doesn't need to not send it again and again or retrain or fine tune all of that. If you fine tune a context, we fine tune just 1 specific block of it. So it's very compromised. We have components and we have this composite architecture that we did not fine tune everything. So by our calculation we save up to 80% of the otherwise in current costs. So basically our pricing is much more cost effective than using just off the shelf APIs and using traditional methods like RAG. RAG, Onto RAG, we have lots of graph based drugs right now. So we benchmark to latest and greatest, not the lowest denominator of course.

Nathan Labenz: (1:17:42) So if you extrapolate this out, I mean, obviously the models are getting better all the time. The costs of models that you're saving relative to baseline costs, also the baseline costs are dropping all the time.

Inna Tokarev Sela: (1:17:53) Mhmm.

Nathan Labenz: (1:17:53) You mentioned earlier that you need to have analysts, you know, if only because there's sort of a liability component, like somebody needs to sign these reports if it's a public company or whatever. In the software development domain, it's a little hard to tell right now, but it does seem like we might be entering into a place where entry level fresh out of college software developers are starting to really struggle. There's like a number of data points that show this, right? There's just how many job postings are there out there. They're like way down. If you go to different 4 where I don't spend a lot of time to be clear, but you know, I've had a few glimpses of online discussion places where people are saying like, I was promised a good job, you know, when I got this CS degree and it's not happening. It does feel like the sort of current senior people that are in are not, you know, immediately threatened both because the technology isn't quite good enough and because I think, you know, for very good reasons, people are not like ready to, you know, go all in on it from a trust standpoint. And it sounds like that's like true for data as well. But do you think that it is also maybe true on the data side that it's sort of the latter is maybe being like pulled up for people? Like would you advise somebody to go into a data analytics type role today if they are in college, for example, and trying to figure out what to study? Because I kind of have a hard time saying, yeah, you should go do that. It seems like, man, I don't know. By the time you finish college, it might not really be there in the same way that it has been.

Inna Tokarev Sela: (1:19:36) I believe our professions are going to be reinvented few times during our lifetime. That's for sure. And we all should be, you know, keep up, upskilling and doing that. It's the same way as like driving instead of, you know, pulling the carts and then, you know, moderating horses and then driving the steering wheel, it will be the same for data and analytics professions. They're going to be less content creators and more content moderators. This is even more critical because to me it's actually not feasible to have everything done, you know, with human resources. So think about how much spend right now on data analytics practice and we are doing maybe 10% of what could be done in the data domain. So let's upgrade our data and analytics people to actually, you know, have alerts, like resolve different conflicts, talking to business users about like what's experience that is expected from them, like really moderating and fine tuning, like customizing, tailoring this experience by giving instructions to software to do so, right? So software at the moment, the agentic including is not very good to understand the actual requirements. So I can say that in the way that you're prompting system, except of Illumex, it gives different answer, right? So this is fair, we still need humans because humans understand analogies, metaphors, different intentions of the customers. And for sure they can also understand the technicalities like why do I have this alerts that data is not sufficient enough for this model to perform? Ah, okay, so they just remove like this crucial model. Why do I know that? Because I'm actually experienced, like I saw different iterations of the semantic systems and what's going on, right? So I'm not saying that being moderator will be still around after 50 years, let's say, will be something else, but I do see this shift from content creation to content moderation on scale for sure. But it does require some expertise for new graduates. For sure they need to be proficient or start being proficient in something deep enough to become moderators in short time.

Nathan Labenz: (1:22:01) It's a brave new world increasingly coming at us extremely fast. Maybe last question if you have time for 1 more would be what research, academic or otherwise would you point people to if they want to go deeper on this sort of work? I've had a couple of them, I'll put my own in the show notes, but I'd be really interested to hear, terms of academics or papers or, you know, any open source projects, what do you think are kind of the things that people, you know, if they wanted to upscale, you know, to be a good candidate to come work at your company, what would be the things that they should go and study in the broader world?

Inna Tokarev Sela: (1:22:38) It's a great question. Something new happened. So we always had academy as, you know, this never ending way to feed us news and inspiration and everything. And especially in agentic, what I see is actually biggest inspiration is actually coming from biggest players or in some cases unexpected sources like deep sea for example, right? I do really advise you to follow-up the industry biggest players in the space to see the developments, the trajectory, what they're saying even in business keynotes, because they're saying a lot. You just need to read between lines sometimes. So to me is the biggest inspiration coming from industry and not from academy. I do read all kind of academic research on Anthorag, different approaches how to marry context and content. The thing is I do not see great novelty in this yet, but there is great understanding that the just semantic approach will not cut for this multimodal experience, right? If you've experienced, you want to have multimodal agentic AI, you have to have a different combinations of content, context, timeline, what you call vibe, right, for your solution. So basically incorporate different factors, which are right now are very flat. So it's big development that academic research is already shifting in this direction of a multi model context, let's say, it's still not there. So I do not see like, you know, something which is outstanding. This pretty nice research coming from, it was published on the NIST conference. It's about this ontology based RAG system specifically for supply chain. So again, folks like to this industry specific data, industry specific ontology and to prove that the RAG performs like much better and much cheaper and all of that. This is great. But now like extrapolate that. Like, do you need to manually create ontology for every use case? It's not scalable, right? So I see that academic research is advancing, but not in the pace that I would expect from that. And now everyone is going to throw stones my way, you know, folks from Stanford, for example, but, you know, it's my personal take on that.

Nathan Labenz: (1:24:54) That that broadly aligns with my sense. I mean, people ask me like, you know, where should I go to get, you know, up to speed on AI? Can I take a course or something? And I'm I'm often like, you know, honestly, the best education I see for general purpose, like what's going on with generative AI is a lot of times coming from like online hustlers, you know, who are

Inna Tokarev Sela: (1:25:16) Exactly.

Nathan Labenz: (1:25:17) Not like professors. But 1 thing that they often do have going for them is that they are just moving really fast, and they're, you know, trying to keep up with the frontier and trying to ship something that is roughly up to date with the frontier. And so in many ways, I have found that, like, less credentialed but more current resources or, you know, sources are, for many people better in today's world. And of course, you know, read a lot of academic stuff too, but that is that is striking. It's the paper I mean, there's a big difference, of course, too between academic, like, publications, you know, what's on archive every day versus also, like, you know, what are they actually offering in the curriculum at the school? Like, that's way behind in in most cases these days. But anyway, that's an interesting answer. This has been great. I really appreciate you taking me, through all this stuff and, there's a couple new concepts, maybe even like paradigms here that I'll definitely be thinking about, going forward. You've definitely hit on my main goal of learning something new and interesting with every episode. I definitely come away with a couple good takeaways from this 1. Anything else you want to share with the audience before we break?

Inna Tokarev Sela: (1:26:25) I think it was like a super enlightening conversation between, you know, between us. I'm sure they're not going to use the word ontology for the next 2 days, at least. Just to came out of that, you know, just be brave, but do not, like, I'm really hesitating. I have 9 year old kid and like, if he's going to attend the university or he's going to learn to teach himself and kind of new type of discord, I'm like all in to dig into it and put your hands on the latest and greatest, just feel the technology by yourself. You know, I did like 5 degrees in different institutions and so on and so forth. It's kind of funny to say, to me it was like inspiration and all of that, but now we can find inspiration in different places and life is very short and everything is moving so fast. So maybe, you know, academic degrees will reinvent themselves.

Nathan Labenz: (1:27:21) Yeah, I have a, 3 kids. Oldest 1 is almost 6. So not quite as far along as yours, but are there any like AI education or tutoring type products that you have found to be particularly valuable?

Inna Tokarev Sela: (1:27:36) It's a tricky 1. So when he was a baby, I bought all this like neural nets for babies and this for toddlers and that. It wasn't very, like it didn't stick. So basically just, you know, throwing a ball and explaining gravitation was much more educating than having books on that. So I actually introduced him to CHAD GPT already, like, you know, early on, and it might be not so educational because I explained to him how to use it, you know, to basically do the homework. But to me, it's more important to teach him critical thinking, checking sources, juggling different types of technology to achieve a task than, you know, manually doing stuff. I know there's a different approach where just give your kids like a pencil and notebook and no access to electronics at all to like teach them sing. On the other extreme of giving them the most extent of technology as, you know, of course age appropriate, and then teach them to be critical about that. Think about what could go wrong with that, looking for faults. So this is just my approach and he was super happy to at first put his hands on chat GPT and ask questions. And then he's like, you know what? Right now the homework is kind of trivial, so I'll just do it myself. And I'm like, okay, great. So you don't need it right now, or at least not to some extent. That's great, but you have these tools available for you. So it's your judgment.

Nathan Labenz: (1:29:07) Yeah. Fascinating. I think I lean more in your direction, but we're just starting to figure this out. All right. Well, this has been amazing. Thank you again for taking all the time. Inna Tokarev Sela, CEO of Illumex, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

An Application-Free Future? Speaking Directly to Data with illumex CEO Inna Tokarev Sela

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

An Application-Free Future? Speaking Directly to Data with illumex CEO Inna Tokarev Sela

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!