Dialpad's Chief Strategy Officer, Dan O'Connell, on AI-Powered Business Communications

Watch Episode Here

Video Description

In this episode, Nathan sits down with Dan O’Connell, Chief Strategy Officer at Dialpad. They discuss building their own language models using 5 billion minutes of business calls, custom speech recognition models for every customer, and the challenges of bringing AI into business. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

We're hiring across the board at Turpentine and for Erik's personal team on other projects he's incubating. He's hiring a Chief of Staff, EA, Head of Special Projects, Investment Associate, and more. For a list of JDs, check out: eriktorenberg.com.

---
SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

X/SOCIAL:
@labenz (Nathan)
@dialdoc (Dan)
@dialpad
@CogRev_Podcast (Cognitive Revolution)

TIMESTAMPS
(00:00) - Introduction and Welcome
(06:50) - Interview with Dan O'Connell, Chief AI and Strategy Officer at Dialpad
(07:13) - The Functionality and Utility of Dialpad
(17:20) - The Development of Dialpad's Large Language Model Trained on 5Billion Minutes of Calls
19:56 The Future of AI in Business
(22:21) - Sponsor Break: Shopify
(23:56) - The Challenges and Opportunities of AI Development
(31:17 ) - Prioritizing latency, capacity, and cost when evaluating AI
(39:41) - Most Loved AI Features in Dialpad
(42:01) - The Role of AI in Quality Assurance
(43:10) - The Future of Transcription Accuracy
(44:06) - The Importance of Speech Recognition in Business
(46:59) - Personalizing AI for Better Business Interactions
(47:01) - The Role of AI in Content Generation
(52:47) - The Challenges and Opportunities of AI in Sales and Support

Full Transcript

Transcript

Nathan Labenz: (0:00)

It's not that I'm racing to replace people or cut costs or whatever, but I always kind of come back to a Bezos-style question: what does the customer really want? And the customer wants immediate response 24/7, where I can pause the conversation where I want to at my convenience and be able to come back and pick it up right where I left off, and maybe even switch modalities. ChatGPT offers me all these things today.

Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host Erik Torenberg.

Hello, and welcome back to the Cognitive Revolution. As we head into 2024, I've been thinking a lot about where we are in terms of AI's impact on knowledge work. While 2023 certainly brought explosive growth in AI adoption, to be honest, things have moved a little less quickly than I had expected. At retail prices, $1 billion only buys one to two GPT-4 API calls for each of the world's 8 billion citizens, which really just goes to show what a tiny toehold language models have established in knowledge work globally. Even if this were to go 100x over the next year, it's still just one GPT-4 API call per person per day—still a tiny fraction of the knowledge work that humans are doing.

So why this delay relative to my admittedly very high expectations, and what are the likely solutions? First, while I've definitely argued that OpenAI has moats, I have been a bit surprised by how long it has taken for other companies to match the quality of GPT-4. It's fair to say, I think, at this point, that GPT-3.5 level models are effectively commoditized. But GPT-4 is different. Only Anthropic and now Google with Gemini are really even in the ballpark in the West, though it's worth noting—and I do hope to have another episode on this soon—that Baidu's ERNIE 4 also appears to be a worthy contender.

A second major issue is that language models are weird, and the know-how to successfully implement them into automated workflows remains relatively scarce. Most companies are naturally excited about opportunities to save 90% of their costs by automating tasks that nobody enjoys anyway. But if they don't have or know someone who can execute on such a project, they really have no choice but to wait.

A third issue is that AI agents, which promise to allow people to delegate work to AI on a real-time, more ad hoc basis, have not matured as fast as I thought they might. In part, that's due to ongoing GPU shortages. I've repeatedly said that I think GPT-4 Vision, originally introduced in March but only now hitting general availability, will boost agent performance. And indeed, we already are starting to see new reports of much better performance at significantly lower cost simply because computer interfaces are designed to be interpreted visually, and GPT-4 Vision now makes that possible.

What else is standing in the way? Another challenge is that it turns out that we can boost AI performance significantly by decomposing tasks into sub-parts. But then when we try to string long chains of these sub-parts back together into meaningful work, it becomes very tricky to determine what information to give the AI at each step along the way. Too much information can be unwieldy and in any case makes the agents slower and much more expensive, while too little information leads to bad decisions and overall failure. GPT-4 fine-tuning, which is also still in early access-only mode as of now, will probably help here. Fine-tuning in general is very useful for shaping model behavior, and regular listeners will know from the recent emergency episode on the MAMBA architecture that I expect state space models to deliver more coherent agentic behavior as well.

But even if so, this would pose another challenge. Where are we going to get all the long-episode, context-rich, "how we work in practice" sort of data that we're going to need to train these long-context AIs? Where does that data exist today, if at all? One likely source of such data, I think, are the software platforms in which people currently do their day-to-day work, which naturally capture records of the work and also implicitly define the space of possible actions that a human or a hypothetical AI agent might take at any given time along the way.

So with all that in mind, I was very interested to discover Dialpad, a communications platform that uses AI to deliver assistance, automation, and insights to human users, who today are mostly sales and customer success teams. Having been in business for years and having grown a customer base to more than 30,000 businesses, Dialpad is now in a unique position with a proprietary dataset of more than 5 billion minutes of sales and service calls, which they've now used to train their own transcription and large language models.

This looks to me like a near-ideal opportunity to begin experimenting with AI employees. Though as you'll hear in this conversation with Dan O'Connell, Dialpad's Chief AI and Strategy Officer, he's not expecting these changes quite as rapidly as I am. And based on my predictions last year, I have to say there's a pretty good chance that he'll end up being right. Yet at the same time, regardless of how long it takes, it does seem clear to me that huge amounts of routine knowledge work will ultimately be done by AI. So while I'll continue to hold myself accountable for the best possible predictions that I can give, it does also seem to make sense for me to bias my analysis toward shorter timeline scenarios simply because those are the ones in which AI scouting will ultimately become most valuable.

As always, if you're enjoying the show, we'd love it if you'd take a minute to share it with your friends or leave us a review on Apple Podcasts or Spotify. You would be amazed at how few reviews we get. I literally get 10 times as many personal notes as we get online reviews. And yet, I understand that it does make a huge difference to podcast distribution. So I would really appreciate it if you would take just a minute to write a short online review.

Now without further ado, I hope you enjoy this conversation about AI and the present and future of knowledge work with Dan O'Connell, Chief AI and Strategy Officer at Dialpad.

Dan O'Connell, Chief AI and Strategy Officer at Dialpad, welcome to the Cognitive Revolution.

Dan O'Connell: (6:53)

Thanks for having me, Nathan.

Nathan Labenz: (6:54)

I'm excited to talk to you about the platform that you guys have built and are continuing to build and are making more and more AI-centric. One of my calling cards in doing this show is I do my homework. So I've gone in and created a free trial account and bounced around and called myself and transcribed a little bit. So I would briefly describe Dialpad as an almost-everything hub for the work that sales and customer service teams in particular are actually doing. You're actually in Dialpad a lot doing your work. Tell me about the product and your customers, I guess, for starters.

Dan O'Connell: (7:33)

Yeah. You see me on video smiling and laughing as you're describing it. You did it exceptionally well. I would add—a communications platform is the easiest way to think about it. So we do voice, video, messaging on any device, anywhere in the world. That can be used for internal communications between teams, but it can also be used, as you mentioned, for external communications, supporting sales organizations or customers. And we build AI throughout that platform. So how do we capture a conversation, transcribe it, and then focus on building features that help with assistance, automation, and insights from those conversations?

Nathan Labenz: (8:09)

Assistance, automation, and insights. Okay. We'll come back to that. I looked you up on G2 as well. Good reviews, upper-right quadrant. 56% of the businesses are SMB, 37% mid-market. What would you say is the state of—you could even maybe broaden a little bit beyond AI, but I'm very curious about AI adoption, just knowing this business segment as I do and knowing that they're not always the quickest to embrace new technology.

Dan O'Connell: (8:37)

Yeah. And those are obviously self-reported from G2. We would classify ourselves on our revenues. And then just for context on the size of business, we're north of $200 million ARR, 30,000 customers. Pretty much a third, a third, a third across SMB, mid-market, and enterprise. We have some really big brands that use this: Stripe, Twitter, Uber, HubSpot. Then we also can go and power really small businesses that might be a law firm of two or three people.

I would say the general thing that we see from our customers is very much people that believe in the cloud, to no surprise. But when you talk about communications, there's still a lot of people that run their whole PBX system or phone system in a closet with wires behind it. For us, we see a lot of people that believe in managed services living in the cloud, and those people tend to be on the forefront of wanting to try out AI-powered features. So they believe in having a transcription and being able to map sentiment, as opposed to getting into the arguments around wiretapping laws and call recording laws and things like that. Those are not the types of conversations that I would say many of our customers are pushing.

Nathan Labenz: (9:49)

Yeah. Interesting. Okay. Well, I think that's a pretty—the third, a third, a third is an interesting accomplishment, really. I mean, to build software that can serve the enterprise and small business at the same time is a real challenge. I've dabbled in that a little bit. And again, this goes beyond the AI focus. But the just-raw number of features and configurations—next thing you know, you're asking, how do we not become Salesforce while we try to meet all these different customer requests? And at the same time, you need something that's really intuitive for small businesses who don't need or want any of that stuff.

I guess one thesis I've had, and I think you've largely addressed it so far with just disciplined product development, but I've had the idea that a lot of that complexity maybe gets smoothed over over the next year or two with natural language interfaces. Maybe everybody who has a crazy complicated platform can start to go down-market because they hide 17 menus and say, "What do you want to do?" and just translate the user need into configuration on their own platform as appropriate. Does that seem realistic to you, even if it's not maybe your most pressing problem relative to other software platforms?

Dan O'Connell: (11:09)

Yeah. I actually agree with you on that. I think one of the unique opportunities for large language models actually is putting them on top of analytics. You think of business intelligence platforms today—because as much as we're a communications platform, as I said, the stuff that got me really excited is, hey, we want to power communication on any channel. We want to understand it in real time. So let's capture it, transcribe it. And once we have it transcribed, well, now it's in text format. We can do all sorts of things with it.

If you put a large language model on top of that, you suddenly have this robust analytics platform. If you think about building an analytics platform today or even a database, you need a bunch of filters to search things, right? And so that creates a lot of complexity. Everyone wants different filters. They want their data in different ways. And to me, one of the biggest opportunities now is you can put a conversational search interface on top of the analytics. And that, to your point, streamlines the complexity of the software you need to build, simplifies the experience that users have, and I think really starts to unlock the potential of this last—I call it the last offline dataset, which is these conversations. So I get really excited thinking about that. Tying it back to what you're saying, yes, I think there's actually a simplification that happens. And we naturally think of simplification coming from the SMB market as opposed to the enterprise market, but I do think that's something that's going to start to play out.

Nathan Labenz: (12:34)

Okay. Cool. So can you give me a little bit better sense of how—when I signed up, there was a list of 10 possible departments and roles. And so I wasn't quite clear on just how broadly the software gets deployed within an organization or, you know, maybe it gets deployed very widely, but there's certain people that are the real hour-by-hour users. Am I right to believe—I have some big questions in mind that I think depend on this answer—how much of it is the sales and customer support versus the whole organization?

Dan O'Connell: (13:17)

Yeah. I would say probably the fastest-growing segment of our business is providing that unique communications and AI for sales and support organizations. That said, the vast majority of those customers—if you take those 30,000 customers—are deployed across the internal communications. And as I said, our friendly competitors are, you know, to actually put it out there and clarify for folks, the Teams, Microsoft Teams, Zoom, RingCentrals of the world. And then when you get into the sales and service side, the Five9 and 8x8 and Talkdesks of the world.

But as I said, the nice part about our business is we sell onto these three segments, but we also sell this one consolidated piece of software that can support these different users. So we might be deployed just on a sales organization, or we might be deployed just within a service organization, but that gives us these kind of upsell opportunities to demonstrate the value of a unified communications platform, the power of AI, and then talk to them about the internal communications that are there. But the vast majority of our customers—my long-winded answer is the vast majority are deploying this across the organization. And then they're leveraging different features depending on the persona or the use case that they have.

Nathan Labenz: (14:35)

Gotcha. Okay. So for the power users, how much of their time on a given day would you say is in the Dialpad UI versus tasks where they sort of have no choice but to leave the Dialpad experience to go do something else?

Dan O'Connell: (14:56)

Service and support, they can live in that application. If you think of a contact center agent, they're probably sitting there waiting for—taking inbound calls. We've got integrations with all sorts of the CRMs they would have—Zendesk, HubSpot, Salesforce—to pull information and provide them with the record information they would need or help them drive assistance from those conversations. But the intention is, the way that we use Dialpad internally is that's our internal communication platform, and that's also what our sales and support teams are using.

So I look here—that app is loaded for me up in my browser window, front and center, my entire workday. But as I said, there's always going to be these other interfaces that I need to engage with, but we want to make this the single destination to power the communications when you need to have a video meeting, personal phone call, business phone call, and then to pull the contextual information that you need from different integrations or workflows, whether that's from systems of record or from ticketing systems, whatever it might be.

Nathan Labenz: (16:01)

Yeah. So it sounds like you're developing your own version of the RAG paradigm as well, like access into even broader knowledge bases than sales and support. AI question answering, all that kind of stuff, I imagine, is on your mind?

Dan O'Connell: (16:21)

It is. Yeah. You think about recruiting, right? So you think about these tangential markets that open up where recruiting organizations or just recruiters—maybe you're interviewing for a role, and I need to make sure that I'm asking the right topics and doing that in a consistent manner. We can help guide that conversation so everyone that's doing the interview is asking the same questions to the same person. But we can capture those responses and then write them into Greenhouse, for example, which would be a CRM or a system of record for recruiting organizations or recruiters to leverage.

But again, it's really about, hey, we want to power those communications wherever they're happening and then start to provide either context to the user or start to automate tasks and workflows that stem from those conversations.

Nathan Labenz: (17:10)

Okay. Cool. Let's change topics for a second, then I'll switch back to more practical, closer to the user, more product-experiential things in a minute.

So I saw that you just announced not too long ago the creation of this Dialpad GPT large language model, which has been trained on—and I'm just reading from the blog post—5 billion minutes of business calls and online interactions, the world's first business-focused LLM. 5 billion minutes is a lot of calls. I guess for starters, I'm translating that in my head to 500 billion tokens. If I'm thinking just 100 words a minute of conversational speech, that becomes 5% of GPT-4 training scale if I'm triangulating effectively here. So that's pretty big.

Bunch of questions around that, like where does all that data come from? Is that just stuff that's been recorded in the platform over time? And then how did you actually undertake the project of creating your own language model? I imagine, potentially some partnership involved?

Dan O'Connell: (18:25)

Yeah. So one thing is—we had started a startup called TalkIQ, and that was one of the first real-time speech recognition engines back in 2016. And at that time, you know, to give a little bit of context to the listeners, at that time, if you were going to get a transcription, you would typically capture a conversation—WAV file or an MP3 file—put it through a third party. It would typically take, if your conversation's 30 minutes, it can take you 30 minutes before you get an output back. So there's this big delay.

And we thought there was this really interesting opportunity, especially for sales and support conversations, to say, look, if you can capture and build a streaming engine and understand a conversation in real time, well, then I can start to route a conversation based on what Nathan might be asking about. I can start to map sentiment in real time. I can start to guide a person. I can do live agenda tracking—all these opportunities open up.

So we got really excited about that opportunity. We built a real-time speech recognition. We started to leverage some open-source software. I think a lot of people when they say, hey, we're building something from the ground up, you tend to start with—there's fantastic open-source software out there. So we started with Kaldi, which was a model for speech recognition, and built this fantastic, really accurate, fine-tuned model to do long-form conversations.

And I say long-form meaning when you do a phone conversation, that's very different than when you talk to Google and Siri and say, "What's the weather?" or "Set a timer." And these were really complex, difficult challenges, roughly a decade ago, which is kind of funny to think about.

So we built this engine. And then when you're building a startup, what you need is distribution. And so Dialpad at that time was, as I said, a really broad communications platform. Their founding team came out of Google. They had built Google Voice. And so it was a really kind of peanut butter and jelly moment, to use an analogy—hey, we can understand the conversations, and you've got the distribution to power the conversations.

And over those years, as I said, I've been here now for five and a half years. When people leverage our platform, they can opt in to share that data, and these are all long-form business conversations or enterprise conversations, and the vast majority of them are sales and support-related. And so that becomes this fantastic training set for us, again, that people opt into. We are—when we leverage the data, we obviously anonymize it, strip it of any personally identifiable information, et cetera, et cetera, et cetera.

And when we started to—I think the world got enamored a year ago with large language models, thankfully, by ChatGPT. And so we were really quick at saying, look, there's suddenly this great opportunity. We've got this training set. We power these communications, and we put this large language model on top of it—all these opportunities open up. But these foundational models have some challenges with them. Capacity—you talked about token limits. So if you think about a transcript, initially, you would have to break up a transcript six different times. And even that creates challenges even to start to do things like summaries.

And so it became—the challenges for these foundational models for us were capacity, the token limits, latency, you know, how quickly were we going to get an output back, cost. And those challenges were things that we didn't think we could wait for. They're all engineering problems that get resolved over time.

And so what we did was actually focus on building our own large language model. And so there's two ways that we've approached that. One is you take fantastic open-source software, you fine-tune it. Fortunately for us, as I said, we've been at this for some time. We've got a bunch of experts, everyone from conversational neuroscience to linguistics. We do our own labeling. They know how to build these models, and they have the dataset, and they know how to do it. And then we're also working on building our own large language model from the ground up. And when I say large language model, it's ultimately a smaller language model that's going to be built specifically for specific industries and use cases.

Nathan Labenz: (22:21)

Hey, we'll continue our interview in a moment after a word from our sponsors.

So you're doing this training in-house? Like, you're managing your own cloud? I would have guessed that you were partnered with a Mosaic or something like that to get over the gnarly parts of the large-scale pre-training.

Dan O'Connell: (22:39)

Yeah. We do everything in-house. We have our own GPU hardware, so we've got our own capacity for A100s. And the nice part for us is we do all the bare metal. So even our telephony platform—what's unique for us is, when you tie this back to the world of communications, a lot of people when they think about building a communications platform today, think about, oh, Twilio as a CPaaS provider—hey, there's APIs to power voice and messaging, so why build that yourself?

For us, we think that the best businesses verticalize their stack. You can see that in different industries, whether it's with Apple now building your own silicon. You can see it in the automotive industry with businesses like Tesla. But I think the biggest uniqueness that comes from owning your stack is obviously the pace of innovation, and then cost advantages at scale. And those two things allow you to actually bring, I think, uniqueness and better products to the market faster than anybody else.

And so, again, I don't know if we would always make the same decisions that we have in the past if we were to start a startup today, but based on the proprietary dataset we had, the people that have the know-how, the team of experts, we've done a lot of this in-house. That doesn't mean that we don't—we do partner with OpenAI to leverage the foundational models in certain aspects of our business. We also partner with Google for leveraging Google for text, which is Bison, which is their large language model today to power some features as well. So we do plug in some partners in different places, but the vast, vast, vast majority of things we do is all done in-house.

Nathan Labenz: (24:17)

Yeah. I guess I'd love to hear—this is, I think, such a big challenge for so many, right, to know what to build, what to buy, what to partner for in a world where things are moving extremely fast. And I think—I'm a little bit surprised by your answer, and I think it's a flex because most—maybe not most, but certainly, I think a lot of companies out there might say, "Oh, let's Llama 2, let's save money," or "Let's own our stack," combination of the two, whatever. "Let's go, we'll fine-tune that. We'll control our own destiny."

And then it's like, but if you don't have a real skill set and kind of a knack for it, and there's multiple different critical components, I would say, to the skill set—from, especially if you're doing large-scale stuff from scratch, managing a cluster to kind of OpenAI recently described it as the artisanal process of shaping language model behavior. And it seems like I would expect so many companies that may be great at what they do to try to add on this type of discipline and sort of get bogged down in it. And next thing you know, OpenAI has kind of released their next version before you've made Llama 2 really work.

So do you have—is it about just the team that you guys had already assembled that you think is just differentiated? Or do you have strategies that you could recommend for how to avoid falling into that trap? Because I do think a lot of people are headed that direction if they try to follow your specific path.

Dan O'Connell: (26:04)

Yeah. And I think that's why I say, even here, I try to be really keen on all this stuff. I think, fortunately, because of the decisions we made in the past, and it's been kind of fun to see how this has played out, I don't always know if we would make the same decision starting from scratch. So some of that is, we have a team of experts that's been in these fields working with these technologies for 10-plus years, just at the forefront of building product.

But a lot of the people on our team—we've got over a team of 50 focused on AI, 18 PhDs. We've got 16 patents. We show up at the best NLP conferences in the world and win awards for the models that we're building and development. So some of it, I think, much like anything, it always comes back to the team of experts that you have and the access to the dataset that you can leverage.

So as I said, because of those two things—that decision six years ago for Dialpad to be focused on AI then—has played out nicely to position us today. That gives us, I think, some advantages over relying on a third-party API to do things. And when ChatGPT was giving us some excitement, you know, two weeks ago, I can tell you we were probably one of the few businesses that were—I'm conservative. We were all following this with interest because I thought it was just a really interesting business thing that was going on. But we were probably one of the few businesses that was like, hey, this doesn't impact us. We're not worried about whether our summarization is going to go down because the company may cease to exist overnight. We don't rely on them to do that.

And so, again, I think that's always why, as a technologist, it's always—to me, I do believe in control your own destiny, that innovation—that's where you're going to really win or lose markets. And don't—I say outsource, but for lack of a better word that's eluding me—but don't outsource what you think are the most important parts of your business.

And I think right now, I do believe, and I know it sounds a little bit generic given everything in the news and so forth, but I do believe that AI is the biggest opportunity. And so, I think there is a better experience, more cost-effective experience because we are a business that you can do by doing that yourself. But that comes with a bunch of landmines and a bunch of challenges, and it's not easy for sure.

Nathan Labenz: (28:27)

Yeah. What do you think businesses should prioritize most? I mean, you talked about a few different values there where there are almost always trade-offs. Right? You've got, on the one hand, what's the absolute best performance that I could get if I don't worry about latency or financial cost? Then there's, if I put some constraints on those, what's the best I can get? And then there's, how much would I trade off that performance for further improvement in latency and cost? I think even that decision-making structure is really tough for a lot of folks. It's pretty unfamiliar territory just given the newness of this technology. It's weird. It's definitely slower than most software we're used to. It's more expensive on the margin than most software we're used to, probably in both of those by orders of magnitude or can be orders of magnitude. Obviously, it can do tremendously more stuff than most software we're used to. But do you have any coaching for how people should think about those relative dimensions of the good in their AI product development?

Dan O'Connell: (29:42)

Yeah. And I can give you our matrix. Obviously, it depends on the business that you're in. So we have a matrix around that. Number one is latency. And what I mean by that is we think the biggest opportunity for our business is how can we do things in real time for a conversation. What I mean by that is, you know, we do things like agent assist, so we can pull information from a knowledge base and present it to a user in real time in terms of milliseconds of latency. If that assist card didn't show up for three seconds, it loses its utility. Even a second later, and if it's ten seconds, there's just no point in even building that feature. So everything, when we talk about the most innovative features, the most interesting things we can do, they have to be as close to real time as possible. So that means latency really matters.

Two is, then, how do we get these features? If you can do that in real time, how can you get them in the hands of as many people as possible? So that, again, comes back to the question of capacity and what's available from these large language models. And those are still challenges today for a lot of these models, which is at what scale can they operate or can you get them? And then the third is cost. Now I think cost is one of those things that you can solve over time. If you're really providing value to a user, then you can always get the additional cost or price that you need to cover and maintain your margins or improve your margins, whatever it might be. And margins tend to improve over time. As I say, you can always optimize that. So the cost piece tends to be third on the list. There's other things, but to give you the way we think about it, cost is not the blocker for us. We've raised a half billion dollars in venture from the best investors in the world: Andreessen Horowitz, Google Ventures, ICONIQ Capital, just to name a few. And we think there's just this unique moment in time to drive innovation, and so let's really focus on the latency piece because that's going to provide the best experience and get in the hands of as many people as possible, and worry about the cost later. And if you're an earlier-stage startup that is more capital constrained, then cost might be at number one of that list.

Nathan Labenz: (31:48)

Yeah. Interesting. So then how do you think about performance in terms of just the quality of the insights or whatever the task is? Because I have mostly thought about, now I haven't really done anything in this sort of highly real-time way, and I totally appreciate the rapid depreciation of a tip, you know, that once a moment has passed, it's passed. So totally get that. I haven't operated in that environment. But my intuition in most of the things that I've built has been, try to maximize the quality of the AI output first. Subject to, it does, even if it takes three minutes, you know, in terms of latency, then that can start to be a deal breaker. Again, it depends on exactly how much better the best configuration of this model might be relative to my alternative. But do you think of it in terms of we want to get above some threshold of utility and then focus on all these other things, or how does that kind of fit in?

Dan O'Connell: (32:50)

Yes. So let me reframe my answer, assuming that you are pleased with the output and accuracy. And so when we build models, there's two ways. Obviously, we do come up with thresholds of what we think are going to provide quality output and utility. Two is, we all dogfood these features ourselves, meaning our support team leverages, and sales team leverages the models that we build along with every one of our employees. If we don't think it's providing utility as a good feature, then there's no way that we would expect anybody ever to go and pay us for it. So it starts there, and then the other part is, and I think this gets skipped by businesses or kind of not thought about, is the feedback loop for users, which is, one, labeling anything that's AI-generated. And then two is providing that feedback loop for users to say, hey, this recommendation was good or bad, or this summary was good or bad, or the identification of this action item was good or bad, whatever it might be. And those are things that I think are intrinsically really important for this and having that feedback loop to get back to the teams. We're also on a biweekly release schedule, meaning every two weeks. So when we talk about tuning our models and making adjustments, we are constantly looking at that data, going back, redeploying models. And, again, there's that constant feedback loop to make sure that there's both utility and function in it. And I think that has to be, obviously, as I said, if I was to reframe my answer, assume that you're doing the right things and building an accurate model that provides utility, then we're focused on these parts around latency, capacity, and then cost.

Nathan Labenz: (34:28)

Hey, we'll continue our interview in a moment after a word from our sponsors. Yeah. It's definitely more art than science, still at this point. And I think we're kind of in this moment where, again, the dynamics are just changing so quickly. I've noticed in my own fine-tuning work, which I've mostly done on the OpenAI platform. And the reasons for that are interestingly also starting to change. Like, it used to be because I couldn't match the performance anywhere else, and it was kind of simple as that. Now I would say it's pretty reasonable to expect that I could match a 3.5 fine-tuned for my use case, which this is at Waymark. You know, we do essentially video script writing and related tasks. And I think I could probably match the 3.5 fine-tuning with not a huge amount of effort on top of Mixtral or whatever the case may be. But now the better reason for me to stay with OpenAI is that pricing actually works out more favorably for us, and hosting is just much simpler. The managed hosting relative to having to work with other platforms to provision dedicated instances and so on. You know, and some of those really don't have much in the way of auto-scaling yet at all. Now it's on that other dimension that they're kind of ahead.

Dan O'Connell: (35:50)

And you bring up auto-scale. These are all things that we have to go through. Right? And, again, communication platform, there's different demands. Sorry to interject. There are different demands. Right? You can imagine, like, a meetings platform. Well, guess when meetings pop up. Top of the hour, bottom of the hour. Right? Those are the two spikes. So even just having auto-scaling on a communications platform for all of these things, you know, most people are not starting their meeting at 11:17. But these are all, as I said, it's important to have this stack and to have these controls and capabilities. You know, as I said, many of us are former Googlers, and so we're obviously at Google Cloud. These are all of the things that come into play of auto-scaling and doing that at scale for sure.

Nathan Labenz: (36:32)

Cool. Okay. So let's go back to kind of product and practical user experiences a little bit. You want to just run down the kind of most common, maybe most loved AI features and experiences that you are powering today and, you know, maybe we can kind of start to get into some that are emerging as well?

Dan O'Connell: (36:52)

Yeah. Most loved, obviously, real-time transcription. We do a summary on top of those conversations, which I think the world is suddenly becoming enamored with, you know, instantly summarized blocks of text. But we do some really cool things with summarized text. And so you imagine a support and sales conversation. We can go label the purpose and the outcome of those conversations. So suddenly, you can take unstructured data and make it structured. Previously, people would have to label it, literally people in their contact center to review calls and tag this stuff.

The agent assist features, as I said, which is, hey, my first job was in a contact center, and I was terrified of the first conversation I was going to get because I was like, what is this person going to ask me? And so, again, being able to guide people and pull information and present it to you in the moment that they need it.

And then the last one that I would highlight are things that are, I think, the perfect application of AI, which is we can infer customer satisfaction from conversations. And you think of what happens in the support world is you and I have this conversation. You know, at the end of it, you send me a survey, and you're like, hey, Dan, what was your experience? And I'll tell you, I'm like, Nathan, either it was really good or really bad. Those are the only two people you kind of hear from. And so for us, it was, hey, look, we can infer customer satisfaction with really high accuracy. We can do it across every conversation. There's no change in behavior, and there's no new software that needs to happen. And you get a hundred times the volume of the data that you're getting, so it's much more representative. And so these are just a few of the things, as I said, that we're working on, but we're really focused on how do we capture these conversations. And then as I said at the beginning of our chat, really focused on delivering features that are focused on assistance, automation, and then telling you things that are happening to help you make better business decisions.

Nathan Labenz: (38:39)

Of the stuff that the AI does today, how much of it would you say in the absence of AI gets done by a human versus simply doesn't get done at all? Presumably, if there was no AI transcription, very few of these calls would be transcribed. Right? But maybe many would be summarized. How do you kind of think about how much is this doing work that otherwise wouldn't have happened versus, you know, taking work off of humans' plates?

Dan O'Connell: (39:10)

Yeah. I think about it in two ways. One is, what happens is you get into the crazy world of quality assurance, and I say crazy world, and I say that, you see me smiling and laughing as I say it in the nicest way possible, but it is. What happened before even transcriptions was people are listening to, sitting down, you know, if you're in a contact center, sitting next to Nathan, listening to the call and then making sure that you're following the process and you're handling things in a friendly manner. So one is, the stuff's either happening today and there's not even a record of it, or two, it's happening and it's really, really, really time-intensive. Meaning, somebody's probably not even providing the transcript or taking the notes, but they're saying, hey, here's the scorecard and the rubric, and I'm giving Nathan a grade to say whether he's doing a good or bad job. Or you get into the legal world. Right? Or somebody may want a record of a conversation, whatever it might be, or a doctor's office. Right? Somebody's probably sitting there maybe taking notes or the doctor's taking notes.

So I think, honestly, our application of the things that we're focused on are these mundane tasks that I think are really ripe for AI to actually do them. Transcription accuracy is near human levels. As I said, kind of everyone's in the same realm of 90, 95% plus accuracy. And then you get into this realm of, hey, we can summarize that information and structure it in really fantastic ways. And I think there are immense opportunities to either completely automate tasks or augment the tasks in really fantastic ways to free up people.

Nathan Labenz: (40:47)

Just as an aside on the transcription, are you still using the original stack that you created some years ago there? That's all been Whisperified at this point?

Dan O'Connell: (40:57)

Yeah. So all of that still we do in-house, and I'll share a little context on that. We're on NVIDIA NeMo Toolkit as our latest models. And when you talk about telephony, there's different codecs. And so everything is fine-tuned for the codecs that we leverage and use. And so really high current transcription accuracy. And when you're talking about driving automation, assistance, and insights with GPT-4, we say those three things to kind of ingrain it in people. Transcription accuracy becomes the foundation of all of that. And, you know, I've listened to some of the other podcasts, and people talk about kind of garbage in, garbage out on data. But there's always that argument that comes up from investors or analysts around, look, isn't speech recognition just a commoditized technology? And I'm like, I don't think so. I think we're a long way from solving that problem when you get into accents and words that don't show up in the dictionary and proximity to microphone. You have all sorts of complex challenges around it, but that to me is the foundation for all of this, that if you can't accurately transcribe information, then everything else goes out the window. And, again, that's why we recognize that and think it's important to have an in-house speech recognition team and build those models and fine-tune those models, again, for our network of telephony.

Nathan Labenz: (42:19)

Yeah. Interesting. The codec aspect to that, there's always a lot of little nuances once you really get into the weeds on these things.

Dan O'Connell: (42:27)

Yeah. People will come to you and say, hey, they'll see a benchmark, you know, much like anything. And benchmarking always has caveats and nuance too, and I always encourage people to understand that, which is you may see a benchmarking report of a business that does speech recognition and says, you know, they have the highest accuracy, and it all comes down to, well, what training set was that? And what network was that on? There's all these things that matter around the codecs. And so for us, as I said, we have the best accuracy on our codecs and telephony network. So there's nuance, and that's not us saying, we're not a provider of speech recognition for somebody as an API. So somebody can't take our models and plug them in on the T-Mobile or Verizon network or AT&T, you know, whatever it might be, and that matters. And I think it's important for people to understand the context of these decisions and the context of these benchmarks.

Nathan Labenz: (43:21)

Yeah. Interesting. How much do you personalize? I guess you could customize to the level of the customer or you could personalize down to the level of the customer's customer. And I wonder how much of those two levels of, and I guess you could even do it in multiple ways. This could be context management at the prompt level of here's a profile of this business or a profile of the customer. You could go into a RAG setup, you can go into fine-tuning. How do you think about, I guess I'm just thinking these tips, right? Tip pops up, obviously that's got to be specific. Certainly got to be specific to the business. It can't just be general, try to smile more. Maybe it even needs to be often specific to the individual end customer too. So, yeah, personalization. Tell us everything.

Dan O'Connell: (44:13)

We start with custom models for every customer. And so that is the ability for them to fine-tune the speech recognition model they would have. And I always go back to, look, in startup and SaaS land, we tend to come up with acronyms that are new. We come up with funky spellings of product names to be unique. So those tend to be the most important things to a business, right, the uniqueness of those products or those acronyms. And so, again, it comes back to you need to accurately transcribe them.

Where we want to get to, we would like to get down to, hey, Nathan, if your business is a customer of Dialpad, we would like to provide Nathan his own model for himself. And I always go back to, you think of somebody that has a name like Sarah, and there's multiple ways to spell that. You want to piss somebody off really well? Show them a transcription or a summarization and spell their name wrong every single time they see it. So those are things that we're working on. We don't have that problem solved today, but, again, these are all of the challenges that start to pop up at scale. And you bring up two interesting points: Well, if you're a customer, can we then build another model for the customers that are leveraging your platform, or can I start to have a unique model for the individual employees that might be leveraging the piece of software? So those are all still opportunities for us that we haven't solved. I don't think anyone has solved it at those levels. But when we think about where do we want to get to and what are the annoyances that show up in this stuff, those are the very real annoyances that show up just thinking about transcribing somebody's name.

Nathan Labenz: (45:52)

Yeah. If you're doing this at home, one very practical tip that I have, which I don't think would work at your scale, but works for me when I'm just processing podcast transcripts, for example, is to take a raw transcript and then run it through a language model to clean it. And at the top, I'll just be like, here are all the proper nouns that you need to know to clean this up so that you get the company names right and all that sort of thing. Again, that works at low scale. I don't mind dropping, it could be like a dollar on Claude or whatever to do that. Not going to work for you at the scale of all your customers' many, many calls. But how much data do they need to do this sort of customization? Is there a rule of thumb of you need X hours of speech? And I assume that they need ground truth as well of, you know, the correct transcript of those inputs?

Dan O'Connell: (46:49)

Yeah. The hours, I don't have the number of hours. I can ping my cofounder, and I'll get you the hours. But there is a limit to say, look, and this always comes up for customers, you know, to put it in relevant aspects. We have customers all around the world, and so somebody will say, hey, we want to have transcription in a language we don't have today. And so then we need to find the audio from that and make sure that there's enough to do it. And to your point on the ground truth, this all stemmed from our first startup when we were building TalkIQ. We had to build our own telephony platform. We had to build our own speech recognition. We had to build our own NLP, and we had to build the tools at that time to even do it, which was how do we capture the audio, literally start to create kind of 15-second clips of the audio, and then build these tools that our labelers could go through to say, okay, you need to start by accurately listening to this 15-second clip. I need the ground truth, which is go and transcribe it. Somebody has to make sure then that is accurate because that's a tedious task at scale. And then we have to take those and then build the models and improve the models from it. So these are all of the very real challenges that, as I said, because we have this past experience, fortunately, kind of went through or faced those problems back then. And as I said, much of that has been solved today, and different startups have showed up to build tools to do it.

Nathan Labenz: (48:22)

Yeah. I'm struck as I'm listening to that description that it seems like it's probably gotten an order of magnitude more efficient even just to set that up. Right? Because now, you know, there's the Facebook model, for example, that supports like a thousand languages or something. So it may not be up to your standard on language 900, but it probably saves 90% of the time of having to go hire somebody on Upwork to get started.

Dan O'Connell: (48:47)

It does. And you do kind of hacks around it, right, which is, you know, you show up today and you may say, hey, maybe you want to have Dutch. And we would say, oh, we don't have Dutch today. So what we could do is we could plug in, you know, Google, we could plug in Rev, Speechmatics, you know, Deepgram, Whisper, you name it. And you assume if, again, goes back to where we started the conversation, you make sure who's going to get you the best accuracy on the codec that you know. So we would go and do that. And then that becomes training data for us. Right? So suddenly, as people just use the platform and they opt in the data, then that helps us over time and say, we can then do this ourselves and reduce costs and improve accuracy faster than a third party. So those are all the very real things that we go through, and there are times we would take that approach, and there are times that we would potentially plug in a third party and just let it be. And that usually stems from thinking about the opportunities from the business, which is, hey, this language might not be in high demand, so we're okay with the pass-through costs of just using a third party for it. Versus, hey, this is, you know, there are very real opportunities for us at scale to do this in-house.

Nathan Labenz: (49:58)

How about on the content generation side? In my limited free trial, I didn't have enough content in the platform to kind of really see how that can work. But obviously this is another thing that has to be customized to really be useful. And I've done some experiments trying to get AI to write as me. I haven't quite got out of the uncanny valley yet, I'm sorry to say. How are you guys doing in terms of generating content? How do you think about measuring how well you're doing? And, you know, I'll predict into the future, I guess, from there?

Dan O'Connell: (50:34)

Yeah. We're early on that. So as I said, when we think about content generation as a communications platform, we think the easiest thing is generating a reply. You and I have a conversation, and one thing that we've kind of noodled on is, hey, can I send Nathan this summary? And if I send you this summary, then I want to have an email that accompanies it. And can we auto-draft the email based on, again, the context of the conversation we just had. I tend to have similar opinions to you of this valley that you're in where I'm like, I think there's some utility there, but is it more trouble than it's worth? I use ChatGPT to write me first drafts of blog posts, you know, as a starting point for things. So I do think there's utility in some opportunities there when we get into content generation. We're still early in exploring that, but I also kind of have this uncanny thing of, I don't know, for things like email responses, you know, for example, is it really worth the time? I mean, is it really going to provide a ton of value for people? Are they going to end up just customizing it themselves and writing it? You know, as you start to engage with a large language model, you suddenly realize that you spent as much time as you would just drafting a two-sentence email and sending it off.

Nathan Labenz: (51:50)

That ultimately comes down to just how good the generations are though, right? This seems like a very dynamic situation. I have not got it to work. I guess I haven't even really tried it with GPT-4 as much as I probably could have. I've been kind of holding off to try fine-tuning GPT-4 on my own writing to see how close I can get it. Presumably, if you could fine-tune GPT-4 on a decent body of business correspondence for a business, you may not be writing love letters to partners or taking over corporate strategy, but presumably, you could get pretty high acceptance rates on routine communication, I would think. Right?

Dan O'Connell: (52:39)

I think you're probably right. I don't, you see my hesitancy. I don't know if I'm bought into that. Like, meaning, I think we're still a ways off from this stuff. And I say this meaning, I think there are very real opportunities for large language models, and I think what will still play out this next year is I still think we have a ways to go. And I think they're going to do some things really broadly, really well, and we're still going to need the next level of development to start doing things that are really going to seamlessly work and be the next wow factor that shows up. And I just think this is one of those examples where I'm like, I do a lot of content generation today, and I'm always like, it's a great first draft. It's a great starting point. And I think about putting those features in our app and potentially charging for them, and I'm like, I don't know if that's the wow factor feature that we should be, I think that it's a quality of life feature to say, hey, end of call, we can generate this amazing recap and structure it. And I'm like, do I need to have the auto-generated email to go along with it? Like, it's a couple sentences, and is it really going to do it that well, and is that where we should be spending, you know, our product and engineering time?

And there are different competitors in the space. Like, Teams is very much focused on that with Copilot. You know, Zoom does that a little bit with AI Companion, and I'm just, I don't know if I'm personally convinced on that use case yet, and especially one that you might potentially charge for. I just think there's some extra development that needs to happen. And to your point, that might be like, hey, there's a bunch of fine-tuning that you individually need to then do, and then you think about, well, how do you do that at scale for a user? And once that, these are just the things that pop up. Sounds great, but you're like, okay, so then you want every user to hop into your app and suddenly do fine-tuning to customize the emails.

Nathan Labenz: (54:35)

Well, actually that's one of the big reasons I thought this conversation was so interesting is because in looking at Dialpad, I've been kind of thinking, this feels kind of like the gym that is both capturing the data of what people do, incompletely, and maybe there's a little caveat we can discuss there, but definitely capturing a lot of what people do, especially in these sales and marketing roles that are the most kind of intensive users. And then also the Dialpad environment seems like a very natural place in which to deploy and evaluate and ultimately refine an AI agent because you've kind of got the scope, the possibility of what the actions are much more clearly defined than if you're just, oh, go use a computer or go use the web at large. There's a bunch of different questions here, but if it's not there yet, or if the next generation of model quality is not going to take us to, I mean, you're saying maybe not even on email drafting, I'm thinking even a little farther than that, virtual AI employees that may start limited, but can hopefully get in there and actually advance your work for you. What do you think is missing that would stand in the way of that happening in 2024?

Dan O'Connell: (56:00)

Well, I think I'll take a small tangent and tie things back. The content generation piece to me, the part that excites me and I think has the higher utility, is understanding the recommended next best step or automated workflow that somebody should take. What I mean by that is we can understand these conversations, and we're tied into a CRM and a database so we know the outcome. If I'm using specific language in a conversation and that language leads to a more positive outcome for the business, then the next time we see that conversation happening for a different person, we should highlight to them, "Hey, here's the right course of action." So when I think of content generation, it's more around generating content around here's the next best action as opposed to content generation of, you know, here's the email to write.

Nathan Labenz: (56:59)

Okay. Well, let me keep pushing you to think a little farther into the future. I feel like if I imagine AI realizing its potential, a lot of sales and customer service ultimately ends up getting mediated by AI. And it's not that I'm racing to replace people or cut costs or whatever, but I always kind of come back to a Bezos-style question: what does the customer really want? And the customer wants immediate response 24/7, where I can pause the conversation at my convenience and be able to come back and pick it up right where I left off and maybe even switch modalities. And ChatGPT offers me all these things today. It's like, I had a conversation while driving this weekend about the new state space model moment that's happening. And I would say ChatGPT was maybe not the best conversation partner in my life possibly for that conversation, but certainly among them. And for many people who aren't as connected as maybe I am, it probably is the best conversation partner that they could have. So do you think that I'm wrong about that? Going back to the stats at the beginning, third, third, third, these small and medium-sized businesses. Enterprises maybe don't have to go this way, but a small business just can't staff the phones 24/7. They can't pick up all the calls. You try calling the local restaurant around here at a lunch rush and they don't even want to answer. You just have to walk in and wait in line. Seems like there is a qualitative change that can happen here. Obviously it comes with a lot of potential disruption. I don't know, it seems like you're not buying that vision. I'm not sure if it's because you think that technology is not going to pan out or you just don't want to go there.

Dan O'Connell: (58:56)

Automating support, doing digital deflection. I think first-generation chatbots have been really good password reset bots. Obviously, you can put a large language model on your chatbot, and it's going to be able to handle much more complex requests and sound much more natural. I fully believe in that for support. And I would say, you know, today, probably 80% of people start engaging with a brand for support on a digital channel. So I think the very first thing they want to do is engage with this next generation experience with a large language model, get better answers, more natural-sounding answers as quickly as possible. For sure, we are working on that. I believe wholeheartedly in that opportunity.

This automated sales one, I don't know about that. And when I say that, I mean outbound sales and creating an outbound presence, because there's some startups working on this. I don't know about that. We had dinner with Marc Andreessen the other week, and Marc mentioned that he has this idea that it might be two bots that chat with each other. It's the sales bot and each of us have our own automated bot that understands us, and I have my own that's selling. I don't know if I agree with that. I personally think people like to buy from people. I also think when you think of a support use case that, yeah, I want to engage with a digital bot, but the second I need to escalate that to a phone, I don't know if I want to talk to an automated voice bot. As good as that bot might be, I think when you're really frustrated about something, talking to a human, and then you get into all sorts of things. Do you let somebody know whether that's a voice bot if it's so good and you've got speech synthesis and everything? I don't know. And I do think that it's going to take time. As I said, when I go back to the time thing, I don't think that stuff shows up in 2024. I've seen some really impressive demos, and at the end of the day, you can still tell. And so that's why I said, I don't think in the next year, you suddenly have an experience where you can't tell. Maybe on a digital channel for sure, but not on a voice channel. And I'm happy, as I said, if somebody's got a really cool demo out there, I would love to see it and play around with it. But those are some things that we spend some time thinking about as a communication platform. Does this automated sales agent show up, and does somebody want that? But I very much view it as the next generation chatbot on a digital channel, for sure. Instant answers, better answers, faster answers, all of that stuff is an immediate big opportunity for us.

Nathan Labenz: (1:01:29)

So what is your kind of 2025 and beyond expectation? Mine is pretty much that all this stuff does happen. I'm not as sure if we get AI automated science, although even that is looking increasingly likely. But definitely AI-mediated sales and support seems quite likely to me. Yes, I would agree they are still kind of uncanny valley, even the GPT voices. They don't handle interruptions very well. They don't handle my long pauses very well. So they start talking when I'm still trying to talk. I was looking back at the transcript with ChatGPT and I'm like, "Stop interrupting me. I'm known for my long pauses for God's sake." So they're not quite there. I would definitely agree with that.

I'm also a big supporter of the notion that I don't think we should have AIs deceiving people. Even if it does sound fully human and could pass, I think we should probably either as responsible corporate actors or as government say, "We're going to just be clear when you're talking to AI." That seems healthy. I don't think people will mind. And it does feel like the ability to respond immediately and put the phone down for 10 hours and then come back and pick it up later after my kids go to bed when it's convenient for me and seamlessly do that, that seems so helpful, not to mention the ability to pass the cost on to the customer. I guess you could think, pick your year. And I'm not saying I care exactly what your specific year is, but do you think it's so far off that it's just not really something that we need to be concerned with right now? Because if it's not 2024, but it is 2025, it's still very close.

Dan O'Connell: (1:03:25)

The reason I have to express that hesitancy, and I think honestly, there's going to be much smarter and greater people that will tell me it's faster and I'm wrong. My experience with technology has been similar to that trough of disillusionment. And I think, you know, I look at self-driving cars as a great example of this. We're now a decade into this journey. If you asked me a decade ago based on what people had said, we would have figured this was a solved problem. And I think these things turn out, when you really get into the complexity of human language and the workflow, especially work and some of the complex workflows that we have every day, those turn out to be much more complex, harder problems in the real world. And that doesn't mean that you can't be really enamored and amazed by a demo or that there's not all of these little things that we do every day, mundane tasks that are ripe for complete automation or better experiences. But as an aggregate, I think it just is going to take more time. And I don't have a guess on time. The only thing I feel is when I think about a year, and I'm like, look, I've been blown away by large language models in ChatGPT, and even just technology and its drama over the past decade. I just have this inkling and this hesitancy that it's going to take more time. And I hope I'm wrong, because as I said, as somebody that grew up in Silicon Valley and as a technologist and somebody who just loves innovation, I hope I'm wrong. But I have this sinking feeling, unfortunately, that it's going to be harder and more complex than we think.

Nathan Labenz: (1:05:04)

That's really interesting because I also kind of hope I'm wrong, but in the other direction. I feel like I see this coming at us so fast.

Dan O'Connell: (1:05:12)

What have you seen that you're like, "This is it"?

Nathan Labenz: (1:05:16)

Well, just broadly extrapolating the last 2 years. Just going back to literally 2 years ago today, and we're recording on 12/19/2023. So December 2021, there still wasn't an instruction-following model in public. The state of the art for me at Waymark was fine-tuning GPT-3. And I think we had just gotten, we had moved from Curie scale to DaVinci scale, and that was enough to create our first script writer that was like, "Okay, this will definitely be useful to our customers." Our use case is quite narrow compared to certainly the Dialpad use case and certainly the broader agent challenge globally. But we have gone from couldn't get actual utility from it in late 2021 to it is a totally qualitatively different experience 2 years later. And that includes the script writing having gotten very good, the image understanding having also progressed by leaps and bounds such that we're now able to choose the appropriate image out of a user's image library with pretty high consistency. The text-to-speech has also gotten dramatically better. Again, it wasn't even 2 years ago, but probably 1 year ago that we went from, "Hey, this just doesn't sound good." We partner with media companies. We do also sell direct to small business, but our kind of standard is, would a media company that owns TV stations or owns a cable network, would they put this on their air and feel like that is the kind of thing that isn't going to make people turn their TVs off? It wasn't a year ago. And now most of the stuff that we put on the air is AI voice. And it's not perfect, but all of these things have kind of gone from 30% of human to 90% of human in just this less than 2-year timeframe. And I don't think we're done. I look at our model and I'm like, I see actually a lot of low-hanging fruit still where some of the techniques that we've developed for past generations are now hurting us. We've tried to fit so much into so few tokens and whatever in the past. Now it's like, you know what we should really do is go back and look at that again because the tokens are a lot cheaper, the models can handle a lot more of them.

Dan O'Connell: (1:07:58)

To me, it's always that last mile and the complexity of that last mile. As you said, I totally, and I go back to the bad analogy of the self-driving cars because I live in San Francisco. So we have Waymo cruising around all the time and doing a good job. But that's not without issue. And so I think it always comes back to, I agree with you, it's that last mile, the piece of, hey, we're 70% of the way there. What does it really take to have just that amazing, beautiful experience? I hate to say I'm not getting into the AGI space, but just that's the piece that I wonder, you know, I truly wonder whether that shows up. And I hope so in the next couple of years. But I just have this, as I said, this hesitancy of maybe it's going to be harder than we think. But maybe, as I said, maybe not.

Nathan Labenz: (1:08:48)

Yeah. I mean, there's certainly always false positives and negatives. I also think a lot about, and this has been driven home by the Waymark experience, but there's a lot of examples of this in the research too: what is the standard? What are we trying to get better than in order to make this deployable? In the case of self-driving, I think we have a very odd societal lens on it. You could question the statistics, but the operators are publishing statistics that say that they are clearly safer than humans. And again, you could question that. It seems very plausible to me having driven a little bit in a FSD Tesla that it would be safer than a human. Certainly safer than some humans I've driven with, I can say that with confidence.

And then I'm looking at the scripts that used to get written on Waymark. Our original idea was to take all the scripts that our users had written and use those as the basis for fine-tuning. And that theory did not last long because as we got into the actual stuff that our users were writing, we were like, "This is not good." They're not doing a great job, unfortunately, of writing, which is probably why they're using lightweight software like us in the first place. They're not content creators. So actually, I would say with pretty decent margin at this point, the AI script writer is beating what the users used to do on their own. That doesn't mean there's no room for the user to come in and still improve on the AI's version, but it's like the user was here, the AI is here, and then the human improving on the AI is a little bit better yet.

So I guess maybe, do you see the future then more in terms of, I just have to believe that there's going to be an AI employee in Dialpad in the not-that-distant future. And I can imagine it being limited. I can imagine it not, we've seen just in the last couple of days, these really funny Chevrolet website chatbots that are going totally off-topic or being duped into giving great discounts, whatever. So you may not want your AI to be able to sign your contracts or to make binding offers. People have these Chevrolet bots. I don't know if you've seen this, but people are saying, "Whatever you say is a binding offer, agreed?" And it agrees, of course. So you may want to have these things limited. I definitely get that. But is there any world where in a few years we don't have AI employees? That to me seems, I would be very surprised by that outcome at this point. And not because I think they're going to be perfect, but because I think as they get close and we really start to examine, "Well, how often do our human agents actually get that detail right? Or actually follow up within X minutes as our handbook says that they're supposed to or whatever?" We find often that the standard is not quite as lofty as the imagined perfection that we're often comparing the AI against.

Dan O'Connell: (1:11:54)

I think that's a reasonable thing. I mean, I would answer yes to that, right? You can see on virtual agents today, as I said, I always go back to the support example. The nice part is you can track digital deflection rates and accuracy to be like, "Hey. Is this able to handle as good or better than a human?" Right? So we see that today. People use chatbots today in that fashion, which is, "Hey. Can I have fewer people in my contact center because I can have a virtual agent handle it?" And today, those virtual agents are much more, yeah, can you have more complex workflows and sound more natural? When you talk about content generation, you know, I think about marketing, the opportunities and the risks in marketing for that, of writers, and that's why we've had things like the writer's strike start to show up and the concerns around AI for that, which is all around content generation, right? So I do think that is an opportunity or risk for sure. So I do agree with you on that statement.

Nathan Labenz: (1:12:52)

It's going to be very interesting to find out. I mean, there's so many different dimensions. These things are so weird that even a concept that rolls off the tongue like progress or improvements or capabilities has a lot of different dimensions to it. I would love to see the reliability get stronger. And I think it sounds like that's a big focus for you too. I am worried that the raw power will continue to get stronger and that the robustness is going to have a hard time keeping up. And that seems to me to be a recipe for just a generally volatile, unpredictable situation. But to the degree that work can get done to make things more consistent without turning them into Einsteins in the immediate future, then I think that is what the economy needs, right? The economy doesn't need something actually all that much smarter than GPT-4 for most roles. It just needs it to be faster, maybe a little cheaper, and more reliable. But, yeah, I think for whatever reason, progress on those dimensions is proving a little bit harder to come by.

Dan O'Connell: (1:14:04)

Can you clarify when you say reliable, what do you mean?

Nathan Labenz: (1:14:07)

Yeah. I mean, there's again, that has kind of different dimensions to it. But I've been thinking a lot about this state space model moment recently and I think that one of the key limitations of the current transformer-based language models is they're purely episodic. The episodes are getting longer as the context windows get longer, but they have no mechanism to carry anything forward from one episode to the next. So it's like they always have total amnesia. That kind of gap between massive knowledge encoded in the weights and then they have kind of a little context and kind of nothing in between. Well, now we have RAG and stuff in between, but kind of hacky stuff in between. Something like an integrated memory to me feels like maybe the way that we get to this higher level of reliability. And again, I think there's a lot of ways to measure or conceive of reliability, but to some degree it's predictability, legibility, consistency, being told once that you did the wrong thing and then actually listening to that and doing it right again in the future. And those are the things that I think the transformers fall down on today, and that prevents them from being reliable coworkers.

Dan O'Connell: (1:15:32)

That makes sense.

Nathan Labenz: (1:15:33)

Any thoughts on memory? This is a real kick of mine at the moment and I'm going to have an insanely long monologue episode coming out about it. But is that an area that you are thinking about as well?

Dan O'Connell: (1:15:48)

Yeah. I mean, the context window being new and, when you think of applications for customers, right, it all stems from being able to personalize the content or that experience. And it all stems from, yes, being able to pass information from, as you classified it, multiple episodes, but multiple conversations in the past. And today, the way that you handle that is you are pulling information from a system of record and trying to provide that context to it. It doesn't have the full context because old records might be partially intact. They might not even exist. Nobody has them. And I think to me, that's, again, comes back to our biggest opportunity. What I think really excites me about the platform of really any communications platform is you can suddenly have a record for every conversation, assuming you want to record, I'll put the preface out there, assuming you want to have a record of it. Great, if that is true, then suddenly you have these really amazing opportunities that open up of saying, "Suddenly, I can personalize the content in the moment. Suddenly, I have this understanding of the past seven conversations that we had." And that really then opens up the opportunity for these virtual agents or virtual sales assistants, however you want to kind of frame it.

So I think figuring out this next step of the long-term memory, that to me is something that needs to happen. Again, I have this hesitancy of these are all the things we have to have happen. Right? And how quickly does that change? But, you know, GPT-4 and the token limits, those created big challenges just as I go back to our, to take a step back. You take a long-form transcript. We would have to break the transcript up, and you're like, "Hey. Go summarize this piece of the transcript." And then suddenly, you've taken a 30-minute conversation. You've got to split it up 16 times, and you're taking 16 summaries and asking for an aggregate summary. These are all the challenges that show up. And you're like, "Well, the better way to get a summary from that is if the token limit is increased." I make one request for a summary. Guess what? That summary that I get from that is much more, it's got better context. So these are all kind of, I think, the big implications for these next experiences that we want to see. And I think for me, as I said, why I get excited about this is you see these experiences, and they're kind of still, to use your words, 70, 80% of the way there, and you can see what it's going to look like when it's 100%.

Nathan Labenz: (1:18:19)

Fascinating conversation. I especially love getting into just kind of the somewhat divergent views of the, or you know, expectations, not necessarily views, but expectations for the short-term future. Anything else you want to cover today that we haven't got to so far?

Dan O'Connell: (1:18:33)

No. I think we covered a lot of ground. As I said, you know, reasons we kind of chose our path of trying to do a lot in-house, and it was fun talking about, as I said, the views. I was like, "Oh, I got to bookmark this, and we'll hopefully come back a year or two later and see how this plays out." But, you know, for me, as I said, I think there's just tremendous opportunities in sales and support and marketing. I think those are three just ripe areas for opportunity for AI to be leveraged, right? Whether it's around content generation or just understanding conversations and powering assistance and automation within them.

Nathan Labenz: (1:19:09)

Dan O'Connell, AI and strategy at Dialpad. Thank you for being part of the Cognitive Revolution.

Dan O'Connell: (1:19:16)

Alright. Thanks, Nathan.

Nathan Labenz: (1:19:17)

It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Dialpad's Chief Strategy Officer, Dan O'Connell, on AI-Powered Business Communications

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next