Tag Archives: AI

Voice Cloning: The Text-to-Speech Feature You Never Knew You Needed And Why It Matters

Over the holiday break, I started experimenting with cloning my voice for reasons I will get to later in this blog post. As I walked down the list of Voice Cloning providers out there and began to evaluate them using my cost-to-benefit ratio scale, a set of requirements and must-have capabilities emerged.

In this blog post, we will cover what those required features are, why they are essential for my scenario, why I feel those reasons will transcend into the general use case, and, ultimately, what it means for text-to-speech providers moving forward.

First Some Background

I have been in the Natural Language Processing (NLP) space for over 3 years. In that time, as most people do, I started looking to obtain accurate transcription from speech and then moved into trying to digest conversation to create “computer-generated” interactions. Large Language Models (LLMs) dramatically accelerated the accessibility and, quite frankly, the ability to do so in a meaningful way without a lot of effort.

After comprehension, most individuals move into increasing the level of interaction by being able to interface with these systems using humans’ other amazing tool.. Hearing. As humans, we don’t want to talk into a device and then have to read its output. I mean, heck, most people find subtitled movies beyond annoying if those subtitles drag out for anything more than a few minutes. Here, we start to see the need for text-to-speech, but what kind of voice should we use?

How I Tried Automating Myself

That voice depends on the use case. More to the point, that voice depends on how familiar you are with the “thing” you are interacting with. I use “thing” as this catch-all, but in reality, it’s some device you are conversing with. Moreover, depending on what that device is and what our connection with said device is, the voice used makes all the difference in the world in the experience of that interaction.

Let’s consider these scenarios:

Siri, Alexa, or Google

These devices are simple. You say a command, and Siri, Alexa, or Google (hopefully) give you a meaningful answer. You don’t place much weight on what kind of voice it replies with. Sure, it’s cute if it replies in an accent or if it can reply in Snoop Dogg’s voice, but in the end, it doesn’t really matter all that much for that interaction.

Call Center, Tech Support, etc

The next wave of voice interactions is replacing humans with voice automation systems. This is where most companies are today in this evolution. There are a ton of companies trying to do this for a variety of reasons, usually led by decreasing labor costs.

The most common use cases are replacing customer support staff with these automated systems. Today, this usually entails using Speech-to-Text to transcribe what someone on the phone is saying, transcribing that text to pass it off to a Large Language Model (LLM) or, more correctly, a Retrieval-Augmented Generation (RAG) system for better context, and then taking the output and passing it through Text-to-Speech to generate a human-like voice to feedback to the listener on the other end of the phone.

That human-like voice is essential for many reasons. It turns out that when people on the phone hear a computer voice made by Felix the Cat from the 60s, they are more likely to hang up the phone because no one wants to deal with a computer unless it is important enough to stay on the line. That last statement is very true. If I really, really need something, then I am going to endure this computer-based interaction by not hanging up.

It all comes down to companies (and the people in the next section) wanting to keep engagement (i.e., not hanging up the phone) as high as possible because they get something out of that interaction.

Content Creator to Mimic Myself

For this last use case, not only do we want the voice to be indistinguishable from a human, but we also want that voice to sound EXACTLY like me. This is the use case I was exploring. I want that voice to sound personalized because that voice will be associated with my brand and, more importantly, a level of personalization and relatability to my content. That is done by creating content or using a voice that is me.
Why was I interested in this use case? In this age of social media, there has been a huge emphasis on creating more meaningful content. For those that do this for a living, creating content in the form of audio (i.e., Podcasts, etc.) and specially recorded video (i.e., Vlogs, TikToks, etc.) is extremely time consuming. So, wouldn’t it be great if there was a way to offload some lower-value voice work to voice cloning? That’s the problem I was trying to solve.

If you are looking to tackle this use case, then based on the Call Center use cases, having your real voice intermixed with an AI clone of your voice that is just slightly off will likely be off-putting. In the worst case, your listeners might just “hang up the phone” on your content. This is why the quality, intonation, pauses, etc, in voice cloning, will make or break the platforms that offer voice cloning. If it doesn’t sound like you, you risk alienating your audience.

Why Voice Cloning Is Important

For Text-to-Speech platforms out there, voice cloning will be a huge deal, but the mainstream is not there yet… This is not because the technology doesn’t exist (it does) but because corporations are still the primary users by volume in Text-to-Speech (for now). They are busy trying to automate jobs away to replace them with AI systems.

In my opinion, there is already a bunch of social media content being generated with human-like voices; case in point, the annoying voice in the video below. Just spend 5 minutes on TikTok. I think once people start to realize the value of automating their own personal brand/content on social media and it’s accessible enough for creators, you are going to see an explosion of growth on the platforms that provide voice cloning.

Those platforms that don’t offer voice cloning will need to at some point or die. Why? Why pay for two subscriptions where one platform provides human-like voice for the Call Center use case and pay another subscription for a platform that provides pre-canned human-like voice but also allows you to clone your voice for social media (that could also be used to create your own set of pre-canned voices)? The answer is you don’t.

Where To Go From Here

In this quest to clone my voice, I tried a bunch of platforms out there, and I found one that works the best for me, taking things like price and intonation into account. I may have a follow-up blog post about the journey and process I used to select and compare all the services. If those are interested, a behind-the-scenes of what I will use voice cloning for might interest people reading this post.

Until then, I hope you found this analysis interesting and the breakdown for the various use cases enlightening. Until the next time… happy hacking! If you like what you read, check out my other stuff at: https://linktr.ee/davidvonthenen.

2024 RTC Conference Recap: Shining a Spotlight on AI in Healthcare and Voice AI Assistants

The 2024 Real Time Communication Conference at Illinois Tech was an electrifying event, showcasing emerging technologies across Voice, WebRTC, IoT/Edge, and groundbreaking research. But if you ask me, the real magic happens in the conversations between sessions. These impromptu chats with attendees always spark new ideas, collaborations, and insights that you won’t find on any slide deck. It’s a space where cutting-edge tech meets human curiosity and creativity, making for an unforgettable experience.

I had the pleasure of presenting two sessions this year, both deeply focused on AI’s transformative potential. From training machine learning models for medical analysis to mining digital conversations for actionable insights, here’s a recap of the key takeaways from both sessions—and resources to keep the learning going.

Session 1: Machine Learning for Good – Training Models for Medical Analysis

In this keynote, co-presented with Nikki-Rae Alkema, we explored how machine learning is reshaping healthcare, especially in diagnostics. We focused on multi-modal/model — the fusion of audio, video, and sensor inputs to catch conditions like Parkinson’s Disease early. By analyzing subtle cues across different data types, we’re not just looking at isolated symptoms but building a more comprehensive picture of patient health.

This session emphasized the human aspect of AI. It’s not about replacing healthcare professionals but augmenting their abilities. Every algorithm, every data point analyzed, translates to real human stories and health outcomes. The goal? To move healthcare from a reactive to a proactive stance, where early detection becomes the norm rather than the exception.

This work underscores the potential for machine learning to empower medical professionals with insights that weren’t possible before, bringing us closer to a future where AI truly enhances human care.

Session 2: Mining Conversations – Building NLP Models to Decode the Digital Chatter

In our increasingly digital world, conversation data is a treasure trove of insights. This session dove into the intricacies of Natural Language Processing (NLP), specifically how to build multiple NLP models to work in concert. Whether it’s Slack messages, Zoom calls, or social media chatter, there’s a wealth of unstructured data waiting to be harnessed.

We walked through collecting raw data from WebRTC applications, then cleaning, tokenizing, and preparing it for machine learning pipelines. This process enables us to extract meaningful insights, classify content, and recognize entities—turning raw digital chatter into a strategic asset.

Whether you’re analyzing customer service interactions or mining social media for trends, these NLP techniques open doors to more profound, data-driven insights, directly applicable to real-world use cases.

The Magic of In-Between Sessions: Final Thoughts

What makes the RTC Conference truly special is the community. Between presentations, I had fascinating discussions with industry leaders, researchers, and fellow AI enthusiasts. These conversations often linger on the edges of what we’re presenting, pushing ideas further and sparking fresh perspectives. From discussing the ethics of AI in diagnostics to exploring how NLP can evolve to understand more nuanced human emotions, these interactions made for a vibrant and thought-provoking experience.

If you missed the event, the session recordings are available through the official conference site now! Take a look at the slides, code and more! Here’s to embracing AI’s potential together—until next time!

Unlocking Conversations: Hands-On NLP for Real-World Data Mining

Hey there, tech enthusiasts! I’m thrilled to share that I’ll be hosting an exciting workshop at the upcoming Open Data Science Conference (ODSC). Titled “Building Multiple Natural Language Processing Models to Work in Concert Together”, this workshop will give you a practical, hands-on approach to creating and orchestrating NLP models. It’s not just another “hello world” session—this is about tackling real-world data and making it work for you.

Session Info:
Building Multiple Natural Language Processing Models to Work in Concert Together
Date: Oct 30, 2024
Time: 4:35pm

Why NLP and Why Now?

As conversations around the world explode in number, the need to make sense of them has become more critical than ever. Think about it: 1.5 billion messages on Slack every week, 300 million daily virtual meetings on Zoom at peak, and 260 million conversations happening on Facebook every day. The sheer scale of this data is astounding. But more than that, these conversations have transformed social platforms into treasure troves of information, offering insights into emerging trends, new associations, and evolving narratives.

NLP

At the workshop, we’ll delve into how to capture, analyze, and gain insights from this data using NLP. Whether you’re looking to spot trends, extract key information, or mine metadata, this session will provide you with the tools and techniques to turn this overwhelming amount of unstructured conversation data into something meaningful.

What You Can Expect

This workshop will be hands-on and highly interactive, featuring three primary components:

  1. Building a Question Classifier: We’ll start with a straightforward model that classifies sentences as questions or non-questions. You’ll see that even seemingly simple tasks can get complex as we deal with language’s natural ambiguity.

  2. Creating a Named Entity Recognition (NER) Model: Next, we’ll move into identifying specific entities within text, such as names, places, and organizations. I’ll show you how to gather, clean, and process data to build a reliable NER model that can extract meaningful information from conversations.

  3. Developing a Voice AI Assistant Demo: We’ll bring it all together by integrating both models into a voice assistant app that uses a RESTful API to process input and return classified and annotated data. This is where you’ll see how these models can work together in a real-world application, adding layers of context and relevance to raw data.

Why Attend?

There are plenty of reasons to be excited about this workshop, but here are a few highlights:

  • Hands-on Learning: We’ll be coding live! For those that are less technical and/or don’t have their laptop prerequisites, I’ll be using Jupyter notebooks in Google Colab, so everyone can follow along.
  • Real-World Applications: While many workshops focus on isolated NLP models, we’ll be tackling multiple models and showing how they can be combined for enhanced functionality. It’s a rare opportunity to see how these technologies can be applied in real-world scenarios.

  • Open Resources: I’ll provide code, data resources, and examples that you can take with you, adapt, and use on your projects. This workshop isn’t just about learning theory—it’s about equipping you with tools you can use.

See You at ODSC!

I’m incredibly excited to share this workshop with you all and to dive into the nitty-gritty of NLP. Whether you’re an experienced data scientist, an NLP enthusiast, or just curious about how these systems work, there will be something for you. Plus, you’ll walk away with new skills and practical examples that can help you build better models and unlock new insights from conversation data.

So, if you’re planning to attend ODSC, be sure to check out this session. You won’t want to miss it!

Workshop Info:
Building Multiple Natural Language Processing Models to Work in Concert Together
Date: Oct 30, 2024
Time: 4:35pm