Over the holiday break, I started experimenting with cloning my voice for reasons I will get to later in this blog post. As I walked down the list of Voice Cloning providers out there and began to evaluate them using my cost-to-benefit ratio scale, a set of requirements and must-have capabilities emerged.
In this blog post, we will cover what those required features are, why they are essential for my scenario, why I feel those reasons will transcend into the general use case, and, ultimately, what it means for text-to-speech providers moving forward.
First Some Background
I have been in the Natural Language Processing (NLP) space for over 3 years. In that time, as most people do, I started looking to obtain accurate transcription from speech and then moved into trying to digest conversation to create “computer-generated” interactions. Large Language Models (LLMs) dramatically accelerated the accessibility and, quite frankly, the ability to do so in a meaningful way without a lot of effort.
After comprehension, most individuals move into increasing the level of interaction by being able to interface with these systems using humans’ other amazing tool.. Hearing. As humans, we don’t want to talk into a device and then have to read its output. I mean, heck, most people find subtitled movies beyond annoying if those subtitles drag out for anything more than a few minutes. Here, we start to see the need for text-to-speech, but what kind of voice should we use?
How I Tried Automating Myself
That voice depends on the use case. More to the point, that voice depends on how familiar you are with the “thing” you are interacting with. I use “thing” as this catch-all, but in reality, it’s some device you are conversing with. Moreover, depending on what that device is and what our connection with said device is, the voice used makes all the difference in the world in the experience of that interaction.
Let’s consider these scenarios:
Siri, Alexa, or Google
These devices are simple. You say a command, and Siri, Alexa, or Google (hopefully) give you a meaningful answer. You don’t place much weight on what kind of voice it replies with. Sure, it’s cute if it replies in an accent or if it can reply in Snoop Dogg’s voice, but in the end, it doesn’t really matter all that much for that interaction.
Call Center, Tech Support, etc
The next wave of voice interactions is replacing humans with voice automation systems. This is where most companies are today in this evolution. There are a ton of companies trying to do this for a variety of reasons, usually led by decreasing labor costs.
The most common use cases are replacing customer support staff with these automated systems. Today, this usually entails using Speech-to-Text to transcribe what someone on the phone is saying, transcribing that text to pass it off to a Large Language Model (LLM) or, more correctly, a Retrieval-Augmented Generation (RAG) system for better context, and then taking the output and passing it through Text-to-Speech to generate a human-like voice to feedback to the listener on the other end of the phone.
That human-like voice is essential for many reasons. It turns out that when people on the phone hear a computer voice made by Felix the Cat from the 60s, they are more likely to hang up the phone because no one wants to deal with a computer unless it is important enough to stay on the line. That last statement is very true. If I really, really need something, then I am going to endure this computer-based interaction by not hanging up.
It all comes down to companies (and the people in the next section) wanting to keep engagement (i.e., not hanging up the phone) as high as possible because they get something out of that interaction.
Content Creator to Mimic Myself
For this last use case, not only do we want the voice to be indistinguishable from a human, but we also want that voice to sound EXACTLY like me. This is the use case I was exploring. I want that voice to sound personalized because that voice will be associated with my brand and, more importantly, a level of personalization and relatability to my content. That is done by creating content or using a voice that is me.
Why was I interested in this use case? In this age of social media, there has been a huge emphasis on creating more meaningful content. For those that do this for a living, creating content in the form of audio (i.e., Podcasts, etc.) and specially recorded video (i.e., Vlogs, TikToks, etc.) is extremely time consuming. So, wouldn’t it be great if there was a way to offload some lower-value voice work to voice cloning? That’s the problem I was trying to solve.
If you are looking to tackle this use case, then based on the Call Center use cases, having your real voice intermixed with an AI clone of your voice that is just slightly off will likely be off-putting. In the worst case, your listeners might just “hang up the phone” on your content. This is why the quality, intonation, pauses, etc, in voice cloning, will make or break the platforms that offer voice cloning. If it doesn’t sound like you, you risk alienating your audience.
Why Voice Cloning Is Important
For Text-to-Speech platforms out there, voice cloning will be a huge deal, but the mainstream is not there yet… This is not because the technology doesn’t exist (it does) but because corporations are still the primary users by volume in Text-to-Speech (for now). They are busy trying to automate jobs away to replace them with AI systems.
In my opinion, there is already a bunch of social media content being generated with human-like voices; case in point, the annoying voice in the video below. Just spend 5 minutes on TikTok. I think once people start to realize the value of automating their own personal brand/content on social media and it’s accessible enough for creators, you are going to see an explosion of growth on the platforms that provide voice cloning.
Those platforms that don’t offer voice cloning will need to at some point or die. Why? Why pay for two subscriptions where one platform provides human-like voice for the Call Center use case and pay another subscription for a platform that provides pre-canned human-like voice but also allows you to clone your voice for social media (that could also be used to create your own set of pre-canned voices)? The answer is you don’t.
Where To Go From Here
In this quest to clone my voice, I tried a bunch of platforms out there, and I found one that works the best for me, taking things like price and intonation into account. I may have a follow-up blog post about the journey and process I used to select and compare all the services. If those are interested, a behind-the-scenes of what I will use voice cloning for might interest people reading this post.
Until then, I hope you found this analysis interesting and the breakdown for the various use cases enlightening. Until the next time… happy hacking! If you like what you read, check out my other stuff at: https://linktr.ee/davidvonthenen.