In this age of AI, synthetic brand voices have experienced a remarkable rise. They promise consistent auditory signatures, and accessibility and inclusivity. Branded voices are becoming more and more indistinguishable from human voices, revolutionizing how we engage with human-like technology and shaping the future of communication. All the more surprising–and reassuring–that human voices still have a significant impact on the development of these synthetic brand voices.
As the head of design at whoozy.agency, Phoebe Ohayon is immersed in the development of synthetic voices. But for Phoebe, the secret to creating better synthetic brand voices is not all about tech, it’s more about infusing synthesized brand voices with the soul of human expression. In this conversation, we delve into the intricacies of maintaining authenticity and evoking genuine engagement for your brand, while exploring the possibilities of synthetic voice creation.
Brandingmag: Let’s dive right in and start with the question “What does your brand sound like?”. How do you answer this question when working with brands?
Phoebe Ohayon: Voice branding, it’s pretty much like regular branding: it heavily relies on how people perceive it. Just because a brand claims to be all sporty and active, doesn’t automatically mean people perceive it as such. It’s a two-way street, really. The way the brand perceives themselves and the way the audience perceives it are closely intertwined. That’s why we put a lot of effort into understanding how the users and consumers truly see and feel the brand. Once we have that clear, we can dig deeper into the brand voice.
It’s important to become familiar with why a brand has chosen specific words or terminologies to describe themselves, how their perception has fluctuated over time and the team dynamics that influenced the development of their branding guidelines. Additionally, I examine the brand’s current trajectory and future growth plans.
Some brands have reputation-related concerns they might want to address. In such cases, we assess whether these concerns still exist among their audience. Taking into account any areas the brand may not be proud of and shape the voice accordingly, making it even friendlier or more caring.
Bm: I’m pretty sure gender is a big discussion for brands: do we pick a female, male, or non binary voice?
PO: Oh, absolutely! And it’s not just a one-sided conversation here also. Usually, the branding team had their own discussions internally before they came to us. Sometimes brands already have a voice or a character and say, “Well, we had this character for five years, but now we’re not so sure about it anymore. Should we make it more male or more female?”
The brand audience might actually prefer a gender-neutral voice or maybe even different voices to choose from. They might want a kids’ version or an elderly version because the brand has a wide audience. It’s all about considering these decisions and incorporating them into the strategy. It not only shapes the way the voice is perceived but also affects how the brand as a whole is perceived.
Bm: What is a branded, synthetic voice mostly used for?
PO: Right now, the companies that are diving into conversational technology are the ones driving the development. They’ve invested in these technologies and are exploring various use cases for voice branding. Think of Interactive Voice Response systems (IVR), a technology used in voice branding to have automated interactions with callers. You know, the systems you encounter when you call customer support or service hotlines. In voice branding, they play a crucial role in shaping the overall experience of a brand’s customer service. A chatbot on a website or online shop can also have a voice, to make customer service feel less robotic. And there are the voice interfaces in car navigation systems and voicing bots in the healthcare industry.
In the future, we’ll see a shift towards spoken content with fluid voices or personas, with multiple voices that can be used for different content.
We’re already seeing publishers like the Dutch newspaper NRC giving a voice to their content.
Bm: Can you tell us more about that case?
PO: NRC was keen to make their content more accessible for e.g. visually impaired users. But that’s not all, with voice, NRC responds to the shifting consumer preferences, audio offers convenience and multitasking capabilities. Common sense to offer all articles in audio but a high quality newspaper will not likely go for robotic, standard voices. But for human voices it would be an almost impossible–and very expensive–task to read the 70 to 80 pieces that the NRC newspaper publishes daily. A voice model is then a cost effective approach to make the newspaper more accessible to readers.
The synthetic voices we developed for them are copies of the voices of Mischa Spel, their Deputy Culture Editor and music critic, and Egbert Kalse, their Economics Editor and podcast presenter, chosen as favorites by the readers. Mischa and Egbert read a total of 4000 sentences in the studio, allowing the AI program to learn to imitate their voices. We have a voice coach that ensures that the pitch, volume, and breathing is good, and a sound engineer to check if the audio signal is properly captured, without any clicks, hoarseness, or lisping, as these can confuse the program. From the recordings for NRC, 4 hours of “net data” were extracted per voice to train the program. From there, the model learns from all the data points in the audio file, along with the written text, how to pronounce things – the sound of certain phrases, how punctuation affects the sound–and tries to replicate it as accurately as possible in other texts.
Bm: A common issue with conversational AI is that it can reach a point where it doesn’t know how to respond and needs to transfer the conversation to a human. If the AI voice feels human, it can feel strange for the listener to be passed on to a human. So, how do you handle this transition from the AI to a real human without causing any confusion or discomfort?
PO: It’s a really important question. When it comes to designing the conversation, we usually leave that up to our clients. However, we always stress the importance of one key principle: let the user know upfront that they’re talking to a bot or AI. It’s crucial–and ethical–to be transparent about it. If you don’t make it clear that it’s an AI, users can get frustrated or even mad: “Wait a minute, I thought I was talking to a real person!”
Bm: Let’s say my father, who isn’t familiar with synthetic voices, is on the phone with an elderly-sounding voice. He heard the “I am a robot” in the beginning but along the way he forgets he’s talking to a bot. Or children—they might not grasp the implications of a talking bot. How can brands inform or educate their audience about the nature of synthetic voices?
PO: It’s a topic that sparks significant debate and requires extensive thought. Different methodologies are being explored, and there isn’t one superior approach for brands to tackle this topic. There are some key takeaways though. One is that if you give too much freedom in bot conversations, people start testing the limits. So, it’s important to design the conversation with some limitations in place. If a child says something inappropriate to the toy, the toy should respond by expressing disapproval, just like a real human would. Otherwise, it creates a false notion that such behavior is acceptable. Moreover, some of these patterns might seep into their real-life relationships. I don’t claim to be the ultimate authority on ethical design, but it’s important to thoroughly analyze each use case, as there are many potential pitfalls to address.
Bm: We can design AIs like humans but in the end ChatGPT or Midjourney will do the task as ordered regardless if I say please and thank you. Don’t you think that if we’re not careful, constant interaction with algorithms and robots can lead to people behaving like them?
PO: When it comes to natural speech, we tend to adapt and mimic different language styles and accents depending on who we’re talking to. It’s a chameleon effect, we do it daily without even thinking about it. When we can apply those patterns consistently, it can lead to changes in our behavior over time. The same can happen with algorithms.
We might start thinking and speaking more like robots as we interact with them, trying to optimize our communication to get the best results.
We see this with tools like ChatGPT, where users experiment with different prompts to achieve desired outcomes. We have different ways of communicating with different people or personas in our lives. I have different ways of speaking when I talk to my dad, my mom, my boyfriend, or my cat. But speaking is such a fundamental human trait that we’ve been honing for ages, it might be harder to alter its patterns fundamentally. So, I’m not sure if we’ll see significant changes in speaking patterns.
Bm: Let’s say I’m someone who tends to be confrontational on Twitter. If my tweets and comments were in voice, do you think it would make me more considerate in my responses?
PO: If it was your own voice, yes. If you picked somebody else’s voice, it is a very different matter. People can have good and bad intentions when they use a voice that is not their own. I know some girls that alter their own voice while gaming just to avoid harassment.
Bm: Over to the technology. How does it build a synthetic, human-like brand voice?
PO: There are various technologies to create a synthetic voice. I like to use models based on deep neural networks. By extracting voice features and learning from the data, models learn how to reproduce speech patterns, intonation, and voice qualities. Depending on what you want to do with the voice model there are different factors to consider. For example, some voice models work well in offline scenarios such as cars, drive-thrus, planes, or remote areas with no internet. There are also hybrid models that combine speech synthesis with audio samples, like in navigation systems. These models ensure prompts are played at the right time and don’t rely solely on real-time synthesis. The choice of model and the amount of speech data needed really depends on the specific use case and implementation requirements.
Our main goal is to determine how the synthetic voice should sound, how it should speak to convey emotions, expressions, and align with a specific brand image. We go through an intense research process to find the right (human) voice talent with the desired speech qualities. We then use their recorded speech data to train our speech synthesis model to sound just like the original voice talent.
Bm: So human voice actors are still needed?
PO: There’s no doubt in my mind that we need human voice actors, it’s the way of working that has changed. Just like a factory worker’s job that is now partially automated. It’s not that voice actors are out of work, it’s just a different way of working and monetizing their talent.
A potential future scenario: Voice actors might control their own voice model and license it to different agencies or companies that they have a license deal with. They get royalties back from how much their voice model has been used in the last couple of months or years.
For example, in the gaming and animation industry, where a wide range of emotions and expressions are needed, they use a mix of voice models and human voice actors. In the development phase, voice models are used to produce placeholder prompts, in the finalizing phase the voice actor overdubs most of the lines.
Bm: Voice actors could put their voice on the blockchain to secure the licensing rights.
PO: Blockchain is really a big opportunity to control who owns your voice data, where the data is used, licensing your speech data or voice model to a certain company. I think it’s a really great way to track who’s using what, and where the royalties go.
Bm: Recently, I was listening to the song “All I Need” by the band Air. The singer’s voice is so authentic that I can hear a delicate accent and even hear the subtle sound of her saliva when she sings. It’s those little human imperfections that make me feel a deep connection to the song.
PO: In recordings, we can still easily pick up on those nuances and tell if it’s a human voice or not. People with stutters or lisps, for example, are not commonly found in synthetic voices and therefore we immediately appreciate it as genuinely human. So far, brands prefer a more neutral accent unless they’re targeting a specific market. So yes, it’s very exciting to explore branded voices with accents yet!
Cover image source: deagreez