Google upgrades its speech-to-text tech for natural sounding responses

Google upgrades its speech-to-text tech for natural sounding responses
Photo Credit: Pixabay
28 Mar, 2018

In an effort to create more natural sounding responses from artificial intelligence-driven virtual assistants, Internet giant Google has introduced a new AI-driven voice synthesiser as part of its cloud-based machine learning tools suite called tensor processing unit (TPU) infrastructure, the company said in a blog post.

The new service, called Cloud Text-to-Speech, can be used for various purposes such as powering voice response systems for call centers (IVRs) and enabling real-time natural language conversations; allowing Internet of Things devices, such as TVs, cars, and robots, to talk back to users; and converting text-based media like news articles and books into spoken format such as a podcast or an audiobook.

The AI-driven synthesiser uses WaveNet technology developed by UK-based AI firm DeepMind. Google had acquired the startup, founded by neuroscientist Demis Hassabis, former chess prodigy Shane Legg, and Mustafa Suleyman, for over $500 million in 2014. Jaan Tallin, Skype developer, had invested in the startup.

WaveNet acts very differently from the voice synthesisers used in AI assistants today. Most virtual assistants (Siri and others) run on concatenative synthesis—a process in which a dedicated programme stores syllables which are later used to form words and sentences. Initially, this process was slow to convert sounds to text but now over the years, its conversion speed has improved greatly.

In contrast, DeepMind's WaveNet produces audio from nought with the help of machine learning, essentially recreating the audio spoken by a user. In its efforts to recreate natural sound, it captures the waveforms of the audio spoken, analyses them against its database of human speech and creates 24,000 different iterations of the waveforms per second. This helps it add more natural sounds to speech such as better matching accents, reading lip smacks and reproducing them, etc. According to Google, the technology can bridge the gap between human speech and AI assistant feedback by 50%.

"The new, improved WaveNet model generates raw waveforms 1,000 times faster than the original model, and can generate one second of speech in just 50 milliseconds. In fact, the model is not just quicker, but also higher-fidelity, capable of creating waveforms with 24,000 samples a second. We’ve also increased the resolution of each sample from 8 bits to 16 bits, producing higher quality audio for a more human sound," Dan Aharon, product manager of Cloud AI at Google, wrote in the blog post.

DeepMind had first introduced WaveNet in 2016 as a neural network trained with a large volume of speech samples that was able to create raw audio waveforms from scratch. The technology was embedded into Google Assistant in October last year and could be used only for two languages—English and Japanese.

Aharon also said that the new service will allow users to choose from 32 different voices from 12 languages and variants. "Cloud Text-to-Speech correctly pronounces complex text such as names, dates, times and addresses for authentic sounding speech right out of the gate. It also allows you to customise pitch, speaking rate, and volume gain, and supports a variety of audio formats, including MP3 and WAV," he wrote.

Google has said that Cisco's cognitive collaboration unit and Dolphin One, a telephony plaform, are already using the new service.

Last month, the Internet giant started offering its new AI-tailored chips on its cloud platform to other companies for advanced testing as part of its effort to accelerate machine learning models and get them running faster.