Content creators and voice actors in today's digital age have their work cut out for them with intelligent software mimicking their writings, art, voice, and even their emotions. If OpenAI's DALL-E can generate realistic art and images from plain text prompts, and ChatGPT can write poems, articles, books and even code, here's one more artificial intelligence (AI)-powered tool that can speak and emote like us without us being able to spot the difference in most cases.
Microsoft published a paper early this month about its new text-to-speech AI model, VALL-E, which can simulate a person's voice with just a 3-second recording. Initial results show that VALL-E can also preserve the speaker's emotional tone (https://arxiv.org/abs/2301.02111). The paper describes VALL-E as "a new language model approach for text-to-speech synthesis (TTS) that uses audio codec codes as intermediate representations".
According to the paper's authors, VALL-E was pre-trained on 60,000 hours of English speech data, which the paper claims is "hundreds of times larger than existing systems".
But what's new about this technology, you may ask? And with good reason. Text-to-speech, or TTS systems, have been around for a while. Free TTS tools include Natural Reader, WordTalk, ReadLoud, Listen (which uses Google's TTS application programming interface (API) to convert short snippets of text into natural-sounding synthetic speech), Free TTS (again from Google), Watson Text to Speech (a tool from IBM which supports a variety of voices in different languages and dialects), and Neosapience (which allows users to write out the emotion they want virtual actors to use when speaking).
That said, TTS tools typically require high-quality studio-recorded annotated audio from different speakers with different styles and emotions for commercial applications. The models also typically need at least 30 minutes of such data.
Read the full story on Mint.