Meta has developed a generative artificial intelligence model for text-to-speech
Meta has unveiled a generative text-to-speech model called Voicebox. According to the developers, the algorithm will do for speech what ChatGPT and DALL-E did for text and images.
Here's What We Know
Like generative systems for text and images, Voicebox can create output from scratch, convert styles, and modify the sample provided. The system has been trained on 50,000 hours of recorded speech and audio book transcripts in the public domain in English, French, Spanish, German, Polish and Portuguese.
As a result, Voicebox is able to edit clips, remove noise and replace mispronounced words.
"A person could identify which raw segment of the speech is corrupted by noise (like a dog barking), crop it, and instruct the model to regenerate that segment"
Voicebox can also reproduce speech over a two-second excerpt, transfer cross-language style, and create a variety of samples for synthetic datasets.
When We Can Expect It
Meta has not gone public with the source code of the model. The developers cited "the potential risks of misuse", despite "many exciting use cases for generative speech models".
Source: Meta