Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio
But it could lead to a proliferation of deepfake voices
Microsoft has unveiled VALL-E: an AI model that can generate speech audio from just three-second samples.
VALL-E is capable of text-to-speech synthesis (TTS) off little prior data and could be used for tasks such as speech editing and content creation when combined with other generative AI models like GPT-3.
Trained on 60,000 hours of English language speech from Meta’s LibriLight audio library, VALL-E essentially mimics the target speaker and what they would sound like when speaking a desired text input. It can also maintain the emotion of the speaker in the sample audio.
VALL-E can be demoed via GitHub. According to the Microsoft researchers behind it, the model “significantly outperforms” other zero-shot TTS systems in terms of speech naturalness and speaker similarity.
One possible use for VALL-E could be to narrate audiobooks. Just last week, Apple published a series of audiobooks narrated by an AI voice via its Books app.
For Microsoft, VALL-E represents its latest foray into generative AI. The tech giant is already exploring ways to incorporate OpenAI’s ChatGPT into its Bing search engine and Office line of products.
VALL-E: How does it work?
Microsoft describes VALL-E as a neural codec language model. The model was trained on discrete codes derived from the LibriLight library.
During the pre-training stage, the training data used to build VALL-E was scaled up to make it “hundreds of times larger than existing (TTS) systems” like CereProc’s CereVoice or ReadSpeaker, according to the research team behind the model.
“While advanced TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio. Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,” according to the paper’s authors.
“Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.”
The Microsoft researchers argue that despite VALL-E training data being “more noisy speech” with “inaccurate transcriptions,” it nevertheless includes more “diverse speakers and prosodies” that lead to a more robust approach.
The paper outlining VALL-E can be accessed via arXiv. However, no code or repository was released. This could be because VALL-E could synthesize speech that maintains speaker identity. The paper references this, saying the model “may carry potential risks in misuse, such as spoofing voice identification or impersonating.”
Examples of this use already exist – in January 2020, scammers used an AI speech tool to steal $35 million from an unwitting UAE bank.
To mitigate such risks, the researchers state the possibility of building a detection model to determine whether an audio clip was synthesized by VALL-E.
About the Author
You May Also Like