Meta Shows Off Revolutionary Audio AI − But Won’t Release It

The model can generate audio in six languages from just a 2-second sample

June 22, 2023

3 Min Read

Illustration of face profiles with chat bubbles in different colors

Kubkoo/Getty Images

At a Glance

Meta has unveiled Voicebox, a generative AI model that can create audio from existing clips just two seconds long.
CEO Mark Zuckerberg calls it the "most versatile" speech generative model but won’t release it over fears of misuse.
Earlier this month, U.S. lawmakers asked Meta how it assessed the risk of releasing powerful AI models to the public.

AI researchers at Facebook parent Meta have unveiled Voicebox, an AI model that can generate speech in different styles using an audio sample that’s just two seconds long.

Using a new approach, Voicebox learns from both raw audio and an accompanying transcription. Prior generative AI speech models required specific training for each task using carefully prepared training data.

Voicebox can match the style for text-to-speech generation and can also be used to edit audio – such as removing background noises of a dog barking or distant car horns. The editing process works by recreating a portion of the speech that's interrupted by noise without having to re-record an entire speech.

Image: Meta

The model is also multilingual and can produce speech in six languages – English, French, German, Spanish, Polish and Portuguese. Voicebox can produce a version of the text in different languages, even when the sample speech and the text are in different languages.

Meta CEO Mark Zuckerberg demoed the model in a video on the company's research blog, saying his team believes Voicebox is “the most versatile speech generative model out there.”

“This is still a research project, but I think we’re going to build a lot of interesting things with tools like this,” said Zuckerberg.

Stay updated. Subscribe to the AI Business newsletter

Voicebox is designed to be multipurpose – or used for a variety of tasks – something Meta has been pushing for its AI models so that they can be applied to a variety of use cases and applications.

Meta said that in the future, multipurpose generative models like Voicebox could “give natural-sounding voices to virtual assistants and non-player-characters in the metaverse.”

“They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more,” the research team behind it said.

Not making it public

Meta is not making the model public, however, over fears of potential misuse. “While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility,” Meta said.

Instead of publishing codes or the model itself, Meta has shared audio samples and a research paper detailing the approach and results.

Earlier this month, U.S. lawmakers sent a letter to Zuckerberg saying they were concerned about the release of Meta’s LLaMA given the power of this large language model.

“We are writing to request information on how your company assessed the risk of releasing LLaMA, what steps were taken to prevent the abuse of the model, and how you are updating your policies and practices based on its unrestrained availability,” wrote Sen. Richard Blumenthal (D-CT), chair of the Senate Subcommittee on Privacy, Technology & the Law, and Josh Hawley (R-MO), ranking member.

They added that “even in the short time that generative AI tools have been available to the public, they have been dangerously abused – a risk that is further exacerbated with open source models.”

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Meta Shows Off Revolutionary Audio AI − But Won’t Release It

At a Glance

Not making it public

About the Author(s)

Latest News

Trending articles