AI Audio Generation: Everything You Need to Know

AI tools can generate audio from text and existing audio

Ben Wodecki, Jr. Editor

August 30, 2023

7 Min Read
AI Business takes a deep dive into a range of audio generation models and tools.Getty

At a time when artificial intelligence is creating art and writing copy, there's another exciting development occurring in the world of generative AI: audio generation.

Consider being able to type a few lines about a piece of music you’d like to hear and a system would make it happen for you. Or perhaps generating voice lines for a podcast or video you’re trying to edit.

A range of models and tools see AI not just the right notes but also create an entirely new auditory landscape.


What are text-to-audio AI models?

A text-to-audio AI model takes text as an input and generates audio content as a result. The output could vary from speech to music.

The most common iteration is TTS – or text-to-speech. This is used in the development of voice assistants like Siri or Alexa. TTS can be used to create spoken content for various languages.

Text-to-audio AI models


Creators: Google

First published: January 2023

MusicLM can generate high-fidelity music from text inputs. Users can type in a prompt such as "a guitar riff with air horns playing in time" and the model will generate a musical output. 

MusicLM can generate music at a consistent 24 kHz over several minutes.

The model works by taking audio snippets and mapping them in a database of words that describe the music sounds. It then takes the user's text or audio input and generates the resulting sound.

Check out MusicLM examples -

Read more about MusicLM on AI Business:


Creators: Google

First published: June 2023

AudioPaLM can generate text and speech for speech recognition and speech-to-speech translation.

To create AudioPaLM, Google combined the audio generation model AudioPaLM with its flagship large language model, PaLM-2, to create a model designed to leverage larger quantities of text training data to assist with speech tasks.

AudioPaLM can also be fine-tuned to consume and produce tokenized audio on a mixture of speech-to-text tasks. The model can also perform zero-shot speech-to-text translation for languages not seen in its training simply based on a short spoken prompt.

Check out the AudioPaLM paper -

Read more on AI Business about AudioPaLM -


Creators: Meta, FAIR

First published: June 2023

Voicebox is a generative AI model that can create audio from existing clips just two seconds long. Voicebox learns from both raw audio and an accompanying transcription to generate audio.

Voicebox can match the style for text-to-speech generation and can also be used to edit audio, such as removing background noises of a dog barking or distant car horns. 

Read more about Voicebox on AI Business:


Creators: ByteDance - researchers from the AI lab at TikTok’s parent company

First published: January 2023

Make-An-Audio is a prompt-enhanced diffusion model capable of generating audio from text prompts.

The model can be used to create personalized audio snippets from natural language inputs and existing audio. It can also be applied to video-to-audio generation.

Read the Make-An-Audio paper -

AI text-to-audio platforms

Want to try some AI text-to-audio tools? Try these:


PlayHT offers a range of text-to-audio tools, including voice generation for podcasts and voice cloning.

The startup behind it is trying to empower businesses to create natural speech content using state-of-the-art AI voices. The likes of Amazon, Samsung and Verizon have all used PlayHT to generate audio content.

Try PlayHT’s AI text-to-speech tech - offers text-to-audio tools for corporate or entertainment purposes. Its studio includes text-to-speech for adverts, education lessons or presentations, among others.

The likes of Nasdaq, Oracle and Toyota are among Murf’s users, with its enterprise plan including a project-sharing workspace for teams to view or edit audio projects.

Text-to-audio tools on enable users to create realistic voiceovers. Resemble also offers voice cloning and tools to localize audio content in various languages.

Nextflix, the World Bank Group and Boingo count among’s users.

Wellsaid Labs

Seattle-based Wellsaid Labs offers text-to-speech for voiceovers. It offers a studio platform where users can craft and curate custom voices for specific use cases.

Wellsaid users include Boeing, Snowflake, Intel and Peloton.

Audio-to-text models


Creators: OpenAI

First published: September 2022

Whisper is an open-source speech recognition system. Trained on 680,000 hours of data collected from the web, the model can transcribe into multiple languages.

Around one-third of the audio used to build Whisper is non-English, according to OpenAI. The company opened up access to the model so developers can use it as a foundation for building applications

Check out the Whisper paper -

Read more about Whisper on AI Business -


Creators: Microsoft

First published: January 2023

VALL-E can generate speech audio from just three-second samples. VALL-E essentially mimics the target speaker and what they would sound like when speaking a desired text input. It can also maintain the emotion of the speaker in the sample audio.

It can also be used for text-to-speech synthesis of little prior data and could be used for tasks such as speech editing and content creation when combined with other generative AI models.

Check out the VALL-E paper -

Read more aboutVALL-E on AI Business -

Fairseq S2T


First published: October 2020

Fairseq S2T is a Transformer-based seq2seq model designed for automatic speech recognition and speech translation.

Fairseq S2T generates transcripts and translations autoregressively. It uses a convolutional downsampler to significantly reduce the length of speech inputs before they are fed into the encoder.

Read more about Fairseq S2T -


Creators: Meta

First published: August 2023

AudioCraft is a suite of text-to-audio and music models.

It includes MusicGen, which generates Meta-owned and licensed music from text prompts, AudioGen, which generates sound effects trained from public audio, and EnCodec, which enables “higher quality” music generation with fewer artifacts.

All models are open source to researchers and practitioners. Access the AudioCraft code on GitHub. Try demos of MusicGen or listen to AudioGen samples. Read the EnCodec paper and download the code.

Read more about AudioCraft on AI Business -

AI audio generation applications & tools


AssemblyAI offers production-ready AI models for speech recognition, speech summarization, and more. Among its offerings is LeMUR, which helps businesses build large language model apps on spoken data.


Speech recognition startup Speechmatics offers AI-powered transcription and translation offerings that span almost 50 languages. Its API allows users to send audio and receive both transcription and translation to enable business users to open their products or services to wider audiences. Video game developers Ubisoft, consulting giant Deloitte and chipmakers Nvidia are listed as partners on its website.

AWS Transcribe

Amazon Transcribe is an automatic speech recognition service from AWS designed to help developers add speech-to-text capability to their applications.

Google Speech-to-Text

Speech-to-Text from Google Cloud can be integrated into applications – allowing users to send audio and receive a text transcription.


Kaldi is the least flashy application on this list – found on GitHub, it’s designed for speech recognition researchers.


Meta’s Wav2letter is an automatic speech recognition for researchers and developers to transcribe speech. is an AI-powered transcription platform. Users can upload or record audio and Otter will generate a text transcription.


Trint is an AI transcription service where users can obtain transcription from audio inputs. Among its co-founders is former ABC journalist Jeffrey Kofman, who serves as CEO of Trint.

Audio to audio models



First published: October 2020

SepFormer is a Transformer-based neural network for speech separation.  It can separate multiple speeches in a single recording.

Read more about SepFormer -


Creators: Google

First published: September 2022

AudioLM is an audio generation model.

It can generate semantically plausible speech from existing spoken word input while maintaining speaker identity.

Read more about AudioLM on AI Business -

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like