Google Discovers Novel Method to Improve Speech Processing

AudioPaLM combines its LLM with an audio generation model to outperform OpenAI’s Whisper Large-v2

Ben Wodecki, Jr. Editor

June 27, 2023

2 Min Read
Getty Images

At a Glance

  • Google has unveiled AudioPaLM: A combination of its PaLM-2 LLM with its AudioLM audio generation model.
  • AudioPaLM can generate text and speech for speech recognition and speech-to-speech translation.

AI researchers from Google have discovered that adding a large language model to an audio generation model improves tasks such as speech recognition and translation.

They developed AudioPaLM, which is a combination of AudioLM, an audio generation model, and PaLM-2, Google’s flagship large language model. It is designed to leverage larger quantities of text training data to assist with speech tasks.

Google’s researchers contend that adding a text-only large language model to an audio-generative system improves speech processing and outperforms existing systems for speech translation tasks.

A paper outlining the model shows AudioPaLM outclassing audio generation models such as Whisper Large-v2 from OpenAI, mSLAM-CTC 2B and Google’s own USM-M when using the CoVoST 2 corpus for the BLEU test.

AudioPaLM can also be fine-tuned to consume and produce tokenized audio on a mixture of speech-to-text tasks. The model can also perform zero-shot speech-to-text translation for languages not seen in its training simply based on a short spoken prompt.

Google has opted not to release the code for the model, instead publishing a series of examples to GitHub.

Stay updated. Subscribe to the AI Business newsletter

Researchers from rival Meta opted for a similar release method for its recently released multimodal audio model, Voicebox, for fear it could be used for malicious purposes. Google's research did not say why it elected not to publish the code, however.

Alongside AudioPaLM, Google has applied PaLM to various other fields to achieve sector-specific results, including Sec-PaLM, which can detect malicious scripts for cybersecurity experts and Med-PaLM-2, which can be used to help determine medical issues with images, like X-rays.

Read more about:

ChatGPT / Generative AI

About the Author

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like