AI Business is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

Baidu's Deep Voice AI Can Talk Like a Human Being

by Ed Lauder
Article Image

Baidu has now developed the world's most advanced speech synthesis AI ever, which they call Deep Voice, that can actually talk like a human being.

Before Deep Voice came around, Google's voice synthesis program, called WaveNet, was the most advanced in the world. However, Baidu have gone one better with their new AI called Deep Voice. Google's WaveNet was powered by their AI called DeepMind and generated speech through texts, however Deep Voice uses deep learning techniques to break down texts into phonemes, which are the small sounds needed to speak any language accurately.

Baidu's Deep Voice was developed in their Silicon Valley lab and is the biggest breakthrough in speech synthesis technology since it completely does away with the countless calculations going on in the background, which means that it can learn how to talk accurately in just a few hours without our help. This is thanks to the deep learning techniques the algorithm uses, and all the researchers needed to do was train Deep Voice accurately.

“For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original,” wrote the Baidu researchers a study published online. “By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise.”

Of course, this sort of AI isn't anything new. They are present in most of our mobile devices and the simplest ones can be found in some modern alarm clocks and even your automated answering phone messages. Yet, where Deep Voice differs is that it can accurately depict free-flowing human speech as opposed to being pieced together using large databases of human voice recordings.

Therefore, Baidu's Deep Voice represents a big step towards their goal of creating a truly human-like personal assistant, as opposed to one using pre-recordings to mimic intelligence. Deep Voice will acctually be able to talk to you like a real human being. “We optimize inference to faster-than-real-time speeds, showing that these techniques can be applied to generate audio in real-time in a streaming fashion,” said Baidu's researchers in an interview with MIT Technology Review.

They continued, “To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.”

At the moment, Deep Voice is a bit too much for our current devices to handle, but given time, our phones, watches and tablets will be able to handle the AI. All we have to do is wait for that day to come when our devices will become capable of utilising Deep Voice.

Practitioner Portal - for AI practitioners


MLOps startup Verta gets $10m in funding, launches first product


The company plans to commercialize open source ModelDB project, developed by CEO Manasi Vartak


AI and analytics services: Capabilities and costs


Which skills do you need in your team? What are the costs for running the service? How can you optimize them? These are three key questions when setting-up and running an AI and analytics service.

Practitioner Portal


More EBooks

Upcoming Webinars

Archived Webinars

More Webinars
AI Knowledge Hub

Experts in AI

Partner Perspectives

content from our sponsors

Research Reports

More Research Reports


Smart Building AI

Infographics archive

Newsletter Sign Up

Sign Up