Baidu has now developed the world’s most advanced speech synthesis AI ever, which they call Deep Voice, that can actually talk like a human being.
Before Deep Voice came around, Google’s voice synthesis program, called WaveNet, was the most advanced in the world. However, Baidu have gone one better with their new AI called Deep Voice. Google’s WaveNet was powered by their AI called DeepMind and generated speech through texts, however Deep Voice uses deep learning techniques to break down texts into phonemes, which are the small sounds needed to speak any language accurately.
Baidu’s Deep Voice was developed in their Silicon Valley lab and is the biggest breakthrough in speech synthesis technology since it completely does away with the countless calculations going on in the background, which means that it can learn how to talk accurately in just a few hours without our help. This is thanks to the deep learning techniques the algorithm uses, and all the researchers needed to do was train Deep Voice accurately.
“For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original,” wrote the Baidu researchers a study published online. “By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise.”
Of course, this sort of AI isn’t anything new. They are present in most of our mobile devices and the simplest ones can be found in some modern alarm clocks and even your automated answering phone messages. Yet, where Deep Voice differs is that it can accurately depict free-flowing human speech as opposed to being pieced together using large databases of human voice recordings.
Therefore, Baidu’s Deep Voice represents a big step towards their goal of creating a truly human-like personal assistant, as opposed to one using pre-recordings to mimic intelligence. Deep Voice will acctually be able to talk to you like a real human being. “We optimize inference to faster-than-real-time speeds, showing that these techniques can be applied to generate audio in real-time in a streaming fashion,” said Baidu’s researchers in an interview with MIT Technology Review.
They continued, “To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.”
At the moment, Deep Voice is a bit too much for our current devices to handle, but given time, our phones, watches and tablets will be able to handle the AI. All we have to do is wait for that day to come when our devices will become capable of utilising Deep Voice.