How smart speakers work

by Charly Walther, Lionbridge 16 March 2020

You’ve probably heard of voice assistants like Siri and Alexa, but what about the actual hardware these tools use to hear what people tell them? The listening device itself is called a smart speaker. Typically small in nature, these devices respond to wake words and commands like “Hey, Alexa” or “Okay, Google.”

Working in tandem with the assistants themselves, smart speakers can perform a variety of tasks: at home, they can turn on lights, play music, make restaurant reservations, and so on. In cars, drivers use them to get directions and parking information from vehicles’ internal navigation systems.

As accessible as smart speakers are, they still have their limits – the first of those limitations being speech recognition technology itself.

It’s always about data

Unlike voice recognition, which identifies who is speaking, speech recognition analyzes voices to determine what was said. To do this, the system first filters a person’s language by digitizing their voice into a machine-readable format before analyzing the words’ meaning and loading it into an artificial intelligence (AI) system.

Voice assistant technology then uses this data to determine what the user needs. And since human speech and desires are complex and ever-changing, this takes copious amounts of data.

As with any algorithm, data from raw speech must be cleaned and labeled before it can be used for machine learning. For smart speaker algorithms to understand and respond to human speech across contexts and environments, a large amount of accurate, linguistic data must also be incorporated into speech training. Then engineers have to weigh in the fact that not all users speak the same language – and even when they do, they don’t all speak it the same way, using a myriad of dialects and accents. Take Chinese for example, which has 130 different spoken variants.

Then there are all the different ways to say something, like “Will it rain this afternoon?” versus “Are they calling for rain?” – or, even more difficult for AI to understand, “Do I need my umbrella?” In order to build effective smart speakers, the speech recognition tech they use has to recognize these questions are all the same.

Human conversations also bounce around a lot: we forget what we’re talking about, ask questions out of order, or sometimes ask more than one thing at a time, making the requests we share with smart speakers equally hard to understand. Add other people talking in the background, and it’s amazing this technology can function at all!

Despite these challenges, today’s AI engineers are making remarkable strides toward developing smart speakers that are able to recognize all these forms of speech. At Google, that means improving recognition for those with speech impairments, while startup Voiceitt is designing technology to better understand stroke victims, and Amazon has developed smart speakers that understand sign language.


Charly Walther is VP of product and growth at Lionbridge, the company specializing in machine translation and language data services.