AI Business Contributor Chris Price talks to Joel Susal, Senior Director of Product, Platform & AI at Rev, about how its super-fast and accurate speech-to-text solutions are powering the world’s voice and video applications.

August 15, 2022

7 Min Read

AI Business Contributor Chris Price talks to Joel Susal, Senior Director of Product, Platform & AI at Rev, about how its super-fast and accurate speech-to-text solutions are powering the world’s voice and video applications.

 

All the research shows that people generally don’t trust AI. How do you think AI businesses can earn trust?

I think trust is best earned through demonstration. Many people would prefer to speak to a human rather than an AI, even if it means waiting an extra hour on hold. This is a phenomenon called ‘algorithm aversion’ and it’s a side effect of not earning the trust of the consumer. 

The path to earning that trust is improving performance so that, eventually, people don't feel the need to talk to an operator. The comparison is with getting cash out of ATMs or using online banking. At first, consumers were wary of the technologies, but over time people began to prefer them to going into a bank and waiting in line to speak with a teller or banker. 

To accelerate trust you need to be able to explain and demonstrate how one system is better than another and be transparent about your performance. This transparency is critical to us at Rev – and to our customers – because successful adoption of AI at scale depends on being not just a little better, but significantly better than the non-AI systems that exist today. This dedication to transparency and accountability is why we’ve released our datasets to the public and open-sourced a piece of technology called FST Align, which is used to compare word error rates between different transcripts (human vs AI, AI vs AI, etc.) more easily.

Finally, one challenge AI providers face is how to explain when something goes wrong and what they are going to do to ensure it doesn’t happen again. At Rev, our approach has proven highly effective at solving many problems others have faced with AI-based speech-to-text because we train using highly diverse content that has been refined into “ground truth” by a diverse group of qualified experts (our Revvers).    

More generally, do you think AI can help bridge the productivity gap in business? 

Absolutely. The fact is humans value diversity in their work, which is to say they generally don’t find fulfillment in repetitive tasks. Whereas in the past humans had to adapt their work knowing the limitations of machines, one of the premises of Industry 4.0 is that machines are now learning how to perform these repetitive tasks more effectively. 

Within our industry, the exciting AI advances come from micro modifications - looking at the edits and improvements that humans make on top of a high-quality AI transcript. Not only are these changes used to measure our internal performance, but they’re also fed into a constantly reinforcing model. 

How have the datasets you’ve worked on helped to provide more accurate results from your speech-to-text service?

Our data has two important dimensions of diversity built in. First, the audio itself is from real-world transcription scenarios with countless permutations for different industries, acoustic characteristics, accents, jargon, gender, etc. Second, that audio is transcribed by a diverse set of freelancers, who collectively produce high-quality ground truth data across these different audio characteristics. When you add both of those things together with a team of world-class speech scientists, the result is you have a system that mitigates bias more effectively than even the most heavily resourced companies like Google and Amazon.

How do you ensure greater diversity and inclusion in your own staff and suppliers, including in-house developers and freelance Revvers?

Mitigating bias with the diversity of input data and ground truth generation has been a happy side effect of our business model which is freelancer based, and therefore open to anyone who’s eligible to become a Revver.

We have found that our approach to using freelancers who are qualified transcribers is inherently effective at mitigating bias. The fact we operate a transcription business for everyone - from grandma who wants to caption a grandchild's birthday party to media businesses putting captions on their valuable content - shows how our approach is inherently diverse. We are very happy with the results so far, but our job - and the job of any AI technology company - is never done. There is still much more bias to be eliminated, but we’re inspired by our progress so far.   

Do the Revvers remain as important as you expand the use of AI in your business?

Yes, they remain critical even as AI continues to advance. That said, the work they do may shift over time as AI carries out more of the repetitive tasks with an acceptable, even superior, level of accuracy to humans. 

Here the comparison is with healthcare where AI is increasingly used to detect some cancers, providing a second set of eyes to radiologists. Even when this AI alone becomes superior to the current hybrid situation, expert radiologists will still be instrumental in researching, identifying, and training AI models to detect rarer cancers.

Certainly, we’ve seen real benefits of having both human and AI business lines. It enables us to compete and win against large companies like Google, Amazon, and Microsoft - as well as smaller ASR providers - because no one else has the highly motivated, trained, and properly equipped workforce of freelancers that we have invested in over the last 12 years.

Are there still big differences in the word error rate between humans and AI?

When we started the business, it was before AI was viable. At that time, Revvers transcribed or captioned each file from scratch. Today, we operate a hybrid approach which means the Revver starts with an AI transcript and modifies and refines it from there. It’s part of a virtuous cycle where humans train the AI which in turn makes the job of the humans easier and creates additional capacity in our marketplace. 

When it comes to word error rates, humans are imperfect and in some cases exhibit word errors of  between 2 and 3%. Our v1 ASR model, which is best in class, was at a 13% word error rate 15 months ago. Today, our v2 model is at an 8.9% error rate. There’s great promise for future improvements as well, so it’s definitely a very exciting time in the speech-to-text space.  

Are there differences in the types of errors committed by humans and machines?

Yes, there are. Humans tend to make errors in confusing situations like when there’s overspeak - when one person is interrupting another - and words can get jumbled or assigned to the wrong person. However, while human errors tend to be localized, machines can get words wrong in the middle of a sentence, perhaps because they don’t recognize a proper noun. And while it may just be one word that's wrong, if the error looks nonsensical it may stick out more to a reader than a human error. 

Do you think having such diverse content to transcribe is a strength, or is it better to specialize so that AI can learn words in a particular vertical?

It’s been a strength. While most players in the AI space have specialized models for different industries, we have been able to outperform all other players in most verticals. The reason is that most companies are starved of data whereas we have an excess of data. 

Now, as you start to look at highly specialized verticals, such as healthcare or science classrooms, we could either adjust our general model to include more of that data or customize it to have a bias towards a certain subject. 

So far, we’ve seen tremendous performance in our general model, which is operationally simpler for both us and our customers as well. Of course, we are ultimately driven to deliver the quality our customers demand and if the benefits of customization outweigh the operational complexity, we would do what is required to remain the accuracy leader. 

Do you think we’ll ever get close to a zero-error rate with AI?

Well, not zero-error rate (zero error rate is impossible in pretty much anything), but superior to humans, yes. It’s a similar path to autonomous vehicles, where there are published levels of autonomy up to Level 5 where there isn’t any need for human involvement. We're not there yet, but I believe we’ll get there. When you talk about having a human at a steering wheel, there’s an error rate built into that. Humans may be impaired or distracted. Machines will not. So absolutely, the potential for AI to be better than a human is possible. 

To find out more about Rev's transcription services, visit www.rev.com

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like