Real-time speech recognition: What it is, what sets it apart, and why you need it for business communications

Manager, Speech Recognition

Making speech recognition work in real time

Automatic speech recognition (ASR) refers to the technology that converts spoken language into written text, which is a replacement for manual transcription of speech.

In the last 20 years, ASR technology has improved by leaps and bounds, and some of the most accurate systems out there can probably transcribe better than you and I can for some speech, but this state-of-the-art technology usually requires a very large number of computations for even a small clip of speech. In some cases, you'll probably have time to make a coffee while you wait for the transcription to complete. (If you’ve used transcription services like Rev before, you’re probably familiar with having to wait hours or even a day to receive a transcript of a call.)

But there are some cases where it certainly wouldn't be convenient to have to wait until long after a call has ended to finally get their words in a written form. There are two kinds of "fast" here that may be important: computation time and online streaming. We'll get to computation time in a minute, but for now, let's focus on the streaming component.

Most ASR systems out there are what we call "offline", which means that they need a complete audio recording before they can start transcribing. Practically speaking, an offline system wouldn't start the transcription process for a call until after you hang up, and you'd be sitting there waiting for a notification that your transcript is ready. This might be okay if you don't need the transcript right away, but if you do, it's less than convenient.

Fortunately, we can dramatically speed up the transcription process by starting before the speaker has finished talking, an approach known as "online" or streaming ASR that is the secret to providing real-time transcripts. We're all familiar with how streaming works and why it's advantageous, especially now that we can instantly start our favorite TV show without having to wait for the entire episode to download to our device. But there are some challenges to streaming transcription that are less obvious.

If you had to transcribe a phone call yourself (a lot harder than it sounds!) you'd probably find that you'd do a better job if you had an audio recording of the whole call that you could play, pause, and rewind as necessary rather than trying to transcribe a live call as it happens. You'd also find that having the whole call available to listen to will help you make sense of some of the harder words, since there might be some clues to help you throughout the call. However, unless you're a professional stenographer, you'd be having to choose between transcribing accurately and transcribing quickly.

Similarly, an online transcription service that operates in real-time is capable of popping out the transcription word-by-word while the speaker is talking, but it can't fast-forward to get more context because the speech doesn't exist yet! We also have to make sure that online ASR never gets "stuck" and takes a long time to figure out a particular phrase, since it might not be able to catch up. To ensure our backlog never piles up, we need to focus on keeping the computation time as low as possible.

Accuracy or latency: Choose your pill

A user’s speech is modeled by doing a series of very large matrix computations—think millions of math equations for even a few seconds of speech. The accuracy of an ASR system significantly depends on the type of computations and the size of the matrices subjected to those computations.

Even with today's computers, the model will get noticeably slower as the number and size of the computations increase, which may lead you to ask, "Why not reduce the size and the number of computations to increase the speed of the speech recognition system?"

Well, picture the ASR model like a seesaw. On one side you have accuracy, and on the other side you have speed. If you only add bricks to one side, your performance might look something like the GIF below!

Sure, it would be faster. But we also need to worry about the accuracy of the model. For example, if we simply used the lowest possible number of computations, we’d end up with a "word salad,” where the model produces a jumble of valid but random words.

So just like balancing on a seesaw, there’s a trade-off between accuracy and latency. The trick is to find the sweet spot between the two. This means that for an optimal system we need to aim for the best accuracy we can achieve, measured using word error rate, while maintaining an acceptable latency, which we measure as follows.

Measuring the speed of an ASR system: Real Time Factor & latency

Real Time Factor

Here on the ASR team, we use a metric called Real Time Factor (RTF for short) to determine if our models are fast enough to run on live calls. RTF is calculated by dividing the time taken by the ASR system to transcribe by the total duration of the spoken audio.

If someone speaks for five seconds and it takes five seconds to get the transcript, we have an RTF of one. However, for these systems, we are generally looking for RTF values that are less than one! So for 10 seconds of speech, it may only take six seconds to decode, meaning our RTF is 0.6.

For an offline system, we have to wait for the speech to finish before starting to transcribe, so after a 10-second clip of speech, we'd have to wait six seconds for the transcript to appear if the RTF is 0.6. However, for online systems, we break the speech down into smaller chunks. If our chunks were 100 milliseconds each, it would take only 60 milliseconds more to evaluate whether we should add a new word to the transcript stream, which is why you can see words appear right after you spoke them!

Latency

Latency on the other hand, can be described as the time it takes for transcripts to be presented to users. If we wait for someone to finish speaking before we transcribe, it might take a few seconds before the transcript appears; on the other hand, if we're actively transcribing while the speaker is talking, words will appear much faster, often taking just a fraction of a second to appear. In both of these cases, the RTF could very well be the same, even though one may take much longer to output words than the other!

Unlike RTF, latency is measured in milliseconds; zero latency is the lowest number possible, since ASR models can't currently predict the future 🔮. Many other components are involved in delivering transcripts to our users and therefore contribute to latency measurements, such as the telephony systems, which provide the audio, and all of the downstream systems that format and present the transcripts in the app. So when we work on our ASR models, we usually stick to using RTF; however, latency is still important to consider, since it directly relates to how the user subjectively perceives the speed of the system. Therefore decisions about latency should be made based on the experience we want to create for our users.

Measuring the speed of ASR: A real-world example

We at Dialpad process voice calls, voicemails, and video conferences with durations ranging from a few seconds to many hours. Although it is possible to process long-running audio as a single chunk of speech, we prefer to segment it into shorter segments for ease of reading as well as for better performance of our ASR system.

For the phone calls that we normally transcribe, most speakers take turns speaking. When one person ends their turn and lets the other talk, this pause is a natural stopping point to segment our speech. The resulting clip of speech is called an utterance.

Our system breaks a call down into individual utterances and transcribes each one in real-time, using streaming ASR, and when the utterance ends, it does a final pass since the entire clip is available.

For example, if you’re watching a live transcript as you or someone else is talking, sometimes you can see one word change to another as the system gains more context and edits the previous word choice. However, once the utterance ends, the final utterance is frozen in time and will not change for the duration of the call.

Not only does this improve the user experience, but segmenting long-running speech into manageable utterances also helps in faster processing and efficient memory utilization within our ASR pipeline.

The end result

Dialpad’s blazing fast real-time transcription is the key to unlocking our full suite of Ai features without needing to wait around for a transcript to be processed after a call. In addition, in-call tools such as real-time assist and live call sentiment can be used on top of real-time transcripts to help agents and managers make quick decisions while calls are in progress.

And this is just the beginning, as advancements in Ai technology coupled with real-time ASR will continue to result in more powerful tools for having better business conversations.

Get a hands-on look at Dialpad Ai

Book a walkthrough with our team, or take a self-guided interactive tour of the app first!

Share