Back to all blogs

Why punctuation matters in AI transcriptions

Xue-Yong Fu

Senior Applied Scientist

Subscribe to the Dialpad newsletter

Thank you for subscribing!

Want to work with us? See open jobs

February 11, 2021

Punctuation is important for the user to understand transcripts of business conversations clearly. For example, consider the following two sentences:

“Let’s eat Grandpa!”

“Let’s eat, Grandpa!”

A single punctuation mark changes things drastically; getting punctuation wrong can completely alter the meaning of a sentence, with potentially serious consequences. And of course, in lower-stakes situations, proper punctuation also makes the reading experience more pleasant, providing a general sense of coherence and flow.

Apart from benefiting humans, accurate punctuation also allows for better: with more accurate and readable punctuation, the AI team here at Dialpad can better analyze transcripts, improving our capabilities for sentiment analysis, highlighting action items, identifying purpose of call, addressing interesting questions, and studying other important moments in a conversation.

How the machine learns punctuation

Getting punctuation right might sound like a straightforward and simple task—and it is fairly easy for humans. But this is another one of those instances where the human capacity for language is greatly underestimated—something that seems very straightforward to the human mind is, in this case, quite challenging for a computer; the machine needs to know enough about English grammar and writing to put all commas, question marks, and periods in where they’re supposed to be in a transcript.

Luckily, we have deep learning techniques which allow the computer to learn how to do things based on data from which the machine is trained to punctuate transcripts, an ideal task for such techniques.

First, we have to have correctly punctuated transcripts—we need data from which the machine can learn. To do so, we hired a team of annotators to manually add the correct punctuation marks to raw transcripts (transcripts with no punctuation). Then, we use this annotated data to “teach” the computer how to correctly punctuate raw transcripts. When the computer predicts punctuation marks incorrectly during its “learning” process, we tell it the correct answer, and it learns how to avoid the same kind of mistakes next time. By doing this little by little, the model gradually develops the ability to punctuate strings of words in a way that approximates the human annotators’ own ability. The result of this “learning” is transcripts that are automatically punctuated in a way that’s coherent to human readers.

Challenge: Manual annotation is expensive

Unfortunately, manually adding punctuation to transcripts is very time-consuming and tedious. To place punctuation marks correctly, human annotators have to read through unpunctuated transcripts very carefully, and unpunctuated texts are much harder to read compared to well-punctuated ones. It can take months to collect a large enough sample of data required for the computer to learn—long enough to make one wonder: wouldn’t it be nice if we could find texts that someone already punctuated, so we can directly use those to teach the machine?

The first thing that comes to mind is Wikipedia: since there are so many informative and well-punctuated articles in Wikipedia, it must be good for learning punctuation, right?

This train of thought might lead someone to feed millions of English Wikipedia articles to the computer, and hope it can learn something from them. But it turns out the computer does not learn anything that improves its punctuation for Dialpad calls. Why? What is the difference between Wikipedia articles and Dialpad call transcripts?

The answer is that one is written text and the other is spoken language, and the way people use language is very different when they’re having a spoken conversation versus writing a formal article for Wikipedia. For example, written language tends to have more complex sentence structures and longer sentences; spoken language tends to have more repetitions and shorter sentences that are easy to follow.

Solution: Movie subtitles

To address the issue above, we need texts that are conversational. And we need a lot of them.

Movie subtitles are a good candidate for this: there are many conversations in movies, and subtitles are usually punctuated directly by the movie creators, so we know they will usually be punctuated in a way that human readers find coherent.

But if we experiment with millions of subtitles to teach the computer, we would see that again, our machine won’t learn to punctuate call transcripts very well.

But why? Subtitles are spoken language, after all.

The thing is, we need to consider not just the style of the language (conversational vs formally written), but also the topics of conversation and the variety of genre. For example, there are subtitles like "They were smuggling diamonds.” or “Archaeologists have always been mystified.” These kinds of conversations are not likely to happen on sales calls, or calls from call centers, or even in normal, everyday conversations between colleagues.

So, we need to filter those that don’t match the context in Dialpad (mostly about business): instead of feeding any and all kinds of movie subtitles to our machine, we can choose to only feed it those subtitles that come from movie snippets where the interactions between characters are similar to business conversations or sales calls. This way, the machine learns from the correct sociolinguistic context.

For example, the following subtitle snippets are quite distant from the context we are trying to capture; these are not things someone is likely to say during a business call:

Distant from the business context

Our dog disappeared.

Sometimes those serious ones fool you.

He looks so utterly vulnerable.

Your paintings were impressive joker.

God destroyed whole cities to punish man's wickedness.

Your module, being flown through an unstable wormhole, piloted by this creature.

I mean, who comes hereafter dark besides murderers?

On the other hand, the following example subtitle snippets are much more similar to the context of our call transcripts; these are things someone might actually say during a business call:

Closer to business context

Well, thank you for coming down.

What did you find out so far?

I guess you could give me her number.

Eh, I really don't know what to do about it!

Um, just hold on a second.

It's gonna be very nice.

We used a machine learning technique to select five million of the most business-like subtitles. More specifically, we first trained a language model on business-context transcripts so that it learned how to estimate how likely an utterance was to be spoken in business calls, then we used this language model to select the “business-like” movie subtitles from the OpenSubtitle corpus, which contains 300 million subtitles in over 4,000 movies. We use this subset of subtitles as learning material for our new punctuation machine. After several rounds of “learning”, the machine is able to automatically punctuate transcripts much more coherently for human readers.

Success!

Movie subtitles not only help us understand conversations in movies better, but also give Dialpad users a better transcript-reading experience, one that is vital to proper understanding of what has been said on a call. Indeed, movie subtitle deep learning, using the appropriate sociolinguistic context, actually makes the difference between inviting Grandpa to a meal, and having Grandpa as the meal: thanks to this learning technique, Grandpa can now enjoy his lunch without fear.