Vi Discourse Markers Blog Blog Image

Dialpad’s artificial intelligence enables users to get more from their calls. One of the features that allows this is the call transcript: after making a call with Dialpad, users can go back and read through their call again. But reading a transcript of real dialogue can be difficult.

Natural, spoken dialogue—like the kind that happens in a meeting—contains more information than you might think. Even though you’re not always consciously aware of it, you use this paralinguistic information all the time to understand the flow of a conversation. For example, we unconsciously interpret cues about politeness, social hierarchies, and conversational turn-taking from so-called “filler words”: if we don’t use filler words, our conversations just don’t sound natural.

On the other hand, scripted interactions like a movie or novel have already been written. The writers know how it will play out; the actors know what they’re going to say and when they’re going to say it; they know when their co-stars’ turns are over and theirs begin. Scripted dialogues, therefore, don’t really need interactional tools like filler words. This is even more true of dialogue in novels: you won’t even see the characters act out the words, and you can read at your own pace, so there’s even less need to include these conversational devices.

Similarly, when we read a transcript of a conversation, we know it’s an interaction that already happened. So, even though it was originally created on the fly, we don’t think of it that way when we’re reading it later. This makes filler words in a transcript look strange, and can make the transcript harder to read.

Transcripts aren’t just for human consumption, either: the AI team also analyzes them to perform sentiment analysis, highlight action item moments, and identify important questions asked during a call. The presence of filler words makes it more difficult to automatically identify these aspects of the transcript.

A linguistic problem: One form, many functions

Unfortunately, cleaning up a voice transcript is not as simple as removing any occurrence of any word that can be used as a filler word, because many words serve multiple purposes: they can function as both filler words and content words, depending on the context.

For instance, the word right is always pronounced and written the same way—always has a single form. But it can have many different functions:

Speaker 1: So what he said to me was that they were trying to find a distributor.

Speaker 2: Right, right.

Speaker 1: So once we can find a distributor, we'll know how long it will take to deliver the materials.

Here, right is being used as a type of filler word called a discourse marker: it signals that Speaker 2 is following what Speaker 1 is saying, but it doesn't add any literal meaning. Removing these occurrences of right from the transcript would not alter the overall meaning.

But right can also be an adjective:

He injured his right leg playing football.

I want to do the right thing.

Or it can be an adverb meaning fully or exactly:

The store is right around the corner.

It can be used to reply in the affirmative:

Speaker 1: So just to confirm, you said I can call you tomorrow at 5:00?

Speaker 2: Right.

Similarly, the phrase you know also has more than one function: it can be a discourse marker or a verb:

So tell him, you know, when you’re free next week, and he’ll get back to you. (discourse marker)

I’ll call her today, but you know I won’t be in the office next week so I can’t check in on Monday. (verb)

The solution: A classification task

Our approach to solving this problem is to treat discourse marker removal as a binary classification task: using machine learning, we train a classifier using a set of linguistic features—cues from other nearby words, or from the discourse marker’s location in the sentence itself —surrounding any given occurrence of right or you know. Once trained, the classifier can use these features to predict whether an occurrence of right or you know is a discourse marker or not.

If it is a discourse marker, removing it from the transcript improves readability without altering the meaning of the utterance, and improves the other linguistic analyses that we do, such as extracting action items. For example, I’ll, you know, give you a call tomorrow becomes I’ll give you a call tomorrow, making it obvious that the utterance is an action item.

While challenging, this task was not impossible: there are many linguistic features available to train such a classifier.

For example, people tend to use discourse markers in between pauses, to hold the floor to continue their conversational turn. That means we can note the presence of other pause-adjacent words that link ideas together, like and, so, and but. Linguists call these words conjunctions.

We also know that people use discourse markers often at the beginning of their conversational turn, to let everyone know that the speaker is about to speak but is still formulating their thoughts. Therefore, another feature we included was whether or not right or you know occurred at the beginning of an utterance.

Combining these with other relevant linguistic features, we created a dedicated machine learning module that can determine with very high accuracy whether a given instance of right or you know is a discourse marker or not.

Right, and that's, you know, how we solved the problem.

Or as we would prefer to say: that's how we solved the problem.