January 05, 2004

and uh -- then what?

There's a piece in the 1/3/2004 NYT, featuring recent research on disfluencies by Liz Shriberg, Herb Clark, Jean Fox Tree and others. This is an area where a lot of good work has been done over the past decade or so. Predictably, the writer is most impressed by Nicholas Christenfeld's 1991 finding that "humanities professors say you know and uh 4.85 times per minute, social scientists 3.84 and natural science professors 1.39 times", and that "drinking alcohol reduces ums." (Christenfeld seems to have a flair for catchy research -- he's also known for studying whether a machine can tickle.)

One of the things that I like about disfluency research is that it has produced some exemplary collaborations between psycholinguists and engineers, especially in the work of Andreas Stolcke and Liz Shriberg. As an example of how this interplay works, I'll describe one of their early papers, "Statistical language modeling for speech disfluencies". Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 405-408, 1996 (HTML, PDF). They test the hypothesis that conversational transcripts would be more coherent (from an information-theoretic point of view) if disfluencies such as filled pauses (ums and uhs) were removed.

They trained a trigram model on 1.8M words of Switchboard transcripts, and tested on 17.5K words of held-out transcripts, comparing a model in which the filled pauses were edited out with one in which they were left in place. They did this in two different ways, in one case dividing the conversations into "phrases" based on the occurrence of pauses, and in the other case dividing the conversation into "phrases" on the basis of linguistic content. The initial and final "phrase boundaries" (however they are defined) function like words in the sequence, so that after the final word of a phrase, the thing to be predicted is the phrase end, rather than the first word of the next phrase. Likewise, the first word of a phrase is predicted based as following a phrase boundary, rather than following the final word of the previous phrase.

In both cases, the measure of coherence is the local perplexity, which is 2 raised to the power of the local entropy. This is a way of quantifying how predicable the next word is. It's simple to calculate this, given a statistical model of word sequences. Let's say we want the perplexity of the word immediately following (all the examples of) uh in the test data. For each such word wi we estimate its conditional probability pi (given the previous two words in the text), and across all the uhs (about 500 in their test), we average the quantity -log2(pi). This average is an estimate of the local entropy e (with respect to the statistical model), and the local perplexity is just 2e. For those who aren't familiar with this measure, it may help to note that if N different words are possible at a given point, and all of them are equally likely, then the perplexity at that point is N. The information-theoretic perplexity is just a way of keeping track of the degree of uncertainty -- the effective "branching factor" -- when the alternatives are not equally likely.

Stolcke and Shriberg's test was to compare the results of statistical language models prepared with the uhs and ums left in as 'words', with the results of otherwise-identical models with the uhs and ums left out. On the hypothesis that the uhs and ums are not really part of the message, the model should make better predictions if we leave them out. However, using the acoustic segmentation, Stolcke and Shriberg found that the perplexity immediately after the uh or um was significantly increased in the "edited" model, not decreased:

 
UH+1
UM+1
Overall
unedited
223.5
36.7
101.9
edited
291.5
73.4
103.3

In contrast, if they divided the conversation up on linguistic grounds (i.e. based on the syntax and semantics), and looked only at phrase-medial filled pauses, the edited model was a better predictor (i.e. gave lower perplexity):

 
UH+1
UM+1
unedited
849.0
437.4
edited
606.2
361.7

You should be able to see what happened. When the phrasing was pause-based, the uhs and ums were often phrase-final. So when you see an uh, you have a good chance to predict a (pause-based) final phrase boundary right after it. If you edit out the uh, you lose that predictive ability. But if you divide phrases on the basis of linguistic structure, the uh will generally not be phrase-final, and the word following the uh will usually be pretty high entropy -- after all, the speaker is emitting an uh before dredging it up -- and you'll have a slightly better chance to predict it based on the preceding two words than based on the uh and one preceding word.

This is a good example of why purely mechanical applications of statistical analysis procedures can be misleading. When Andreas and Liz first did this work (at the Johns Hopkins summer workshop in 1995), they first thought that the pause-phrasing results showed that disfluencies are really carrying information about the word sequence. However, being smart, sensible and careful researchers, they went on to look more closely at the situation, with the results that you can read in the cited paper.

There has been a lot of work over the years suggesting that disfluencies are often really communicative choices rather than system failures. I have a favorite anecdote about this. Former New York mayor Ed Koch has (or used to have?) a radio talk show, which I would sometimes listen to in the car when I lived in northern New Jersey, back in the neolithic era. Though highly verbal and even glib, Ed is a big um-and-uh-er, to the point that he would often introduce himself by saying "This is Ed uh Koch." Since it's not credible that he was having trouble remembering his own last name, I concluded that he often used a filled pause as a sort of emphatic particle.

Ideas like this would have made it easy to interpret the first (pause-based) results that Andreas and Liz founds as confirming that filled pauses are communicatively significant. They are, no doubt about it, but not in the sense that they help a trigram model to predict the words that follow them. As Dick Hamming used to say , "beware of finding what you're looking for." (I haven't been able to find a web link for this aphorism, but you can find some other good advice from Hamming here.) Liz and Andreas were (and are) really interested in the foundational questions about this problem, and so they didn't just go for the quick score, but probed their results carefully, re-did the analysis in other ways, and made a solid contribution rather than a flashier but more ephemeral one.

Andreas, Liz and others have gone on to learn a lot more about the science of disfluency as well as about how to solve the engineering problems involved in recognizing and understanding disfluent speech. It's too bad that (as far as I know) linguists who study syntax, semantics and pragmatics have not been involved in this enterprise to any significant extent.

Posted by Mark Liberman at January 5, 2004 10:16 AM