February 21, 2004

Speech Recognition Recognizes Intonation?

The other day Bill Poser posted a comment on linguistics sessions at last week-end's annual meeting of the American Association for the Advancement of Science. Not only were there six sessions on language at the meeting, but we had our own theme track, for the first time ever: the AAAS called it "Language, Origins and Development".

But there was also linguistics-related material in some other sessions. The one that surprised me was Mari Ostendorf's talk on "Overview of Speech Recognition" in a mostly-Microsoft session called "Scientific Problems Facing Speech Recognition Today".

Ostendorf, a professor in Electrical Engineering at the University of Washington, gave an interesting survey of what's going on in this field and then turned to her own research. She said that she's currently exploring a new aspect of speech that is especially useful and exciting: prosody! Her main example illustrating this exciting discovery was a natural speech segment recording consisting of two sentences, with clearly audible declarative sentence-final intonation at the end of each sentence; the accompanying written text on the screen lacked punctuation entirely. She pointed out that this short segment was problematic until one considered prosody, at which point the text suddenly made excellent sense. Of course, it was only a problem as long as one omitted all standard punctuation from the written version, a point she didn't mention.

Taking prosody into account, she said, makes her automatic speech recognition system vastly more efficient. Surprise, surprise.

I don't have any trouble understanding why many people who work on speech recognition and machine translation find much of linguistic theory unhelpful, because only some of what linguists know is likely to be useful for these purposes. But announcing the discovery of prosody (she was actually talking mainly about sentence intonation, not other prosodic features) seems a bit much.

It seems especially odd because intonation has been a feature of at least one rather low-tech automatic voice system for a long time. In the wonderful film "American Tongues", a celebration of American English dialects, one segment features the woman whose voice is (or was?) the source of the telephone numbers you get automatically when (for instance) you dial a number that has been changed. She explains on screen that she recorded each numeral from 1 to 9, plus 0, in several different pronunciations, differing by intonation (I don't remember whether she uses that word), so that the entire number will sound fairly natural to the listener. So in a telephone number 228-2228, the first 8 will have clause-final-but-not-sentence-final intonation, and the second 8 will have falling-pitch sentence-final intonation.

So AT&T knew a long time ago what some speech recognition experts are apparently just finding out. In fairness to Ostendorf, I should add that her talk was aimed at a nonspecialist audience, so it's quite possible that she knows about the vast amount of work on intonation in linguistics, including work in computational linguistics. But she certainly didn't mention any.

Posted by Sally Thomason at February 21, 2004 05:32 PM