February 03, 2006

Who wants to die, asks Baby David

There's an urban legend that an early speech recognition system heard "recognize speech" as "wreck a nice beach". That one's made up, but it's not a legend that BBN's Podzinger recently transcribed "say Jesus is Lord" as "Beijing this morning", or "a moment in your life" as "remote wooded delight". The perils of ASR should be getting some sympathy these days from the publisher of that Elmo book on potty training, the one where you press Baby David and hear (something that sometimes sounds like) "who wants to die?" for what was recorded as "who has to go?".

This appears to be an unfortunate artifact of carelessly done compression, which was not detected before publication because of the effects of lexical priming: as everyone who ever gave a speech synthesis demo knows, perceptions of distorted speech are strongly influenced by expectations. Thus the folks responsible for quality control on this book weren't irresponsible, they were just primed. Well, maybe they were a little irresponsible -- if you're going to be in that business, you should know that you need to check how the compressed and re-created sounds are perceived by people in the position of the product's users, rather than its creators.

I don't have a lot to add to Brent Edwards' discussion, except to observe that the actor who produced Baby David's voice is using a fundamental frequency (pitch) that peaks (in "go"/"die") at about 690 Hz., or about f2 (the second F above middle C), well into the soprano range. This is higher than the lowest vocal tract resonance for a vowel like [o], and produces a set of overtones (690, 1380, 2070, 2760) that will not fill in the standard resonance pattern for such a vowel very clearly, since the resonance peaks are likely to be at roughly 500, 1000 and 2500 Hz. In fact, it's a bit of a mystery why such high voices can be understood at all -- presumably it has to do with the brain's ability to recover the resonance pattern from the way that time-varying overtones sweep across it -- but listeners will still sometimes mistake overtones for resonances.

I guess I could also observe that aside from the effects of compression distortion and high fundamental frequency, many American youth have hardly any high back vowels left at all anyhow, having fronted /u/ and /o/ in pretty much every context. But the quality of the available recording in this case is very poor -- I recorded it from the compressed stream of a TV clip available on the internet, so that we have the original compression, the TV technician's recording of the book's playback over its tiny little speaker, and the compression involved in the TV clip's distribution as well. With a low-quality nth-generation copy like this, it's hard to assign clear causes to the obvious effect.

Posted by Mark Liberman at February 3, 2006 06:35 AM