A couple of weeks ago, David Pogue wrote in the NYT about speech recognition in a way that might lead you to believe that all the problems are solved ("Like Having a Secretary in your PC", 720/2006). On the other hand, this more recent video of a demo gone awry might lead to believe that the technology doesn't work at all:
The truth, as far as I can tell, is somewhere in between. The technology is getting better all the time, and most people can use dictation software effectively if they want or need to. But the technology is also rather fragile, and unexpected acoustic conditions can wreak havoc with speech-to-text systems in cases where normal human ears don't detect any particular problem. I would be surprised to learn that there is any significant difference in basic recognition performance between the Dragon system that Pogue raved about and the Microsoft system that experienced the embarrassing demo failure.
[Hat tip: Spreeblick.]
[Update -- there are a couple of weblog entries from people involved in speech development at Microsoft, explaining what happened: Rob Chambers' post "Vista SR Demo failure -- and now you know the rest of the story" (7/29/2006), and Larry Osterman's post "Wait, that was my bug? Ouch!" (7/31/2006). Here's Larry Osterman's explanation:About a month ago (more-or-less), we got some reports from an IHV that sometimes when they set the volume on a capture stream the actual volume would go crazy (crazy, for those that don't know, is a technical term). [...] The annoying thing about it was that the bug wasn't reproducible - every time he stepped through the code in the debugger, it worked perfectly, but it kept failing when run without any traces.
If you've worked with analog audio, it's pretty clear what's happening here - there's a timing issue that is causing a positive feedback loop that resulted from a signal being fed back into an amplifier.
It turns out that one of the common causes of feedback loops in software is a concurrency issue with notifications - a notification is received with new data, which updates a value, updating the value causes a new notification to be generated, which updates a value, updating the value causes a new notification, and so-on...
The code actually handled most of the feedback cases involving notifications, but there were two lower level bugs that complicated things. The first bug was that there was an incorrect calculation that occurred when handling one of the values in the notification, and the second was that there was a concurrency issue - a member variable that should have been protected wasn't (I'm simplifying what actually happened, but this suffices).
As a consequence of these two very subtle low level bugs, the speech recognition engine wasn't able to correctly control the gain on the microphone, when it did, it hit the notification feedback loop, which caused the microphone to clip, which meant that the samples being received by the speech recognition engine weren't accurate.
There were other contributing factors to the problem (the bug was fixed on more recent Vista builds than the one they were using for the demo, there were some issues with way the speech recognition engine had been "trained", etc), but it doesn't matter - the problem wouldn't have been nearly as significant.
This confirms my view that the problem had nothing to do with the basic quality of the speech recognition software, where Microsoft's systems are no doubt roughly comparable with the Dragon software that Pogue liked so much (and for all I know, might even be better). All contemporary commercial speech-to-text systems use the same basic design, though the implementations are of course different. As a result, they have roughly the same strengths and weaknesses.
It would interesting to listen to the audio stream as received by the ASR engine during that demo. Though the result of the gain-control bug might have been totally unintelligible noise, it also might well have been seriously distorted-sounding, but still intelligible to human listeners. But massively clipped or otherwise distorted audio causes big problems for all current speech analysis methods. You could think of this as being as sort of unplanned audio CAPTCHA puzzle, and in the general case, computer algorithms are (so far) no better at this for understanding sounds than they are for understanding images. ]
Posted by Mark Liberman at August 1, 2006 04:43 PM