I thought I'd try out Blinkx to see if the transcriptions it creates for searching video clips might have any value for linguistic research. So far, it looks like the television transcripts are more valuable as evidence of how far the company's automated speech-to-text technology still has to go. I took a look at four snippets of transcripts (two from US-based FOX News and two from UK-based GMTV), which I happened to discover while searching on terms of interest to me and finding false matches.
A search on Malaysia, for instance, turns up this transcript snippet from FOX News, Oct. 25, 2005:
the conviction idea my principal is getting very bad this sort of not Malaysia but this sordid tale the feeling in Washington
A look at the video reveals that this comes from an interview with Sen. Bill Frist on "Hannity & Colmes." Here's the relevant section of the transcript as it appears on the Nexis news database:
Right now there's not a lot on offense, leading by conviction, by idea, by principle. It's getting buried by this sort of, not malaise, but this sort of down feeling in Washington.
Lining up the transcripts, we get this unit-by-unit comparison:
[right now there's not a lot on offense leading] by conviction by idea by principle
the conviction
idea
my
principal
it's getting buried by this sort of not malaise is getting very bad this sort of not Malaysia
but this sort of down
feeling in Washington but this sordid tale the feeling in Washington
This example has an error rate comparable to many of the transcriptions I've seen from Blinkx. We can see that speech-recognition errors often emerge due to quasi-homonymic pairs (malaise-Malaysia, sort of-sordid). Other errors stem from the software's reliance on collocational data — for instance, the transcription follows sordid with tale, which appears to be nothing more than a hunch based on the frequency of the collocation sordid tale.
Frist's speech is relatively measured and proceeds without
interruption from the interviewers. We would naturally expect a higher
error rate for speech that is rapid, overlapping, or indistinct. Here's
a more muddled example from FOX News, Dec. 3, 2005:
no one should not we should be cutting education likely of people in his mind and a shock to the ballot identify cardiac arrest and taken from Malaysia to a town in shock everybody clear
In the video we see that this is taken from a demonstration of a home defibrillator by Ed Stapleton of the American Heart Association. Stapleton is explaining how to use the defibrillator as recorded instructions issue from the device (indicated in italics):
Here is how the transcriptions line up:No one should touch the patient.
Nobody should be touching the patient. I'd clear people at this point.
Analyzing. Shock advised.
Now it's identified cardiac arrest, ventricular defibrillation. It will tell me to shock. Everybody clear.
no one should touch the patient
nobody should be touching the patient no one should
not we should be cutting education
analyzing shock advised I'd clear people at this point
likely of people in his mind and a shock to the
now it's identified cardiac arrest ventricular defibrillation
ballot identify cardiac arrest and taken from Malaysia
it will tell me to shock everybody clear
to a town in
shock everybody clear
The error rate is predictably higher here, and we also see some
runaway collocational guesswork going on, with "...ventricular
defibrillation. It will tell me..."
getting transformed into "...and taken from Malaysia to a town in...".
Next let's turn to the two GMTV examples. (I collected these last
week, but now it appears that GMTV's transcript snippets are no longer
displayed in the Blinkx search results, replaced instead by descriptions of
the videos. The video clips will still appear when words in the
transcript text are searched on, though now one needs to play the clip
to determine if a particular match is false or not.) The first video
is an interview with the American recording artist John Legend on Sep. 1, 2005,
which includes this
exchange about his album "Get Lifted":
GM: Oh it's just lovely. I think it's good music for the whole time. There's just been a resurgence of real music, hasn't there, lately?
JL: A little bit, yeah.
GM: Well certainly in this country anyway. "Get Lifted" has sold millions everywhere...
Here is how the actual transcript compares with a snippet that
Blinkx provided last week:
oh it's just lovely I think it's good music
for the whole time
see him about his kidney
for the whole time
there's just been a resurgence of real music hasn't there lately this has been a resurgence of real music as a Malay
a little bit yeah well certainly in this country anyway
I'm a bit here whatsoever in this country anyway
Get Lifted has sold millions everywhere
get the lift it has sold millions everywhere
There are decent stretches in this example, though the whole thing is
rather undermined by the inexplicable transmogrification of "Oh it's
just
lovely, I think, it's good music..." into "See him about his kidney."
Finally, here is an excerpt from a GMTV interview with Emma Thompson about her movie Nanny McPhee, broadcast on Oct. 18, 2005:
ET: Cos you know what it was, her great-grandfather used to make up stories about her naughtiness when she was little, and then she wrote it into this kind of made up, he'd made up this awful, dreadful-looking nanny.
GM: Nurse Matilda.
ET: Called Nurse Matilda, exactly, and we changed that for obvious reasons...
And here is the comparison with the Blinkx snippet:
[her great-grandfather used to make up stories about her naughtiness when she was little]
and then she wrote it into this kind of made up Valenti rate here into this camp is made up
he'd made up this awful
dreadful looking nanny he'd made up of all four were dreadful looking man he
Nurse Matilda
called Nurse Matilda exactly
knew LASIK lawyer called natural hair day factory
and we changed that for obvious reasons
we save that for obvious reasons
Again, reliable bits of transcription are undercut by bizarre transformations. "Nurse Matilda" first turns into "knew LASIK lawyer" and then moments later becomes "natural hair day." It's possible that the speech-recognition software used by Blinkx has been trained to work more effectively on U.S. accents, increasing the error rate when transcribing the speech of U.K. speakers.
In the AP article I mentioned in Part 1 of this post, one of the founders of Blinkx acknowledged the limitations of their speech-recognition technology but remained optimistic about future progress:
The good news for speech-to-text services is that they might improve with use. That's partly because the engines can learn better ways to determine words from their context.
Blinkx co-founder Suranga Chandratillake illustrates the process this way: If a podcast were made about the topics in this story, a computer probably would be right if it detected the phrase "recognize speech."
But in a podcast about last year's tsunami, the computer would do better to hear almost the same sounds as "wreck a nice beach."
For now, if nothing else, the automated transcriptions provided by Blinkx are at least good for some comic relief.
Posted by Benjamin Zimmer at December 13, 2005 11:30 PM