December 13, 2005

From TV to text, Part 2

I thought I'd try out Blinkx to see if the transcriptions it creates for searching video clips might have any value for linguistic research. So far, it looks like the television transcripts are more valuable as evidence of how far the company's automated speech-to-text technology still has to go. I took a look at four snippets of transcripts (two from US-based FOX News and two from UK-based GMTV), which I happened to discover while searching on terms of interest to me and finding false matches.

A search on Malaysia, for instance, turns up this transcript snippet from FOX News, Oct. 25, 2005:

the conviction idea my principal is getting very bad this sort of not Malaysia but this sordid tale the feeling in Washington

A look at the video reveals that this comes from an interview with Sen. Bill Frist on "Hannity & Colmes." Here's the relevant section of the transcript as it appears on the Nexis news database:

Right now there's not a lot on offense, leading by conviction, by idea, by principle. It's getting buried by this sort of, not malaise, but this sort of down feeling in Washington.

Lining up the transcripts, we get this unit-by-unit comparison:

[right now there's not a lot on offense leading]
by conviction by idea by principle

the conviction
idea
my
principal











it's getting buried by this sort of not malaise
is getting very bad this sort of not Malaysia









but this sort of down
feeling in Washington
but this sordid tale the feeling in Washington

This example has an error rate comparable to many of the transcriptions I've seen from Blinkx. We can see that speech-recognition errors often emerge due to quasi-homonymic pairs (malaise-Malaysia, sort of-sordid). Other errors stem from the software's reliance on collocational data — for instance, the transcription follows sordid with tale, which appears to be nothing more than a hunch based on the frequency of the collocation sordid tale.

Frist's speech is relatively measured and proceeds without interruption from the interviewers. We would naturally expect a higher error rate for speech that is rapid, overlapping, or indistinct. Here's a more muddled example from FOX News, Dec. 3, 2005:

no one should not we should be cutting education likely of people in his mind and a shock to the ballot identify cardiac arrest and taken from Malaysia to a town in shock everybody clear

In the video we see that this is taken from a demonstration of a home defibrillator by Ed Stapleton of the American Heart Association. Stapleton is explaining how to use the defibrillator as recorded instructions issue from the device (indicated in italics):

No one should touch the patient.
Nobody should be touching the patient. I'd clear people at this point.
Analyzing. Shock advised.
Now it's identified cardiac arrest, ventricular defibrillation. It will tell me to shock. Everybody clear.

Here is how the transcriptions line up:

 no one should touch the patient







nobody should be touching the patient
no one should
not we should be cutting education















analyzing shock advised
I'd clear people at this point


likely of people in his mind and a shock to the









now it's identified cardiac arrest ventricular defibrillation

ballot identify cardiac arrest and taken from Malaysia










it will tell me to shock everybody clear
to a town in
shock everybody clear

The error rate is predictably higher here, and we also see some runaway collocational guesswork going on, with "...ventricular defibrillation. It will tell me..." getting transformed into "...and taken from Malaysia to a town in...".

Next let's turn to the two GMTV examples. (I collected these last week, but now it appears that GMTV's transcript snippets are no longer displayed in the Blinkx search results, replaced instead by descriptions of the videos. The video clips will still appear when words in the transcript text are searched on, though now one needs to play the clip to determine if a particular match is false or not.) The first video is an interview with the American recording artist John Legend on Sep. 1, 2005, which includes this exchange about his album "Get Lifted":

GM: Oh it's just lovely. I think it's good music for the whole time. There's just been a resurgence of real music, hasn't there, lately?
JL: A little bit, yeah.
GM: Well certainly in this country anyway. "Get Lifted" has sold millions everywhere...

Here is how the actual transcript compares with a snippet that Blinkx provided last week:

oh it's just lovely I think it's good music
for the whole time
see him about his kidney

for the whole time











there's just been a resurgence of real music hasn't there lately
this has been a resurgence of real music as a Malay











a little bit yeah well certainly in this country anyway

I'm a bit here whatsoever in this country anyway












Get Lifted has sold millions everywhere


get the lift it has sold millions everywhere



There are decent stretches in this example, though the whole thing is rather undermined by the inexplicable transmogrification of "Oh it's just lovely, I think, it's good music..." into "See him about his kidney."

Finally, here is an excerpt from a GMTV interview with Emma Thompson about her movie Nanny McPhee, broadcast on Oct. 18, 2005:

ET: Cos you know what it was, her great-grandfather used to make up stories about her naughtiness when she was little, and then she wrote it into this kind of made up, he'd made up this awful, dreadful-looking nanny.
GM: Nurse Matilda.
ET: Called Nurse Matilda, exactly, and we changed that for obvious reasons...

And here is the comparison with the Blinkx snippet:

[her great-grandfather used to make up stories about her naughtiness when she was little]
and then she wrote it into this kind of made up
Valenti rate here into this camp is made up









he'd made up this awful
dreadful looking nanny
he'd made up of all four were dreadful looking man he









Nurse Matilda
called Nurse Matilda exactly

knew LASIK lawyer called natural hair day factory










and we changed that for obvious reasons


we save that for obvious reasons

Again, reliable bits of transcription are undercut by bizarre transformations. "Nurse Matilda" first turns into "knew LASIK lawyer" and then moments later becomes "natural hair day." It's possible that the speech-recognition software used by Blinkx has been trained to work more effectively on U.S. accents, increasing the error rate when transcribing the speech of U.K. speakers.

In the AP article I mentioned in Part 1 of this post, one of the founders of Blinkx acknowledged the limitations of their speech-recognition technology but remained optimistic about future progress:

The good news for speech-to-text services is that they might improve with use. That's partly because the engines can learn better ways to determine words from their context.

Blinkx co-founder Suranga Chandratillake illustrates the process this way: If a podcast were made about the topics in this story, a computer probably would be right if it detected the phrase "recognize speech."

But in a podcast about last year's tsunami, the computer would do better to hear almost the same sounds as "wreck a nice beach."

For now, if nothing else, the automated transcriptions provided by Blinkx are at least good for some comic relief.

Posted by Benjamin Zimmer at December 13, 2005 11:30 PM