April 06, 2005

I didn't know...

that the audio from some broadcasts and other sources can now be searched in transcripts produced automatically by speech recognition software, provided by a HP system called SpeechBot.

Here's the SpeechBot search page for WBUR's Here and Now, which still features the Compaq logo.

I tried for umami but got nothing -- it's probably an out of vocabulary ("OOV") word, though I guess it's also possible that today's broadcasts haven't been indexed yet.

A query for {"silver leaf gospel"} got me the April 1 story "A Joyful Noise" that I was searching for, though. The surrounding bit of transcript looked like this in the ASR output:

...this is a road it's english move the use of the only city to the phrase day that that I now for more information on the silver leaf gospel singers of roxbury massachusetts go to our with you that that that in the end of the man in the state hanging on and they go on and I can be a reduction in the view are obliged to me to ..

The words in the associated clip actually seem to be something like:

Deacon Randy Green: ...like I said, that the Lord has uh much more for me to do as I always say, he ain't through with me yet.

Singers: ... ((we ain't got no)) Three gates ((will)) open over here, I got my religion and I won't be late.

Robin Young: For more information on the Silver Leaf Gospel Singers of Roxbury, Massachusetts, go to our web site here dash now dot org.

Singers: ... gates to the city, hallelu, hallelu.

though it's hard to tell, in places, because the singers are always in the background.

This examples shows off two of the worst aspects of current speech recognition technology: the lack of robust "diarization" (i.e. keeping track of who is talking when rather than running everything together as if it was from a single source), and the lack of good ability to deal with overlapping speech, speech over music etc. Still, at least it accomplished the indexing that I asked for!

Looking for "pope john paul" found six extracts from the April 1 show (which therefore must indeed be the most recent day indexed), with the most relevant passage (or at least the one presented first) being given as:

...in the holy cross and mr. massey and senora furnaces says his sentencing and that is instances the pope's condition has worsened you're listening to here and now you're a growing young is here and now and if you just joined us the vatican has just released a statement saying that to a pope john paul the second's conditions has seriously worsened we are following the situation in rome where pope john paul the 2nd would seem to be close to death and we're also speaking here in united states today that o'brien professor of history at the college of holy cross in worcester massachusetts an expert on the american catholic church and david welcome back to his and joining us and now the studio to hear now ...

My transcript of the associated clip is

[Female speaker]: ... of the Holy Cross in Worcester, Mass; we're going to continue our conversation in a few seconds; again, the Vatican statement says the Pope's condition has worsened. You're listening to Here and Now; we'll be right back.

Robin Young: I'm Robin Young, it's Here and Now, and if you've just joined us, the Vatican has just released a statement saying that uh Pope John Paul the Second's conditioned [sic] {breath} has seriously worsened. Uh we are following the situation in Rome uh where Pope John Paul the Second seems to be close to death, and we're also speaking here in the United States to David O'Brian, professor of history at the College of Holy Cross in Worcester, Massachusetts, an expert on the American Catholic Church. Uh David, welcome back --

David O'Brian: Hi. ((Glad to be here))

Robin Young: And joining us- and joining us uh now in the studio here and now is uh ...

Again, pretty good indexing; semi-crappy transcript; lack of diarization and other punctuation-type formatting makes the ASR transcript pretty hard to read, even where it's mostly correct.

Though it's hard to tell from two short passages, the speech-recognition engine used in this system seems to be a generation or two behind the state of the art. These days, the best systems should be able to achieve an overall word error rate of about 10% on broadcast material. These two passages are not ideal because of background music (fairly loud in the first one, softer in the second), and so the expected performance would be somewhat worse.

However, the overall application design is impressive, and the indexing performance is decent. A sign of things to come, I think.

Posted by Mark Liberman at April 6, 2005 04:57 AM