If you haven't already done so, go check out BBN's Podzinger service for searching podcasts. More exactly, according to its current banner, it's "searching 48797 podcasts" -- and growing. Podzinger applies automatic speech recognition to turn podcasts into text, and lets you search the stored texts for words or strings. You can sort results by date or by "Relevancy". Each hit is shown with the harvested title and abstract of the podcast, and 25 words or so of textual context around the matched search term -- more if there are several matches within the window shown -- and an indication of the time point in the podcast where the match occurred. In principle, Podzinger lets you access the audio of the podcast at the point of the match (although in my experience this often doesn't work due to server load or other issues), and it gives you links for the original source of the podcast (URL or RSS).
I should say right up front that I think Podzinger is terrific. I've been using it for several days with considerable satisfaction. And it's an excellent display of the strengths and weaknesses of state-of-the-art speech recognition technology.
To start your own tour, try a search term like Beijing. You'll get some very plausible hits, especially from podcasts of news programs. One that I got this morning was this stretch, from 04:34 in NPR's January 24th 10:00 a.m. news summary:
Deputy secretary of state robert zoellick is in beijing where he began talks today with senior chinese officials the nuclear standoff with iran and north korea are high on the agenda The two sides also are expected to discuss bilateral relations and preparations for a strategic dialogue later this year China is the host of six party talks aimed at ending north korea's nuclear weapons ambitions -- visit to beijing follows a recent visit by north korean leader kim jong il i'm carl -- NPR news in washington ...
This has the ring of truth -- without even bothering to check, I'm confident that this transcript is mostly correct. Except for the lack of appropriate punctuation and capitalization, it's pretty readable. And all things considered, I think this is an extraordinary achievement. Before we get to some of the area where today's speech recognition technology still needs improvement, we should pause and reflect on how good these programs have gotten to be.
Sometimes.
Speech-to-text (STT) programs are still heavily dependent on their "language model" -- their statistical appreciation of what words and word sequences are likely to occur -- and still find reverberant (and otherwise distorted) recordings difficult. I imagine that these are the factors that led Podzinger to return, as the first hit on my search this morning for Beijing, a passage at 0:28:42 of a sermon titled "Crown Him with Many Crowns", which it renders as:
... by the dot all eaten the curse and the -- in beijing this morning -- finest art so against -- has -- decade remote wooded delight When music scene it is my lord ...
Though I can't identify any particular theological error, this hardly seems like a suitable message to be delivered from the pulpit. When I listen to the appropriate section of the podcast, I hear it as:
(no one speaking) by the spirit of God calls Jesus a curse, and no one can say Jesus is Lord except by the spririt. So yes, you s- have said there came a moment in your life when you said Jesus is my Lord ...
The recording is a bit reverberant, and it's about topics that are not often featured in the newswire text that Podzinger's language model is apparently trained on, but it's not at all hard to follow for a human listener.
You can see what has happened, to some extent, if we line up the passages wordwise:
by the dot all eaten the curse by the spirit of God calls Jesus a curse
Here the "spirit" is missing, probably because the phrase "spirit of God" is spoken very rapidly, and the word sequence "God calls Jesus" has been rendered as "dot all eaten".
and the -- in beijing this morning and no one can say Jesus is Lord
This time "say Jesus is Lord" has been rendered as "beijing this morn(ing)".
The double hyphens in the Podzinger transcript represent unknown words, or rather (I presume) regions where none of the program's hypotheses reached its threshold of confidence. As this example indicates, the state of the art in assigning confidence ratings to recognition hypotheses is not very good.
decade remote wooded delight When music scene it is my lord there came a moment in your life when you said Jesus is my Lord
Here "you said Jesus" has been rendered as "music scene it". I think we've seen enough to suspect that Podzinger is not yet ready to accept Jesus into its vocabulary, much less into its stony little silicon heart.
But no -- if we search for {Jesus}, we find 7,990 hits. Some are plausible, if not entirely correct. The 3rd hit I got, for instance, was at 0:04:05 in The Bible Podcast's reading of Genesis 38, which Podzinger rendered as:
... turned to prostitution and as a result she has become pregnant Jesus said Bring her out and let her be burned While they were bringing her route she sent word her father in ...
This is almost entirely correct, except that of course it's Judah, not Jesus, who is featured in the story of Tamar and Onan in Genesis 38.
Podzinger's first hit for {Jesus} this morning was at 0:09:50 of Rounders - The Poker Show for January 22, 2006, in a passage which it rendered as
... six names including daniel le grande do when jennifer harman and jesus ferguson and also -- -- had reaction doctor do we instead of tonight with the winner but at a -- that's ..
Not knowing much about poker, I figured this instance of "jesus" was another error, but in this case Podzinger had it right. At least the "jesus" part. My transcription of the corresponding stretch:
... six names including Daniel Negreanu and Jennifer Harman and uh Jesus Ferguson and also Robert Williams, and so had we actually talked to him two weeks ago instead of tonight, we wouldn't have uh chatted with that, so ...
So it seems that Podzinger is ready to accept Jesus after all, at least as the nickname of the poker player Chris "Jesus" Ferguson.
[I should make it clear that this post's focus on mistranscriptions of "Jesus" is just a humorous way to highlight some issues with STT technology. If you search Podzinger for "Jesus", you will certainly find plenty of examples where the word has been correctly recognized, and I certainly don't mean to suggest that Podzinger has any special problems with religious as opposed to secular words, or with Christian words as opposed to those associated with any other religion. Bill O'Reilly need not get indignant.
On the other hand, the examples cited above are exactly those that came up as I explored the Podcaster service this morning in writing this post. I first searched for "Beijing", and checked two of the top three hits, one of which looked good while the other looked bad; having observed some problems in rcognizing the word "Jesus" in one of the podcasts, I tried a search for "Jesus", and again checked two of the top three hits. The one I left out (from 36:56 of the PK & J Show)was transcribed by Podzinger as "embedded this can all learn sues outside community of (%EXPLETIVE) jesus was laying in finance -- awesome -- And The ..." Since this is a family weblog, you'll have to find out for yourself what the transcription should actually have been. Suffice it to say that "Jesus" is one of the few words that Podzinger got right. ]
. Posted by Mark Liberman at January 25, 2006 10:26 AM