December 13, 2005

From TV to text, Part 1

The Google Foundation recently announced an intriguing philanthropic partnership with an organization called PlanetRead, designed to improve literacy levels in India. As described on the official Google Blog, PlanetRead's goal is to increase literacy by means of "Same-Language Subtitling" (SLS), which provides on-screen subtitles for television programming. So far, PlanetRead has focused primarily on subtitling music videos for songs in various Indian languages, with text streaming across the bottom of the screen karaoke-style. As the organization's president Dr. Brij Kothari explains in the blog entry, SLS is a cost-effective method of providing reading practice for more than 200 million "early-literates" in India, most of whom live in poverty.

Google has provided PlanetRead with a grant to support SLS programs and is also hosting content via Google Video, a Google Labs project that is still in beta testing. So far the PlanetRead videos hosted by Google are limited to a handful of samples, consisting of some newly composed Telugu folk songs (with Telugu subtitles in both Indic script and Roman transliteration, as well as a running English translation) and clips of Bollywood musical numbers (Hindi subtitles only, in Indic script). Somewhat surprisingly, Google Video does not allow searching on the subtitles of these videos (regardless of the language or script), only on the accompanying descriptive text. That's a shame, as it would be great for researchers to have access to such multilingual text samples — a linguistic fringe benefit to an admirable philanthropic endeavor.

The lack of searchability on the PlanetRead subtitles would seem to go against Google's aim of maximizing the global search. But Google Video only indexes text that is provided to them in the form of metadata. An earlier version of Google Video included screen shots of U.S. television programming with searchable closed-captioning transcripts, but they have evidently removed all of the TV material from their database due to rights issues (even though the About Google Video page still acts as if the TV screen shots and transcripts are available). As Google Video director Jennifer Feikin recently discussed at a Silicon Valley conference, the video project has moved on to a new phase, relying entirely on user-provided video content (and metadata). There are hints that Google Video will soon begin offering commercial video content from TV and elsewhere, but the company seems mainly concerned these days with figuring out how to charge people for viewable content. Under such a scheme, searchable closed captioning might at least be offered as an enticement to get people to pay for content, as long as the snippets of transcripts are considered "fair use" (mirroring some of the controversy involving the Library Program of Google Book Search, formerly Google Print).

If Google Video eventually brings back searchable transcripts of closed-captioned TV shows (from the US or other countries), it would be a fascinating addition to the field of Googlinguistics. Granted, snippets of closed captioning wouldn't provide anywhere near the transcriptional precision that one might expect from data collected by researchers of spoken interaction, such as the corpus of telephone conversations compiled by the Linguistic Data Consortium. But as we've seen with Googlinguistics more generally, sheer quantity of data can sometimes overcome shortcomings in quality. Just as Google searches can yield a rich variety of written texts from the most formal to the most informal registers, so too could a database of closed-captioning transcripts illuminate a broad cross-section of mass-mediated discourse, from scripted dramas to celebrity interviews to the more free-and-easy conversational exchanges seen on American shows presided over by the likes of Jerry Springer or Judge Judy. And as long as the corresponding video clips are easily accessible, even inaccurate or limited transcriptions can serve as a rough guide for pinpointing relevant stretches of discourse to analyze.

Ultimately, one of Google's competitors may step up to the plate first and make TV clips with closed-captioning transcripts widely available. Yahoo Video Search could very well take the lead, if the recently announced partnership between Yahoo and Tivo is any indication. Currently another Internet player is exploring this market: Blinkx, one of Time Magazine's "50 coolest websites of 2005." Blinkx pulls together a wide variety of clips from TV shows (not just American ones), and does its own indexing — not with closed captioning but via some in-house speech-recognition technology (details here). Blinkx also indexes podcasts, but they may be losing that territory to podcast-specific search engines using speech-to-text automation such as Podzinger and Podscope. See this AP article for a comparison of Blinkx, Podzinger, and Podscope, and see Part 2 of this post for some fun with Blinkx transcriptions.

Posted by Benjamin Zimmer at December 13, 2005 11:30 PM