April 26, 2006

A million words here, a billion words there...

It looks like 2006 is going to be a banner year for misinformed reporting on the English language. Numerous journalists have already swallowed the absurdly specious claim that the English language is going to add its millionth word some time later this year. But doesn't "one million" sound a little paltry? Well, never fear. Today the Associated Press trumpets even bigger news:

"English Language Hits 1 Billion Words"

Do I hear a trillion?

That's how the headline reads on Yahoo! News, but you can find identical headlines on the websites for the Washington Post, the Los Angeles Times, the Boston Globe, the San Francisco Chronicle, Newsday, CBS News, ABC News, Fox News, and dozens of other media outlets. We've already seen ample evidence that news organizations relying on the AP wire very often reproduce the headlines provided to them in an entirely uncritical fashion. But this is not a case of circulating a grammatically questionable construction like "Skilling Calls He and Lay 'A Good Team.'" Here we have editors around the country blithely accepting a laughable assertion.

The article itself is relatively straightforward, belying the ridiculous headline:

A massive language research database responsible for bringing words such as "podcast" and "celebutante" to the pages of the Oxford dictionaries has officially hit a total of 1 billion words, researchers said Wednesday.

Drawing on sources such as weblogs, chatrooms, newspapers, magazines and fiction, the Oxford English Corpus spots emerging trends in language usage to help guide lexicographers when composing the most recent editions of dictionaries.

The press publishes the Oxford English Dictionary, considered the most comprehensive dictionary of the language, which in its most recent August 2005 edition added words such as "supersize," "wiki" and "retail politics" to its pages.

Oxford University Press lexicographer Catherine Soanes said the database is not a collection of 1 billion different words, but of sentences and other examples of the usage and spelling.

So there you have it: it's a lexicographical corpus of texts that has hit a billion words, and like any corpus it contains lots and lots of duplicated lexical items. How unobservant does a headline-writer or copy editor have to be to construe this to mean that the "English language" has hit a billion words? Apparently the good people at Oxford have a corpus that encompasses the entire language! Pretty darn impressive.

Kudos to those news outlets that recognized the AP headline as bunk and provided their own, though they're few and far between:

"Wordy? Dictionary database hits 1 billion mark" (MSNBC/Newsweek)
"Oxford database reaches 1 billion words" (CNews)
"English language database reaches 1 billion words" (AZCentral)
"Oxford English Corpus database of 21st century usage reaches 1 billion words" (San Diego Union-Tribune)

(A tip of the hat to Lance Nathan, who observes that the outrageously inflated headlines paradoxically represent "new lows in linguistic reporting.")

Posted by Benjamin Zimmer at April 26, 2006 09:28 PM