February 06, 2006

986,120 words for snow job

The subject of the Language Log final exam, loyal readers will recall, was a peculiar article in the New York Times real estate section on the power of buzzwords in the New York housing market. In need of a language expert to lend the imprimatur of authority, the reporter turned to one Paul JJ Payack, president of the Global Language Monitor:

Mr. Payack, who graduated from Harvard with a bachelor's degree in comparative literature, calculated the popularity of some 36 buzzwords chosen by a reporter. He used his Predictive Quantities Indicator, or P.Q.I., an algorithm that tracks words and phrases in the media and on the Internet in relation to frequency, contextual usage and appearance in global media. It is a weighted index that takes into account year-to-year increases and acceleration in the last several months.

Along with calculating the "popularity" of market buzzwords (using his magic "algorithm"), Payack also revealed to the credulous reporter that "as of Jan. 26 at 10:59 a.m. Eastern time, the number of words in the English language was 986,120." This is one of those pronouncements so exquisitely silly that you figure it has to be a put-on. Who would possibly claim to have determined the exact number of words in the language, and that the number would be anything like 986,120? But there it is, proudly featured on GLM's page of language statistics. And now this absurd declaration has spread far beyond the New York Times real estate pages, as the Times of London has spun an entire article out of Payack's number, trumpeting the news that the English language will be welcoming its millionth word some time this summer. Break out the party hats!

It's hard to know even where to begin in analyzing Payack's specious claim. The description of GLM's "methodology" in calculating the number of English words is hard to take seriously:

The Global Language Monitor has attempted to pinpoint the precise number of words in the English Language at a given point in time. To do so, it first established a base number of words in the language using the generally accepted unabridged dictionaries (the O.E.D., Merriam-Webster's, etc.), that contain the historic 'core' of the English language: every word found in the works of Shakespeare, the King James Bible, and the other 'classics'.  It then created a proprietary algorithm, the Predictive Quantities Indicator (PQI) that attempts to measure the language as currently found in print (including technical and scientific journals), the electronic media (transcripts from radio and television), on the Internet and, increasingly, in web logs (blogs). GLM then assigned a number to the rate of creation of new words and the adoption and absorption of foreign vocabulary into the language. The result, though an estimate, has been found to be quite useful as a starting point of the discussion for lay persons, students, and scholars the world over.

So GLM starts with a "core" number of words, evidently based on the sum of entries in unabridged dictionaries. Who knows what that number might be, since even if we consider one particular dictionary there is no simple answer to how many "words" it contains. The second edition of the Oxford English Dictionary has about 300,000 headwords, covering 640,000 words and phrases, according to AskOxford. (The Third Edition, now in preparation, will increase that number to 1.3 million or more.) So do we count headwords? All defined words and phrases? Every distinct sense and subsense of those words and phrases? Every spelling variant? Do archaic words make the cut, and if so, what's the chronological cutoff for "English"? In estimating the size of the lexicon, AskOxford remains admirably agnostic in its FAQ (emphasis mine):

How many words are there in the English language?

There is no single sensible answer to this question. It is impossible to count the number of words in a language, because it is so hard to decide what counts as a word. Is dog one word, or two (a noun meaning 'a kind of animal', and a verb meaning 'to follow persistently')? If we count it as two, then do we count inflections separately too (dogs plural noun, dogs present tense of the verb). Is dog-tired a word, or just two other words joined together? Is hot dog really two words, since we might also find hot-dog or even hotdog?
It is also difficult to decide what counts as 'English'. What about medical and scientific terms? Latin words used in law, French words used in cooking, German words used in academic writing, Japanese words used in martial arts? Do you count Scots dialect? Youth slang? Computing jargon?

Once we step beyond the seemingly authoritative pages of the OED and other unabridged dictionaries, these questions only get muddier. Are we to count every nonce form of the sort found in Urban Dictionary or Merriam-Webster's Open Dictionary? If so, what about all the fleeting nonce usages that existed in those benighted pre-Internet days? Even a big historical dictionary like the OED won't tell you about those, since words and phrases still need to show some staying power before lexicographers will consider them for entry. Or can GLM's mysterious "proprietary algorithm" somehow discern what is nonce and what is here to stay?

None of these questions get addressed by the Times of London correspondent, who seems happy to take Payack at his word as a reliable linguistic authority. (The article dubs him "a Harvard-educated linguist," but the bio on Payack's other Web venture, Yourdictionary.com, mentions no linguistic credentials.) Though we are left in the dark as to the the criteria used by GLM to count "words," Payack does divulge that Chinese-English, or "Chinglish," is largely responsible for the lexicon's latest expansion:

Chinglish terms include "drinktea", meaning closed, derived from the Mandarin Chinese for resting; and its opposite, "torunbusiness", meaning open, from the Mandarin word for operating.
While some are amusing to the British ear, others are abrasive. Public toilets for disabled people in Beijing are marked "deformedman" and in Hong Kong "kweerboy" denotes a homosexual.
The Chinese government has vowed to sweep Chinglish from road and shop signs before the 2008 Beijing Olympics, but is fighting an uphill battle.
Payack...said 20,000 new English words were registered on the company's databases last year — twice as many as a few years ago. Up to 20% were in Chinglish.

Now there's obviously nothing wrong with incorporating various World Englishes in appraising innovations in the language. (See, for instance, the diverse international sources for neologisms catalogued on Double-Tongued Word Wrester.) But how many of the thousands of "Chinglish" words that Payack claims to have recorded are in common use even in China? Google turns up very few attestations of the examples given in the article. For instance, torunbusiness (a running together of to run business?) only shows up in two Chinese-language news articles about improperly used English on Chinese street signs (this article can be "gisted" with Google's translation). If GLM is going to include every novel use of half-learned English on the world's street signs, they've got an awful lot of work ahead of them (even if they restrict their attention to, say, Japan).

All of this merely elaborates on the eloquent observation made by OED pioneer J.A.H. Murray way back in 1888: "The circle of the English language has a well-defined centre but no discernible circumference." So why are ostensibly scrupulous news sources so eager to accept a calculation of this unknowable circumference, without at least asking the opinion of an established linguist or lexicographer? Apparently this is yet another language-related factoid that's just too good to check.

[Update: Grant Barrett notes that there are "millions of chemical names alone," and also takes on some of the other "low-hanging fruit" in the Times of London article.]

Posted by Benjamin Zimmer at February 6, 2006 06:00 AM