October 02, 2006

The imminent lexicographic singularity

In his 1940 essay "New Words", George Orwell wrote:

I have read somewhere that English gains about six and loses about four words a year.

In a review of Don Watson's 2003 jeremiad Death Sentence, The Decay of Public Language, James Button wrote:

The genius of English is the way it updates itself every day, with 20,000 new words a year, Watson read somewhere.

Now, a BBC News book review from 10/1/2006 ("New insults for English language"), writing about a book that documents cutesy new words like "tanorexic" and "celebutard", tells us that

The words have been taken from entries in the Collins Word Web, which monitors sources to pick up any new additions.

The Word Web lists genuine words and phrases that have entered the English language, and contains more than 2.5bn words, expanding at a rate of 30m words per month.

Holy singularity, Batman! Have we really come so far since 1940? Has the internet turned English into a real-world lexicographical Naian?

George Orwell would be intrigued by this prospect, I think, since he felt that

Everyone who thinks at all has noticed that our language is practically useless for describing anything that goes on inside the brain. [...] [I]t seems to me that from the point of view of exactitude and expressiveness our language has remained in the Stone Age.

And like many people (but not everyone), Orwell felt that the key to improving exactitude and expressiveness is to increase the stock of words. And 30 million new words a month would be a hefty rate of increase indeed, even if the voice of common sense whispers that you'd need to learn 12 words per second, 24/7, in order to keep up.

However, if Orwell has been reading the BBC news in some heavenly pub, I'm afraid that he's in for a disappointment. This particular BBC article is even more confused than your usual BBC piece is. A trivial web search turns up a 2004 press release, which makes it clear that the "Collins Word Web" does not "list genuine words and phrases that have entered the English language". Rather, it's a text corpus used by Collins (a Murdoch subsidiary) for lexicographical purposes -- that is, a collection of texts in digital form. By a sensible convention, the word count given for such a collection is the total number of character strings separated by white space, not the number of different space-separated strings, much less the number that are deemed to represent distinct "words" in any interesting sense. (You'll want to eliminate inflected forms of existing words, digit strings, typographical errors, and so on.)

The sheer size of the "Collins Word Web" corpus is not so impressive -- anyone can now use web search engines to search the text collection of the whole internet (minus various excluded bits), which amounts to many trillions of words. The Collins corpus is presumably selected in various ways to be more representative, to include less junk, etc. -- but the size alone shouldn't impress anyone.

A couple of items further along in the Google hits for "Collins Word Web", you can find a page on the Collins Word Exchange site, "How to make a dictionary", which explains the conceptual issues clearly, and explains that "[e]very month [the Colllins lexicographers] collect several hundred new [words] and then monitor them to see if they are becoming part of general language".

The writer(s) and editor(s) of the BBC piece apparently didn't know about any of this, and didn't have the time or inclination to learn. ("30 million, several hundred -- whatever.") That's not surprising, I guess, since they also neglected to mention the names of the authors of the book that they were reviewing -- for the record, the book title (which the review does give) is I Smirt, You Stooze, They Krump, and the authors (whom the review fails to mention) are Justin Crozier & Cormac McKeown.

All jokes aside, I wonder what is going on at the BBC these days. Do they give their writers impossible quotas to fulfill? Is this the insidious onset of nvCJD, seeded by the BBC cafeteria's roast beef? Does the organization's worldwide reach create a thriving internal market in unusually potent recreational drugs?

If you're curious about how many distinct new words are actually added to the English language every year, I can't give you an answer, but I can point you to a discussion of some of the additional questions that would have to be answered before you could get an answer that would mean anything.

And if you're a fan of jokey neologisms like "celebutard" and "tanorexic" (also known as "stunt words"), you should check out Mark Peters' blog Wordlustitude.

[Orwell link by email from Ian Cooper]

[Ben Zimmer points out that this confusion (between "corpus" and "lexicon") was also featured in recent headlines on the billion-word Oxford English corpus. But headline writers, especially at newswires like the AP, are notoriously careless. The novelty in the current story is that the confusion in embedded in the article itself. ]

Posted by Mark Liberman at October 2, 2006 07:42 AM