December 23, 2003

Counting new words: is there a lexicography gap?

Geoff Pullum is absolutely right to observe that Don Watson's notion of 20,000 new English words a year is probably an example of the well-known fact that 57% of all quoted statistics are made up on the spot, while another 34% are an inflated quotation of someone else's extemporaneous fabrication. People do this because it sounds better than saying "quite a few, I don't know how many".

Geoff is also right to observe that an accurate count of how many new English words come up every year is almost impossible to define in any useful way, since the meaning of the terms "word" and "English" in such statements is so vague. Nevertheless, it's easy to come up with some specific numbers that are not completely devoid of interest.

The OED's four most recent quarterly updates (through Dec. 11 2003) added 487 new "out of sequence" entries (leaving aside the much larger number of new-edition words in designated alphabetical ranges, such as the most recent batch Nipkow disc-nuculoid, since these have presumably been in preparation for a longer time). Even so, the great majority of the past year's 487 out-of-sequence additions were words that have been around for a while, but had previously been missed. These are not just stuffy old formal-language words, though -- the list includes backassward, digerati, fuckwit, gang-bang, infoholic, perl, Queer Nation, studmuffin, Thinsulate and Wonderbra. If there are really 20K new words a year, the OED's lexicographers are almost two orders of magnitude short of keeping up -- they'd be falling behind by more than 1.9 million words per century, the poor saps. But perhaps we should give them credit for all the new-edition entries -- adding 545 of the Nipkow disc-nuculoid batch in the last quarter alone, plus about a hundred new sub-entries in the same range. Along with relative newcomers like Nomex and nitrox, this would include definitely older words such as non-abelian and nonadditive; but it's all arguably part of the same lexical ledger, so let's give full credit for all the additions. If we do that, then I guess that the OED is adding about 2500-3000 new items per year -- and only falling behind Watson's estimate by some 17,000 per year, or 1.7M words per century :-).

The 2001 edition of Microsoft's Encarta Dictionary advertises "over 5000 new words", presumably relative to the 1999 edition. This would be 2500 new words per year; but there is no reason to think that these are all novel words, as opposed to older words that the editors decided on reflection to include. If they were indeed all new, and if there really were 20,000 new words a year to keep track of, Encarta would be falling behind roughly at the same rate as the OED :-).

I'm sure that there are lexicographers out there who can give a more exact account of the number of apparently novel English coinages or borrowings they observe per year, independent of the number that they decide to include in their published dictionaries. I'll be somewhat surprised if those estimates are higher than 5,000 words a year, if they are that high; and I'll be very surprised if there really is a "lexicography gap", in the sense that the profession is falling behind by millions of bona fide words per century.

On the other hand... At the other end of several scales, the USPTO's TESS "contains more than 3 million pending, registered and dead federal trademarks" (as of 10 November 2003), whereas when it was started on Feb. 14, 2000, "TESS [allowed] the public to search ... the 2.6 million plus pending, registered, abandoned, cancelled or expired trademark records found in PTO’s X-Search system." This is about 400,000 added in 3.75 years = >100,000 added per year.

A lot of these are things like FUSION WAKEBOARD TOWERS AND ACCESSORIES or ROCK WAX SLAM'N HAIR WAX THAT ROCKS -- but wakeboard tower really is a word, and so is hair wax. The three-letter acronym IED has been trademarked 13 times, and none have anything to do with Improvised Explosive Devices :-). So there might well be more than 20,000 new company and product names invented every year, not to speak of semi-compositional complex nominals like "wakeboard tower," but I suspect that this is not what Don Watson was talking about.

In various areas of science and technology, there are many new terms of art added every year, and in some of these areas, some more or less official group keeps track. The Ezyme Commission's Enzyme Nomenclature Supplement 9 for 2003 includes around 200 new items, each of which may involve several new "words" (if we take the registered terms to be "words") -- thus EC is

Common name: GDP-L-fucose synthase.
Other name(s): GDP-4-keto-6-deoxy-D-mannose-3,5-epimerase-4-reductase
Systematic name: GDP-L-fucose:NADP+ 4-oxidoreductase (3,5-epimerizing).
The cross-listed NiceZyme entry gives another "alternative name" GDP-fucose synthetase.

If each of these variants is a different word, and if this entry is typical, then there might have been 800 or more new enzyme names registered officially in 2003. From my recent experience in biomedical information extraction, I can say that many "names" of enzymes (and genes and structural proteins and ...) are used without being officially registered. These are names, not words in the general sense, though the shorter variant variant names of a few of them might come into general use from time to time (like caspase-9 or topoisomerase 1).

If we looked across all the different sub-areas of science and technology, there will probably be many more than 20,000 (durable and generally-recognized) new names coined every year -- new genes, new species, new stars, new algorithms, whatever -- but I don't think that's what Don Watson had in mind either.

I also admit that there's lots of stuff going on under the lexicographical radar of all these monitors. Neither the OED nor Encarta nor the USPTO nor the Enzyme Commission has glemphy, craptacular, or Falluja. My personal guess is that craptacular (with 16,500 Google hits) will make it into the dictionaries before long, and that the other two won't (because glemphy won't ever be generally used, while Falluja will fall back into the category of foreign-language place names that are not really part of the general English vocabulary, even though they once might have been (like Qui Nhon and Echternach); but I'm skeptical that the list of also-rans as plausible as these is anywhere near as big as 20,000 a year.

Without spinning out the obscurities any further, it's clear that there are meanings of "new", "word" and "English" under which you could argue that there are 20,000 new English words per year, or even more -- but these meanings are pretty loose and even unreasonable ones. A more plausible guess, closer to the core interpretation of the terms by working lexicographers, seems to be in the range of the two or three thousand items that the OED and the creators of Encarta seem to be adding (though I look forward to hearing other numbers from people in a better position to know).

Like Geoff Pullum, I haven't read what Don Watson has to say about the globo-downfallization of language, because Watson's book is not available here. Maybe some reader down under can take a look? If Watson shows any evidence of having thought at all about what it means to say that "there are X new English words every year", rather than just blurting out some implausibly large estimate because he didn't want to say "a whole bunch", I'll buy a round of drinks at the LSA for anyone who cites his evidence or his arguments.

[By the way, we can't answer the question just by looking at the growth over time of the list of lexical tokens in some very large electronic corpus, because after a while, most of the new tokens are typos or mis-spellings. In addition, this method doesn't find new words that happen to be written with internal white space. One can imagine a variety of ways to deal with both of these problems, and people have tried some of them, but that's another story, or at least another post :-)].

[While we're on the subject, I need to very gently correct Geoff's statement that "the 5 exabyte mistake about word tokens uttered in human history [is] much repeated but known to be completely false." It's not completely false, it's just off by a factor of 8 thousand or so :-).]

[A few other relevant sites: (adds one new word a day)
The most commonly misspelled words on the web -- 2.86M cited for transexual, Google now says 4.47M...
The Dictionary Forum ]

Posted by Mark Liberman at December 23, 2003 05:31 PM