May 01, 2006

Oxford English Corpus: infested with eggcorns!

The billion-word Oxford English Corpus continues to make news, though thankfully no longer under the farcical headline, "English Language Hits 1 Billion Words." Now we get this dire headline from the Guardian: "Internet culture spells doom for strait-laced orthographers." The opening paragraphs elaborate the theme of linguistic degenerationism:

If you believe the internet is the fount of all wisdom, giving free rein to bloggers to exercise their vocal cords, think again. Ancient English cliches and expressions are being mangled by the culture of cut and paste and the spread of unchecked writing on the internet.
According to the Oxford English Corpus, a database of a billion words, dozens of traditional phrases are now more commonly misspelled than rendered correctly in written English.

Though the Guardian article doesn't say so explicitly, the common misspellings taken from the Oxford English Corpus are all semantically motivated — in other words, they're eggcorns.

As it happens, every example given by the Guardian has already been entered into the Eggcorn Database, maintained by Chris Waigl. (They also can be found in Paul Brians' comprehensive site, Common Errors in English.) Here's more of the article, hyperlinked to the appropriate Eggcorn Database entries:

"Straight-laced" is used 66% of the time even though it should be written "strait-laced", according to lexicographers working for Oxford Dictionaries, who record the way English is spoken and written by monitoring books, television, radio and newspapers and, increasingly, websites and blogs.
"Just desserts" is used 58% of the time instead of the correct spelling, "just deserts" (desert is a variation of deserve), while 59% of all written examples of the phrase in the Corpus call it a "font of knowledge or wisdom" when it should be "fount".
It has become so widely used that the wrong version is now included in Oxford dictionaries alongside the right one.
Other mistakes fast becoming the received spelling include substituting "free reign" for the correct phrase, "free rein".
The original refers to letting a horse loose, but many use "reign" and assume the expression means to allow a free rule.
Other examples of common mistakes include "slight of hand" instead of "sleight", "phased by" when it should be "fazed by", "butt naked" instead of the correct "buck naked" and "vocal chords" for "vocal cords."

All of these examples are marked in the Eggcorn Database as "nearly mainstream," so it's nice to get corroboration from a corpus more reliable than Google's. Whether you consider the mainstreaming of eggcorns to be a simple matter of language change in progress or something more nefarious depends on your perspective, of course. Catherine Soanes of Oxford Dictionaries takes the lexicographical long view as an antidote to the Recency Illusion:

"We have to accept spelling is not fixed and can change over the years," said [Soanes]. "You only have to look back 100 years, when the word rhyme was spelled rime. But since then we adopted rhyme as the correct spelling because this is more like the Greek word from which it originally came."
She added: "Our Corpus has around 150m words from the web and the way words are written often has to do with familiarity.
"For instance, 35% of people say 'a shoe-in' when actually it should be 'a shoo-in'.
"But the original is an American phrase using a US version of the word shoe in the first place."

(I'm unclear what Ms. Soanes means by the "original" form of shoo-in. The OED derives shoo-in from the verbal phrase to shoo in, and the verb shoo is in turn derived from the interjection shoo! used to drive away animals or intruders — similar forms include German schu, French shou, and Italian scioia. I don't see anything suggesting a derivation from the word shoe, unless she means that the eggcorn variant shoe-in has been present from early on in American usage.)

The Guardian article squarely lays the blame for this rampant eggcornification on "the culture of cut and paste," and particularly on "the spread of unchecked writing" found on the godforsaken Internet. Note, however, that only about 15 percent of the Oxford English Corpus (150 million of the billion words) is gleaned from online usage, so that can't be the whole explanation. As we've seen in previous cases of hell-in-a-handbasket reporting on the degeneration of the English language, the Internet (and especially the discourse of young people on the Internet) is always an easy target for condemnation, since it brings into easy reach a whole range of non-standard usage. Because we're reading so much more text that has not been professionally edited, what may previously have risen only to the level of pet peeve may now appear as a grave threat to the future of the language.

Getting past the doom-and-gloom soothsaying, the Guardian article highlights some other findings from the Oxford English Corpus:

According to the Corpus, another linguistic trend is the American habit of turning two words into one, such as someday, anymore and underway.

I'm not convinced that "turning two words into one" is a particularly "American habit," or at least not as the phenomenon has been illustrated by the Guardian. For example, the earliest example given by the OED for single-word someday is from George Bernard Shaw in 1898, while the single-word form of underway was popularized in nautical usage on both sides of the Atlantic, as far as I know. And the use of single-word anymore has more to do with its development as an adverb meaning 'nowadays.' Even the dialectally distinct "positive anymore" is not particular to American usage, as the OED also marks it "Irish English" (as in the citation from Tom Murphy's 1961 play "A Whistle in The Dark": "We'll squeeze Michael a bit. He'll chip in anymore").

And finally:

The Corpus also records how some words are used almost exclusively to apply to men and others to women.
Only men seem to hijack, crouch, kidnap, rob, grin, shoot, dig, stagger, leap, invent or brandish.
Women, meanwhile, tend to be the only ones to consent, faint, sob, cohabit, undress, clutch, scorn or gossip.

There is no doubt some interesting gender-based variation to be found in the corpus, but as with the other findings the Guardian's framing is a bit dubious. Only men (or women) commit the actions of those verbs? That's a wild overstatement, but a bit of Googling corroborates large gender disparities when comparing the simple collocations "he + V" versus "she + V" for the verbs given. It will be quite intriguing to see the actual results from an analysis of the Oxford corpus, rather than having to rely on the unfortunate simplifications given in the Guardian article and other glib media accounts.

[Update: The Guardian's technology blog links to the article in an entry titled "Watch your language &mdash most of you are wrong." Comments are already turning nasty.]

Posted by Benjamin Zimmer at May 1, 2006 07:21 PM