June 22, 2006

Time after time after time...

The Oxford English Corpus, a lexicographical research project on 21st-century English, has generated a surprising amount of copy for news organizations lately. First it was announced that the corpus had surpassed 1 billion words, though unfortunately the accompanying Associated Press article ran under the risible headline, "English Language Hits 1 Billion Words." Then it was revealed that the OEC is chock full (chalk full?) of pernicious eggcorns. Now the AP brings the latest news: the most frequently occurring noun in the OEC is time. The wire story (which was considered newsworthy enough to be reproduced by CNN, the LA Times, the Washington Post, the Guardian, and a host of other media outlets) begins as follows:

For those who think the world is obsessed with "time," an Oxford dictionary added support to the theory Thursday in announcing that the word is the most often used noun in the English language.

The AP writer apparently followed the lead of the press release out of Oxford, which reads:

It's official: we're a nation ruled by time

We like to be punctual, we expect our trains to run to schedule, and many of us spend our working day watching the clock. Now the new revised eleventh edition of the Concise Oxford English Dictionary can officially confirm that we are indeed ruled by time. Drawing on evidence from the Oxford English Corpus, the word time comes top in the list of commonest nouns in the English language, with year (3rd), day (5th) and week (17th) not far behind.

So is the ranking of time as the top noun really indicative of anything about the way we live in the 21st century? Well, first of all, who are "we"? The AP writer refers to "the world," apparently under the belief that the English language is now completely universal, while the Oxford press release writer more modestly refers to the "nation" (i.e., the United Kingdom), pitching the story to a domestic audience. The truth, of course, lies somewhere in between, as the sources used for the OEC are neither restricted to the UK nor open to non-English texts. The coverage is intended as a representative snapshot of the world's Anglophones, with US and British English dominating (together accounting for 80 percent of all texts) and the remainder devoted to a variety of world Englishes (Australian, South African, Canadian, Caribbean, Indian, Singaporean, etc.).

Restricting the discussion to the Anglosphere then, do the OEC findings imply that we are "obsessed with" or "ruled by" time? I'm not convinced. This seems like a kind of pop-Whorfianism not too far removed from the old "Eskimo words for snow" meme. But at least frequency data from a broad corpus of texts should be a little more telling than a simple count of words in the lexicon, right? Depends on what you want the data to be telling you. For instance, though time is the top-ranked noun in the OEC's list of most common words, overall it only ranks #55. The top spots are dominated by boring words like the, to, of, and, a, and in, which are hardly conducive to lively PR copy. There are also several verbs that rank ahead of time on the list: be (#2), have (#9), do (#19), say (#28), get (#47), go (#49), and make (#52). Does this mean we English speakers are obsessed with being, having, doing, saying, getting, going, and making?

Nouns are easier to get a hold on, even abstract ones like time, so it's not surprising that other parts of speech got neglected by the copywriters. (Imagine the headline: "We definitely love definite articles! The triumphs over a!")  But even just considering nouns, I fail to find anything particularly newsworthy about time being ranked the most frequent. Taking a look at other extensive English corpora collected in the past, we can see that the dominance of time is nothing new. The British National Corpus, compiled in the early 1990s, had time as the top noun, according to the frequency lists published in a book based on BNC data (Word Frequencies in Written and Spoken English). And we can go all the way back to the Brown Corpus of Standard American English, a million-word corpus derived from texts printed in 1961. Time was the top noun back then too (according to this list), coming in at #66 overall.

So has time at least jumped in the standings, if it's at #55 in 21st-century texts? Probably not. The OEC uses slightly different coding standards from the Brown Corpus, choosing to merge items that differ by number or tense. So, for instance, the Brown Corpus separately ranks is (#8), was (#9), be (#17), are ( #24), and were (#34), while the OEC lumps them all together under be, bringing that item all the way up to second place. Similarly, say, says, and said are all merged together under say, which stands at #28 on the OEC list — same goes for have, has and had. Once all of those mergings are accounted for, time ends up ranking just about where it was back in 1961.

This is not to say that the OEC is just a retread of old corpora. From what I've seen, it has already yielded fascinating new findings on a range of research topics. For instance, the OEC website gives some glimpses into the collocational data for the corpus, which could be immensely valuable for the study of snowclones and other phrasal patterns. Here's a sampling:

The idea of one's 'inner child', popularized in psychotherapy in the 1980s, has spawned an array of humorous variations. These illustrate the way that language is routinely exploited and extended, not as part of a literary endeavour but simply as part of normal creativity in language use. In the Oxford English Corpus the most common of these are (in order):

  • inner geek
  • inner nerd
  • inner diva
  • inner dweeb
  • inner slut
  • inner cynic
  • inner hippie
  • inner brat

Or how about this insight into the productivity of the suffix -fest?

The most common uses of -fest are: slugfest, lovefest, gabfest, crapfest, talkfest, gorefest, snoozefest, hatefest, bitchfest, snorefest, geekfest, gabfest, bloodfest, blogfest, songfest, shitfest, screamfest, filmfest, yawnfest, funfest, sobfest, plugfest, mudfest, fragfest, and suckfest.

Surely the rise of formations such as inner slut and suckfest would make for far more interesting reading than a story about the frequency of the word time? I await the next round of reporting on the corpus, which I hope my inner cynic will not consider a snoozefest.

[One final caveat about word frequency lists. If you see a list of "the most common nouns of English" — say, on Wikipedia — and that list finds room for such words as colony, continent, and slave, but not for way, thing, or life (all among the OEC's top ten nouns), be very, very skeptical. That list is evidently derived from one attributed to Jerry Jones on esl.about.com, which, to its credit, actually does show way, thing, and life in its top 250 overall. But Jones' corpus-gathering techniques are still highly suspect, since his list contains such oddities as hot at #30 (#776 in the Brown Corpus). The Jones list also has unusually high rankings for word, write, sentence, and spell, which suggests that the corpus leans heavily on ESL texts and the like.]

[Update: The Wikipedians are on the case, as that page of "most common nouns in English" has disappeared — the link now redirects you to a page of "most common words in English" based on OEC data. The old list of nouns to which I referred is still visible here.]

Posted by Benjamin Zimmer at June 22, 2006 01:17 AM