Language Log: Corpus fetishism

November 16, 2003

Corpus fetishism

A depressing tendency is apparent in a couple of the published reviews of The Cambridge Grammar of the English Language. (Don't ask me to name the reviewers. It would be unkind. A couple of the reviews published in Britain have been so stupid that the only thing a fair-minded man like me can wish upon the reviewers is that they should die in obscurity.) The tendency is to grumble that the grammar does not cite corpus sources for its examples, and to imply that that this means Huddleston and I are bad people.

The charge that we did not use exclusively corpus data to illustrate points of grammar in the book is certainly true. We sometimes used examples taken from texts, even well known ones, but never with a source citation (the source was not the point). We sometimes used edited versions of sentences from texts (omitting irrelevant clutter, shortening clumsy noun phrases where they didn't matter, replacing unusual names, etc.), or sentences we heard on the radio and jotted down. And sometimes we used natural-sounding made-up examples. It depended on what would do the job best. The subject matter of chapters 16 and 17 (information packaging and anaphora) makes style and context highly relevant, so there the frequency of attested examples is very high. But in Chapter 4, basic clause structure is under discussion, and the chief need is for very short and simple examples, not rich and ornate ones.

The reviewers whine on about our policies as if there were something improper and disappointing and unrigorous about a grammarian ever making up an exemplificatory sentence. I disagree. I think we have to draw a line between sensible use of corpora and a perversion that I call corpus fetishism.

You see, if you look at what someone like Mark Liberman does with corpora (often the gigantic corpus constituted by Google and the complete copy of the entire web that it keeps in a barn in Sunnyvale), you will note (e.g. here and here, and especially here) that he uses the corpus for investigation. He probes the text that is out there to see what sentences can be found, and he changes his mind about what the facts are according to what he finds in natural use of the language that appears to emanate from native speakers and seems not to have unintentional slips in it. This is because (and here I reveal a fact about Mark's private life, but only because it is highly relevant)... Mark is not a moron. Mark knows how linguistic investigation is done, not because he once read about it in a book he got out of the library, but because he actually does it. He is not attached to the corpus as if it were the object of study, like a twisted lover obsessed with the shoe of his beloved instead of the woman who wears it.

More than one of the reviewers of The Cambridge Grammar on the Old Europe side of the Atlantic -- reviewers who were clearly not grammarians themselves -- have hinted that no facts can be trusted if they are presented in terms of examples written by the grammarian. They claim that The Cambridge Grammar should have used corpus data throughout for illustration. But this is madness.

Take the beginning of Chapter 10, "Clause type and illocutionary force" (see page 853). There we list the five basic clause types, and give an example of each. We exemplify imperative clauses by giving the example Be generous. Rodney Huddleston chose it, and I have no doubt that he thought it up. Now, using "real" data (as the corpus fetishists always say) would have been trivially easy. We could have used "Call me Ishmael." (We wouldn't even have needed to take the book down from the shelf to cite the source, would we? Moby Dick, by Herman Melville, page 1.) But the question is, why would we or should we do this?

Would have it improved our exposition of clause type? No, it would have worsened it. It would have ruined the symmetry of the set of near-minimal contrasts we give between the five clause types: You are generous for the declarative, Are you generous? for the closed interrogative, etc. Using random attested examples from wherever we could find them attested would have lessened the clarity of the illustration.

Would it ensure a convincing answer to some contested question? No. Nothing is at issue here. There is no possibility that Be generous might be ungrammatical. No point is being missed if we use that rather than a different example that came from a corpus. We just need a clear and simple illustrative example so that you can see what we mean when we say "imperative clause".

In any case, there isn't really a line here between attested and non-attested data. Check out Be generous on Google and you find it gets roughly 120,000 hits, and thousands of them are imperatives. So it is attested, though choosing a source from the thousands available would have been arbitrary. If you want a literary citation, a few seconds of experimentation with the little corpus of uncopyrighted Victorian materials I keep on my Linux box plucks out this:

Don't mind Mrs. Dean's cruel cautions, but be generous, and contrive to see him.

We could have used that, though it has an extra twelve words of clutter, bloating the example up from 12 characters to 80, a factor of 6.67, well as messing up the symmetry with the other clause types. We could have given the citation too: "Wuthering Heights by Emily Brontë (1801)", plus a specific edition, and a page reference. The whole thing would take more than an order of magnitude more space on the page. Why didn't we do this? Because (you know what I'm going to say, don't you?)... Huddleston and I are not morons.

There are way over 10,000 numbered examples in The Cambridge Grammar, and thousands more given in passing in the text. To use only corpus examples, and to give full source citations of all examples used, would have added scores of pages (possibly a hundred pages or more) to a book that is already 1,842 pages long. You really would have to be a moron to do it. But because we didn't, we are getting accused of not being adequately responsive to the corpus revolution in modern syntax. Only two or three so far, but already I am getting tired of them. The charge is nonsense. Huddleston and I used corpora constantly. The British National Corpus was not available to us back in the 1990s, but we slaved over printouts from three well-matched and well-balanced small corpora (the Brown, LOB, and ACE corpora, representing American, British, and Australian English respectively); in addition I ran thousands of searches on the Linguistic Data Consortium's famous Wall Street Journal corpus of 1987-1989 journalism to check points of American English; we paid attention to both spoken and written English (notice, any spoken English caught by reporters turns up inside quotation marks); in every way we could think of we sought out evidence from attested linguistic material -- not just one fixed corpus serving as the only source for everything (that turns the language into a dead language -- corpus necrophilia), but a dynamically evolving collection embracing any kind of material that might be of use.

But what it was of use for was the investigation phase, when we were finding out what was true of English and what was not. To suggest that we then should have set out our illustrations only (or even largely) with unedited examples together with full text locations is just nuts.

I defend the rights of consenting adults to engage in corpus fetishism if they wish, in the privacy of their own homes. But it is a perversion, and I don't want its perverted adherents trying to tell me that The Cambridge Grammar would be a better book if its exemplifications were exclusively long and ungainly attested utterances taken unedited from corpora of text with location information attached, because it wouldn't.

Posted by Geoffrey K. Pullum at November 16, 2003 02:14 AM