December 31, 2003

Beware, corpus fetishists

Apropos of my recent transsexual pronominal reference story, let me just mention that the word transsexual gives us a truly frightening glimpse of the giant reservoir of error out there that Google keeps an index of. Google reported, the last time I checked, that the incorrect spelling transexual occurs on at least 1.87 million web pages, while the correct transsexual spelling occurs on only 1.37 million. Now, I take it that it is quite clear what is correct or incorrect in this domain: spelling is conventionalized and fixed in a way that grammar is not. This is not the time of the Paston letters, when spelling varied regionally and between families. So beware, corpus fetishists! The possibility of corpus research is a great asset to linguistics, and no one should try to work without corpus material; but there are major pitfalls for those who take the corpus to be the object of study. It is not the object of study. The language is the object of study. A corpus is just an assemblage of material through which we can study the language, and virtually any corpus is going to have errors in it. Possibly numbering in the millions, even outnumbering the correct forms. Deciding when some new expression type has become a part of the language and when we are simply dealing with a lot of people messing up is not an easy process.

