January 07, 2004

Transexual, transsexual, and restricted Google searches

Natalja Schmidt wrote a thoughtful note to me from Germany on New Year's Day about my post Beware corpus fetishists, in which I discussed the curious fact (originally pointed out by Mark Liberman, citing this) that the incorrect spelling transexual is significantly more common on the web than the correct spelling transsexual. What she says merits some thought. Let me quote her in full.

Says Natalja:

Unfortunately you didn't say what settings you used for your Google search. I'm sure you are aware of the fact that the results can vary considerably depending on the Google configuration, especially language and country.

Per default, Google searches the entire web, ignoring what country the page is from or what language it's written in. In this case the results for transexual indeed outnumber the ones with the correct spelling transsexual, although I get a different number (correct spelling: 2.24 million, incorrect spelling: 2.83 million).

Obviously, sources in foreign languages must be excluded from the results. In Spanish, for example, "transexual", not "transsexual" is the correct spelling (And you get lots of Spanish pages if you use the default search settings). Another problem are foreign speakers of English (such as myself) who make many mistakes a native speaker would never make. These results shouldn't count either I think.

Unfortunately, even if you narrow down the search to English pages (i.e. pages in English) located in the US, the results are not much better, mainly because of the fact that English is an international language and because the "American" TLDs .com, .net, .org etc. are not only used by Americans. Another option are UK or Australian domains, since they are rarely used by foreigners. So I tried searching for UK/Australia pages. The number of incorrect spellings for UK and Australia was very small compared with the correct ones and compared with the results of web-wide searches, i.e. not restricted by language or country.

I guess what I'm trying to say here is that the "reservoir of error" in Google is not as giant as you think after all (if you know how to use it).

These sensible words of Natalja's are well worth keeping in mind, and for the most part I do not dispute them -- except for that very slight hint of an implication ("if you know how to use it") that I am a meathead web surfing moron who doesn't know how to Google his way out of a wet paper bag and believes fgrep -i on large text samples gives you a direct line to God; and I forgive her for any such suggestion, since there are so many meatheads out there, and who knows, I could easily have been one of them, rather than the sophisticated guru-class data wrangler and enemy of corpus fetishism (not to mention sexy super fun wild and crazy guy) that I actually am.

One can indeed tell Google to limit its searches in all sorts of ways to make the results more useful. There are some nice books on using Google in a sophisticated way; the wonderful O'Reilly Associates publishes a couple of them, the more technical of them called Google Hacks; I chose Google Pocket Guide by Tara Calishain, Rael Dornfest, and DJ Adams, which supplies enough. The results Natalja got with her restricted searches were these:

Jan 1, 2004, Google search results for transsexual / transexual

transsexual
(correct)

transexual
(incorrect)

Any language or country

2,240,000

2,8 30,000

English, any country

1,400,000

1,630,000

English, USA

1,280,000

1,390,000

English, UK

14,300

5,360

English, Australia

12,300

3,060

Let me say a word first about the results I originally cited, and how I got them. I tried out a quick and simple way (perhaps too quick and simple). I went to GoogleDuel, a student experiment site that compares two given words or phrases with regard to their occurrences on Google, typed in the two words, and hit Go. No prizes for sophistication there.

Now, Natalja has used her human intelligence to figure out that if you restrict yourself to British and Australian sites, the correct spelling does come up as more frequent. But of course, she needed to know in advance what the correct spelling was, and she needed to use some of her general knowledge (like the number of Spanish pages, and the frequency of non-US-controlled .com sites) in order to come to her conclusion. When you use the web as a corpus (and that idea is the theme for the whole of the latest issue of the journal Computational Linguistics published by the Association for Computational Linguistics), you have to use it with care, and intelligence, and caution. And that, precisely the point I was originally making, is a point on which I think we agree.

Posted by Geoffrey K. Pullum at January 7, 2004 01:07 PM