November 16, 2005

Googlinguistics: the good, the bad, and the ugly

Language Log popped up rather unexpectedly today in an entry on Peter Suber's Open Access News, an informative blog collecting reports on the open access movement. Suber links to an article in the Cornell Daily Sun by Elise Kramer on the potential uses of Google for social scientists. Here's the relevant passage:

If linguistics is more your thing, Google's index of billions of webpages provides insight into how people across the globe use written language. A variety of academics at the Language Log, a linguistics blog, use Google to assess common usage — for example, how often the word "guttural" is used incorrectly (pretty often), or whether people more frequently say "in the circumstances" or "under the circumstances" (it depends on the circumstances, so to speak).

It's interesting that the writer singled out these particular posts ("Guttural politics" and "In or under") as her examples of Googlinguistics in action. Both entries do indeed illustrate the possibilities of using Google for rough-and-ready corpus linguistics (as previously discussed in The Economist back in January), but they also point to some severe limitations that go unmentioned in the article.

In my post on the shifting semantics of the word guttural under influence from both gutter and gut, I invoked Google in a very limited way, only mentioning the search engine in a parenthetical note about instances of "guttural/gutteral reaction" and "guttural/gutteral instinct." Of course, Google-searching informed other parts of the post in a more implicit manner (for instance, I Googled to find examples of guttural used as a pejorative tag for unpleasant speech patterns, and also as a description of nonlinguistic vocalizations). But I wouldn't go so far as to say I was using Google to "assess common usage" to determine "how often the word 'guttural' is used incorrectly." Google could do no more for me than provide some anecdotal (though enlightening) evidence about changes in usage of this particular term. (Also, good descriptivist that I am, I never weighed in on which particular usage should be labeled "incorrect" — and even if I did, I wouldn't necessarily rely on Google to tell me "how often" it occurs!)

Arnold Zwicky's post on "in the circumstances" vs. "under the circumstances" did use Google in a more systematic manner, with calculated ratios of Googlehits for "in..." vs. "under..." in a variety of collocational contexts. Arnold used the Google data to form some preliminary conclusions, but he acknowledged that his analysis "just scratches the surface of the phenomenon." Though I found the conclusions quite intriguing, I would have preferred some additional caveats, since many of the Googlecounts extend into the hundreds of thousands or even millions, orders of magnitude that have proven to be quite unreliable. (See the roundup of Language Log commentary on Googlecount problems at the end of this post.)

When Mark Liberman and Jean Véronis were investigating shortcomings in Googlecounts back in January, they cautioned against expecting much reliability in counts above about 100,000, particularly when boolean operators were at play. Since then, the situation has only become more dire, in terms of the prospects for even the roughest kind of Googlinguistics.

In August, Google announced that it was using "softer pattern matching" to make searching more effective. It was never explained what exactly this entails, beyond a mysterious change in the already troublesome use of the asterisk as a full-word wildcard. (Try searching on strings like "fourscore * ago" and "fourscore * * fathers" and figuring out how many "filler" words an asterisk can stand for!) They also introduced automatic stemming for plurals (try "seven years ago our" + father), tense endings (try "conceived in liberty and" + dedicate), and some derivational suffixes (try "all men are created" + equally). All this may indeed improve the results of the average searcher, but it makes large Googlecounts even more meaningless from a computational standpoint.

Despite these criticisms, I'm actually an enormous Google fan (developments in Google Print [*] are particularly exciting for linguists and lexicographers). But unless Google starts offering search services specifically designed for researchers concerned with standards of precision, I despair for the future of Googlinguistics. Signs are not particularly positive from Google's camp. In Carl Bialik's Sep. 15 "Numbers Guy" column in the online Wall Street Journal, Peter Norvig, Google's director of search quality, had this to say about the unreliability of Googlecounts: "It's only reporters and computational linguists who care if it's really precise." Well, at least they know that computational linguists care! Perhaps that's the first step.

[* Update 11/17/05: Make that Google Book Search.]

[Update 11/18/05: It's been pointed out to me that Google's automatic stemming can be avoided by prefixing a search term with a plus sign. So, for example, <"all men are created" +equally> will not return matches for "all men are created equal." But there doesn't seem to be any way around the algorithm that allows an asterisk to stand for two or three words in a search string rather than a single one.]

Posted by Benjamin Zimmer at November 16, 2005 04:46 PM