January 28, 2004

Google frequency agen (um, agin) (oops, again)

Apparently linguists aren't the only ones out there using Google frequencies to make inferences about language. So there's a wider community than we thought of people who need to be warned to do such things with care. Here's an example, plus a few constructive ideas.

Jason Eisner pointed me to a story in today's NY Times, which describes how some sellers on eBay are losing money by failing to spell their offers correctly. As an example, one eBay auctioneer was unable to sell a pair of chandelier earrings. (Linked in case you, like I, had no idea what these are. Useful information, come to think of it, what with Valentine's Day approaching. Curiously, though, my wife did the same search and came up with a different page.)

The problem was failure to use Google properly in doing a frequency comparison. From the Times article:

    Ms. Marshall, who lives in Dallas, said she knew she was on shaky ground when she set out to spell chandelier. But instead of flipping through a dictionary, she did an Internet search for chandaleer and came up with 85 or so listings.

    She never guessed, she said, that results like that meant she was groping in the spelling wilderness. Chandelier, spelled right, turns up 715,000 times.

Apparently eBay does try to warn users when they are using a common misspelling, but

    wrong spellings can also turn up similar misspellings, so that buyers and sellers frequently read past the Web site's slightly bashful line asking, by any chance, "Did you mean . . . chandelier?"

For what it's worth, three solutions are worth noting. First, a simple Google search may result in Google's version of "Did you mean...". If the suggested correction is the same as eBay's suggested correction, that should increase your confidence that it's right -- two data points are better than one.

Second, if you're going to use search engine frequencies, at least try a bunch of different alternatives. And if the tallies are a closer call than 715,000-to-85, it's easy to do an on-line statistical test to see if the difference is significant. Read details about one such test, or just try one of the many on-line applets that let you do the computation. For example, here's a link to one particularly easy to use tool. Should you trust a Google frequency difference between, say, 5000 hits versus 6000 hits? Enter X1=5000, X2=6000, and both N1 and N2 as 3307998701 (from the bottom of http://www.google.com). Click "Submit". If the p value is less than 0.05, conventional statistical wisdom says you can trust that the difference between 5000 hits and 6000 hits did not happen by chance. (I'm sure someone can suggest a better statistical test, taking advantage of the fact that N1=N2. And of course beware that just because something is statistically significant, it doesn't necessarily mean it's meaningful! With such a huge N, even relatively small differences will give you significance on this test.)

The third alternative, of course, would be to use the dictionary.

Posted by Philip Resnik at January 28, 2004 11:55 AM