Language Log: Pass the hát.

May 29, 2005

Pass the hát.

a -a = 35,800,000. According to Google that is. The "-" sign is advertised by Google as a way to remove stuff from a search. So you would have thought that any string of the form X -X would produce 0 hits. But it doesn't. Try it: a -a.

Or try espanol -espanol.

Or achete -achete: infantile as I am, I really like this one since it produces 1,890,000 hits, while Google helpfully suggests the alternative acheter -acheter, which produces no hits, surely a new record of bad performance for a search enhancing feature.

Or else try agreable -agreable, which yields 1,450,000 hits. Restricting a search for agreable to French pages only also yields 1,450,000 hits. Coincidence, n'est pas?

By now you must think you've figured out what Google is doing. Simple positive queries return hits that ignore diacritics, while the negation operator removes stuff that perfectly matches its ASCII argument, right? But Google search results are never as simple as they appear.

First, try resume -resume. No hits. Weird. You might speculate that the earlier explanation is right, but words that Google thinks are of financial value, i.e. adwords that might be sold to a vendor, are indexed using a different algorithm that ignores diacritics completely.

No, no, no, a thousand times no. You still don't get it, do you? Google hits are not only never as simple as they appear, they are never simple at all. Although the problems with Google's wildcard * described in this earlier post of mine seem to be all but fixed, Google's secret algorithms are still tied up in knots that nobody understands. You don't believe me? Try some searches that include accents:

hat	172,000,000	matches "hat" and who knows what else.
hat hat	52,500,000	matches "hat", but now without mistaken Google extrapolation. For explanation of extrapolation see this earlier language log post (and this one too), based on a superb analysis by Jean Veronis.
hát	52,500,000	similar to above, but the results are at least ordered differently. Could diacritics be another trick we could use to remove mistaken extrapolation without repeating the whole query?
hát -hat	664,000	Matches hát with accent, loads of Vietnamese hits. It means "to sing", apparently.
hat -hát	52,400,000	Similar to hát, but different, e.g. our friend languagehat is on the first page!
hát -hát	52,400,000	Like the previous one, except "red hat" is higher ranked.
hat -hat	0	Uh, yeah, right.
hát hát -hat	11,200,000	My head aches.
hát hát hát -hat	11,300,000	Your head aches.
hát hat -hat	11,300,000	What do you mean you knew that would happen?
hát -hat -hát	0	Oh, OK, I think I get it, perfectly logical after all...
hát hát -hat -hát	11,200,000	Seems sensible, same number as hát hát -hat. But wait! Those hits were Vietnamese and these are in English. And none of them are for "hat". They are for "hats". Plural. WTF?
hát hát -hat -hats	671,000	OK, so there were "hats" in the "hát hát -hat" count, albeit not in any pages of hits I sampled.
hát hát -hat -hát -hats	0	Back to zero again! But what were the hats doing there in the first place?

This is all scary stuff if, like me, you want to use Google counts for complex boolean searches to get a statistical handle on how language works. Using Google to measure language frequencies is like trying to measure the circumference of the Earth by putting live snakes of unknown length end to end around the equator. If I tell you the answer is 68,000,000 snakes, will you be any the wiser?

Posted by David Beaver at May 29, 2005 03:44 AM