May 29, 2005

Pass the hát.

a -a = 35,800,000. According to Google that is. The "-" sign is advertised by Google as a way to remove stuff from a search. So you would have thought that any string of the form X -X would produce 0 hits. But it doesn't. Try it: a -a.

Or try espanol -espanol.

Or achete -achete: infantile as I am, I really like this one since it produces 1,890,000 hits, while Google helpfully suggests the alternative acheter -acheter, which produces no hits, surely a new record of bad performance for a search enhancing feature.

Or else try agreable -agreable, which yields 1,450,000 hits. Restricting a search for agreable to French pages only also yields 1,450,000 hits. Coincidence, n'est pas?

By now you must think you've figured out what Google is doing. Simple positive queries return hits that ignore diacritics, while the negation operator removes stuff that perfectly matches its ASCII argument, right? But Google search results are never as simple as they appear.

First, try resume -resume. No hits. Weird. You might speculate that the earlier explanation is right, but words that Google thinks are of financial value, i.e. adwords that might be sold to a vendor, are indexed using a different algorithm that ignores diacritics completely.

No, no, no, a thousand times no. You still don't get it, do you? Google hits are not only never as simple as they appear, they are never simple at all. Although the problems with Google's wildcard * described in this earlier post of mine seem to be all but fixed, Google's secret algorithms are still tied up in knots that nobody understands. You don't believe me? Try some searches that include accents:

hat 172,000,000
matches "hat" and who knows what else.
hat hat
matches "hat", but now without mistaken Google extrapolation.
For explanation of extrapolation see this earlier language log post (and this one too), based on a superb analysis  by Jean Veronis.
hát 52,500,000
similar to above, but the results are at least ordered differently. Could diacritics be another trick we could use to remove mistaken extrapolation without repeating the whole query?
hát -hat 664,000
Matches hát with accent, loads of Vietnamese hits. It means "to sing", apparently.
hat -hát 52,400,000
Similar to hát, but different, e.g. our friend languagehat is on the first page!
hát -hát 52,400,000
Like the previous one, except "red hat" is higher ranked.
hat -hat
Uh, yeah, right.
hát hát -hat
My head aches.
hát hát hát -hat
Your head aches.
hát hat -hat
What do you mean you knew that would happen?
hát -hat -hát 0
Oh, OK, I think I get it, perfectly logical after all...
hát  hát -hat -hát
Seems sensible, same number as hát hát -hat. But wait! Those hits were Vietnamese and these are in English. And none of them are for "hat". They are for "hats". Plural. WTF?
hát hát -hat -hats
671,000 OK, so there were "hats" in the "hát hát -hat" count, albeit not in any pages of hits I sampled.
hát  hát -hat -hát -hats
Back to zero again! But what were the hats doing there in the first place?

This is all scary stuff if, like me, you want to use Google counts for complex boolean searches to get a statistical handle on how language works. Using Google to measure language frequencies is like trying to measure the circumference of the Earth by putting live snakes of unknown length end to end around the equator. If I tell you the answer is 68,000,000 snakes, will you be any the wiser?

Posted by David Beaver at May 29, 2005 03:44 AM