Pass the hát.
a -a = 35,800,000. According to Google that is. The "-" sign is advertised by Google as a way to remove stuff from a
search. So you would have thought that any string of the form X -X
would produce 0 hits. But it doesn't. Try it:
a
-a.
Or try
espanol
-espanol.
Or
achete
-achete: infantile as I am, I really like this one since it
produces 1,890,000 hits, while Google helpfully suggests the alternative
acheter -acheter, which produces no hits, surely a new record of bad
performance for a search enhancing feature.
Or else try
agreable
-agreable, which yields 1,450,000 hits. Restricting a search for agreable to French pages only
also yields 1,450,000 hits. Coincidence, n'est pas?
By now you must think you've figured out what Google is doing. Simple
positive queries return
hits that ignore diacritics, while the negation operator removes stuff
that perfectly matches its ASCII argument, right? But Google search
results
are never as simple as they appear.
First, try
resume
-resume. No hits. Weird. You might speculate
that the earlier explanation is right, but words that Google thinks are
of financial value, i.e. adwords that
might be sold to a vendor, are indexed using a different algorithm that
ignores diacritics completely.
No, no, no, a thousand times no. You still don't get it, do you? Google
hits are not only never as simple as they appear, they are never simple
at all. Although the problems with Google's wildcard * described in
this
earlier post of mine seem to be all but fixed, Google's secret
algorithms are still tied up in knots that nobody understands. You
don't believe me? Try some searches that include accents:
hat |
172,000,000
|
matches
"hat" and who knows what
else.
|
hat hat
|
52,500,000
|
matches
"hat", but now without
mistaken Google extrapolation.
For explanation of extrapolation see this
earlier language log post (and this
one too), based on a
superb analysis by Jean Veronis.
|
hát |
52,500,000
|
similar
to above, but the
results are at least ordered differently. Could diacritics be another
trick we could use to remove mistaken extrapolation without repeating
the whole query? |
hát
-hat |
664,000
|
Matches
hát with accent, loads of Vietnamese hits. It means "to sing", apparently. |
hat
-hát |
52,400,000
|
Similar
to hát, but
different, e.g. our friend languagehat
is on the first page! |
hát
-hát |
52,400,000
|
Like the
previous one, except
"red hat" is higher ranked.
|
hat -hat
|
0
|
Uh, yeah,
right.
|
hát
hát -hat
|
11,200,000
|
My head
aches.
|
hát
hát hát -hat
|
11,300,000
|
Your head
aches.
|
hát
hat -hat
|
11,300,000
|
What do
you mean you knew that would happen?
|
hát
-hat -hát |
0
|
Oh, OK, I
think I get it, perfectly logical after all...
|
hát
hát -hat -hát
|
11,200,000
|
Seems
sensible, same number as hát hát -hat. But wait!
Those hits were Vietnamese and these are in English. And none of them
are for "hat". They are for "hats". Plural. WTF?
|
hát
hát -hat -hats
|
671,000 |
OK, so
there were "hats" in the "hát
hát -hat" count, albeit not in any pages of hits I sampled.
|
hát
hát -hat -hát -hats
|
0
|
Back to
zero again! But what
were the hats doing there in the first place?
|
This is all scary stuff if, like me, you want to use Google counts for
complex
boolean searches to get a statistical handle on how language works.
Using Google to measure language frequencies is like trying to measure
the circumference of the Earth by putting live snakes of unknown length
end to end around the equator. If I tell you the answer is 68,000,000
snakes, will you be any the wiser?
Posted by David Beaver at May 29, 2005 03:44 AM