Google recall (They stole his mind,now he wants it back.)
Google does weird stuff with disjunctive searches, the stuff Mark
Liberman just discussed
here,
and I don't know why. I am impressed by Geoff Nunberg's
partial
explanation, that only when the results count is low (under 1000,
he says) is Google's estimate based on something similar to actual
counting. I'm even more impressed by his cunning workaround, of adding
garbage search terms to keep the count low, and then extrapolating.
But
this still leaves open the question of what on Earth, or elsewhere,
Google does to produce lower numbers for disjunctive searches like
quantum|mice
(= "about 15,700,000") than for searches the individual disjuncts
quantum
(= "about 22,600,000") and
mice
(= "about 22,000,000"). No obvious numeric combination of the results
for conjunctions,
disjuncts and exclusively tested disjuncts (e.g.
quantum
-mice) produces the number of hits given for the disjunction. This
is a matter of some importance for me, since I'm currently working on a
dataset which involves cross-linguistic quantitative assessment of
disjunctive Google searches. I can manage without the disjunctions, but
it's a lot slower.
Since I'm just a linguist and don't know how Google works, it's really
not safe for me to speculate on why Google is as Google does. So fasten
your safety belts!
Suppose that Google's approach to a search request you make is like
this:
- you have connected to some node on Google's network, i.e. a
processor somewhere
- a check is made in a database to see if the search has already
been made recently, in which case you get back the canned response from
the previous search
- your processor sends out a bunch of requests to other Google
servers
- your processor spends a little time waiting for results to come
in, if it's still waiting after say .2 of a second, it makes an
estimate of the total number of hits based on the number of responses
it received in that time, while if the number of results was small it
actually tallies them
Since complex (boolean or string) searches are harder to process than
one word searches, it might be expected that the response rate is lower
for complex searches than for single word searches that have a similar
actual web frequency. And that should artificially lower your personal
Google node's estimate of the result count. This is precisely what
Mark, c/o Jean Véronis, showed. Note also that the above
pseudo-algorithm predicts that a first search on a term will tend to
produce a slower response than later searches, because look-up is used
for the later searches. I did a tiny study of this using obviously rare
disjunctive searches, and it appeared to hold up. Initial searches
clock at around .25-.4 seconds, while later searches are 0.08-0.2
seconds. (These figures are anyway a marvel of engineering which even
after years of Google use still blows me away.)
Yahoo search appears to work much better on counting for disjunctive
queries (though not string searches, which is what I REALLY care about). Even if it does do something similar to Google, perhaps it
does what seems most obvious: splitting the disjunctive search into
multiple non-disjunctive searches, making them separately, and then
combining the results. I can only conjecture (you didn't unbuckle your
safety belt did you...) that Google's page rank algorithm makes
recombination of separated disjunctive searches unattractive.
If you have a better explanation of Google's behavior, please send it
to dib at-sign stanford DOT edu. Unless of course you know the truth,
which would be cheating. In that case send it to Mark and Geoff right
away. And if you think Google is confusing, just look at the plot
outline for
Total
Recall:
What
is reality when you can't trust your memory. Arnold
Schwartzenegger is an Earthbound construction worker who keeps having
dreams about Mars. A trip to a false memory transplant service for an
imaginary trip to Mars goes terribly wrong and another personality
surfaces. When his old self returns, he finds groups of his friends and
several strangers seem to have orders to kill him. He finds records his
other self left him that tell him to get to Mars to join up with the
underground. The reality of the situation is constantly in question.
Who is he? Which personality is correct? Which version of reality is
true? (John Vogel's summary, originally
here.
)
Posted by David Beaver at January 24, 2005 03:37 AM