January 24, 2005

Google recall (They stole his mind,now he wants it back.)

Google does weird stuff with disjunctive searches, the stuff Mark Liberman just discussed here, and I don't know why. I am impressed by Geoff Nunberg's partial explanation, that only when the results count is low (under 1000, he says) is Google's estimate based on something similar to actual counting. I'm even more impressed by his cunning workaround, of adding garbage search terms to keep the count low, and then extrapolating.

But this still leaves open the question of what on Earth, or elsewhere, Google does to produce lower numbers for disjunctive searches like quantum|mice (= "about 15,700,000") than for searches the individual disjuncts quantum (= "about 22,600,000") and mice (= "about 22,000,000"). No obvious numeric combination of the results for conjunctions, disjuncts and exclusively tested disjuncts (e.g. quantum -mice) produces the number of hits given for the disjunction. This is a matter of some importance for me, since I'm currently working on a dataset which involves cross-linguistic quantitative assessment of disjunctive Google searches. I can manage without the disjunctions, but it's a lot slower.

Since I'm just a linguist and don't know how Google works, it's really not safe for me to speculate on why Google is as Google does. So fasten your safety belts!

Suppose that Google's approach to a search request you make is like this:
  • you have connected to some node on Google's network, i.e. a processor somewhere
  • a check is made in a database to see if the search has already been made recently, in which case you get back the canned response from the previous search
  • your processor sends out a bunch of requests to other Google servers
  • your processor spends a little time waiting for results to come in, if it's still waiting after say .2 of a second, it makes an estimate of the total number of hits based on the number of responses it received in that time, while if the number of results was small it actually tallies them
Since complex (boolean or string) searches are harder to process than one word searches, it might be expected that the response rate is lower for complex searches than for single word searches that have a similar actual web frequency. And that should artificially lower your personal Google node's estimate of the result count. This is precisely what Mark, c/o Jean Véronis, showed. Note also that the above pseudo-algorithm predicts that a first search on a term will tend to produce a slower response than later searches, because look-up is used for the later searches. I did a tiny study of this using obviously rare disjunctive searches, and it appeared to hold up. Initial searches clock at around .25-.4 seconds, while later searches are  0.08-0.2 seconds. (These figures are anyway a marvel of engineering which even after years of Google use still blows me away.)

Yahoo search appears to work much better on counting for disjunctive queries (though not string searches, which is what I REALLY care about). Even if it does do something similar to Google, perhaps it does what seems most obvious: splitting the disjunctive search into multiple non-disjunctive searches, making them separately, and then combining the results. I can only conjecture (you didn't unbuckle your safety belt did you...) that Google's page rank algorithm makes recombination of separated disjunctive searches unattractive.

If you have a better explanation of Google's behavior, please send it to dib at-sign stanford DOT edu. Unless of course you know the truth, which would be cheating. In that case send it to Mark and Geoff right away. And if you think Google is confusing, just look at the plot outline for Total Recall:

What is reality when you can't trust your memory. Arnold Schwartzenegger is an Earthbound construction worker who keeps having dreams about Mars. A trip to a false memory transplant service for an imaginary trip to Mars goes terribly wrong and another personality surfaces. When his old self returns, he finds groups of his friends and several strangers seem to have orders to kill him. He finds records his other self left him that tell him to get to Mars to join up with the underground. The reality of the situation is constantly in question. Who is he? Which personality is correct? Which version of reality is true? (John Vogel's summary, originally here.)

Posted by David Beaver at January 24, 2005 03:37 AM