January 23, 2005

when things don't add up

I've been frustrated for a long time by the deficiencies in Google's hit-count estimation algorithms involving Boolean searches that Mark points to, citing Jean Véronis. (Another, possibly related problem arises when you do date-restricted searches in Google Groups, say for every year: when you sum the results you often get a wildly different total from what you get if you simply search on the same terms without date restrictions.)

My understanding is that this is old code that is low on Google's priority list, particularly since Google isn't going to actually show you more than 1000 hits for a given query in any case. There is a work-around to the problem with Boolean searches, though, if you're looking at fairly common terms and are interested only in determining relative frequencies.

This is to do several searches containing restrictors that will keep the hit count down under 1000, which makes it more likely that Google will actually count all the hits.

For example, a search on "criticize" turns up "about 2,360,000" hits, and a search on "criticize" turns up "about 4,000,000." But a search on "criticize OR criticized" turns up "about 1,670,000," which is obviously not consistent with the other results (though you shouldn't expect the disjunctive search to exactly sum the individual searches, since some pages contain both terms).

But now let's add some irrelevant restrictors that reduce the hit counts to under 1000:

criticized OR criticize cleveland squash 891

criticized cleveland squash 637

criticize cleveland squash 331

criticize criticized cleveland squash 78

In this case, the sum of the results for the two individual searches (968) minus the result of the conjunctive search (78) comes out to almost exactly the same total (890) as the disjunctive search. Iterate this a couple of times using other restrictors that limit the totals to a similar range (e.g., "york dictionary" or "multilingual"), and you'll have a fairly reliable way of estimating the relative frequency of the two terms, or of comparing their combined frequency to that of some other term. But of course you still won't know with any certainty how many pages Google has indexed in absoluto that contain the items

Posted by Geoff Nunberg at January 23, 2005 09:00 PM