Language Log: Uh Oh...

January 23, 2005

Uh Oh...

I hate to rain on Geoff Nunberg's parade, since I'm also really happy about Google's increase of the word count limitation in queries from 10 to 32. And regular readers know that on the whole, there are few Google users more enthusiastic than me. But there's a cloud on the horizon...

Over the past few days, Jean Véronis has been doing a little experimental science on Google. His extrapolation of the time series of Google index counts demonstrates that the number of pages searched today must not be 8,058,044,651, as Google has been claiming since November, but actually 9,105,590,456 or so. That's the good news.

The bad news: Jean shows that Google's implementation of boolean search has some serious problems.

Here's the evidence (numbers from Jean's post of 1/19/2005 and from my replication at about 3:45 this afternoon):

	Searching for:	Jean's Count	My Count
1	Chirac	3,260,000	3,230,000
2	Chirac OR Sarkozy	1,570,000	1,420,000
3	Chirac OR Chirac	1.950,000	1,780,000
4	Chirac AND Chirac	1,950,000	1,760,000
5	Chirac Chirac	2,010,000	1,760,000

As Jean points out, it's hardly logical (for example) that lines 2 and 3 have smaller counts than line 1, or that lines 1, 4 and 5 have three different counts.

And another puzzle --

Searching for:	Jean's Count	My Count
Chirac AND Sarkozy	154,000	167,000
Chirac -Sarkozy	1,950,000	1,790,000
-Chirac Sarkozy	320,00	315,000
TOTAL	2,424,000	2,272,000

As Jean points out, the total from the second table (2,272,000) should be the same as the {Chirac OR Sarkozy} search in the first table (1,420,000).

The small differences between my counts and Jean's probably reflect not only the 3-day difference in query times (though this should not produce smaller counts for many of the later queries), but also the fact that Google's results come from a massively parallel set of parallel sets of machines, which have somewhat different indices. (And it may also be true that some of the search algorithms are implemented in essentially stochastic ways, e.g. estimating set intersection and union counts by extrapolating from random (initial) samples). But the big differences are harder to explain, and more troubling from the point of view of those who would like to use Google as an instrument of research. As Jean says:

Je n'ai pas la moindre idée de l'origine du problème. Bien sûr, je sais que les nombres retournés par Google sont des approximations (d'ailleurs le moteur précise bien environ x résultats), que les valeurs peuvent légèrement varier en fonction des "centres de données" qui traitent la requête et qui peuvent varier d'un moment à l'autre. Ces raisons pourraient expliquer de petites différences, mais pas des différences du simple au double. J'ai cherché sur les différents forums. Personne ne semble avoir la solution (si certains parmi vous l'ont, je serais très curieux de la connaître !).

[translated by myl] [I don't have the slightest idea of the source of the problem. Of course, I know that the numbers returned by Google are approximations (also the engine specifically says 'about x results'), that the numbers can slightly vary as a function of the "data centers" that process the request and can vary from one time to another. These reasons can explain small differences, but not differences of a factor of two. I've asked in different forums. No one seems to have the solution (if some among you have it, I'll be very curious to know!)]

I don't really have any idea either. One possible component of the problem is that (some) boolean queries might be sent to a search unit whose index is way out of date. [Update: see this post for the real story, which is more along the lines of wrongly extrapolating intersection counts from a sample, as I mention above.] But whatever the explanation, the fact is clear: Google's boolean queries are badly broken, at least if you care about the counts.

Jean also shows that the results from Yahoo! for similar tests are more logically plausible, though the total numbers are somewhat smaller. Advantage: Yahoo?

Posted by Mark Liberman at January 23, 2005 04:10 PM