August 02, 2005

Google gods: please make the * shine again!


Once upon a time, on a world wide web far, far away, Google wildcards seemed to work pretty well. Must have been all of a year ago. Now, the Google gods must be crazy. Consider this: a search on any pattern with "* X Y", now matches strings of the form "X * Y", and most of the latter are included in the count estimate.

Example: "whether nobler in the mind to suffer" produces (until Google indexes the current page!) 0 hits. Eminently reasonable, since Hamlet never said that. On the other hand, he didn't say "* whether nobler in the mind to suffer" for any choice of word for the * either. But that produces 15, 500 hits, only 500 less than "whether * nobler in the mind to suffer", and a few thousand more than "whether tis nobler in the mind to suffer".

And no, the * does not merely hop over one word. It jumps anywhere into a string. More than once. The search "whether tis in the mind to suffer" gives no hits, but "* whether tis in the mind to suffer" produces 14,900. And "* whether tis nobler in the mind  suffer" produces 14,700. And "* whether tis in the mind  suffer" gives 14,900, although without the * we get none.

Heck, let's try to pretend this is a feature rather than a bug. "To be, or not to be: That is the question:-- Whether tis nobler in the mind to" produces 3440, which is plausible. "To be,  not to: That is question:-- Whether nobler in mind to", which has every third word removed, produces 0, which again seems fair (until this post is indexed). "* To be,  not to: That is question:-- Whether nobler in mind to", which is the same quote with a *, produces 7690. And "* To be,  not to: That question is:-- Whether nobler in mind to", which is just like the 7690 search but with the order of two words swapped, produces 0 again. So sticking a star at the start of a quoted string will tell you whether that sequence of words occurs on the net in that order with any combination of single words stuck in between. But I can't really turn this into something useful. If you leave out pairs of words, weird stuff happens: you only get a tiny fraction of the results. "* To be,  not to: That is :-- Whether nobler in mind to" gives 16 hits, apparently full Hamlet quotes. And I tried once taking out three words ("* To be,  not to: That  Whether nobler in mind to") and got zero hits. I'll leave you all to experiment.

By the way, you can put the star elsewhere in the string and get similar results but I think there's a proviso: as well as any number of extra words before or after the location of the star, there must be a match at the location of the star. Thus "To be,  not to: That * is question:-- Whether nobler in mind to" has one hit, and the match is: "To be, or not to be, that is the question;/ whether 'tis nobler in the mind to suffer.", so * matched "/" or ";/". However, "To be,  not to: That is * question:-- Whether nobler in mind to" gives 8040 hits, most of them presumably where * matches "the".  I'll leave you all to experiment even more.

I want to be able to make linguistic claims based on web counts, and wildcards allow me to get at really thorny data amazingly quickly. But I cannot trust the wildcard results any more. Let us all pray to the Google gods that one day we shall return to that land of innocence we knew a year ago, a far off place where the * shone and it never rained on the linguists' parade.



Here's an index of past LL posts on Google count problems:

Pass the hát.
*
Type twice for truth?
More arithmetic problems at Google
Questioning reality
Google recall (They stole his mind,now he wants it back.)
When things don't add up
Uh Oh...


Posted by David Beaver at August 2, 2005 01:59 AM