Google gods: please make the * shine again!
Once upon a time, on a world wide web far, far away, Google wildcards
seemed to work pretty well. Must have been all of a year ago. Now, the
Google gods must be crazy. Consider this: a search on any pattern with
"* X Y", now matches strings of the form "X * Y", and most of the
latter are included in the count estimate.
Example: "whether
nobler in the mind to suffer" produces (until Google indexes the
current page!) 0 hits. Eminently reasonable, since Hamlet never said
that. On the other hand, he didn't say
"*
whether nobler in the mind to suffer" for any choice of word for
the * either. But that produces 15, 500 hits, only 500 less than
"whether
* nobler in the mind to suffer", and a few thousand more than
"whether
tis nobler in the mind to suffer".
And no, the * does not merely hop over one word. It jumps anywhere into
a string. More than once. The search
"whether
tis in the mind to suffer" gives no hits, but
"*
whether tis in the mind to suffer" produces 14,900. And "* whether
tis nobler in the mind suffer" produces 14,700. And
"*
whether tis in the mind suffer" gives 14,900, although
without the * we get
none.
Heck, let's try to pretend this is a feature rather than a bug.
"To
be, or not to be: That is the question:-- Whether tis nobler in the
mind to" produces 3440, which is plausible.
"To
be, not to: That is question:-- Whether nobler in mind to", which has every third word removed,
produces 0, which again seems fair (until this post is indexed).
"*
To be, not to: That is question:-- Whether nobler in mind to", which is the same quote with a *,
produces 7690. And
"*
To be, not to: That question is:-- Whether nobler in mind to", which is just like the 7690 search but with the order of two words swapped,
produces 0 again.
So sticking a star at the start of a quoted string will tell you
whether that sequence of words occurs on the net in that order with any
combination of single words stuck in between. But I can't really turn
this into something useful. If you leave out pairs of words, weird
stuff happens: you only get a tiny fraction of the results.
"*
To be, not to: That is :-- Whether nobler in mind to" gives
16 hits, apparently full Hamlet quotes. And I tried once taking out
three words (
"*
To be, not to: That Whether nobler in mind to") and got
zero hits. I'll leave you all to experiment.
By the way, you can put the star elsewhere in the string and get
similar results but I think there's a proviso: as well as any number of
extra words before or after the location of the star, there must be a
match at the location of the star.
Thus
"To be, not to: That * is question:-- Whether nobler in mind
to" has one hit, and the match is:
"To
be, or not to be, that is the question;/ whether 'tis nobler in the
mind to suffer.", so * matched "/" or ";/". However,
"To
be, not to: That is * question:-- Whether nobler in mind to"
gives 8040 hits, most of them presumably where * matches "the".
I'll leave you all to experiment even more.
I want to be able to make linguistic claims based on web counts, and
wildcards allow me to get at really thorny data amazingly quickly. But
I cannot trust the wildcard results any more. Let us all pray to the Google gods that one day we shall return to that land of innocence we knew a year ago, a far off place where the * shone and it never rained on the linguists' parade.
Here's an index of past LL posts on Google count problems:
Posted by David Beaver at August 2, 2005 01:59 AM