*
N stars in quotes now mean zero or N words, but nothing in between zero
and N. According to Google. I'm pretty sure it didn't used to mean that
(correct me if I'm wrong by email:
dib
AT stanford DOT edu). But times have changed. Google reality is
fleeting.
In a quoted Google string search, * used to match exactly one word, any
word at all. So
"language
* log" would have produced similar (or identical) hits to
"language
* log" -"language log". Not so any more! Iván
García-Álvarez, a graduate student at Stanford University's
linguistics department, pointed out to me that a * now sometimes
matches zero words. The full story is even more complex.
"there
is * * * * a house in New Orleans" gets, at time of writing,
precisely the same number of hits, 3560, as
"there
is a house in New Orleans", and they all seem to be hits on "there
is a house in New Orleans", although the hits are not in the same
order. I can't even be sure whether they are the same hits, since I
would only be able to check 1000 of them.
On the other hand
"there
is * * * * house in New Orleans" only gives 3 hits, and all
of them are for "there is house in New Orleans". Meanwhile
"there
is * house in New Orleans" matches both (i) the three "there is house
in New Orleans" strings, which are in fact top ranked, and (ii) "there is a
house in New Orleans". Google claims 4550 hits for this one star search, though
whether there is an actual surplus of matches beyond the sum of "there
is house in New Orleans" and "there is a house in New Orleans" I don't
know. Add one more star, i.e.
"there
is * * house in New Orleans" and we get 5 hits. As far as I can
tell, these consist of cases where "* *" matches either zero words or
two.
More generally, I hypothesize that N stars now matches either zero or N
words, where by
words I mean
non-empty strings containing nothing Google treats as separators. Which
means that the meaning of * is now context sensitive. It used to match
exactly one word, and it carried on meaning this when there were other
*s around. Not so any more: now we cannot give a natural interpretation
to a single * within a list of *s, but rather have to interpret the
whole list. (BTW: you don't actually need spaces between *s.
"there
is * * house in New Orleans" is interpreted by Google in exactly
the same way as
"there
is ** house in New Orleans".)
It probably does not matter to many people that the meaning of * is no
longer context free. And if what you want to do is match N words and
not zero words, you can just use the minus operator, as in
"there
is ** house in New Orleans" -"there is house in New Orleans",
although admittedly it is a pain in the butt. But
as a linguist who likes to do rapid prototyping of theories using
Google, all this is a little scary.
Google's semantics is in such rapid
flux that my little web experiments become even more unrepeatable than
they would do if we only had to deal with the ever changing web. And
what if someone has been tracking how some aspect of the web changed
over time using a string search involving stars? They would now have to
change their search pattern, and probably start their cyberdiachronic
investigation from scratch. Then again, maybe there is no such thing as
a cyberdiachronist who uses starry Google strings. Then double-again,
maybe there
is such a person,
but I don't yet know how to search for them. But then triple-again, and to paraphrase Berkeley: if
a tree falls in the forest, but it doesn't show up on Google, did it
really happen? Now perhaps you begin to see why the ethereal transcience of Google
scares me so much...
Posted by David Beaver at March 6, 2005 01:30 AM