March 06, 2005

*

N stars in quotes now mean zero or N words, but nothing in between zero and N. According to Google. I'm pretty sure it didn't used to mean that (correct me if I'm wrong by email: dib AT stanford DOT edu). But times have changed. Google reality is fleeting.

In a quoted Google string search, * used to match exactly one word, any word at all. So "language * log" would have produced similar (or identical) hits to "language * log" -"language log". Not so any more! Iván García-Álvarez, a graduate student at Stanford University's linguistics department, pointed out to me that a * now sometimes matches zero words. The full story is even more complex.

"there is * * * * a house in New Orleans" gets, at time of writing, precisely the same number of hits, 3560, as "there is a house in New Orleans", and they all seem to be hits on "there is a house in New Orleans", although the hits are not in the same order. I can't even be sure whether they are the same hits, since I would only be able to check 1000 of them.

On the other hand "there is * * * *  house in New Orleans" only gives 3 hits, and all of them are for "there is house in New Orleans". Meanwhile "there is * house in New Orleans" matches both (i) the three "there is house in New Orleans" strings, which are in fact top ranked, and (ii) "there is a house in New Orleans". Google claims 4550 hits for this one star search, though whether there is an actual surplus of matches beyond the sum of "there is house in New Orleans" and "there is a house in New Orleans" I don't know. Add one more star, i.e. "there is * * house in New Orleans" and we get 5 hits. As far as I can tell, these consist of cases where "* *" matches either zero words or two.

More generally, I hypothesize that N stars now matches either zero or N words, where by words I mean non-empty strings containing nothing Google treats as separators. Which means that the meaning of * is now context sensitive. It used to match exactly one word, and it carried on meaning this when there were other *s around. Not so any more: now we cannot give a natural interpretation to a single * within a list of *s, but rather have to interpret the whole list. (BTW: you don't actually need spaces between *s. "there is * * house in New Orleans" is interpreted by Google in exactly the same way as "there is ** house in New Orleans".)

It probably does not matter to many people that the meaning of * is no longer context free. And if what you want to do is match N words and not zero words, you can just use the minus operator, as in "there is ** house in New Orleans" -"there is house in New Orleans", although admittedly it is a pain in the butt. But as a linguist who likes to do rapid prototyping of theories using Google, all this is a little scary.

Google's semantics is in such rapid flux that my little web experiments become even more unrepeatable than they would do if we only had to deal with the ever changing web. And what if someone has been tracking how some aspect of the web changed over time using a string search involving stars? They would now have to change their search pattern, and probably start their cyberdiachronic investigation from scratch. Then again, maybe there is no such thing as a cyberdiachronist who uses starry Google strings. Then double-again, maybe there is such a person, but I don't yet know how to search for them. But then triple-again, and to paraphrase Berkeley: if a tree falls in the forest, but it doesn't show up on Google, did it really happen? Now perhaps you begin to see why the ethereal transcience of Google scares me so much...
Posted by David Beaver at March 6, 2005 01:30 AM