March 07, 2006

Collocation provocation

When Crash upset Brokeback Mountain at the Academy Awards, the media and entertainment blog Gawker added fuel to the anti-Crash fire by claiming, "Google Can't Hide Its Oscar Disappointment." They point out that searching with Google for "i'm really glad crash won" prompts the question "Did you mean: 'i'm really glad trash won'?" At first this might seem like a case of Googlebombing, where a group of people (say, in this case, bitter Brokeback fans) try to skew Google's search returns by linking to pages with particular keywords. But Googlebombing only affects the ranking of pages, not the search engine's "Did you mean..." suggestions. (One enterprising Googlebomber did manage to get the top result for "french military victories" to link to a spoof page asking if you meant "french military defeats," but that suggestion never actually showed up on Google's result page.)

So why does Google ask if you're "really glad trash won"? As noted here before, Helpful Google sometimes moves in mysterious ways. But this case isn't so mysterious, since it looks like Helpful Google's algorithm is simply relying on collocational frequencies. It notices that "glad crash" isn't a common search return, so it looks for near misses on the assumption that the original query is a misspelling. And changing just one letter yields "glad trash," a much more common collocation thanks to GLAD trash bags. The same change from "crash" to "trash" occurs with various other search strings like "glad crash is" and "glad crash has," though Helpful Google demurs with plain old "glad crash."

Google's algorithms have been mistakenly ascribed intentionality in another recent case. The Times of London reports:

Google has been asked to explain why the name of the Premiership footballer Ashley Cole has been linked to the word "gay" in internet search results.
Lawyers acting for the Arsenal and England defender want the internet company to disclose why typing his name into the search engine generates "See results for: ashley cole gay". ...
Graham Shear, solicitor for Mr Cole, said that he is interested in the origin of Google's decision to display the "gay" results alongside general searches for his client.
He said: "I am keen to find out whether the decision to automatically include the term 'gay' to the keyword 'Ashley Cole' was an editorial decision or one made by a computer based on the volume of searches for 'Ashley Cole' linked to the word 'gay'.

The dispute hinges on a new feature from Helpful Google — now Extra Helpful Google! — as Search Engine Watch explains (with an illustrative screenshot):

This is an example of the middle-of-the-page query refinement that Google's been testing over the past several months, as we wrote about back in August.
In particular, what seems to be happening is that Google is performing "clustering," a long-standing technique of grouping pages on a similar topic together. In other words, its sees there are lots of pages about "ashley cole" along with a subgroup of those on the topic of "ashley cole gay."
That there might be a subgroup like this isn't surprising. Cole is currently suing newspapers The Sun and The News Of The World over allegations they printed that he is gay. Those allegations have fueled discussion on the web, leading to a subgroup of pages on this topic.

And it's precisely the tabloid allegations of Cole's gay relations that prompted his solicitor to go after Google. As Mr. Shear told the Times, "I would be interested in when and what prompted this and whether the process started since we launched the cases against the News of the World and The Sun or before." He implies, bizarrely, that Google might somehow be colluding with the tabloids to tarnish his client as gay (apparently a fate worse than death for British footballers).

Leaving aside the not-too-subtle homophobia underlying the solicitor's request, it is indeed curious why <ashley + cole + gay> should be suggested, when it's not even the most common subsearch of three terms where two of them are <ashley> and <cole>. Blogger Chromatius notes that the Google toolbar autofills the most common search keywords, and typing <ashley + cole + g...> doesn't automatically suggest "gay" or rank it as the most common choice of keyword. (Cole and his solicitor should be happy to know that the first g-word suggested is "girlfriend." And the g-word listed by the toolbar with the highest number of results is "gallery.") Once again, Search Engine Watch enlightens the ways of Google. The suggestion <ashley + cole + gay> shows up in the middle-of-the-page refinement not because it's the subsearch with the most results but because it's now the most commonly queried subsearch:

Why bring up this particular topic when something like "ashley cole" cars comes up with more matches (60,100 of them)? That brings me back to search volume. If Google's noticing that there are a lot of queries on a particular subtopic (ashley cole gay) related to the main topic (ashley cole) plus a significant number of pages on that topic, that might cause this refinement to kick in.

So apparently it's the controversy itself that is to blame. Lots of people have heard the allegations and are entering <ashley + cole + gay> into Google, which in turn is triggering the search engine to suggest that particular combination of keywords as a refinement of <ashley + cole>. I have a funny feeling that once the hacker types figure out the algorithm for the new feature, they'll use it for a new round of Googlebombing. All they would need to do is set up a "bot" that enters a particular query — let's say <george + bush + nincompoop> — over and over again, and eventually that will be the suggested refinement for <george + bush>. It might take people doing it from multiple IP addresses for this to work, but given the great success Googlebombers have already had, I'm sure it's just a matter of time before Extra Helpful Google is brought low.

Posted by Benjamin Zimmer at March 7, 2006 09:26 PM