September 24, 2004

An odd little linguistic artifact

On Aug. 24, J.D. Lasica posted a blog entry noting that when he went to Google News and clicked on the link "John Kerry" under In the News, the first 35 results (other than those from mainstream newspapers and magazines) were from anti-Kerry rightwing sites, with at least a dozen of these appearing on the first page.

Lasica's post was spotted by the editor of the Online Journalism Review, who asked Lasica to write about it, which he did, in an article appearing on 9/23. The explanation for the effect was worked out by Ethan Zuckerman (see his post about it here), whom Lasica quotes in his OJR piece. Lasica discussed the sequence of events in this 9/23 blog entry.

Here's the proposed explanation:

"I think what you're seeing is an odd little linguistic artifact," said [Ethan] Zuckerman, former vice president of and now a fellow at Harvard's Berkman Center for Internet and Society who studies search engines. The chief culprit, he theorized, is that mainstream news publications refer to the senator on second reference as Kerry, while alternative news sites often use the phrase "John Kerry" multiple times, for effect or derision. To Google News' eye, that's a more exact search result.

A second possible factor, Zuckerman said, is that small, alternative news sites have no hesitancy about using "John Kerry" in a headline, while most mainstream news sites eschew first names in headlines. The inadvertent result is that the smaller sites score better results with the search engines.

Zuckerman gives some advice on how to game Google News, if you want to do it:

With an occasional exception, Weblogs are generally not found among the Google News results, so Zuckerman had some advice for aspiring political publishers who want to game the search engines: Don't blog -- start an alternative news network. Use terms like George Bush and John Kerry frequently, rather than their last names alone, in both your text and headlines. Publish new works frequently.

I'm not sure how smart Google's algorithms are -- it would be pretty easy to spin off a blog's RSS feed into something that looked like an "alternative news network".

In any case, whether because Google has changed its algorithms or because the pro-Kerry sources have changed their behavior, the results at Google News today seem to be more balanced than what Lasica reports from a month ago. At least, the first page of search results for "John Kerry" now includes AlterNet and Daily Kos (which is a blog, right?!?) as well as; not to speak of the Socialist Worker and the Collective Bellaciao, who attack both Kerry and Bush.

This new-found balance is too bad, in a way, because to satisfy a linguist (well, at least to satisfy this one), we should look at some controls. Specifically, we should examine the treatment of a wider variety of political and non-political figures. Do the left-wing "alternative news networks" use George Bush's full name more often than right wing ones do? What about Dick Cheney and John Edwards? What about down-ticket candidates, and other national political figures like Ted Kennedy or Tom DeLay? How about celebrities without significant partisan political associations, like Ray Charles or Dan Brown?

Zuckerman closes his blog entry this way:

Basically, I think it's an interesting, accidental linguistic artifact which demonstrates just how hard it is to get an AI to do something as complex as laying out a page of news stories. But stop listening to me and go read Lasica's excellent article.

It's clear from Lasica's results that there is some kind of "linguistic artifact" here, but without some further work in quantitative rhetoric, I'm not sure what it is. So my advice is to put on your pajamas and start counting.

However, if it no longer affects Google News, then your motivation can only be an interest in the statistics of political rhetoric. And I suspect that interest in this topic is somewhat lower than interest in googlebombing. At least, some popular application like googlebombing would be needed to spur the efforts of most amateur quantitative rhetoricians, just as debunking Dan Rather inspired the interest of those who understand the details of typography.

One last point. Apparently, the folks behind Google News changed their algorithms within 24 hours of the publication of Lasica's article. I guess they might have started earlier, perhaps alerted by Lasica's phone call to Krishna Bharat while the story was in preparation. But whatever the exact time line, this looks like pretty fast work. It took CBS longer than that to decide to 'fess up about those forged memos, and they didn't even have to modify any software.

Posted by Mark Liberman at September 24, 2004 12:41 AM