February 26, 2006

en language log splitter

Anyone who has used a blog search engine or set up a blog feed knows that spam has thoroughly infested the blogosphere. Spam blogs, or splogs, have been running rampant for the past year or so, especially since last October's "splogsplosion" (to use a lovely double-blend coined by Tim Bray of Sun Microsystems). As with email-borne spam, much of the text of splogs is randomly generated, or at least generated according to a set of esoteric rules known only to the splogger.

By way of example, entering "language log" on Google Blog Search turns up a recent post appearing on a splog with the perfectly Dadaist name of "Separates on Crustal Gerald." Like so much spam text these days, the post reads like found poetry:

en language log splitter
February 25th, 2006

Captain combination upward language log space buy becoming chapter thirty. Tomorrow satisfied draw lie language log castle whispered. Given act wish log splitter establish discover bent park. Pleasant mood lungs funny log splitter splitter Mike variety soon uncle. Wonder flew promised en language tonight half fifty load. Brain fallen fort drawn en language development catch deeply tonight imagine. Attached concerned. Fifty yard log splitter ago Autumn forgot curious obtain. Mississippi constantly en language accept baseball beside. Indicate stop deeply vessels log splitter passage. Exact report american splitter ourselves completely grandfather language log thread continued black. Nearby exclaimed earlier record orange en language meet.

The two embedded links lead to pages full of Google ads for retailers selling — surprise — log splitters. The two pages are similarly packed with spam text, and one of them even shares the rubric of "en language log splitter" with the referring splog. If I had to guess, I would think the splogger's text-generating algorithm loosely relies on collocational frequencies to determine how to string words along. So the target phrase of "log splitter" is preceded by "language" because of the frequency of the collocation "language log" on the Web. (Now there's some dubious fame for you.) The "en" preceding "language" is more difficult to explain, though I see that the collocation "en language" is favored by spammers for some reason.

(I'm loath to direct any additional traffic the sploggers' way, so you needn't bother clicking on the above links. Better yet, you can follow the advice given by Wired from the October splogsplosion and report offenders to the proper authorities. Splog Reporter is fighting the good fight, accumulating examples of splogs to help search engines identify and exclude likely suspects. And sploggers like dear old Crustal Gerald who take advantage of Google ads should be reported to Google's AdSense program.)

[Update: Bruce Rusk sends along some useful reflections on "en language":

The "en" is somehow parasitized from the undergirdings of the web: "en" is the generic code for English (vs "en-us" for US English, etc.), and often appears in HTML tags and in URLs. Googling "language en" will reveal lots of URLs with the string "Language=en" and of course Google ignores the "="; many such hits are from pages available in multiple languages, with the "en" indicating a directory containing English-language file. Now why that translates into the collocation "en language" and why an "en language" search yields mostly sploggy sites I am unsure, but it must have something to do with how Google finds similar pages. ]

Posted by Benjamin Zimmer at February 26, 2006 01:42 AM