August 06, 2007

Annals of spam


The most recent (8/6/07) New Yorker has an unsettling piece by Michael Specter on spam, in an "Annals of Technology" category: "Damn Spam: The losing war on junk e-mail" (pp. 36-41).  The title pretty much tells the story: it's an arms race, with both sides evolving, but the spammers seem to be winning.


Bayesian filters try to catch spam by looking at properties of previous spam: looking for the word Viagra, for instance.  Spammers respond by re-spelling the word, as, say, "\ / iagra"; this one looks transparent, but Specter reports (p. 40) that a blogger has estimated that there are over 600 quintillion ways to "spell" Viagra.

Another technique that's been around for a while is to bury a few instances of tell-tale words in bizarrely phrased, but comprehensible, text, and to toss in some ordinary text lifted from another source, to throw the Bayesians off the scent, as in these two gems from my mailbox last week (I've removed the "From" and "To" addresses, since these were probably hijacked or forged):

(1) From: ...
Subject: Find out the sex craving all guys have
Date: August 2, 2007 7:39:52 AM PDT
To: ...

Dames always hee-hawed at me and even men did in the urban water closet!
Well, now I laugh at them, because I took M_E_G. ADI. K
for 7 months and now my dick is dreadfully more than civil.
market http://neuyormet.com/
--------------------------
Dadullah.
Mr Ducat said that he had no intention of harming the
According to Reuters, police have found no evidence of
Mr. Kenrick and Mr. Smith both denied to disclose how
on Google Video and YouTube. It is a segment taken from

(2) From: ...
Subject: These positions will help you reach your peak
Date: August 3, 2007 2:10:27 AM PDT
To: ...

Cuties always srieked at me and even boys did in the unrestricted toilet!
Well, now I laugh at them, because I took M E _G_A_D_ IK
for 7 months and now my dick is dreadfully preponderant than civil.
earn http://goosdon.cn/
--------------------------
surface.
Saturn's moon Enceladus taken in 2005, has shown that
Department and the CIA approved of using harsh
represents them, and the courts are closed to public
The Duke of York will leave New Zealand on Thursday 22nd

Each message has only one occurrence of the word dick, and the product name MEGADIK (in all-caps, iconic of bigness), itself a (modest) re-spelling, is expanded via spaces, underlines, and periods in such a way that human readers will have no problem recognizing it, but programs that search for orthographic patterns might be out-foxed.

As Specter notes, another strategy is to conceal the message one level down, in something other than a text file at top level.  Specter describes the image file strategy, where the message is encoded in an image rather than in text.  I get piles of penis-enlargement spam in image form every day (much of it depicting monstrously large and monstrously ugly penises), and it gets past both the spam filters I currently have in place (one at CSLI, one on my Mac).

The amount of spam sent to me has clearly been increasing; the amount of spam sent to everybody has.  I'm now getting significant amounts of spam in languages other than English: German especially, but also (just in the last week)  Chinese, Japanese, Hebrew, and Spanish.  And recently there's been a flood of two new (to me) types of one-level-down spam: stuff in a "greeting card" from someone -- "a friend", "a family member", etc. -- and a minimalist strategy, involving an e-mail message headed X (for some innocuous word or phrase X) and containing nothing but an attached file named X.pdf or X.zip: header "alert", file "alert.pdf" or "alert.zip".  My spam filters are getting better at weeding out the first sort (I fear the legitimate electronic greeting card business may be in for a bad time), but so far they don't catch any of the second sort.

Well, I get a LOT of e-mail that's basically just a .pdf file -- departmental business, research project files, reports to and from students -- so it's hard to see how the junk could be filtered out without looking into the files, and in a sophisticated way.  (I also trade some .zip files -- on Friday, a zipped version of Garner's Modern American Usage, with my current undergraduate intern.)

So for the moment I'm getting a lot of junk I have to expunge by hand.  And I'm not alone.

(Notes: Thanks to Doug Kenter for getting me to think about spam in the first place.  And a warning to readers: this is only a report of personal experience; I am an idiot about spammish details, and I'm not proposing to survey the topic or to keep on top of developments in the worlds of spam dissemination and detection.)

(Note of a more linguistic sort: Specter's article has the noun spam "doubly categorized", used sometimes as a mass noun and sometimes as a count noun, even in the same sentence:

As the Web evolves into an increasingly essential part of American life, the sheer volume of spam [Mass] grows exponentially every year, and so, it would appear, do the sophisticated spams [Count] from a twenty-dollar broadband account each month; at those rates, a penny would pay for fifty thousand pieces of mail.

Double categorization is a wrinkle in the system of Count/Mass assignment in English that I keep putting off for a future posting.  But if you don't mind a moderately technical and abbreviated treatment, you can look at the discussion here.)

zwicky at-sign csli period stanford period edu

Posted by Arnold Zwicky at August 6, 2007 12:15 PM