Annals of spam
The most recent (8/6/07)
New Yorker
has an unsettling piece by Michael Specter on spam, in an "Annals of
Technology" category: "Damn Spam: The losing war on junk e-mail" (pp.
36-41). The title pretty much tells the story: it's an arms race,
with both sides evolving, but the spammers seem to be winning.
Bayesian filters try to catch spam by looking at properties of previous
spam: looking for the word
Viagra,
for instance. Spammers respond by re-spelling the word, as, say,
"\ / iagra"; this one looks transparent, but Specter reports (p. 40)
that a blogger has estimated that there are over 600 quintillion ways
to "spell"
Viagra.
Another technique that's been around for a while is to bury a few
instances of tell-tale words in bizarrely phrased, but comprehensible,
text, and to toss in some ordinary text lifted from another source, to
throw the Bayesians off the scent, as in these two gems from my mailbox
last week (I've removed the "From" and "To" addresses, since these were
probably hijacked or forged):
(1) From: ...
Subject: Find out the sex craving all guys have
Date: August 2, 2007 7:39:52 AM PDT
To: ...
Dames always hee-hawed at me and even men did in the urban water closet!
Well, now I laugh at them, because I took M_E_G. ADI. K
for 7 months and now my dick is dreadfully more than civil.
market http://neuyormet.com/
--------------------------
Dadullah.
Mr Ducat said that he had no intention of harming the
According to Reuters, police have found no evidence of
Mr. Kenrick and Mr. Smith both denied to disclose how
on Google Video and YouTube. It is a segment taken from
(2) From: ...
Subject: These positions will help you reach your peak
Date: August 3, 2007 2:10:27 AM PDT
To: ...
Cuties always srieked at me and even boys did in the unrestricted
toilet!
Well, now I laugh at them, because I took M E _G_A_D_ IK
for 7 months and now my dick is dreadfully preponderant than civil.
earn http://goosdon.cn/
--------------------------
surface.
Saturn's moon Enceladus taken in 2005, has shown that
Department and the CIA approved of using harsh
represents them, and the courts are closed to public
The Duke of York will leave New Zealand on Thursday 22nd
Each message has only one occurrence of the word
dick, and the product name
MEGADIK (in all-caps, iconic of
bigness), itself a (modest) re-spelling, is expanded via spaces,
underlines, and periods in such a way that human readers will have no
problem recognizing it, but programs that search for orthographic
patterns might be out-foxed.
As Specter notes, another strategy is to conceal the message one level
down, in something other than a text file at top level. Specter
describes the image file strategy, where the message is encoded in an
image rather than in text. I get piles of penis-enlargement spam
in image form every day (much of it depicting monstrously large and
monstrously ugly penises), and it gets past both the spam filters I
currently have in place (one at CSLI, one on my Mac).
The amount of spam sent to me has clearly been increasing; the amount
of spam sent to everybody has. I'm now getting significant
amounts of spam in languages other than English: German especially, but
also (just in the last week) Chinese, Japanese, Hebrew, and
Spanish. And recently there's been a flood of two new (to me)
types of one-level-down spam: stuff in a "greeting card" from someone
-- "a friend", "a family member", etc. -- and a minimalist strategy,
involving an e-mail message headed X (for some innocuous word or phrase
X) and containing nothing but an attached file named X.pdf or X.zip:
header "alert", file "alert.pdf" or "alert.zip". My spam filters
are getting better at weeding out the first sort (I fear the legitimate
electronic greeting card business may be in for a bad time), but so far
they don't catch any of the second sort.
Well, I get a
LOT of e-mail that's basically just a
.pdf file -- departmental business, research project files, reports to
and from students -- so it's hard to see how the junk could be filtered
out without looking into the files, and in a sophisticated way.
(I also trade some .zip files -- on Friday, a zipped version of
Garner's Modern American Usage,
with my current undergraduate intern.)
So for the moment I'm getting a lot of junk I have to expunge by
hand. And I'm not alone.
(Notes: Thanks to Doug Kenter for getting me to think about spam in the
first place. And a warning to readers: this is only a report of
personal experience; I am an idiot about spammish details, and I'm not
proposing to survey the topic or to keep on top of developments in the
worlds of spam dissemination and detection.)
(Note of a more linguistic sort: Specter's article has the noun
spam "doubly categorized", used
sometimes as a mass noun and sometimes as a count noun, even in the
same sentence:
As the Web evolves into an increasingly
essential part of American life, the sheer volume of spam [Mass] grows exponentially
every year, and so, it would appear, do the sophisticated spams [Count] from a twenty-dollar
broadband account each month; at those rates, a penny would pay for
fifty thousand pieces of mail.
Double categorization is a wrinkle in the system of Count/Mass
assignment in English that I keep putting off for a future
posting. But if you don't mind a moderately technical and
abbreviated treatment, you can look at the discussion
here.)
zwicky at-sign csli period stanford period edu
Posted by Arnold Zwicky at August 6, 2007 12:15 PM