May 26, 2004

Unwanted linguistic material: the spam problem

Those who have noted that Language Log doesn't have an open comments section have sometimes wondered why. Those who are aware that robots can be programmed to scour the blogosphere for open comments sections and spit spam into them may have some sense of why Language Log has been cautious so far. Here is just one small story about what's out there itching to get at you: recently posted statistics reveal that the percentage of material mailed to Debian Linux mailing lists that passes all ID checks and content X-raying and security screenings and is duly made available to the subscribers on one of the lists is just 3.5%. That's not a typo: three point five percent. Only 35 out of every thousand items sent to the list are genuine postings by human beings that pertain to Debian Linux. The rest, the spam, is caught by various filters, which some human has to constantly tune and maintain. Such is the flood of mass-mailed garbage travelling around the net looking for a way to get to your screen. The percentage of all email that is from spammers is said to be as high as 80% as of April 2004, and rising. These may be underestimates. Even running spamassassin in fairly aggressive mode does not prevent my address, which I do not advertise, from getting at least one Nigerian scam letter per day that skips past the filtering (plus a dozen pieces of other junk that get caught — a very light spam load by some people's standards). This is Geoff Pullum, looking forward to not hearing too much from anyone at ssshhhh!@censored.sorry.xyz.

Posted by Geoffrey K. Pullum at May 26, 2004 07:07 PM