February 14, 2004

WebSense -- Not!

I guess there's some satisfaction to be taken from Geoff Pullum's discovery that the WebSense filter's block of Language Log as a sex site wasn't the result of an overzealous reading of the site's content ("copular sentences," anyone?), but merely of the filter's having blocked the IP of Language Log's host machine on the basis of what turned out to be a misclassification of another site that was hosted there.

But I'd demur from Geoff's description of the WebSense error as the result of a kind of "typo" -- it's a lot more ominous than that.

In fact this sort of misclassification is extremely common, as I learned when I served as an expert witness for the American Library Association in its challenge to the Children's Internet Protection Act, which mandates the use of filtering software in all libraries receiving certain federal subsidies. (The law was overturned by a three-judge federal panel in June of 2002, but was ultimately held constitutional by the Supreme Court.)

All the filtering companies routinely block the IP's associated with any site their software flags as objectionable, even if the machine in question hosts dozens or even hundreds of innocuous sites. The rationale for this procedure is that porn sites frequently change their url's, so that IP blocking is a necessary back-up. This IP blocking accounts for a large proportion of the misclassifications of sites by filters. In records that N2H2 (makers of the "Bess" filter) produced for the ALA trial, it turned out that more than half of the overblocks for which the filter received unblocking requests over one seven-week period several years ago involved virtual hosting -- or 583 sites for that period alone. Note that these were merely those sites whose owners had discovered that they were being blocked and had taken the trouble to write to the filtering company -- the actual number of innocuous sites that are blocked by this procedure was surely orders of magnitude greater than that. And the proportion of sites that are improperly blocked by this procedure is doubtless a lot higher today, owing to the sharp increase in the number of blogs and other sites that are virtually hosted.

In fact while critics of filters have noted chiefly the overblocking of sex- and health-information sites and the like, IP blocking is responsible for restricting access to a huge amount of utterly irrelevant protected speech, and the burden is entirely on the site owners to discover the errors and report them to the filtering companies. (For more on this, you can look at the pieces I've done on filters in The American Prospect and The New York Times.) In our investigations in connection with the ALA case, we discovered that filters were blocking sites devoted to dollhouse furniture, obituaries, wrestling, Latin music, celebrity autographs, and the computer society of Lulea University of Technology, most of the overblocks probably due to IP blocking, so Language Log is in good, if depressingly abundant, company.

Note by the way that the filters also block all translation sites, Google cache and image pages, anonymizer sites, and other sites that return a url different from the one that was requested -- but that's another issue.

Posted by Geoff Nunberg at February 14, 2004 04:33 PM