July 25, 2007

I18N invective

In this digital and international age, it's hard out there for a Bowdler. Just think how tough it is to find all the spammers' creative ways of spelling words that they hope will attract an occasional sucker.

My own introduction to the censor's side of this duel of wits came in 1982, back when Disney World's Epcot ("Experimental Prototype Community of Tomorrow") first opened. AT&T sponsored a large exhibit there, and one of the initial installations was a real-time speech synthesizer that I had helped to develop. The idea was that visitors could type text on a keyboard, and hear the synthetic results immediately every time they hit 'enter'. This was so long ago that the system ran on a PDP-11, or perhaps it was clockwork, I'm not sure...)

Anyhow, the Disney people's immediately saw that we'd have to figure out how to thwart kids who type taboo words or phrases. So of course we added a list of words to the pronouncing dictionary, all with the pronunciation "cough" (or sometimes "cough cough"). But this wasn't really enough, of course, since we also had letter-to-sound rules, and with a bit of effort kids could figure out how to get the system to deliver their message. (Of course we tried to forestall this as well, but in the battle between censorship and creativity, censorship usually loses sooner or later.)

I was reminded of this by an email from Mike Albaugh:

Reading your "Expressions of negative Clippy feelings" post I flashed back to two memories.

In one, a friend was bitterly criticized on a online "forum", because he had ported a coin-op game to the PC, and left the "stop list" of words that users should not be able to enter as "names" in the high-score list. Someone had run "strings" or the DOS equivalent on the game, and was _outraged_ that three of the words, adjacent simply because they were alphabetized, formed what the complainer considered a phrase, and an obscene, racist one at that. No amount of explanation on my friend's part would mollify this possible troll.

Perhaps for similar reasons, the VMS "password suggestion" feature "encrypted" its stop-list. Not industrial-strength, but a bit more than ROT-13, IIRC. So of course folks had to figure it out and someone posted the list to comp.os.vms or the like, whereupon various people commented on what languages the forbidden words came from. Many were easy, but one stayed unattributed for a few days, until a post saying "It's Turkish. Don't ask".

I mention this because I can imagine, what with the deep well of love for Microsoft in the world at large, that they might fall prey to similar ill-will for simply having such a list for many languages. I hope they have learned about hashes.  

Well, I'm highly confident that there are many people in Microsoft R&D who know about hashes. I'm somewhat less certain that the software engineers involved will have seen this one coming -- but I guess by now they've probably experienced, at least once, pretty much every way that ill-wishers of one kind or another might approach their creations.

Another response to the Clippy post: Joseph Kynaston Reeves writes:

Just read your fascinating Language Log post about how Clippy responds to verbal abuse, so (like everyone else who read it, I suspect) I gave it a try. The results are generally as you say, with one notable exception: tell Clippy "Fuck you" in Word and the first template he offers you is "Thank-you for job interview". Genius.

"Mute inglorious Milton" doesn't quite cover this one. "Mute inglorious Dilbert", maybe.

Posted by Mark Liberman at July 25, 2007 06:57 AM