March 16, 2004

Making Email Eight-Bit Safe

While we're on the topic of improvements in the infrastructure for writing systems other than the Roman alphabet, I thought I'd mention that the email system seems to have become eight-bit safe. What that means is that it is safe to include in email messages bytes whose high (most significant) bit is set. The original network architecture assumed that all email would consist of ASCII characters. ASCII character codes range from 0 through 127, which is to say, from 00000000 through 01111111. The eighth or high bit has the value 128 and so is set (has the value 1) only in bytes whose value ranges from 128 through 255. If you sent a byte with the high bit set through the email system, you couldn't be sure what would happen to it.

As long as all that people wanted to send was ASCII text, this was alright, but soon people began to want to send data other than text, such as images and programs. In order to get such non-textual data through the email system, it had to be encoded in such a way that what was sent over the network consisted entirely of bytes with values between 0 and 127. The most common way of doing this was by means of uuencoding. uuencode originally stood for Unix-to-Unix encode, but the same encoding technique soon spread to non-Unix systems. Uuencoding distributes the information in three eight bit bytes (24 bits) over four bytes all of which have their seventh and eighth bits unset. Essentially, abcdefgh ijklmnop qrstuvwx is reencoded as 00abcdef 00ghijkl 00mnopqr 00stuvwx, where each letter of the alphabet represents a 0 or a 1. On Macintoshen, the program that performs the same function is called binhex. In both cases binary data was encoded, sent over the network, and then decoded back into its original form. When you send a file as an attachment, the same sort of encoding and decoding is performed.

This system wasn't too inconvenient as long as people were willing to write in plain ASCII, but it becomes tedious if you want to write in a writing system whose encoding requires more than 128 codepoints, such as one of the ISO-8859 encodings for European languages, Arabic, and Hebrew, or in Unicode, since many bytes in Unicode have their high bit set. For example, the UTF-8 Unicode encoding of the Korean word 한글 [hangul] (the name of the Korean alphabet) looks like this when written out in binary:

Every byte has its high bit set.

A lot of software still isn't eight-bit safe. One such program is Movable Type, the software that runs Language Log. You can enter UTF-8, for example, but your post will be truncated at the first byte with its high bit set. In order to get non-ASCII characters in, therefore, you have to use HTML numeric character entities, which represent Unicode codepoints using ASCII characters. For instance, the phonetic symbol ʤ has the Unicode code 0x02A4, which comes out as 0xCA 0xA4 in UTF-8. But it won't work to enter these two bytes directly into Movable Type since both have their high bits set. So instead, I enter the eight byte sequence ʤ. Each of these eight characters is an ASCII character, so Movable Type is happy, but your browser knows that such a sequence is to be interpreted as "the Unicode character whose hexadecimal value is 02A4", and if you have a suitable font installed, displays the correct character.

The email system, however, seems now to be eight-bit safe. A little while back, I tried sending myself unencoded email in Unicode, and somewhat to my surprise, it worked. Since then, I've corresponded successfully in Korean with a friend at Rutgers. There may still be parts of the network that are not eight-bit safe, but it looks like things are looking up.

Posted by Bill Poser at March 16, 2004 09:35 PM