August 31, 2005

Encoding Puzzle Answer

Here's the solution to yesterday's encoding puzzle If you look at the HTML metadata, the page claims to be in ISO-8859-1 (aka Latin-1), an ASCII extension in which things like accented characters occupy codepoints above the ASCII range, while still remaining in a single byte. The claim, though technically true, is misleading. All of the characters are ASCII characters. That is, not a single byte on that page has a value greater than 0x7F. Technically, you can call that ISO-8859-1, since it is consistent with it, but really the page is in the ASCII subset of ISO-8859-1.

Inspection of the page source reveals that the accented letters are each represented by a sequence of two HTML decimal numeric character entities. For example, é e with acute accent, is not represented by the single byte with value 0xE9 as it would be in ISO-8859-1. Rather, it is represented by a sequence of twelve bytes: é. à is an HTML representation for à upper case a with tilde; © is an HTML representation for © copyright symbol. That's why the word représente comes out as représente on your terminal. (Don't anybody write in to say that this is the usual spelling used by dyslexic speakers of North African French when text-messaging after they've had a few drinks or something like that. Writing Unix man pages is a serious, indeed sacred, matter. Learned authors have compared the interpretation of Unix man pages to the study of the Talmud.)

What do I mean by saying that à is an HTML representation of à and that © is an HTML representation of ©? In HTML, characters may be represented as many as four ways:

  • They may be directly encoded. For example, a byte with the numerical value 0x26 (38 to the hexadecimally challenged) is the ASCII code, and therefore also the ISO-8859-1 and Unicode, for the ampersand character &. Most of the text that you see on web pages is represented this way.
  • They may be represented by character references. These are little labels enclosed between an ampersand and a semi-colon. For example, ampersand may be represented &. Such character references exist for most of the common symbols, such as © and ®, and for letters with diacritics, such as é and à. (You may be wondering how it is that I am writing things containing & if it is used to introduce character references. The answer is that I am very clever. If that doesn't satisfy you, the view page source command in your browser should clue you in.)
  • Characters may be represented by means of hexadecimal numeric character entities. A numeric character entity begins with an ampersand and a cross-hatch and ends with a semi-colon. Between them is a numerical representation of the character's Unicode value. If the number is base 16, it is preceded by an x. The hexadecimal numeric character entity for ampersand is &.
  • Characters may be represented by decimal numeric character entities. These are just like hexadecimal numeric character entities except that the number is in base 10. That it is decimal is marked by omission of the x that marks hexadecimal numbers. Ampersand is represented as a decimal numeric character entity as  . No one knows for sure why decimal numeric character entities exist since they are wholly redundant and not nearly as elegant as their hexadecimal counterparts. Some scholars suspect that they have a symbological value. Perhaps a sequel to The Da Vinci Code will enlighten us.

So, how did é end up represented as é? Well, é is an ASCII-fied representation of a sequence of bytes whose numerical values are 0xC3 (aka 195) and 0xA9 (aka 169). Notice how the use of decimal numeric character entities obscures things. It just happens that 0xC3 0xA9 is the UTF-8 encoding of UTF-32 0xE9. In its pure and ethereal form, Unicode codepoints are all 32 bits, or 4 bytes. For various reasons (discussed previously on Language Log and in more detail here) the preferred form for exchange of Unicode-encoded text is UTF-8, in which most characters are encoded as two or more bytes.

To pull all this together, the garbled man pages are what you would get if you started off with a page in UTF-8, and mistakenly thinking that it was in ISO-8859-1 ran it through an HTML-izer that converted anything outside the ASCII range to numeric character entities.

The reason for using an HTML-izer is that some software, such as the software that runs this blog, cannot handle bytes whose high bit is set. If you enter such a byte into a Language Log entry, it looks fine when you enter it, but you will find the post truncated immediately before the first such byte. So if you want to use non-ASCII characters with confidence in web pages, it is wise to convert them all to character entities. I have written a couple of programs that do this myself.

Several of our readers figured this out: Diane Bruce responded in the wee hours of last night not long after I posted the puzzle. The others are Aaron Elkiss and John O'Neill.


Posted by Bill Poser at August 31, 2005 01:54 AM