Here's the solution to yesterday's encoding puzzle If you look at the HTML metadata, the page claims to be in ISO-8859-1 (aka Latin-1), an ASCII extension in which things like accented characters occupy codepoints above the ASCII range, while still remaining in a single byte. The claim, though technically true, is misleading. All of the characters are ASCII characters. That is, not a single byte on that page has a value greater than 0x7F. Technically, you can call that ISO-8859-1, since it is consistent with it, but really the page is in the ASCII subset of ISO-8859-1.
Inspection of the page source reveals that the accented letters are each represented by a sequence of two HTML decimal numeric character entities. For example, é e with acute accent, is not represented by the single byte with value 0xE9 as it would be in ISO-8859-1. Rather, it is represented by a sequence of twelve bytes: Ã©. Ã is an HTML representation for Ã upper case a with tilde; © is an HTML representation for © copyright symbol. That's why the word représente comes out as reprÃ©sente on your terminal. (Don't anybody write in to say that this is the usual spelling used by dyslexic speakers of North African French when text-messaging after they've had a few drinks or something like that. Writing Unix man pages is a serious, indeed sacred, matter. Learned authors have compared the interpretation of Unix man pages to the study of the Talmud.)
What do I mean by saying that Ã is an HTML representation of Ã and that © is an HTML representation of ©? In HTML, characters may be represented as many as four ways:
So, how did é end up represented as Ã©? Well, Ã© is an ASCII-fied representation of a sequence of bytes whose numerical values are 0xC3 (aka 195) and 0xA9 (aka 169). Notice how the use of decimal numeric character entities obscures things. It just happens that 0xC3 0xA9 is the UTF-8 encoding of UTF-32 0xE9. In its pure and ethereal form, Unicode codepoints are all 32 bits, or 4 bytes. For various reasons (discussed previously on Language Log and in more detail here) the preferred form for exchange of Unicode-encoded text is UTF-8, in which most characters are encoded as two or more bytes.
To pull all this together, the garbled man pages are what you would get if you started off with a page in UTF-8, and mistakenly thinking that it was in ISO-8859-1 ran it through an HTML-izer that converted anything outside the ASCII range to numeric character entities.
The reason for using an HTML-izer is that some software, such as the software that runs this blog, cannot handle bytes whose high bit is set. If you enter such a byte into a Language Log entry, it looks fine when you enter it, but you will find the post truncated immediately before the first such byte. So if you want to use non-ASCII characters with confidence in web pages, it is wise to convert them all to character entities. I have written a couple of programs that do this myself.
Several of our readers figured this out: Diane Bruce responded in the wee hours of last night not long after I posted the puzzle. The others are Aaron Elkiss and John O'Neill.