March 21, 2004

Them old diacritical blues again

Depending on your browser, you may have noticed some oddities in the Chuvash endings cited in this recent Language Log post about Attila the Goth Hun. The privative and benefactive suffixes should have vowels (a and e) written with underdots. Since there are no economically important languages that use vowels with underdots, the Unicode Consortium in its wisdom has determined that such characters must be handled in the virtuous fashion, by composition of character features, rather than in the convenient and workable fashion, using pre-composed characters such as those provided for the major European and East Asian languages. For the same reason, software writers have been lackadaisical at best about supporting character composition. This creates a catch-22 of global proportions: diacritic-heavy languages like Yoruba don't have the clout to force Unicode to include pre-composed variants, as for instance Korean did; but they also don't have the clout to make software writers render the relevant combining-character sequences correctly.

As I've mentioned in an earlier post, the problems of complex character rendering can be very complex indeed. However, putting a dot under a vowel is not exactly rocket science, and you'd think that people could agree about how to do that much, and then implement that agreement in a consistent way. Of course, you'd be naive and foolish to think that.

In order to get around the fact that not all browsers (and/or browser character encoding and font settings) deal with raw unicode correctly, I dutifully transformed Unicode 0323 "COMBINING DOT BELOW" into its html character-entity form ̣ (changing the code point number from hex to decimal and wrapping it with &#___;). And then I put this abominable string after the vowel to be underdotted, like so:

-sạr

which should produce "-sạr" with a nicely underdotted "a".

But in fact it produced a bewildering variety of different outcomes in the different circumstances that I've tried so far.

In Internet Explorer on my windows laptop, depending on what font I've selected, it produces "sar" with a single dot below the "a", "sar" with a pair of side-by-side dots below the "a", or "sar" with an empty square glyph (indicating a missing character), either after the "a" or superimposed on the "a" (I don't know which settings are responsible for the last difference).

In Mozilla/Netscape/Firebird/Firefox, depending on what font I've selected, it produces the single or double dots NOT under the (preceding) "a" BUT RATHER under the following "r". As I understand the Unicode and HTML specs, this is wrong. Of course, mozilla is also happy to produce versions with empty boxes in various locations, if the font is missing the combining diacritics, as many fonts are.

Macromedia's Dreamweaver doesn't even try to do any composition in such cases. I haven't had the heart to check Opera or Safari or Java's HTML rendering classes or any of the other options.

The empty boxes are just a matter of fonts lacking the diacritic glyphs -- OK, that's just a residue of history. The double underdots are apparently a particular font that has gotten 0323 and 0324 mixed up -- OK, that's just a little mistake. But not being able to agree on whether combining diacritics combine to the right or the left?

Come on, people, this is pathetic. It's an gratuitous, ongoing insult to the hundreds of millions of people around the world whose languages are normally written in a Latin alphabet with diacritics that Unicode doesn't happen to provide in precomposed form. And since Microsoft often takes a few lumps in the blogosphere, let's specify that it's the Beast of Redmond that did the right thing here, and Mozilla that gets it wrong.

What I actually decided to do on the page in question was to put the ̣ character entities in the wrong place (before rather than after the vowels) so that the whole mess renders correctly in the mozilla-based browsers that I usually use. If I set the font right.

Here it is, so you can see what happens in your environment: "-ṣar".

Of course, the result is that the page renders incorrectly for the 55% of our readers who use some form of Internet Explorer. Sorry, folks. What I'm supposed to do, I guess, is to put in some javascript code figuring out what browser people are using, and then select different stretches of html, depending. But I won't.

[If one of you can tell me how to do underdots in html in a reliably portable way, I'll buy you a very good dinner the next time we're in the same city.]

[Update: Tenser, said the Tensor finds underdot e and a in Unicode! He spies them "in the Latin Extended Additional range, which I believe is the 'tricked out Latin characters for use in Vietnamese' range." He has a few other useful suggestions as well, all of which I'll try out when I next have a few minutes. Meanwhile, I believe that I owe a dinner, and will make arrangements to pay up.

One of the reasons that I didn't find these characters is that the index of character names given seems quite incomplete. None of the obvious index points seems to turn up e.g. LATIN SMALL LETTER A WITH DOT BELOW (such as LATIN or SMALL or A or DOT or BELOW).

In any case, what I said about combining diacritics still stands -- for example, to handle Yoruba, you need to be able to combine underdotted vowels with acute and grave accents (for tone).]

Posted by Mark Liberman at March 21, 2004 12:28 PM