March 31, 2004

Convenience for the wealthy, virtue for the poor

Warning: this is a rant. I don't do it very often, but after editing a grant proposal for a few hours this morning, I felt like indulging in one. So bear with me, or move along to the next post. (Now that I think of it, I did indulge in a similar rant just ten days ago. Well, you've been warned...)

I was pleased to find, via Nephelokokkygia, this page by Nick Nicholas on Greek Unicode issues. In particular, he gives an excellent account, in a section entitled "Gaps in the System," of a serious and stubborn problem for applying Unicode to many of the world's languages. He sketches the consortium's philosophy of cross-linguistic generative typography, showing in detail how it applies to classical Greek, and explaining why certain specific combinations of characters and diacritics still don't (usually) work.

Given the choice between the difficult logic of generative typography and the convenient confusion of presentation forms, the Unicode consortium has consistently chosen to provide convenient if confusing code points for the economically powerful languages, but to refuse them systematically to weak ones. As a result, software providers have had little or no incentive to solve the difficult problems of complex rendering.

This reminds me of what Churchill said to Chamberlain after Munich: "You were given the choice between war and dishonor. You chose dishonor and you will have war." The problems of reliable searching, sorting and text analysis in Unicode remain very difficult, in all the ways that generative typography and cross-script equivalences are designed to avoid -- due to the many alternative precomposed characters (adopted for the convenient treatment of major European and some other scripts), and the spotty equivalencing of similar characters across languages and scripts (adopted for the same reason). At the same time, it's still difficult or impossible to encode many perfectly respectable languages in Unicode in a reliable and portable way -- due to the lack of complex rendering capabilities in most software, and the consortium's blanket refusal to accept pre-composed or other "extra" code points for cloutless cultures. I'm most familiar with the problems of Yoruba, where the issue is the combination of accents and underdots on various Latin letters, and of course IPA, where there are many diacritical issues, but Nicholas' discussion explains why similar problems afflict Serbian (because of letters that are equivalent to Russian cyrillic in plain but not italic forms) and Classical Greek (because of diacritic combinations again).

I'm in favor of Unicode -- to quote Churchill again, it's the worst system around "except all those other forms that have been tried from time to time." However, I think we have to recognize that the consortium's cynical position on character composition -- convenience for the wealthy, virtue for the poor -- has been very destructive to the development of digital culture in many languages.

There is a general issue here, about solving large-ish finite problems by "figuring it out" or by "looking it up." While in general I appreciate the elegance of "figure it out" approaches, my prejudice is always to start by asking how difficult the "look it up" approach would really be, especially with a bit of sensible figuring around the edges. My reasoning is that "looking it up" requires a finite amount of straightforward work, no piece of which really interacts with any other piece, while "figuring it out" suffers from all the classical difficulties of software development, in which an apparently logical move in one place may have unexpectedly disastrous consequences in a number of other places of arbitrary obscurity.

I first argued with Ken Whistler about this in 1991 at the Santa Cruz LSA linguistic institute. At the time, he asserted (as I recall the discussion) that software for complex rendering was already in progress and would be standard "within a few years". It's now almost 13 years later, and I'm not sure whether the goal is really in sight or not -- perhaps by the next time the periodical cicadas come around in 2017, the problems will have been solved. Meanwhile, memory and mass storage have gotten so much cheaper that in most applications, the storage requirements for text strings are of no consequence; and processors have gotten fast and cheap enough that sophisticated compression and decompression are routinely done in real time for storage and retrieval. So the (resource-based) arguments against (mostly) solving diacritic combination and language specificity by "look it up" methods have largely evaporated, as far as I can see, while the "figure it out" approach has still not actually succeeding in figuring things out in a general or portable way.

There are still arguments for full decomposition and generative typography based on the complexities of cross-alphabet mapping, searching problems, etc. But software systems are stuck with a complex, irrational and accidental subset of these problems anyhow, because the current system is far from being based on full decomposition.

In sum, I'm convinced that the Unicode designers blew it, way back when, by insisting on maximizing generative typography except when muscled by an economically important country. Either of the two extremes would probably have converged on an overall solution more quickly. But it's far too late to change now. So what are the prospects for eventually "figuring it out" for the large fraction of the world's orthographies whose cultures have not had enough clout to persuade the Unicoders to implement a "look it up" solution for them? As far as I can tell, Microsoft has done a better job of implementing complex rendering in its products than any of the other commercial players, though the results are still incomplete. And there is some hope that open-source projects such as Pango will allow programmers to intervene directly to solve the problems, at least partially, for the languages and orthographies that they care about. But this is a story that is far from over.

Posted by Mark Liberman at March 31, 2004 11:56 AM