January 20, 2007

Standardizing away the world's languages

The transmutation of a prime into a skull and crossbones reported by Geoff is an example of the all too familiar incompatibility of files produced by different word processors and different versions of the same word processor, especially Microsoft Word. These incompatibilities are not only annoying and time-consuming, but where the file formats are secret, as are Microsoft's, they make it nearly impossible for competitors to inter-operate completely, make access to archives difficult, and lock users into the same product line. Among other things, this reduces competition and therefore increases costs for consumers.

In response to this problem some years ago a consortium was formed to create an open standard for exchange of documents. The standards group began work in 2002 and completed its work in 2005. The result was the Open Document standard, which you can read for yourself here. The official version of the standard is that produced by the International Organization for Standardization (ISO), available here, but they'll charge you 342 Swiss francs ($274). Open Document is open in the sense that anyone may read it and anyone may use it without obtaining a license or paying royalties. Part of being open in this sense is being sufficiently specific that someone wishing to implement the standard has all of the information he or she needs. If you write a word processor that exports in ODF, I can, using only the specification, without any other information about your program, write a word processor that will import your document perfectly, and of course, conversely.

ODF is a very good thing for just about everybody, from Geoff to the Commonwealth of Massachusetts, the National Archives of Australia, the Allahabad High Court, and Belgium, all of which have adopted it. One entity that is not too keen on ODF is Microsoft. In an effort to prevent ODF from becoming the universal standard, Microsoft suddenly came up with its own "open standard" [49 MB PDF document] known as Open XML. Open XML is not actually an open standard because it leaves some elements publicly undefined. Some elements, for example, are defined only by reference to secret Microsoft specifications. In any case, it isn't really a standard of the usual sort because, in its attempt to enshrine every detail of the formats used by Microsoft products it is much more specific than a normal standard. That is why it runs over 6,000 pages while ODF runs a mere 737. Rob Weir at An Antic Disposition has a hysterically funny blog post about the Open XML specification. I especially like his observation that: "This is not a specification; this is a DNA sequence."

Microsoft is now trying to get Open XML approved by the ISO. The current phase of the process is what is called the "contradictions" stage, in which contradictions between the proposal and existing standards are investigated. The process is described by Pamela Jones in this Groklaw article. Some of the contradictions that have already been pointed out are discussed by Andy Updegrove in this Standards Blog post. For example, Open XML does not follow ISO 8601, the standard for representation of dates and times. Why? Because whoever wrote the code for computing dates in a Microsoft product long ago did not know that 1900 was not a leap year. (Years divisible by 100 are not leap years unless they are divisible by 400.) Open XML requires conforming implementations to replicate this Microsoft bug forever.

Now, you might be wondering what this all has to do with linguistics. Well, one of the things that document metadata specify is the language of the document. The Open Document standard does this correctly. It uses (p. 61) the three-letter language codes of ISO-639, followed by a two-letter country code following ISO 3166. This allows for the specification of any of the world's languages. A three letter code allows for as many as 17,576 languages. ISO-639-3 in fact already encodes most of the world's approximately 6,700 languages. Open XML, on the other hand, does not follow ISO-639-3. Instead (section 2.18.52), it requires that languages be specified by means of two hexadecimal digits, e.g. 0x09 for English. That means that no more than 256 languages can be accomodated. The list of languages available is in the document referenced above on pp. 2531-2537 but for the two-letter hex codes you'll have to look elsewhere because Microsoft doesn't list them together with the languages. For some reason it gives a completely different set of non-hexadecimal codes ranging from 1025 to 58,380. The hex codes can be found in the fourth column of this table, the one labelled "Win Code".

In short, the Open Document standard provides for all the languages in the world, while Open XML excludes the great majority. This isn't a matter of ignorance. Microsoft has employees like Michael Kaplan who are quite knowledgable about the world's languages and the technical issues that they raise, but business strategy comes first.

Posted by Bill Poser at January 20, 2007 12:30 AM