Language Log: WEB ADDRESSES - IN ANY LANGUAGE?

August 03, 2003

WEB ADDRESSES - IN ANY LANGUAGE?

On 20 June 2003, ICANN announced the deployment of a new system of Internationalized Domain Names (IDNs), which permit domain names to use non-Roman scripts. Thanks to this we now have spam about "multilingual web addresses" and hype which confuses languages and writing systems: "register any domain name in any language."

The new system permits web addresses in any Unicode-supported script, covering all the "prominent" languages but leaving many others unsupported.

To avoid modifying the infrastructure of the internet, IDNs are uniquely and reversibly translated into ASCII strings having a special prefix "xn--", and comprised only of letters, digits and hyphens. These ASCII names are then resolved to IP addresses like 129.215.144.3 by nameservers in the usual way. End-users don't actually see these ASCII names, since the business of mapping between Unicode and ASCII is handled by IDN-aware web applications. For example, consider the following Devanagari IDN:

यहल�‹�—हिन ��द�€�• ��य�‹�‚नह�€�‚ब�‹लस�•त�‡ह�ˆ�‚.com

[Click here if above example does not display correctly]

This would be mapped to: xn--i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd.com

RFC 3490, the first of three IDN Standards, recognizes a linguistic problem: ``the introduction of the larger repertoire of characters creates more opportunities of similar looking and similar sounding names.'' ICANN's Guidelines, address this by requiring top-level domain registries to associate each IDN with one language or set of languages, and employ language-specific rules, such as the reservation of all domain names with equivalent character variants in the languages associated with the registered domain name.

Language identification: Associating an IDN to a language or set of languages is problematic when the standard for language identification covers less than a tenth of the world's languages, with a host of attendant problems as explored by Peter Constable and Gary Simons (2000) in their paper Language identification and IT: Addressing problems of linguistic diversity on a global scale.

Language-specific rules: Unicode already introduces indeterminacy, since a single visual form, such as a URL printed on a business card, has many Unicode representations. (e.g. U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA)). However, each language has its own possibilities for orthographic confusability and variability, especially those languages lacking a standard orthography. In ICANN's system, the top-level domain registries will handle these by establishing language-specific rules of character equivalences.

This model has three indeterminacies: language identification, Unicode character identification, and language-specific character equivalences. Perhaps it had to be this complicated. If nothing else, the indeterminacies will be a great marketing opportunity. For example, Verisign's IDN Marketing Guide presents several ploys for getting people to buy variants of the same IDN.

Posted by Steven Bird at August 3, 2003 08:52 AM