September 29, 2004

Flash: fontgate at Language Log

This time, the pajamahadeen were coming after me.

A couple of days ago, I typed in a few paragraphs from the novel Gravity's Rainbow, dealing with the efforts of Soviet linguists around 1926-28 to establish a new alphabet for Turkic languages of Central Asia. The quoted passage included an odd character, one that I've never seen in any other context, which is described as representing "a kind of G, a voiced uvular plosive". I searched through the Unicode code charts for a few minutes without finding it, and finally came up with ଗ U+0B17 ORIYA LETTER GA, which is vaguely similar in appearance and also in phonetic value, but is clearly not the right thing. However, I was about to be late for an interesting talk. And maybe the right character wasn't really available at all -- Pynchon might have made it up, or it might be an obscure invention that never made it into Unicode. And who would notice, anyhow? So I figured ଗ (ଗ) would do.

Guess again. Within a few hours, Tim May emailed to suggest that (based on his knowledge of Unicode, and his memory of what the character looks like in the printed version of Pynchon's novel, which he didn't have at hand!) the Unicode code point should really be Ƣ - U+01A2 LATIN CAPITAL LETTER OI and ƣ - U+01A3 LATIN SMALL LETTER OI, about which the code chart for Latin Extended B adds the note "= gha [in] Pan-Turkic Alphabets". Tim also pointed to this chart of the "Kirghiz (Kyrgyz) Latin alphabet (1928 - 1940) which shows that (something looking like) Ƣ and ƣ were definitely in there, ordered right after G.

Tim commented gently that

The two characters are quite similar, and apparently both denote voiced back consonants. Still, a rather surprising substitution. Oriya's one of the more obscure of the Indic scripts in Unicode.

Right -- Oriya is spoken in Orissa state, on the eastern seacoast of India, south of Bengal. A very long way from Baku. If I was the president of CBS News, this would be my cue to mutter about the silk road and the spread of Buddhism, to object that the glyph in the printed form of Pynchon's novel looks as much like U+0B17 as it does like U+01A2, and to admit grudgingly that "Language Log cannot prove that this code point is authentic". But I'm not, so I'll just say that I made a mistake. I knew it was a mistake at the time, and considered adding a note about it, but thought that would be piling pedantry on top of pedantry, so... Hell, I didn't think anyone would notice. Isn't it incredible that there is someone who knows enough, and cares enough, to look at the page source of a post like that one, track down the Unicode code point involved, and send a helpful email to correct it?

The thing is, I thought Tim's intervention was terrific. I learned something, I improved the quality of the post, it's all good. In the opening of this post, I put myself in the position of CBS just to make a joke. Of course, Tim wasn't accusing me of basing an argument on forged documents.

If I had tried to, I surely wouldn't have gotten away with it. I do have that much in common with Dan Rather.


Posted by Mark Liberman at September 29, 2004 12:15 AM