June 04, 2006

ASCII the diacritic assassin

[Guest post by Jarek Weckwerth. Jarek was one of those who answered Sven Godtvisken's "Name that tune ('s language)" query. This morning, he sent along an interesting follow-up, presented here after the jump.

There are a few terms of art in Jarek's post that may not be familiar to everyone, so here's a quick glossary: ASCII is the American Standard Code for Information Interchange, a 40-year-old scheme for assigning 7-bit codes to characters used in representing text. ASCII lacks codes for characters with accents and other diacritics, such as háčeks and ogoneks.  ISO Latin-1 is a slightly less obsolete character encoding, which uses 8 bits rather than seven, and includes many though not all of the characters used in "latin" orthographies, i.e. those based on the roman alphabet. Unicode is a modern, non-obsolete character encoding standard, increasingly (but still incompletely and often inadequately) supported by computer software that deals with text. UTF stands for Unicode Transformation Format; there are several different UTFs, of which the commonest (I believe) is UTF-8, designed by Ken Thompson and Rob Pike on a placemat in a New Jersey diner, one evening in 1992. Rob's description of the process will help you to understand why Unicode needs "transformation formats" and why there is more than one of them. ]

[Jarek's note follows.]

Just (?) three things more:

(1) Voltaj’s web page seems to have the official lyrics. “In chefuri si in fite.” So I think that should solve the Mondegreen problem :) (Notice the lack of diacritics on their page, which is otherwise quite nice in fact.)

(2) Stephen Smith’s comments re gashca etc.: In fact you can see the same in Poland. Those who cannot/do not use Polish diacritics e.g. in email will sometimes use English-like spellings instead, e.g. tesh for też /tɛʃ/ ‘too’. I would say this is done mainly for fun, and in an extremely irregular/idiosyncractic fashion. There are relatively few cases where you might see some real justification in avoiding “graphemic neutralisation” with another word. So, for example, the informal term for a mobile/cell phone, komóra /kɔˈmura/ may end up as komoora, since without the diacritic it’s in competition with komora /kɔˈmɔra/ ‘chamber’.

Another approach is to use “phonetic” spellings that would normally count as “spelling errors”, thus komura.

Interestingly, some owners of mobiles that do not offer Polish diacritics (still a majority) will use a slightly different technique when texting: they’ll use whatever diacritics are available, usually Western European characters with diacritics within ASCII Latin1, thus e.g. mojä for moją ‘my sg.fem.acc./loc.’ to preserve the contrast with moja ‘my sg.fem.nom.’.

Of course, this is all redundant in a vast majority of cases; in context, Polish (like Romanian) is almost completely readable without the diacritics, and many people will happily write Polish without them. However, my impression is that these are mainly the “old ASCII guard” (like myself – I’ve switched to UTF especially for this message) rather than “internet-savvy modern” people, as Stephen says. Or perhaps those who realise what conversion problems there are. I never use the Polish characters in my phone because they invariably come out garbled at the other end unless the recipient happens to also use a relatively up-to-date SonyEricsson. Nokia or Samsung won’t do ;)

One interesting case is the demise of ę in some endings, as in piszę ‘I’m writing/I write’ vs. pisze ‘(s)he/it is writing/writes’. The ending lost nasality quite some time ago, with the two forms becoming homophonous for many verbs. Standard spelling still preserves the distinction but I think it won’t last long in the Internet age; people who otherwise do use diacritics omit the ogonek more and more often. ASCII reinforces a natural sound change.

Has there been any research into this?

(3) When re-reading my response to the name-that-song riddle (BTW, thanks for posting!), I noticed I had done some unexpected stereotyping: I seem to think that Slavic/Balkan rock is an identifiable musical style. Where did that come from?! My experience with Russian/Romanian/Serbian etc. rock/pop is rather limited (=close to nonexistent). Two options:

(i) There is indeed an identifiable musical “substratum” in at least some pop/rock produced in those countries; it’s rooted in folk music; this folk music has some shared characteristics; it’s part of the stereotypical images of those countries; and these stereotypical images are sufficiently familiar to me. Possible. My feeling was that e.g. Russia, Poland, Serbia, Romania, Bulgaria would qualify, but the Czech Rep. wouldn’t, and for some reason neither would Greece. (No intuitions about Albania whatsoever.) So, after all, this may be related to experience in some way, however subconscious and superficial it might be.

(ii) It was Balkansprachbund at work. My first impression was based – I think – on the ubiquity of hard-ish sounding [tʃ, ʃ], and [ts], coupled with a rather simple vowel set. This was consistent with the initial Slavic hypothesis, and was amplified by [tatuˈaʒɛ] (whatever the final vowel), because the word sounded so instantly recognisable. (And it finally turned out to actually mean the same as in Polish!) Of course there was the schwa in one of the first words, but I seem to remember that it’s the only common phonetic feature of Romanian, Bulgarian and Albanian that is widely cited, most descriptions of the Bund focussing on morphosyntax, while my intuition was based on Slavic-like consonantal content. Could it be that the poor Romanians, having been surrounded by the Slavic element for so long, have not only borrowed some vocabulary but also some phonotactics? Not that unlikely. I’ll have to read up on this next week; a quick Google search doesn’t help this time.

It was probably both (i) and (ii). Anyway, the intuition was largely correct. Sometimes the apparent usefulness of stereotyping in everyday life scares me.

[Guest post by Jarek Weckwerth]

Posted by Mark Liberman at June 4, 2006 05:44 AM