February 05, 2004

HMMs at the fords of Ephraim

Anoop Sarkar has an interesting weblog Special Circumstances, which mingles (computational) linguistics with reviews of science fiction, movies and other notes. In the "other" category, he noted last week that

Australian science-fiction author, Greg Egan, has taken time off from his fiction writing to investigate the procedure of immigration detention in his country.

His essay on the topic is called The Razor Wire Looking Glass.

One sentence in this essay was particularly intruiging:

There are institutionalised flaws in the system, such as the language tests routinely used for validating people's nationality that have been discredited by professional linguists.

I wonder what kind of language test can prove that one is from a particular country. Kafka (if he used speech reco) might imagine the following scenario. Perhaps they ask people to talk into ViaVoice and measure the word error rate: "Edit distance of 24? You must be from Bhutan.''

This is not entirely a joke, as Anoop doubtless knows, though probability with respect to a model is a more likely measure than edit distance -- here's a link to some relevant research. Of course, no responsible person would suggest assigning any legal weight to the results of such automated diagnosis. And I imagine that the Australian test cited by Egan is some paper-and-pencil thing, anyhow, though he doesn't give any details at all.

A more traditional (and fatal) example of language as gatekeeper is given in Judges 12:


Jephthah then called together the men of Gilead and fought against Ephraim. The Gileadites struck them down because the Ephraimites had said, "You Gileadites are renegades from Ephraim and Manasseh."


The Gileadites captured the fords of the Jordan leading to Ephraim, and whenever a survivor of Ephraim said, "Let me cross over," the men of Gilead asked him, "Are you an Ephraimite?" If he replied, "No,"


they said, "All right, say `Shibboleth.'" If he said, "Sibboleth," because he could not pronounce the word correctly, they seized him and killed him at the fords of the Jordan. Forty-two thousand Ephraimites were killed at that time.

A 20th-century parallel to the Biblical shibboleth story took place in the Dominican Republic in 1937, when tens of thousands of Haitians were massacred on the basis of whether or not they could roll the /r/ in the Spanish word for "parsley" (the cited page is some on-line background reading for an introductory linguistics couse).

[Update: the author of cannylinguist emails cannily that

I suppose it doesn't detract from the injustice, but the /r/ in "perejil" is flapped, not rolled.

I'm no expert in the phonetics of Spanish dialects, but I guess that's right -- at least it's consistent with what I've heard and seen in other varieties of Spanish, where (as I recall) word-initial /r/ is trilled, and word-medial /r/ is written "rr" when trilled (as in perro), but is written "r" when tapped (as in pero or, I suppose, perejil).

I just copied the story uncritically from Wucker's account and from Dove's poem, and of course neither of them is trained in phonetic vocabulary or its application to speech. I'll check among folks who know something about Domenican Spanish and Haitian Kreyol, and post an update if I learn anything new about what the difference in /r/ pronunciation in perejil would have been.]

Posted by Mark Liberman at February 5, 2004 07:51 AM