December 24, 2006

The ghost of Christmas past, and the entropy of (C)han(n)uk(k)a(h)

Merry Christmas to our readers! Some seasonally-appropriate reading from past editions:

"Same-sex Mrs. Santa: 'The semantics are confusing'" 11/27/2003
"'Twas the night before Christmas" 11/24/2003
"A 'Boxing Day Election' -- or not?" 12/5/2004
"Talking animals: Miracle or curse?" 12/24/2004
"Homo Hemingwayensis" 1/9/2005
"For linguists only" 2/4/2005
"Christmas trees and holiday trees" 12/2/2005
"Negation, over- and under-" 12/21/2005
"L(a)ying snow" 12/24/2005
"Zogby: Bill O'Reilly's bitches?" 12/22/2006

In other holiday news, a new survey by Language Log labs has found that Hanukkah is second only to Muammar al-Gaddafi in public spelling uncertainty.

We learned of this problem by data-mining the web. Ignoring case, here are some of the counts:

 
hanukkah
hanukah
hannukah
hannukkah
hanukka
hanuka
hannuka
hannukka
Google
24,100,000
1,160,000
1,430,000
85,200
194,000
957,000
125,000
9540
Yahoo
55,900,000
56,600,000
57,200,00
71,200
33,600,000
55,000,000
126,000
2,010
MSN
2,097,292
537,348
159,167
12,823
21,469
39,290
9,031
1,352

 

 
chanukkah
chanukah
channukah
channukkah
chanukka
chanuka
channuka
channukka
Google
461,000
5,380,000
560,000
975
359,000
835,000
3,040
697
Yahoo
33,800,000
38,600,000
33,900,000
1,750
291,000
33,200,000
35,200
1,320
MSN
56,078
767,919
46,153
638
24,662
52,053
4,282
577

(Note that Yahoo is almost certainly doing some curious sort of "query expansion".)

The orthographic background of this problem is discussed in the wikipedia article, from which I learned about Khanike, the "YIVO standard transliteration from the Yiddish and/or Ashkenazic pronunciation of the Hebrew", which has 10,300 Google hits; and also about Robert Siegel's entertaining and informative exploration of the issues on All Things Considered last year.

In our survey results, 31.2% of the American public claimed to know how to spell Hanukkah, while 63.4% said they had no clue, and 5.4% responded that "it's people like you who are ruining Christmas". When we asked those who claimed to know the spelling what it actually is, we got 11 different versions from the 15 people who actually made it though to the end of word. A typical response from the others: "Hey, man, what is this, fifth grade?"

For those who care about such things, the entropy of the MSN distribution is almost exactly 2 bits, corresponding to the amount of uncertainty in four equally likely alternatives.

[Several readers have pointed out that it's strange that the only consistent part in the many common spellings of this word is the vowel sequence 'a u a', which is also the only part that isn't specified by the Hebrew orthography (heth nun vav kaf hey). Others have pointed out that this is completely expected, given the first letter-name itself has the common variants Ḥet, H̱et Khet, Kheth, Chet, Cheth, Het, and Heth. And then there are those who have pointed to additional variants in which the vowels are also altered. like "Hanakah" (13,200 Google hits). Well, as Don Rumsfeld said about the looting of Baghdad, "Freedom's untidy".]

Posted by Mark Liberman at December 24, 2006 08:22 AM