June 02, 2004

(Mis)spelling Gandhi

Shankar Kalyanaraman observes that people often write Gandhi as "Ghandi". In fact, this misspelling is much commoner than either of the other two errors, "Ghandhi" and "Gandi":

 

 
dh
d
gh
8,220
261,000
g
2,260,000
78,000

The difference is even larger considering that many of the "gandi" hits are really examples of gandi.net or other completely different but equally valid words. We've commented on this pattern of errors many times before, for example with respect to Jennifer, tomorrow, parallel, Karttunen, Attila, and so on. What this means in the case of Gandhi is that people know there is an "h" in there somewhere, and just one of them, but they're not too sure where it is. As a result, the omission of the "h" after the "d" and the insertion of an "h" after the "g" are not statistically independent processes.

It's no doubt also relevant, in this case, that "gh" is a commoner sequence of letters in English than "dh" is, by a large factor.

I haven't seen a model of spelling/misspelling that does a very good job of predicting such patterns. The spelling-correction algorithms that I'm familiar with tend to assume independent of string-local edits, which is obviously wrong.

Posted by Mark Liberman at June 2, 2004 09:22 PM