I'm just a humble collector of Cupertino curiosities, but Thierry Fontenelle of the Microsoft Natural Language Group is deep in the orthographic trenches, tinkering with the algorithms used by the Microsoft Office spellchecker so that users get the spelling suggestions they deserve. (Whether they choose to take those suggestions is of course another matter.) Below is Thierry's response to my latest Cupertino foray, "Cupertino, Part Deux: I read it on misplace."
Ben Zimmer’s recent post on “MySpace”, “Misplace” and spell-checkers is very interesting. As noted in his post, the word MySpace is now in the lexicon of the Office 2007 spellchecker. I guess nobody will complain about that addition (that shows that the tools evolve as well, like our vocabulary, and solving the Cupertino issues he regularly describes on Language Log is an algorithmic problem, but also a problem related to the coverage of the dictionary).
As you know, a spell-checker has two main functions: it should spot mistakes, but it should also try to suggest the most likely word form to replace the erroneous input. Computing the suggestions is usually an algorithmic process based upon the concept of “edit distance”, which measures the number of character manipulations that were necessary to turn a correct word into an incorrectly spelled one: deleting, adding, transposing or replacing a character are the most common manipulations. Here are examples of such manipulations (the word to the right of the arrow is flagged with a red squiggle):
Deleting a character: information → infomation Adding a character: developing → developping Transposing 2 characters: believe → beleive Substitution: independent → independant
When a word is not found in the speller dictionary, the speller tries to find the nearest candidate in terms of edit distance. This algorithmic process is used to compute the order of the suggestions. In addition to “edit distance”, which is a general, language-independent concept, some language-specific knowledge may also be used to fine-tune the order of suggestions. There can be a specific rule saying for instance that some users have problems with the letters “gh” which they sometimes mix up with “f”: if you write “rouf”, you will therefore see that the application of the edit distance mechanism is responsible for the suggestion “roof” appearing in the first position, but you will also see “rough” in the list of suggestions offered by the speller, even though, in terms of edit distance, there are more manipulations to delete “gh” and add “f”: this is based upon an English-specific typology of errors which enables us to take into account frequent mistakes.
Ben cites one of his readers who points out that the latest Word spellchecker gives misplace as the first suggestion for Mysplace. That is true and is in fact expected, since going from Misplace to Mysplace is done by substituting the “i” and the “y” (one-character change only).
The distance is longer to transform MySpace into Mysplace (turning the capital “S” into a lower-case “s” and adding “l”). This is why MySpace appears as the 2nd suggestion.
Note that MySpace is listed as the first suggestion when you type Mispace in Office 2007, and not as the second one, as suggested by Ben’s reader (see screenshot below):
Of course, it will always be up to the writer to decide whether they really meant MySpace or something else. In any case, I don’t think this is a “Cupertino” issue, since there is no automatic replacement (as you know, the Cupertino issue affected the Word speller in the 1997 version, over 10 years ago, and was due to the AutoReplace function – many things have changed since then and the Office proofing tools have improved a lot, for instance with the introduction in Office 2007 of a contextual speller). The speller does its job when it flags the mistake in Mispace and also does its job when it suggests the most likely corrections. I would argue that if the user unfortunately clicks on “Misplace” in this list when they meant “MySpace” and had written “Mispace”, the tool cannot really be blamed, can it? ;-).
[guest post by Thierry Fontenelle, Microsoft Natural Language Group]Posted by Benjamin Zimmer at January 29, 2008 09:53 AM