September 22, 2007

Automatic hyphenation

Having just touched on issues of hyphenation, I'm reminded that I should do a follow-up (note: usually this is follow-up for me, but sometimes followup; so sue me) on automatic hyphenation, a topic that I posted on at the end of June.  I relayed some mis-hyphenations resulting from early attempts to eliminate proofreaders in favor of hyphenation programs: kneep-ants, co-aches, and, in a manual for a program that was supposed to make proofreaders obsolete, pro-ofreaders.  Now it turns out that there's a name for such things.

I also commented:

In any case, pro-ofreaders were clearly not obsolete then.  Nor are they now.  Though brute-force methods -- really really big dictionaries with possible hyphenations specified -- can improve things considerably, and undoubtedly have.

But it turns out that there's an excellent hyphenation program that abstains from simple brute force.

First, the term for mis-hyphenations.  This is the wonderful mishy-phens, devised by Donna Richoux and reported on the newsgroup alt.usage.english.  In 2004 Richoux posted an entertaining list of all the mishy-phens she'd collected since 2000.  (Hat tips to Ben Zimmer, who posted on the topic on ADS-L in 2004; and to Aaron Dinkin.)  Meanwhile, Mark Mandel wrote to say that in 2000 he wrote a song, "Editors' Waltz", with examples of all sorts of things that can go wrong in manuscripts, including the mishy-phen moong-low from the NYT that year.

Then, the excellent hyphenation program.  Chris Lance and Jed Davis both wrote me about the TeX algorithm that was developed as a Stanford Ph.D. project by Frank Liang in 1983, under the direction of TeXman Donald Knuth.  (This is embarrassing, because Knuth is of course a colleague of mine at Stanford.  In my defense, I'm not a computer scientist, nor even a TeX person.)  The TeX hyphenation dictionary contains 4447 patterns that the algorithm uses, a much smaller number than the huge number of entire words that were used in developing it, so it's far from a brute-force scheme.  Both Lance and Davis report that the performance of the algorithm is very good.

There's still plenty of work for proofreaders, of course.

zwicky at-sign csli period stanford period edu

Posted by Arnold Zwicky at September 22, 2007 01:31 PM