Automatic hyphenation
Having just
touched
on issues of hyphenation, I'm reminded that I should do a follow-up
(note: usually this is
follow-up
for me, but sometimes
followup;
so sue me) on automatic hyphenation, a topic that I
posted
on at the end of June. I relayed some mis-hyphenations
resulting from early attempts to eliminate proofreaders in favor of
hyphenation programs:
kneep-ants,
co-aches, and, in a manual for
a program that was supposed to make proofreaders obsolete,
pro-ofreaders. Now it turns
out that there's a name for such things.
I also commented:
In any case, pro-ofreaders were clearly
not obsolete then. Nor are they now. Though brute-force
methods -- really really big dictionaries with possible hyphenations
specified -- can improve things considerably, and undoubtedly have.
But it turns out that there's an excellent hyphenation program that
abstains from simple brute force.
First, the term for mis-hyphenations. This is the wonderful
mishy-phens, devised by Donna
Richoux and reported on the newsgroup alt.usage.english. In 2004
Richoux
posted
an entertaining list of all the mishy-phens she'd collected since 2000.
(Hat tips to Ben Zimmer, who
posted
on the topic on ADS-L in 2004; and to Aaron Dinkin.)
Meanwhile, Mark Mandel wrote to say that in 2000 he wrote
a song,
"Editors' Waltz", with examples of all sorts of things that can go
wrong in manuscripts, including the mishy-phen
moong-low from the
NYT that year.
Then, the excellent hyphenation program. Chris Lance and Jed
Davis both wrote me about the
TeX algorithm
that was developed as a
Stanford Ph.D.
project by Frank Liang in 1983, under the direction of TeXman Donald
Knuth. (This is embarrassing, because Knuth is of course a
colleague of mine at Stanford. In my defense, I'm not a computer
scientist, nor even a TeX person.) The TeX hyphenation dictionary
contains 4447 patterns that the algorithm uses, a much smaller number
than the huge number of entire words that were used in developing it,
so it's far from a brute-force scheme. Both Lance and Davis
report that the performance of the algorithm is very good.
There's still plenty of work for proofreaders, of course.
zwicky at-sign csli period stanford period edu
Posted by Arnold Zwicky at September 22, 2007 01:31 PM