Christopher Farah's piece on statistical machine translation appeared in today's NYT Circuits section, under the headine "From Uzbek to Klingon, the machine cracks the code".
The story is timely and generally positive, but it's curiously lacking in translation examples, good or bad. And very strangely, it doesn't mention DARPA.
The story features Kevin Knight and David Yarowsky, whose contributions are well worth citing. It mentions the IBM "Bleu" metric, but it doesn't name any of the people (notably Salim Roukos) responsible for inventing and popularizing it. It also doesn't mention the role of George Doddington and others at NIST in developing the metric and applying it in the current DARPA TIDES research that funds both Knight and Yarowsky.
The article is rather short, and although MT is a topic that lends itself to vivid exemplification, the story doesn't have even one example of a translated passage or even a translated phrase. (See my earlier post for a short but I think compelling translation sample.) It's possible that this is the fault of the NYT's editors rather than the writer. The piece reads like it was edited by deleting large sections, and perhaps that's what happened.
The writer doesn't seem entirely to understand the nature of the ngram-based MT evaluation metric, since he quotes Ophir Frieder, an information retrieval researcher at IIT, to the effect that n-grams "work poorly" as a search technique and are "a basic novice solution," which (true or false) is not directly relevant to MT evaluation.
This quote seems to be misleading at best, since state-of-the-art IR methods are still based on lexical statistics in one way or another, including for example the methods in this recent paper of Frieder's. I wonder if Frieder understood the point of Farah's questions, or if his answers were accurately presented in the NYT article.
In any case, there are plenty of potential issues with n-gram-based MT metrics -- one can construct good translations that score badly, and horrible translations that score very well. These issues (at least the bad translations that score well) may become a problem within a couple of years, as MT algorithms hill-climb on these Bleu-style metrics. MT output is already a lot better than it was pre-Bleu, and it will certainly keep improving, but the apparent quality may hide serious distortions of content due to errors that don't affect Bleu scores much. However, asking an IR researcher about the value of n-grams in text search is not a good way to help people understand this.
Farah's article alludes to the concept of statistical MT, but doesn't explain much about how it works or how it is different from other approaches. It doesn't explain the key role of parallel text resources for training statistical MT, or the nature of the multiple-translation corpora used for evaluation. For an example of an interesting attempt to give lay readers a sense of what is happening, see this press release from ISI.
But the really odd thing about the NY Times article is that it doesn't mention DARPA!
The research cited -- including the 1999 JHU summer workshop featured in the article's lead paragraph -- has all been funded and/or organized by DARPA., which since 1985 has been the main source of U.S. Government funding for speech and language technology research. At present, DARPA's Human Language Technology program is part of the Information Awareness Office, under John Poindexter, which has been getting rather bad press recently. Why not give them credit for the good stuff as well?Full disclosure: I've been involved in DARPA-funded HLT research for over a decade. The Linguistic Data Consortium, which I direct, provides training and testing materials for the DARPA TIDES program, including TIDES MT research. I spoke on the phone with Christopher Farah while he was researching his story, and helped hook him up with Kevin Knight.
Posted by Mark Liberman at July 31, 2003 07:10 AM