Language Log: July 2003 Archives

July 31, 2003

"AI is brain dead."

According to Marvin Minsky, AI is brain dead.

Fernando Pereira concurs.

Minsky and Chomsky famously disagreed about the prospects for AI, but I don't think this means that Chomsky wins the argument.

At one level, perhaps Chomsky did win. Minsky argued that to study human language, you should combine models of language structure with models of meaning and common-sense reasoning, and that all this is accessible through the methods of classical AI. Chomsky argued that meaning is among the "mysteries" that "lie beyond the reach of the form of human inquiry that we call 'science'", while language structure is a "problem" where science can make progress.

In the 1960s, Minsky convinced a lot of people to follow his program. And 30 or 40 years later, the project is dead in the water, as he admits. So perhaps Chomsky was right about the "mysteries" business. (Though it's not obvious to outsiders that Chomsky's own theories are notably more successful than they were 30 years ago).

But Pereira makes a different objection:

Coding up a tangle of "common sense knowledge" is useless if the terms of that knowledge are not endowed with meaning by their causal connection to perception and action. The grand challenge is how meaning emerges from a combination of genetically wired circuitry and learning.

On that view, Minsky and Chomsky were both wrong, and in the same way. They share the belief that activities of the mind can (and should) be understood in terms of the manipulation of formulae that are not essentially grounded in perception and action. Pereira, along with many others, suggests that this belief (which he himself once held or at least acted on) is fatally mistaken.

This debate -- whose roots go back through Descartes and Locke -- isn't over yet. The neo-Lockeans have some neat initial results, just as Minsky and Chomsky did circa 1970. But most of the problems (or mysteries?) remain unsolved.

Posted by Mark Liberman at 07:25 PM

NYT story on DARPA MT... doesn't mention DARPA!

Christopher Farah's piece on statistical machine translation appeared in today's NYT Circuits section, under the headine "From Uzbek to Klingon, the machine cracks the code".

The story is timely and generally positive, but it's curiously lacking in translation examples, good or bad. And very strangely, it doesn't mention DARPA.

The story features Kevin Knight and David Yarowsky, whose contributions are well worth citing. It mentions the IBM "Bleu" metric, but it doesn't name any of the people (notably Salim Roukos) responsible for inventing and popularizing it. It also doesn't mention the role of George Doddington and others at NIST in developing the metric and applying it in the current DARPA TIDES research that funds both Knight and Yarowsky.

The article is rather short, and although MT is a topic that lends itself to vivid exemplification, the story doesn't have even one example of a translated passage or even a translated phrase. (See my earlier post for a short but I think compelling translation sample.) It's possible that this is the fault of the NYT's editors rather than the writer. The piece reads like it was edited by deleting large sections, and perhaps that's what happened.

The writer doesn't seem entirely to understand the nature of the ngram-based MT evaluation metric, since he quotes Ophir Frieder, an information retrieval researcher at IIT, to the effect that n-grams "work poorly" as a search technique and are "a basic novice solution," which (true or false) is not directly relevant to MT evaluation.

This quote seems to be misleading at best, since state-of-the-art IR methods are still based on lexical statistics in one way or another, including for example the methods in this recent paper of Frieder's. I wonder if Frieder understood the point of Farah's questions, or if his answers were accurately presented in the NYT article.

In any case, there are plenty of potential issues with n-gram-based MT metrics -- one can construct good translations that score badly, and horrible translations that score very well. These issues (at least the bad translations that score well) may become a problem within a couple of years, as MT algorithms hill-climb on these Bleu-style metrics. MT output is already a lot better than it was pre-Bleu, and it will certainly keep improving, but the apparent quality may hide serious distortions of content due to errors that don't affect Bleu scores much. However, asking an IR researcher about the value of n-grams in text search is not a good way to help people understand this.

Farah's article alludes to the concept of statistical MT, but doesn't explain much about how it works or how it is different from other approaches. It doesn't explain the key role of parallel text resources for training statistical MT, or the nature of the multiple-translation corpora used for evaluation. For an example of an interesting attempt to give lay readers a sense of what is happening, see this press release from ISI.

But the really odd thing about the NY Times article is that it doesn't mention DARPA!

The research cited -- including the 1999 JHU summer workshop featured in the article's lead paragraph -- has all been funded and/or organized by DARPA., which since 1985 has been the main source of U.S. Government funding for speech and language technology research. At present, DARPA's Human Language Technology program is part of the Information Awareness Office, under John Poindexter, which has been getting rather bad press recently. Why not give them credit for the good stuff as well?

Full disclosure: I've been involved in DARPA-funded HLT research for over a decade. The Linguistic Data Consortium, which I direct, provides training and testing materials for the DARPA TIDES program, including TIDES MT research. I spoke on the phone with Christopher Farah while he was researching his story, and helped hook him up with Kevin Knight.

Posted by Mark Liberman at 07:10 AM

July 30, 2003

The value of evaluation

This is a story with a moral. It shows that a simple, cheap, quantitive measure of quality -- even one that is obviously flawed -- and a commitment to improving performance on that measure -- even over a relatively short time -- leads to improvement. Real improvement, not just improvement in terms of the flawed metric

About two years ago, Salim Roukos and others at IBM suggested a remarkably simple method for evaluating translation quality: just count the number of words and word sequences ("n-grams") in common between a translation to be tested and a set of reference translations. They named this metric "Bleu", and they showed that despite its obvious flaws, it correlates well with human evaluations, not only for (generally poor) automatic translations, but even for human translations of varying quality.

DARPA researchers quickly adopted (a version of) this metric for TIDES MT research, as described in this NIST report.

As predicted by those who believe in the value of quantitative evaluation for "language engineering", the result has been an extraordinary improvement in the quality of machine translation. In the 2002 TIDES MT evaluation, the best research system for Arabic-to-English translation scored at 51% of human translation performance as measured by the NIST metric, while the best commercial system scored 57%. In the recent 2003 evaluation, the best research system scored 89%, while the best commercial system was at 58%.

The improvement can be seen in qualitative terms by reading some samples:

2002 System:
insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .

And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " .

2003 System:
Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

Human Translation:
Egypt Air May Resume its Flights to Libya Tomorrow

Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.

The official said that, "the company sent a letter to the Ministry of Foreign Affairs to inquire about the lifting of the air embargo on Libya, and in the event that it receives a response, then the first flight to Libya, will take off, Wednesday morning."

Posted by Mark Liberman at 08:51 AM

July 28, 2003

"It looks like it's his own tongue"

The Austrian recipient of a tongue transplant is said to be doing well:

"The tongue now has a completely normal color. When you look inside his mouth, it looks like it's his own tongue. The transplant has blood flowing to it, but there is a risk of rejection with every transplant operation," Ewers said.

He said that if the patient's body accepts the tongue and it heals properly, the man should be able to speak and swallow, but most likely he will not have a sense of taste.

Posted by Mark Liberman at 10:10 AM