February 02, 2006

Tong-maker the Kong-maker, and other translational follies

I recently came across a press release about an online English-Malay translation tool that promises "real-time translation and searching of the whole Internet in Malay." The Malaysia-based company, Linguamatix, claims that their product can translate between English and Malay at a rate of 500,000 words per minute, compared to the mere 5,000 words per minute achieved by commercial translation systems for other languages. The company's goal is to allow the Malay-speaking public to surf the Web and read all English-language webpages instantaneously translated into their native language. Linguamatix is also planning to apply its high-speed translation engine, LinguaBASE, to various other language pairs.

The company is currently offering its translation service, known as LinguaWeb, in a free online trial version. According to the press release, the trial version has been made publicly available "for a limited period while Linguamatix assesses its current capacity to continue providing such services." I took LinguaWeb out for a spin, and it looks like they're still working out the kinks. One can surf the Web in either English or translated Malay by entering in a search term, but the Malay option frequently returns timeout errors. However, as with other translation tools like Altavista's Babel Fish and Google Translate, one can also supply a URL and get a translated version of the page in return (either from English to Malay or Malay to English). That feature works quickly and cleanly, and the resulting translations seem roughly on par with Babel Fish et al. in terms of accuracy. But like other automatic translators, LinguaWeb has some peculiar stumbling blocks.

The first page that I chose to translate was my Language Log entry "Nias, Komodo, and 'Kong'," about Indonesian connections in the original 1933 version of King Kong, directed by Merrian Cooper. The output from LinguaWeb shows how much trouble most translation tools have with any snippet of text that is even slightly idiomatic or noncompositional. The first sentence of the entry begins:

I have yet to find three hours to devote to Peter Jackson's remake of King Kong...

And here is the Malay version from LinguaWeb, with my item-by-item gloss:

Saya tetapi ada untuk mencari tiga jam untuk menumpukan untuk buat semula Peter Jackson bagi Raja Kong
I
but/yet
is/have
for/to
look for
three
hour(s)
for/to
devote
for/to
remake (v.)
Peter Jackson for
King
Kong

Needless to say, the Malay version only makes the vaguest of sense when the sentence is strung together (I'd gloss it back into English as, "I but have for looking for three hours for devoting for remaking Peter Jackson for King Kong"). As LinguaWeb's press release acknowledges, the translation tool is mostly useful for "gisting purposes." (See the Wikipedia entry on machine translation, or MT, for more on "gisting.") Not that other automatic translators do a better job with such tasks — hence the bizarre results that quickly generate from serial Babelfishing, or even a single translational cycle of something particularly idiomatic, such as the English-to-Italian-to-English output from "Rapper's Delight." (Here's how my "Kong" post looks translated back into English.)

Finding humor in automatic translation is nothing new, by the way. Take this old MT urban legend, as it appeared in Art Buchwald's syndicated column of July 2, 1959 (Buchwald is discussing the International Conference on Information Processing):

At the beginning of the conference one of the lecturers was describing a machine which translates English into Russian. The first phrase put through the machine was, "The spirit is willing, but the flesh is weak." But the Russian equivalent which came out read: "The whisky is fair, but the meat is foul."

As it turns out, that joke is actually more than a century old, long predating the advent of MT. Here's the earliest version I've found in the newspaper databases:

Decatur (Ill.) Herald, Jan 20, 1903, p. 5
A student at Berkeley contributes the following: Many ludicrous mistakes are made by foreigners in grasping the meaning of some of our common English expressions. A young German attending the state university translated "The spirit is willing, but the flesh is weak" into "The ghost is willing, but the meat is not able." And a Filipino youth fairly set the class in an uproar by the statement that "Out of sight, out of mind" meant "The invisible is insane."

(The "out of sight, out of mind" line has also frequently been hauled out for the MT age, with the purported mistranslation sometimes appearing as "invisible idiot" or "blind idiot.") But beyond repeating stale jokes about the hazards of translating idioms literally, I'm curious about another common problem with automatic translators: the (mis)recognition of proper names.

When I was first skimming through the Malay translation of my "Kong" post, I noticed the collocation "Pembuat Tong" in places referring to the film's director, Merrian Cooper. Since pembuat means 'maker' in Malay, my first thought was, "That's strange... Did I write 'the maker of Kong' and it's coming out as 'the maker of Tong'?" Then it dawned on me that tong is Malay for 'barrel(s),' and LinguaWeb had translated Cooper's name according to the literal (but rare) meaning of cooper: 'one who makes or repairs casks or barrels.' Similarly, a mention of Mark Liberman in another post comes out as "tanda Liberman," or '(the) mark or sign (of) Liberman.'

Again, LinguaWeb is no worse than other translation tools in this department. Here's how Babel Fish does with these two names, regardless of the context of their appearance:


Merrian Cooper
Mark Liberman
Dutch
Kuiper Merrian Teken Liberman
German Merrian Faßbinder Markierung Liberman
French tonnelier de Merrian marque Liberman
Spanish fabricante de vinos de Merrian marca Liberman
Portuguese cooper de Merrian marca Liberman
Italian cooper di Merrian contrassegno Liberman
Russian бондарем Merrian меткой Liberman
Chinese-simp Merrian 木桶匠 标记Liberman
Chinese-trad Merrian 木桶匠 標記Liberman
Japanese Merrian のたる製造人 印Liberman
Korean Merrian술장수 표Liberman

Babel Fish managed to recognize "Mark Liberman" as a proper name in only one language, Greek. But the Greek translation for "Merrian Cooper" came out as "βαρελοποιός Merrian." That first word is given as a Greek term for cooper by Answers.com, but when I feed it back into Babel Fish I'm told that it means "gravimeter," bafflingly enough.

Clearly Babel Fish and LinguaWeb are working from lexicons where mark means 'sign' and cooper means 'barrel-maker,' without any information about common given names like Mark or common surnames like Cooper. But how hard would it be to include a heuristic that guesses whether a given collocation is a proper name, especially when it is in the form Given Name + Surname? I would think in many cases there would be dead giveaways — little things like capitalization, or the relative frequency of the barrel-making cooper vs. the surname Cooper in contemporary English usage. But perhaps I'm expecting too much of tools intended merely for "gisting purposes."

Posted by Benjamin Zimmer at February 2, 2006 05:00 PM