January 24, 2007

Tired food

I was puzzled about why Babel Fish translated the Dutch phrase "Harm Beukers, hoogleraar geschiedenis van de geneeskunde" ("Harm Beukers, professor of the history of medicine") as "Harm tired cherry, hoogleraar history of medicine". So was Ruud Visser, and so he (and others commenting on his blog) investigated and solved the problem.

It seems that the Fish is confident enough to split Beukers into

beu (ik ben het beu) = tired
kers = cherry

which (I gather) Dutch speakers do not find to be a plausible decomposition. On the other hand, hoogleraar (= "professor") is not in the Fish's lexicon, although it's quite a common Dutch word, with 2.8M Google hits; and the Fish is also unwilling to split it into the parts hoog (= "high") and leraar (= "teacher"), though these are commoner in Dutch than beu and kers.

Ruud observes that

The “tired cherry” pattern also holds for other fruits, including those with more than one syllable: beupeer (pear), beuappel (apple), beubanaan (banana), beumandarijn (mandarin) and even beusinaasappel (orange) are all translated as tired X. Don’t like fruits? Babel Fish provides tired vegetables as well, like beusla (lettuce) and beuwortel (carrot). That goes with a beubiefstuk (steak) and some beuaardappelen (potatoes); beupatat (fries/chips) is not on the menu, unfortunately. All of this is served by be(a)utiful, though somewhat weary, beumannen (men) and beuvrouwen (women) in your local beurestaurant.

The fact that hoogleraar is missing, and that Beukers is (unwisely) split while hoogleraar is not, means that the Babel Fish Dutch/English system was not constructed with adequate attention to lexeme frequency, not even in the obvious first-order sense of checking the translation dictionary against a frequency-ordered list of word forms.

A Dutch-language news search at news.google.nl gives these counts:

hoogleraar
844
hoog
5,977
leraar
364
beu
239
kers
51

I mean, even without going all the way to a statistical MT system (which would require more bilingual text than might easily be available), at least you could make common-sense use of first-order word frequencies in populating your lexicon. Digital texts and word-frequency lists in Dutch have been available for a long time.

[Update -- Bertil, commenting on Ruud's site, explains the real etymology of Beukers -- "cod-beater", roughly:

The site of the Meertens Instituut says that the name Beukers is related to a profession: http://www.meertens.knaw.nl/nfd/detail_naam.php?naam=Beuker.

My 1970 Dikke Van Dale writes that "beuken" means to hit the stockfish until it becomes soft. Apparently, it used to be someone's profession to do this all day.

]

[Update -- Tako Schotanus wrote to clarify that the decomposition of beukers into beu+kers is not just semantically implausible in Dutch, but also seems to him to be morphologically impossible as well, roughly as if Winston Churchill were to be translated as "victorious weight chapel sick".]

Posted by Mark Liberman at January 24, 2007 03:39 PM