December 10, 2007

More fun with machine translation

A reader asks:

Babelfish translates 'wow' into French as 'défaut de la reproduction sonore'.

I'm wondering: how on earth does that happen?

Well, "defect in sound reproduction" is one of the meanings of the word -- although there's no reason for a citizen of this digital age to know it. The online Merriam-Webster dictionary gives us:

Main Entry: 4wow
Function: noun
Etymology: imitative
Date: 1932

: a distortion in reproduced sound consisting of a slow rise and fall of pitch caused by speed variation in the reproducing system

The superscript 4 means that this is the fourth option -- the others are:

1. interjection -- used to express strong feeling (as pleasure or surprise)
2. noun: a striking success : HIT
3. transitive verb : to excite to enthusiastic admiration or approval

A machine translation system needs to decide which of these to use. Without going into the theory and practice of machine translation, let's just say that Babel Fish makes the wrong choice here, just as Kingsoft so spectacularly did in the case of 干:

 

This doesn't really answer the question, but just pushes it back to another one: why do machine translation systems make the particular wrong choices that they do? This implies that such errors are somehow abnormal or unexpected, which is a natural reaction to notably silly mistakes such as Babel Fish's translation of wow into French, or Kingsoft's (former) translation of 干 into English.

But in fact, mistakes of that general kind are all too easy for computer algorithms to make, and very difficult to eliminate entirely -- though uniformly translating wow as "défaut de la reproduction sonore", or 干 as "fuck", is unnecessarily and indeed inexcusably bad practice. The real question should be, how can an MT system possibly make such choices correctly, not just sometimes or even most of the time, but almost all of the time?

There is an answer, but the margin of my breakfast time is too small to contain it.

To illustrate a closer approximation to the state of the art, here's what Google's current public system does with the Chinese title that Kingsoft 2002 translated as "Expand Enterprising and Really Grasp Solid Fuck and Continuously Expand and Great the New Situation of Buildings of Western Region":

Google's current English-to-French offering likewise gets wow-the-interjection right, at least in my concocted example:

However, when the meaning should actually be the "speed distortion" one, things don't go so well at all:

Leaving aside the translation of wow as "plaire" (which is a verb, as far as I know, as well as having entirely the wrong meaning), we can note that cheap is sometimes "bon" (= good), but not here.

[Sending the English-to-French output back through the French-to-English system, we get "The good cassette player, introduced many pleasing and flicker."]

Posted by Mark Liberman at December 10, 2007 07:38 AM