March 23, 2008

Made in USA == Made in Austria|France|Italy|... ?

Antonio Cangiano has noticed an odd thing about Google's statistical translation software.  As he puts it,

Google Translate sometimes changes the country mentioned within the source language to the main country of the translation language.

I've checked the examples that he cites, and they work exactly as he says.

For example, Austria can be rendered as "USA":

Or as "France", if the target language is French:

Of course Austria is not a German word, but an English one (the German equivalent would be "Österreich"). But something similar can happen in translating from English:

The phenomenon is a subtle one. Thus in the German-to-English example, the source language is really English, not German. And similarly, in the German-to-French and English-to-Italian examples, the whole phrase is being "translated" from English (or English-pretending-to-be-German) into English-pretending-to-French or English-pretending-to-be-Italian", just substituting "Italy" or "France" for "USA" or "Austria".

I can usually figure out the reasons for amusing translation errors, but I remain a bit puzzled about this one.

Statistical machine translation, of the kind that Google uses, traditionally combines two sorts of information:

  1. statistical relationships between words or phrases in the source language and words or phrases in the target language;
  2. statistical word-sequence patterns in the target language.

One of the strengths of Google's MT systems is the size of the samples from which they build their models. Perhaps these samples are large enough for their German-to-English and German-to-French systems to have cross-language information for a significant number of non-German words -- like "Austria". But I don't see why this information would map "Austria" onto "USA" or "France" respectively. And this will not help with the "USA" to "Italy" problem.

Then again, perhaps these amusing errors are a symptom of a different kind of statistical modeling, based on looking for source-target mappings in bodies of untranslated but somehow "comparable" text. For example, we look for an English string E, occuring in contexts AEB, and an Italian string I, occurring in contexts CID, such that A_B and C_D are sufficiently "similar" for us to conclude that E and I may be translations. But I don't see why comparable texts would contain patterns that would create these errors.

If you think you know -- or especially if you really know -- what's happened here, please let me know.

[A reader writes:

It looks like the translator is looking at "Made in USA" with a meaning of "Made in [here]" as opposed to "Made in [specific place]", so it just naturally swaps the country names. Just my guess, though.

That's the obvious route to the mistake. What I don't understand is what statistical patterns in parallel or comparable text would lead a modern MT algorithm to take that path. ]

[Pekka Karjalainen writes:

The posting on Language Log about the translation of country names prompted me to test the feature. I found that the punctuation following the phrase sometimes affects the translation. I couldn't find any consistency in how it happens with different language pairs, but I found that this worked the same way for me every time:

Made in Austria!! => Made in USA!
Made in Austria! => Made in Austria!

Here => represents using the translator to go from German to English. With German to French, you can try having a trailing comma right after the same phrase and then some other punctuation mark (or nothing at all).

This probably calls for more thorough testing (which many Language Log readers might volunteer to do). For starters, I hope you can in fact repeat my results.

(I used the Google translator at this address, just to make sure: http://translate.google.com/translate_t.)

]

[Empty Pockets writes:

Following up on Pekka's comment, I got the following (mystifying) results:

I live in Austria! ==> I live in Australia!
I live in Austria!! ==> I live in Canada!
I live in Austria!!! ==> I live in Canada!
I live in Austria!!!! ==> I live in Australia!!

In each example the translation is German ==> English.

Wow. ]

Posted by Mark Liberman at March 23, 2008 09:37 AM