March 24, 2008

The (probable) truth about Austria and Ireland

In a couple of earlier posts, I expressed puzzlement about what patterns in parallel or comparable text corpora could have persuaded Google's statistical MT algorithms to translate "Austria" as "Ireland", and so on. Several readers, and Melvyn Quince, had a bit of irreverent but irrelevant fun with the resulting silliness, of course. Anyhow, Bob Moore from Microsoft Research has sent in a very plausible explanation. Like many such theories, it's completely obvious in retrospect.

Although I obviously do not have access to the inner workings of Google's system, I am quite certain of this, because we have observed exactly the same thing happen at Microsoft in some of our research systems that are built along similar lines to Google's.

The problem comes about because these correspondences occur in the training data. As you know a statistical MT system such as Google's is trained on a parallel corpus in a pair of languages. But much of the parallel data one might find is not simply translated, but is "localized". Lengths are changed from feet to meters, prices are changed from dollars to Euros, and contact information is changed to be appropriate for the target audience. This means addresses are changed! The website of a multinational corporation might have a contact address in French version that has "Paris, France" in a place exactly parallel to where the UK version has "London, England". If these are fed as parallel text into Google's training algorithm, it will learn that one possible translation of "Paris" is "London" and one possible translation of "France" is "England". What translation is picked depends on frequency, but also on contextual factors, which is probably why the way "Austria" is translated depends on how many exclamation marks it is followed by.

I think that there are some other things going on as well -- the correspondences between "Indiana" and "Indianapolis", and "Austria" and "Australia", are very likely caused by a too-permissive model of probable transliteration relations. And the contextual effect of the number of exclamation points remains mysterious.

But the idea of structural correspondences between different local contact addresses is something that should have occurred to me. I'm too used to thinking about translations of legal codes and news stories and such-like things as sources of parallel text.

[Update 3/25/2008: Most (maybe all?) of these oddities have now been fixed. Quick work!]

Posted by Mark Liberman at March 24, 2008 08:02 PM