April 22, 2004

Henning Mangled

Geoff Pullum wonders why he and his wife find the name "Henning Mankell" so much  more confusable than the name of Henning's most famous creation, "Kurt Wallander." Could be "Hanning Menkell." Could be "Henkel Manking." Could be almost anything. Or, to restrict it a little, anything with an "M", an "H", an "en" an "an", a "k", an "ing" and an "el" or "ell."

Presumably, the reason is connected to the state of Geoff's mind. And that of his wife, philosopher Barbara Scholz. And presumably the states of their minds are related to what they have experienced. And presumably what they have experienced relates to what is in their environment. And I'm not in their environment very much, although I was in Geoff's environment last week, and I had a great time. Thanks for the curry, Geoff. But given that I'm not in their environment very much, I can only guess at what has been in it. And using the argument of the drunk who looks for his keys under the lamp post, what I guess is that the Google database provides a good impression of Geoff and Barbara's environment. Of course, this could be wrong.

First the distribution of non-English words in Google is unlikely to be similar to that in Geoff and Barbara's environment. I'll conveniently ignore this. Second, the Google corpus, as Mark has impressed upon me indelibly, is wildly full of porn and gambling sites. But from what we all know of  Geoff, the internet may underrepresent his (scholarly) interest in porn and gambling to the same extent that it overrepresent's Barbara's. So let's not worry about that either.

Then again, the porn and gambling sites are chock-a-block with artificially created text - should we worry about that? Well, what I'm going to do now is compare the rates at which various possible Swedish mystery writer names arise. I suppose the porn and gambling sites have an equal tendency to use "Hanning" as "Henkel",  possibly close to zero, so that although they skew any absolute frequency estimate, they probably won't affect a relative comparison too much. So no, let's not worry about the artificial text.

Let's get on with it!

Mystery Name
mennkell 106
mennkel 1
mankel 9660
menkel 7710
mankell 206000
mannkell 17
manning 2670000
menning 66800
hanning 75100
henning 2080000
henkell 11000
hankell 160
hankel 70500
henkel 661000
hennkel 18
hennkell 9

First observation, f(Henning)*f(Mankell) > f(Kurt)*f(Wallander). So the confusability of "Henning Mankell" is likely not just a raw frequency issue. The problem, quite obviously, is that the "Henning Mankell" morpheme space is full of similarly plausible combinations. A full analysis would presumably involve looking at phonetic distance between alternatives, but I haven't the time for that. I'm not even going to consider orthographic distances, as could be measured by counting the number of changes to one word's spelling you would need to turn it into another. No, I'll assume that we are given that the first name ends in "ing" and the second in "el" or "ell", and satisfy myself just by looking at the two possible first-names/surname combinations which use up all the relevant morphemes the right amount of times, and which are most popular in terms of the raw frequencies of the individual names, i.e. "Henning Mankell", and "Manning Henkel."

Doing the math, it turns out that, based naively on raw frequency of the individual words, "Manning Henkel" is over 4 times as likely as "Henning Mankell"! The fact that "manning" is a reasonably common gerund has little to do with this, since "Henning" competes admirably in frequency terms: the real problem, if Google to be believed, is the far higher frequency of "Henkel" than "Mankell". This is in spite of the fact that half of Google's "Mankell" pages are "Henning Mankell" pages, so that in a survey that threw out actual mentions of the author, the odds would be stacked even more stronlgly against him. And in the wild feedback loops of the Pullum/Scholz household, it would take only one or two mentions of the wrong name for their linguistic environment to become even more polluted. No wonder Geoff and Barbara find it confusing.

Suspiciously, I found only one instance of someone on the internet actually misnaming "Henning Mankell" as "Manning Henkel." The culprit appears to be Finnish - one Esa Tuomas Tikka. Having, as I do, a talent for making strong categorical claims on the basis of weak statistical data, and being prepared to overlook the fact that Geoff mentioned "Henkell" but not "Henkel" in his post, I therefore propose that Geoff is also Finnish. And if you know anything about Finnish orthography, you'll know what that means. It means that "Geoff" cannot be his real name. Too many "g"s, you see. Who is this blogger, linguist and distinguished university professor who claims conveniently elsewhere heritage (English, of all things - does he think it's classy?), yet is married to someone with a passion for Swedish murder mysteries and has an unusually deep knowledge of Eskimo snow vocabulary? "Geoff Pullum"? Pull the other one, I should say. The game's up - reveal your true identity!

Posted by David Beaver at April 22, 2004 03:29 AM