July 19, 2004

Eggcorn dynamics

Following up on Chris Weigl's lovely exploration of the airwavesairways eggcorn, Ray Girvan discusses some examples of the inverse variety, airwaysairwaves. After reading Chris' post, I looked for these but didn't find them, partly because I didn't look hard enough, but mostly because I was looking for them in text about the airline industry. Ray found his examples in texts about asthma and the like.

Chris hypothesized that "a great many of the google hits seem to indicate that those who write airways believe that TV signals travel along something like pipes. And this shapes the way they understand the words they use."

This general idea is clearly true -- common eggcorns -- including the original "egg corn" for acorn, and for that matter the etymology of acorn itself -- always have a motivation in meaning as well as in sound. And I think that Chris makes a persuasive case for the relevance of something like Lakoff's Communication as Conduit metaphor in the airwavesairways examples.

So what about Ray's airwaysairwaves examples?

As Ray says, "It's hard to tell why the misnomer developed in this direction". He offers a couple of ideas: resonance from the Airwaves newsletter (brought to you by Combivent and Atrovent inhalation aerosols); or Wrigley's Airwaves gum, "The Kick that Helps you Breathe Free").

Another idea is suggested by one of his examples, from a TV website (for GMTV, "Europe's Biggest Breakfast Show"), where "Dr. Hilary" answers the question "What is asthma?" The good doctor uses airways correctly in the body of the article -- it's a headline writer who has substituted airwaves, in the subhed "Sensitive or inflamed airwaves can be caused by pollution and smoking".

It makes sense for an editor at a TV website to have airwaves well primed and ready to jump out.

This press release from NCSU tells us that "Normally, when allergic mice are exposed to an allergen, their airwaves swell and mucus production increases dramatically. Treatment with the anti-mucus molecule prevented this mucus build-up." Again, these sentences were written by a PR person whose whole job is focused on getting NCSU discoveries onto the airwaves.

Well, it's too easy to give these post hoc explanations, as convincing as they may be in some cases. But at least in principle, there should be a genuinely scientific way to approach the question. More important, there might be some real value in modeling the relative frequency across contexts of substitutions of this sort, just as psycholinguists have learned a lot from modeling corpora of speech errors such as spoonerisms.

Independent variables would include similarity in sound and/or spelling; metaphorical resonance; and pragmatic association. The dependent variable would be the frequency of a given substitution in a given context or class of contexts. This is in principle a much better situation than analysis of speech-error corpora, since we can control for the relative frequency of the error-free cases.

In the example under discussion, I'm pretty sure that the airwavesairways substitutions are much more frequent than the airwaysairwaves ones; and the airwaysairwaves ones seem (almost) always to occur in talking about the bronchial tubes rather than about the airline industry, and mostly in contexts where the writer is focused on media and/or publicity. That's all pretty vague, but that's because I'm not taking the time to frame the contexts and do the counts.

An even more interesting sort of research should soon be possible, if it isn't already, namely the study of eggcorn dynamics. Given a snapshot of substitution frequencies, it might be possible to make predictions (or at least retrodictions) about changes over time. . My guess is that you'd need about 20 years of data to be able to see significant changes with any degree of clarity, and we don't have that yet, at least on the scale required. We'd be trying to track frequency ratios of events with frequencies between about 100 and 2,000 whG/bp, and even at the high end of that scale, a corpus of a mere few million words is not going to give us any useful information.

For example, the lexicographers at OUP apparently say that "diffuse" for "defuse" has become (one of?) the commonest substitutions (or what the Guardian calls "word crimes").

We can check the frame "___ the crisis" and see 6,350 hits for defuse vs. 773 for diffuse. That's enough for us to be fairly confident in the value of the ratio (8.2) for this snapshot. We can compare the frame "___ the bomb" and see 7,760 hits for defuse vs. 800 for diffuse, a slightly (but probably significantly) higher ratio of about 9.7. In contrast, in the frame "___ the light" we see 3,890 hits for diffuse, vs. 89 for the substitution defuse, for a ratio of 43.7. In the frame "___ the Gospel" we get 149 hits for diffuse vs. 3 for defuse, for a ratio of 49.7.

So it's clear already that defusediffuse is much commoner than diffusedefuse -- apparently about 5 times commoner. (To do this sort of thing right, you'd have to check for false positives, either exhaustively or by sampling methods, but the basic results are not going to change in this case).

However, in terms of brute frequency, all these patterns are still pretty rare. Google currently indexes 4,285,199,774 pages, so the 6,350 hits for "defuse the crisis" is about 1,482 whG/bp ("web hits on Google per billion pages"). I don't know what the word count of an average Google page is, but if it's as little as 500 words, this comes out to roughly 3 occurrences of "defuse the crisis" per billion words of indexed text.

And "diffuse the crisis" weighs in at 773/4.2852 = 180 whG/bp, which is probably less than 1 hit per billion words, while "defuse the light" is only 89/4.2852 = 21 whG/bp, which is roughly 4 hits per 100 billion words.

So you can see that conventional corpora -- even large ones like the hundred-million-word BNC -- are not going to work for this kind of research.


Posted by Mark Liberman at July 19, 2004 09:17 AM