Language Log: September 2004 Archives

September 30, 2004

Decisiveness is SVO: a Hitlerian theory of communication?

The 2004 U.S. presidential election is turning out to be a rather linguistic affair. Unfortunately, this is not because reforming the language-related parts of the American educational system has become a major issue. Rather, this season's wonkery and punditry have been focused to an unusual degree on language-related properties of the candidates, and on linguistic aspects of their presentation of the issues. There's the whole Bushisms industry, and the rise of frame-speak. About a week ago, there was even a NYT Op-Ed by Stanley Fish, analyzing the candidates' stump speeches as if they were homework essays from a freshman composition class.

Continuing this trend, the Fashion & Style section of last Sunday's NYT had an article by Alex Williams about tonight's presidential debate, running under the headline "Live from Miami, a Style Showdown". According to this story, "[t]he subtle style cues of gesture, posture, syntax and tone of voice account for as much as 75 percent of a viewer's judgment about the electability of a candidate".

I don't expect much substance from these presidential debates. I expect even less substance from the newspaper articles discussing the debates in advance. And when we get to the pre-debate discussion in the Fashion & Style section of the New York Times, I'd be shocked to find any public-policy content at all. So when I read this article and stuck it in the list of links that I might blog about some day, my angle was how amusing it is, for a phonetician like me, to see someone calling syntax one of those "subtle style cues".

But not everyone was amused.

According to a story by Sarah Breger in the 9/29 Daily Pennsylvanian ("Critical letter circulates among profs"), my colleague John Richetti found this article "offensively vulgar". John was especially upset about the article's quotes from our colleague Kathleen Hall Jamieson:

"It is possible to be decisive and not sound decisive," said Kathleen Hall Jamieson, the director of the Annenberg Public Policy Center at the University of Pennsylvania. "People who speak in sentences that contain parenthetical phrases, people who begin a sentence and then deflect to add a series of illustrative examples before they end the sentences" do not seem authoritative, she said. "The language of decisiveness is subject, verb, object, end sentence."

Equally important to Mr. Kerry, she said, is to refrain from using words like "gilded" and "panoply" at the lectern, as he has on the stump.

"Words found on the SAT verbal exam," she added, "should not appear in candidate's speeches."

John expressed his opinions forcefully and eloquently, in a letter sent to the faculty of John's own department of English, and Kathleen's Annenberg School of Communications. According to Breger's story, Richetti's letter asserted that

"This disgracefully simple-minded, pseudo-rhetorical analysis of the most important events in the most crucial election of my lifetime from one of my Penn colleagues is an insult to me and I should think to all of us who teach writing and communications."

and ended with an even stronger condemnation:

"The theory of communication she enunciates is in my view nothing less than Hitlerian and [endorses] demagoguery of a pernicious kind with appalling complacency."

I think we can all agree that John's letter provides a sort of meta-level proof of its first point, by using words like demogoguery and pernicious with no sacrifice of authority or decisiveness.

But I'm less convinced by his second point, that a policy of using simple sentence structures and simple words should be viewed as a "Hitlerian theory of communication." It's certainly true that Goebbels' recipe for propaganda includes the view that

Political propaganda ... speaks the language of the people because it wants to be understood by the people. Its task is the highest creative art of putting sometimes complicated events and facts in a way simple enough to be understood by the man on the street. Its foundation is that there is nothing the people cannot understand, rather things must be put in a way that they can understand.

However, this prescription has been followed by the leaders of mass movements for millennia, and Goebbels was neither the first nor the last to recommend it explicitly. The advice to "keep it simple" follows directly from the premise that public opinion matters, a premise that can be adopted by good and evil forces alike. What I associate more specifically with "Hitlerian" views on mass communications is the "big lie" concept, which was described in this passage from Mein Kampf as a practice of Hitler's enemies:

... the grossly impudent lie always leaves traces behind it, even after it has been nailed down, a fact which is known to all expert liars in this world and to all who conspire together in the art of lying.

and was then, of course, taken up systematically in practice by Goebbels and others. But obviously, nothing in Kathleen Hall Jamieson's quoted remarks recommends Hitlerian theories in this sense.

I do agree with John that Kathleen's prescriptions (as quoted!) are simple-minded. Perhaps it would be better to say "overstated", "caricatured" and "linguistically not very careful". For example, if we take literally her advice that "The language of decisiveness is subject, verb, object, end sentence," we'll find that by this standard, political decisiveness is in rather short supply in all times and places.

George Bush's most recent speech on the whitehouse.gov website, one given extemporaneously yesterday in Lake Wales, Florida, contains 36 sentences (by my quick and informal count). The number of these that can be analyzed as "subject, verb, object" is zero. It's not the most decisive-sounding speech I've ever read, but it does have a sort of common touch, in a style that is curiously zany for a speech about hurricane damage:

I want to thank Adam Putnam, congressman from this part of the world. Every time I see Adam, all he does is talk about oranges. His hair is kind of orange.

W's 9/21/2004 speech to the U.N. was of course entirely scripted, and quite decisive-sounding. Among its 161 sentences, I found nine (about 5.6%) that are simply SVO in form -- if you allow complex subjects like "Both the American Declaration of Independence and the Universal Declaration of Human Rights", and complex objects like "the support of every nation that believes in self-determination and desires peace", and verbal sequences like "has earned" and "will honor". There are a few others that might be assimilated to an SVO pattern, depending on how you treat some loosely-associated modifying phrases, and whether you allow initial connectives, but in any case, the proportion is probably under 10%. More important, some of the most forceful and authoritative-sounding passages in this speech do not involve any simple SVO structures at all -- for example:

Terrorists and their allies believe the Universal Declaration of Human Rights and the American Bill of Rights, and every charter of liberty ever written, are lies, to be burned and destroyed and forgotten. They believe that dictators should control every mind and tongue in the Middle East and beyond. They believe that suicide and torture and murder are fully justified to serve any goal they declare. And they act on their beliefs.

At the same time, some of the SVO sentences strike me as valleys rather than peaks of rhetorical force: "The government of Prime Minister Allawi has earned the support of every nation that believes in self-determination and desires peace", or "History will honor the high ideals of this organization."

As a point of comparison, among the 223 sentences in John Kerry's 9/24/2004 address at Temple University (which focused on similar issues, and is similarly forceful and authoritative-sounding), about 21 can be given a Subject Verb Object analysis, a using similarly rough-and-ready analytic method. This is about 9.4% of the total. The difference in SVO proportions between these two foreign policy speeches is probably not statistically significant, but in any case I'm very sure that it's not rhetorically significant. As in the case of the Bush speech, most of the most authoritative and decisive-sounding sections of Kerry's speech happen not to be among the SVO sentences, but instead (in my opinion) are things like the 10-times-repeated phrase "That was the wrong choice", or the many other short, crisp sentences with non-SVO structure, like "I will show the world that America finishes what it begins", or "To destroy our enemy, we have to know our enemy."

There's certainly a serious issue about how syntactic styles convey characteristics like clarity of vision or decisiveness. It would be interesting to study this carefully. To caricature the syntactic style of decisiveness as "subject, verb, object, end sentence" certainly doesn't advance the scholarship of political rhetoric, though I suppose that it may be like those odd things that voice teachers and golf instructors tell their pupils to do: "sing through a hole in your forehead", "don't follow the ball with your eyes after impact". Although these prescriptions seem impossible or irrelevant, apparently the consequences of trying to follow them are sometimes helpful to some people.

In the end, I'm not sure what to conclude about this brouhaha. I like and respect the two Penn participants, both personally and professionally. I sympathize with aspects of both positions, at least as seen through the dark journalistic glass of the DP and the NYT Fashion & Style section. And I don't know enough about either participant's position to judge to what extent they really disagree.

I don't know what Kathleen Hall Jamieson actually said to (NYT reporter) Alex Williams. It probably doesn't matter -- no one with any sense will believe that this reporters overall conclusions were based on any actual "reporting". Here (as often) the reporter no doubt decided in advance on the content of the story, and then called the usual suspects to get some quotes to plug in. I'm also very much aware that quotes in such articles are not necessarily accurate, much less complete.

I also don't know what John Richetti really wrote about the NYT article. I'm not a member of one of the departments he sent his letter to, and so I'm relying on the couple of sentences quoted in the Daily Pennsylvanian article. Again, these quotes may or may not be accurate, and they are surely not a complete picture of what John said, much less what he meant.

So for now, I'll just take this as more evidence that (the discussion of) political language has become a central issue in (the discussion of) contemporary American politics. As a linguist, I reckon that this is good for business. As a citizen, I think that it's bad for the country.

Posted by Mark Liberman at 11:15 AM

Post-apocalyptic toponymy

Ray Girvan at the Apothecary's Drawer Weblog has an interesting post on the sometimes-odd development of place names in the wake of major political/linguistic changes. He was "partially inspired" by the series of Language Hat and Language Log posts on Pishpek → Frunze → Bishkek, and he gives links to toponymic developments both in fiction (in particular, Russell Hoban's Ridley Walker) and in fact (post-Roman Kent).

Posted by Mark Liberman at 07:06 AM

September 29, 2004

From hacker to high priest

In the Oct. 4 issue of Newsweek, Steven Levy has a column entitled "Memo to Bloggers: Heal Thyself", which ends with a memorable phrase:

We were promised a society of philosophers. But the Blogosphere is looking more and more like a nation of ankle-biters.

This is the same Steven Levy whose classic pop-ethnography Hackers is one of my favorite books. Hackers depicts the early evolution of the "hacker ethic", with its connections to traditional American virtues of individual initiative, informal voluntary cooperation, suspicion of authority, and intellectual curiosity.

Here's how Levy put it in 1984 -- a version that's made it into the Columbia World of Quotations:

The Hacker Ethic: Access to computers—and anything which might teach you something about the way the world works—should be unlimited and total.
Always yield to the Hands-On Imperative!
All information should be free.
Mistrust authority—promote decentralization.
Hackers should be judged by their hacking, not bogus criteria such as degrees, age, race, or position.
You can create art and beauty on a computer.
Computers can change your life for the better.

In blog years, 1984 was like the neolithic age. The PC was all of three years old. The first Apple Macintosh had just been announced. Only a few people had access to email, and even fewer used it. NSFNET was two years in the future. The first web browser (Mosaic) was made available eight years later, in 1992, at a time when there were about 50 web servers in the whole world. The first easy-to-use blogging software didn't emerge until 1999, fifteen years later. But there's a clear progression from Levy's 1984 version of the Hacker Ethic to the ethos of blogging today.

In contrast, consider the "frame" evoked by the word ankle-biters.

There are the big people, the grown-ups, trying to go about their important business. And then there are the little ones, down there on the floor, oblivious to the larger issues, just nipping at passing ankles and generally getting in the way. It's a term that's traditionally used for small children and yippy little dogs.

This is the perspective of those that Levy satirized in his 1984 book as the "high priests", the ones who depend on credentials, hierarchy, top-down goals and methods, controlled access. It's also, ironically, the way that powerful men and women tend to view an independent press.

What's happened? Levy is 20 years older, that's one thing. But there's also a big difference between then and now in his relationship to the subculture he's writing about. In Hackers, he was writing about a set of quaint, eccentric grouplets in which he had no real role. As a budding writer, fresh from an M.A. in literature at Penn State, he had no reason to have any allegiance to the High Priests of computing, who surely seemed much more alien to him than the members of the Tech Model Railroad Club or the Homebrew Computer Club did. But now Levy is writing as a senior editor at Newsweek, and he's writing about a subculture of people carrying out what he calls "their promise to 'fact-check Big Media's a--'." In this encounter, he's a High Priest himself.

This is the same impulse that led to former CBS V.P. Jonathan Klein to characterize the typical blogger as a "guy sitting in his living room in his pajamas writing", and led NPR's "critic at large" John Powers to complain about how bloggers "shriek 'gotcha' at tiny factual errors in articles written on short deadlines by people who actually have to leave the house to do their work", and led Lewis Lapham to compare blogging to "scratching your name on the men's room wall of the, you know, Blue Moon Bar".

Recently, NPR's Ombudsman Jeffrey A. Dvorkin reflected on these issues as follows:

First, we must acknowledge that the blogs have truly arrived. It is hard for journalists who have led a sheltered life without public accountability to acknowledge that those days are over.
Second, it will be tough for ombudsmen and women to admit that their unique role as overseers on behalf of the public is also changing. We need to make room on the bench and give the bloggers a place at the dinner table. The question remains: who's for dinner?
NPR listeners have always been quick to point out our errors and lapses, and in a non-partisan way. The blogs are different because many are explicitly political. It will be interesting to see if the "blogosphere" still has as much impact on mainstream journalism once the election is over.

But blogs are also different because they have an independent way to reach the public, not subject to the control of NPR or any other institution. That's what "mainstream media" have the hardest time accepting, I think.

Posted by Mark Liberman at 11:00 AM

Splitters and Lumpers - Then and Now

As Mark has pointed out, the division between "lumpers" and "splitters" goes back a long way, and in their time, Benjamin Barton Smith and Thomas Jefferson represented opposing camps. But there is a big difference between the situation in 1804 and the situation now, two centuries later. In Jefferson and Barton's day, very little was known about historical linguistics, either about how languages change or about the problems of determining how languages are related to each other. There had been isolated sprouts of good ideas - both morphological evidence and sound laws had been used as arguments for linguistic relationship - but scientific historical linguistics had yet to develop. As a result, reasonable people could hold different views. Jefferson's position resonates with us today both because he turned out to be right and because his argument is a bit more nuanced, but at the time one couldn't say definitively that he was right on the basis of a deep understanding of the problem.

The situation is quite different today. We now have a much better understanding of the problem of determining whether and how languages are related, and we know quite a bit about how languages change. We've also had a good bit of experience, from which we have learned about what works and what doesn't. Among the things we've learned:

The probability of chance resemblance among languages is sufficiently high that we must make an effort to determine that the similarities we notice are not due to chance. The first discussion of the mathematics of linguistic comparison did not appear until 1819, four years after Smith's death, and it was wrong.
Some similarities among languages are due to the fact that the relationship between sound and meaning is not entirely arbitrary. These similarities are therefore not relevant as evidence of linguistic relationship.
Some grammatical properties of languages are either universal or admit of only a few possibilities, rendering them useless as evidence of linguistic relationship.
Sound change is regular, allowing us to establish sound correspondences between related languages.
Sound correspondences greatly reduce the number of degrees of freedom and so greatly reduce the probability of chance relationship.
Multiple sets of sound correspondences can be used to distinguish loans from inherited words.
Languages can borrow large amounts of vocabulary, including basic vocabulary.
In some circumstances, languages borrow morphology as well as vocabulary.

What we've learned has had two consequences. On the one hand, it has made us much more skeptical about proposed linguistic relationships. We know that we need to determine whether similarities are due to chance, that we need to exclude from consideration certain kinds of words, such as mama and papa words and onomatopoeia, and that we have to look very carefully at the possibility that we are dealing with borrowing rather than common descent. On the other hand, we've learned how to reconstruct unattested proto-languages from their attested descendants and how to work out the family tree of related languages. We've also learned a lot about the mechanisms of linguistic change.

There are plenty of things we don't know, but we know so much more now than in Jefferson and Smith's time that some ideas that were reasonable then are not reasonable now. One of these is the idea that a small number of vaguely perceived similarities between languages constitutes evidence of common descent. One sometimes sees the difference between splitters and lumpers presented as one of taste and personality. That isn't accurate. There may be such differences, but the disputes between mainstream historical linguists and "long-rangers" like Joseph Greenberg and Merritt Ruhlen are about methodology, namely whether historical relationships must be established by the comparative method or whether superficial lexical comparison is a valid alternative. When one looks at the evidence, the outcome is clear. The mathematics of probability shows that superficial lexical comparison fails to provide evidence that similarities are not due to chance. Even if similarities are not likely to be due to chance, superficial lexical comparison is unable to distinguish between borrowing and common descent.

These conclusions derive from our understanding of language change and of the problem of determining linguistic relationship; they are supported by the history of historical linguistics. Experience has shown that superficial lexical comparison leads to results that subsequently prove to be incorrect. In 1901 in his book Die Sprachwissenschaft Georg von der Gabelentz made this point eloquently in a passage (pp. 164-168) in which he pointed out that Franz Bopp, deservedly famous for his work on Indo-European morphology, had gone astray when he made claims about linguistic relationships without following the comparative method. Here is the German original followed by my translation. (Lyle Campbell and I have discussed this in our paper Indo-European Practice and Historical Methodology [PDF file].)

Es ist schrecklich verfürerisch in der Sprachenwelt umherzuschwärmen, drauf los Vocabeln zu vergleichen und dann die Wissenschaft mit einer Reihe neu entdeckter Verwandschaften zu beglücken. Es kommen auch schrecklich viele Dummheiten dabei heraus; denn allerwaerts sind unmethodische Köpfe die vordringlichsten Entdecker. Wer mit einem guten Wortgedächtnisse begabt ein paar Dutzend Sprachen verschiedener Erdtheile durchgenommen hat, - studirt braucht er sie gar nicht zu haben, - der findet überall Anklänge. Und wenn er sie aufzeichnet, ihnen nachgeht, verstaendig ausprobirt, ob sich die Anzeichen bewähren: so thut er nur was recht ist. Allein dazu gehört folgerichrichtiges Denken, und wo das nicht von Hause aus fehlt, da kommt es gern im Taumel der Entdeckungslust abhanden. So ging es, wie wir sahen, dem grossen Bopp, da er es versuchte, kaukasische und malaische Sprachen dem indogermanischen Verwandtschaftskreise zu zuweisen. Das Schicksal hatte es merkwürdig gefügt. Es war, als hätte er die Richtigkeit seiner Grundsätze doppelt beweisen sollen, erst positiv durch sein grossartiges Hauptwerk, das auf ihnen beruht, - dann negative, indem er zu Schaden kam, sobald er ihnen untreu wurde... Die Sprachen sind verschieden, denn die Lautentwickelung hat verschiedene Wege eingeschlagen. Hüben und drüben aber ist sie ihre Wege folgerichtig gegangen; darum herrscht in den Verschiedenheiten Ordnung, nicht Willkür. Sprachvergleichung ohne Lautvergleichung ist gedankenlose Spielerei.

It is terribly seductive to roam the world of languages comparing words from them at random and then to bestow upon scholarship a series of newly discovered relationships. Very many stupidities also result from this; for the most urgent discoverers have unmethodical minds. He who, endowed with a good memory for words, has gone through a couple of dozen languages from different parts of the Earth, - he need not at all have studied them -, finds familiar forms everywhere. And if he records them, investigates them, tests intelligently whether the indications pan out, he does only what is right. Only logically correct thought belongs here, and where it is not absent from the outset then he gladly gets lost in the giddiness of the mania of discovery. Thus it went, as we saw, with the great Bopp, when he sought to assign Caucasian and Malayan languages to the Indo-European language family. Fortune had decreed him a curious fate. It was, to have to prove the correctness of his principles twice, first positively through his magnificent main work, which is based on them, then, negatively, by coming to grief as soon as he was unfaithful to them ... Languages are different because sound change has taken different paths. But it has gone its way consistently hither and thither; therefore Order reigns in differentiation, not Chaos. Language comparison without comparison of sounds is irresponsible game-playing.

The final nail in the coffin for superficial lexical comparison is that it has proven barren. When linguistic relationships are established by the comparative method, the evidence that there is a relationship is the beginning, not the end. Historical linguists then move on to reconstruct the proto-language, work out the family tree, and figure out what changes took place and often how and why. Superficial lexical comparison yields no such fruit.

In 1804 it wasn't crazy to be a lumper like Benjamin Barton Smith, but it is today because over the last two centuries we have come to know better. The disagreement between Jefferson and Smith may have been largely one of taste, but the disagreement today between mainstream historical linguists and "long rangers" like Greenberg and Ruhlen is not; it's a disagreement between science and quackery.

Posted by Bill Poser at 02:11 AM

Plagiarism and coincidence

Joseph Bottum has an interesting recent article on scholarly plagiarism, and the related phenomenon that some are calling "plagiaphrase" (that's where you borrow all the main threads of the fabric of someone else's writing but you re-weave it a little). The issue relates very clearly to themes in statistical linguistics.

How frequent is the phrase "the vagaries of the Electoral College"? It gets about 50 webhits on Google (whG), but that's not very many in 4.3 billion pages.

What about the phrase "in the Holmes mold"? It gets about 19 whG. But more specifically, how frequent is it in contexts where the reference is to Oliver Wendell Holmes? There are no Google hits for that at all. It's a very unusual locution with that reference.

Finding both phrases in the same book already has a ridiculously low probability. But when unusual phrases such as these are repeated in a later-published book in just the parts where they discuss a point that was discussed in an earlier-published book that also used those phrases, you really have to stop and take a closer look.

Bottum finds evidence of this sort in abundance in a book by Professor Laurence Tribe (Harvard University), who he claims has been guilty of constantly repeated plagiaphrase of a book by Henry J. Abraham (University of Virginia). Bottum supplies enough quotations — dozens of them —that you can make up your own mind whether you think it goes far enough to beyond the boundary of accidental phrase similarities, and into the land of dishonest appropriation.

Bottum did ask Abraham for comment, and the 83-year-old legal historian seems justifiably rather miffed. He talks of laziness, careless use of young assistants, and the desire to get a book out and make a quick buck.

Posted by Geoffrey K. Pullum at 01:55 AM

Ask, and ye shall receive

I asked for it. In a follow-up to my post on Thomas Pynchon's 1973 transduction of the Vsesoynznyy Tsentral'nyy Komitet Novogo Tyurkskogo Alfavita into magic realism, I quoted from Mark Dickens' 1989 treatment of the real story of Soviet language policy in Central Asia. At the end of that post, I wondered "what were Pynchon's sources for the history of this period?"

This was around dinner time on 9/27. The next morning, I got the answer.

Jim Bisso (of Uncle Jazzbeau's Gallimaufrey) emailed:

According to Steven Weisenburger, in A Gravity's Rainbow Companion: Sources and Contexts for Pynchon's Novel, Pynchon's main sources for the Turkic linguistic escapades of Vaslav Tchitcherine are Professor Thomas G Winner: (1) Oral Art and Literature of the Kazakhs of Russian Central Asia, 1958, and (2) "Problems of Alphabetic Reform Among the Turkic Peoples of Soviet Central Asia" in Slavonic and East European Review 31, 1952, pp.133-47.

People like Tim May and Jim Bisso are an amazing resource, and it's wonderful that modern networked computing allows someone like me to get the benefit of their knowledge and interest.

However, some people seem to find this same phenomenon threatening. In a Fresh Air commentary aired on 9/23, "critic at large" John Powers said of bloggers that

Some shriek "gotcha!" at tiny factual errors in articles written on short deadlines by people who actually have to leave the house to do their work.

I guess it could look that way, if it bothers you to learn the facts of the case from people who know more about it than you do. And as for the "gotcha" part, which seems to refer to the practice of posting objections for others to read, what's the alternative? If someone sends an objection or correction to the journalist or the media organization, it'll either be ignored or else presented in an abridged form in some sort of letters or feedback area. That's fine, but why shouldn't they post it directly for others to read as well?

Bloggers make mistakes, too -- plenty of them. We also leave things out. We might not be writing on deadline, but then on the other hand we have real jobs, sometimes several of them, and we write for fun in odd moments of spare time. If our blogs have readers, our mistakes and omissions, big and small, usually get corrected. This generally happens in a pretty friendly way, even in areas like politics and linguistics where emotions have been known to run high in Real Life. I think that's partly because everyone is playing on a fairly level field. People Jim Bisso know that if they send me email or post something relevant, I'll add their information, in the same place as the original or an even more prominent one, and give them credit for it. (Well, unless I get too far behind...) Or they can blog it themselves, and have the same access to interested readers that I do.

And although we bloggers are sometimes embarrassed by others' corrections, objections and amplifications (look down at the bottom of the cited post), we're always happy to get them. (Well, in principle we're happy, anyhow.)

Posted by Mark Liberman at 12:29 AM

Flash: fontgate at Language Log

This time, the pajamahadeen were coming after me.

A couple of days ago, I typed in a few paragraphs from the novel Gravity's Rainbow, dealing with the efforts of Soviet linguists around 1926-28 to establish a new alphabet for Turkic languages of Central Asia. The quoted passage included an odd character, one that I've never seen in any other context, which is described as representing "a kind of G, a voiced uvular plosive". I searched through the Unicode code charts for a few minutes without finding it, and finally came up with ଗ U+0B17 ORIYA LETTER GA, which is vaguely similar in appearance and also in phonetic value, but is clearly not the right thing. However, I was about to be late for an interesting talk. And maybe the right character wasn't really available at all -- Pynchon might have made it up, or it might be an obscure invention that never made it into Unicode. And who would notice, anyhow? So I figured ଗ (ଗ) would do.

Guess again. Within a few hours, Tim May emailed to suggest that (based on his knowledge of Unicode, and his memory of what the character looks like in the printed version of Pynchon's novel, which he didn't have at hand!) the Unicode code point should really be Ƣ - U+01A2 LATIN CAPITAL LETTER OI and ƣ - U+01A3 LATIN SMALL LETTER OI, about which the code chart for Latin Extended B adds the note "= gha [in] Pan-Turkic Alphabets". Tim also pointed to this chart of the "Kirghiz (Kyrgyz) Latin alphabet (1928 - 1940) which shows that (something looking like) Ƣ and ƣ were definitely in there, ordered right after G.

Tim commented gently that

The two characters are quite similar, and apparently both denote voiced back consonants. Still, a rather surprising substitution. Oriya's one of the more obscure of the Indic scripts in Unicode.

Right -- Oriya is spoken in Orissa state, on the eastern seacoast of India, south of Bengal. A very long way from Baku. If I was the president of CBS News, this would be my cue to mutter about the silk road and the spread of Buddhism, to object that the glyph in the printed form of Pynchon's novel looks as much like U+0B17 as it does like U+01A2, and to admit grudgingly that "Language Log cannot prove that this code point is authentic". But I'm not, so I'll just say that I made a mistake. I knew it was a mistake at the time, and considered adding a note about it, but thought that would be piling pedantry on top of pedantry, so... Hell, I didn't think anyone would notice. Isn't it incredible that there is someone who knows enough, and cares enough, to look at the page source of a post like that one, track down the Unicode code point involved, and send a helpful email to correct it?

The thing is, I thought Tim's intervention was terrific. I learned something, I improved the quality of the post, it's all good. In the opening of this post, I put myself in the position of CBS just to make a joke. Of course, Tim wasn't accusing me of basing an argument on forged documents.

If I had tried to, I surely wouldn't have gotten away with it. I do have that much in common with Dan Rather.

Posted by Mark Liberman at 12:15 AM

September 28, 2004

Giving the Army a the hand

In today's Today's Papers at Slate, Eric Umansky writes that "USA Today leads with word that one-third of the 1,600 former soldiers who've been ordered back into service are giving the Army the hand and haven't shown up."

I guess this is a blend of the idiom "give the finger to <someone>" and the catch phrase of a few years ago "talk to the hand", with its associated gesture. There's a stark contrast between giving someone a hand and giving them the hand: not your standard semantics of definiteness.

Posted by Mark Liberman at 09:19 AM

331 extra linguists are not enough

According to an article by Eric Lichtblau in today's NYT, the FBI still has some serious translation problems:

Three years after the Sept. 11 attacks, more than 120,000 hours of potentially valuable terrorism-related recordings have not yet been translated by linguists at the Federal Bureau of Investigation, and computer problems may have led the bureau to systematically erase some Qaeda recordings, according to a declassified summary of a Justice Department investigation that was released on Monday.

If you do the arithmetic, this doesn't seem very surprising. The article explains that "the number of linguists at the F.B.I. rose to 1,214 as of April 2004 from 883 in 2001, with sharp increases in the number of translators of Arabic, Farsi and other languages considered critical to counterterrorism investigations," meaning 331 extra translators overall. But if there was a backlog of 120,000 hours to transcribe (?) and translate, and if it takes ten translator-hours per hour of audio to do whatever they do with it, then it would take ((1,200,000/331)/40)/50 = about 1.8 years to catch up. I have no idea what the FBI's procedures are with respect to such material, but I would guess that 10 person-hours per hour of audio is a low estimate; and 40 hours/week, 50 weeks/year are surely high estimates for the amount of time translators can spend doing their core job, as opposed to attending planning meetings and training sessions and so on. And if the system was building up a backlog before, then some of those extra 331 translators must be helping to keep up with new stuff as well. And the overall backlog is apparently more than 500,000 hours, with the 120,000 hours just in counter-terrorism operations.

Of course, the FBI's translation people are also spending time dealing with this type of stuff, discussed in an article by Lichtblau back in July. It sounds like a hard row to hoe.

A note of caution: NYT stories dealing with things that I actually know something about are often misleading, to the point sometimes of being tantamount to falsehoods. For example, this 9/13 NYT story by Steve Lohr on IBM's open-sourcing of some speech-technology software, led to this angry rebuke to IBM from Fernando Cassia at the Inquirer. But as far as I can tell, Cassia is mostly slamming IBM for things they never said, but which were implied by Lohr's misleading NYT article, which Cassia cites as his ("extremely vague") source for the claims he's complaining about. I was being polite when I called Lohr's article "somewhat misleading" in my 9/13 post -- really, looking over it again and looking at the trade-press reaction exemplified by Cassia, I would judge now that the article was incompetent, written by someone who apparently has very little understanding of the technical area that he was writing about.

I'm not singling the NYT out because I think they're especially bad -- I'd say that they're still the best single source for general news that's available to me. To express my feelings about that fact, I can only quote André Gide's famous response when asked to name the greatest French poet: "Victor Hugo, hélas."

As a result, I can't accept the facts and (especially) the implications of Lichtblau's article about FBI translation problems as necessarily being true. I don't know any particular reason to doubt what he wrote this morning, although I would have liked to have seen some information about what these recordings actually are (some are presumably interview or interrogation tapes, some are presumably wiretaps?), what it is that the FBI translators actually need to do to these recordings (presumably it varies, but what is the range of treatments needed?), some reactions from commercial translation professionals about the scope, nature and prospects of the FBI's problems, etc. It seems as if Lichtblau's main goal is to demonstrate that the FBI has serious problems, without really clarifying to any significant extent what those problems are and what would be needed to solve them. (In fairness to Lichtblau, his story was prompted by the release yesterday of an edited form of a Justice Department report on the problems, and what he presumably did was mainly to read the released report and summarize it, with some (predictably outraged) quotes from relevant politicians. On the other hand, he's been on this beat for some time now, and has written at least one earlier story -- the one back in July -- on translation at the FBI.)

In the case of something like IBM's speech technology software, there are lots of alternative sources of information, and someone who knows the field and cares to investigate can figure out what's really going on. But in the case of the FBI's translation (and other) problems, most of the crucial information is probably classified, and most of the rest is locked up behind bureaucratic walls. So all I can really do is to keep an open mind, and accept whatever information is available with a grain or two of salt. It would be nice if more journalists were less committed to advocating a chosen narrative, and more competent in evaluating the information available to them.

Posted by Mark Liberman at 09:05 AM

Watch out, Ron

Ron Rosenbaum is an accomplished journalist who recently debunked Elizabeth Kübler-Ross in an obituary at Slate. Gutsy stuff, since she's become "a saintly icon, the Queen of Death", as Rosenbaum puts it. And, as he explains, she's not dead. But he should be more worried about his blurb at the Harry Walker Agency ("America's Leading Exclusive Lecture Agency"):

His latest bestseller, Explaining Hitler: The Search for the Origins of His Evil, has won critical acclaim for it's candid analyses of how we explain Hitler through literature and film.

According to her website , Lynne Truss isn't lecturing in the U.S. again until November. But Ron, if you see an intense-looking blonde coming towards you with a taser in one hand and a cleaver in the other, you might not want to stick around for the interview.

Posted by Mark Liberman at 08:51 AM

"A comparison from which something might have resulted"

There's a bit more to be said about Thomas Jefferson's linguistic research and its sad end, which Bill Poser has just described.

As I observed in a post last year, Thomas Jefferson understood in 1781 that the comparison of languages offers a way to understand the history of peoples:

How many ages have elapsed since the English, Dutch, the Germans, the Swiss, the Norwegians, Danes and Swedes have separated from their common stock? Yet how many more must elapse before the proofs of their common origin, which exist in their several languages, will disappear? It is to be lamented then . . . that we have suffered so many of the Indian tribes already to extinguish, without our having previously collected and deposited in the records of literature, the general rudiments at least of the languages they spoke. Were vocabularies formed of all the languages spoken in North and South America, preserving their appellations of the most common objects in nature, of those which must be present to every nation barbarous or civilised, with the inflections of their nouns and verbs, their principles of regimen and concord, and these deposited in all the public libraries, it would furnish opportunities to those skilled in the languages of the old world to compare them with these, now or at a future time, and hence to construct the best evidence of the derivation of this part of the human race.

His colleague, Benjamin Smith Barton M.D., Professor of Materia Medica, Natural History and Botany in the University of Pennsylvania, was eager to see linguistic similarities everywhere:

By a careful inspection of the vocabularies, the reader will find no difficulty in discovering that in Asia the languages of the . . . tribes of the Delaware-stock may be all traced to ONE COMMON SOURCE. Nor do I limit this observation to the languages of the American tribes just mentioned . . . HITHERTO, WE HAVE NOT DISCOVERED IN AMERICA. . . ANY TWO, OR MORE LANGUAGES BETWEEN WHICH WE ARE INCAPABLE OF DETECTING AFFINITIES (AND THOSE VERY OFTEN STRIKING) EITHER IN AMERICAN, OR IN THE OLD WORLD. [emphasis original]

(New Views of the Origin of the Tribes and Nations of America, 1798)

Barton went on to assert that "[m]y inquiries seem to render it probable, that all the languages of the countries of America may . . . be traced to one or two great stocks. . ." But Jefferson took a far more cautious line:

. . . imperfect as is our knowledge of the tongues spoken in America, it suffices to discover the following remarkable fact. Arranging them under the radical ones to which they may be palpably traced, and doing the same by those of the red men of Asia, there will be found probably twenty in America, for one in Asia, of those radical languages, so called because, if they were ever the same, they have lost all resemblance to one another. A separation into dialects may be the work of a few ages only, but for two dialects to recede from one another till they have lost all vestiges of their common origin, must require an immense course of time; perhaps not less than many people give to the age of the earth. A greater number of those radical changes of language having taken place among the red men of America, proves them of greater antiquity than those of Asia.

Thus the division between "lumper" and "splitter" was already well established by 1798 -- though Jefferson and Barton maintained friendly cooperation throughout, despite their disagreements.

I suppose that Jefferson's "course of time" that "many people give to the age of the earth" was most likely a reference to the calculations like those of James Usher, Anglican Bishop of Ireland, who had calculated based on biblical considerations that the earth had been created on October 26, 4004 B.C., or almost 6,000 years before Jefferson wrote those lines. "Not less than 6,000 years" is not too far off from conservative modern estimates of the time required "for two dialects to recede from one another till they have lost all vestiges of their common origin".

By 1801, Jefferson had collected vocabularies for about 30 indigenous languages, and began to arrange this material for publication "lest by some accident it might be lost" (as he had written in a letter to Benjamin Hawkins in March of 1800). He was apparently near to realizing this goal in 1803, but put it off due to the opportunity afforded by the Lousiana Purchase to obtain a large amount of additional data, and he formed the plan to devote himself to this work after retiring from the presidency.

At this point, the story that I've read elsewhere conflicts with the version given on the APS web page that Bill referenced, which says that the linguistic papers were being shipped from Monticello to Philadelphia, and were lost on the Rappahannock. But in American Science in the Age of Jefferson (1984), John C. Greene wrote (p. 384-385) that

In September 1809 Jefferson wrote Barton that he would be glad to let him see any or all of his vocabularies if he were able to do so.

But Jefferson was now unable to oblige Barton or anyone else with Indian vocabularies. The accident he had dreaded in his letter to Hawkins in 1800 had happened. He had put off arranging his vocabularies for publication until he could incorporate the word lists brought back by Lewis and Clark. His turbulent last term as president of the United States having expired, he had packed the vocabularies and related materials in a trunk and sent them to Monticello with the rest of his things, looking forward eagerly to completing work on them when he arrived at his beloved plantation. Alas, the trunk never arrived. It was stolen from the ship that carried it up the James River, and the thief, disappointed to find nothing but "worthless" papers in the trunk, had emptied its contents into the river. Only a few of the precious documents floated ashore and were rescued from the mud. Among these was Lewis' vocabulary of the Pani language, which Jefferson sent to Barton along with a fragment of another vocabulary in Lewis' hand.

Jefferson's cover letter to Barton read in part:

It is a specimen of the condition of the little that was recovered. I am the more concerned at this accident, as of the two hundred and fifty words of my vocabularies, and the one hundred and thirty words of the great Russian vocabularies of the languages of the other quarters of the globe, seventy-three were common to both, and would have furnished materials for a comparison from which something might have resulted. ... Perhaps I may make another attempt to collect, although I am too old to expect to make much progress in it.

Jefferson was 66 at the time, and though he lived another 17 years, the field by then had left him behind.

So I'm not sure whether the stolen trunk was bound from Washington to Monticello, or from Monticello to Philadelphia, and whether it was emptied into the James River or the Rappahannock. But whatever the true historical facts of the case, it's a good argument for making sure you have off-site backups.

Posted by Mark Liberman at 08:31 AM

September 27, 2004

Words from the West

Mark's post on William Clark's spelling mentions that Thomas Jefferson was a linguist. Among Jefferson's interests was the light that language could shed on history, particularly the history of American Indians. The American Philosophical Society, which holds what survives of Jefferson's vocabularies, including the page illustrated above left, has a nice discussion of Jefferson's work here. Indeed, one of the tasks of the Lewis and Clark expedition was to collect lists of words in Indian languages. This is mentioned in Jefferson's instructions to Merriwether Lewis. Regrettably, much of Jefferson's material was lost when it was shipped from Virginia to Philadelphia. The trunk in which it was packed was rifled by a thief who threw the linguistic material, which had no value to him, into the Rappahannock River. The sheets held by the American Philosphical Society are those that survived and were picked out of the mud.

What many people don't know is that Lewis and Clark were not the first to reach the Pacific overland. A decade earlier, a party of Northwest Company men led by Alexander MacKenzie set out from Montreal and after overwintering at Fort Chippewyan, reached the Pacific at Bella Coola on July 22nd, 1793. The route he followed from the Fraser River to Bella Coola was an existing Indian route known as the "Grease Trail", Carrier /tl'inaɣeti/. The name comes from the fact that the most important item traded into the interior was the processed oil of the eulachon fish Thaleichthys pacificus. Indeed, the Carrier word /tl'inaɣe/ "eulachon oil" is a compound of Carrier /xe/ "grease, oil" (combining form /ɣe/) with /tl'ina/, a loan from Heiltsuk or Haisla, North Wakashan languages spoken on the coast. You can still hike this route, now known as the Nuxalk-Carrier Grease Trail, or as the Alexander MacKenzie Heritage Trail.

MacKenzie's journey was primarily for the purpose of finding an overland route for the fur trade; it was smaller in scale than the Lewis and Clark expedition and was not explicitly scientific, though the Northwest Company was interested not only in the country but in its inhabitants, with whom they would have to deal. MacKenzie's Journal contains short word lists in several Indian languages, including the first record of Carrier, which he collected on June 22nd, 1793 while camped near what is now Alexandria reserve. The orange circle on the map of MacKenzie's route shows where he was camped.

Here is MacKenzie's Carrier vocabulary. The headwords are the words as he wrote them. These are followed by his gloss, the modern form in IPA, if discernible, and in some cases comments.

nah: "eye" -/na/
thigah: "hair" -/t̪s̪iɣa/
gough: "teeth" -/ɣu/
nenzeh: "nose" -/nintsis/
thie: "head" -/t̪s̪i/
dekin: "wood" /dʌtʃʌn/
lah: "hand" -/la/
kin: "leg" -/ketʃʌn/. The modern form consists of /ke/ "foot" plus /tʃʌn/ "stick". Either MacKenzie just didn't hear the first syllable, or in this dialect of Carrier at this time "stick" was used uncompounded for "leg".
thoula: "tongue" -/t̪s̪ula/
zach: "ear" -/dzoh/. MacKenzie's transcription probably reflects /dzʌx/, which is the expected earlier form of /dzoh/.
dinay: "man" /dʌne/
chiqoui: "woman" Perhaps /ts'eku/. If this identification is correct, and if MacKenzie is correct in glossing the form as "woman" rather than "women", it shows that the distinction between singular and plural nouns had already been lost in 1793. Athabaskan languages do not, in general, have distinct singular and plural forms for nouns. Most dialects of Carrier have distinct singular and plural forms only for nouns denoting people and dogs. In the southernmost group of dialects, however, the Blackwater group, even this distinction has been lost, and the word that means both "woman" and "women" is /ts'eku/. In the other dialects "woman" is /ts'eke/, of which /ts'eku/ is the irregular plural. On the other hand, it is conceivable that this dialect still had the distinction between singular /ts'eke/ and plural /ts'eku/ when MacKenzie recorded it, and that MacKenzie's <chiqoui> represents /ts'ekue/, a blend formed when a Carrier speaker started off with the plural form, then switched to the singular.
zah: "beaver" /tsa/
yezey: "elk" /yezih/
sleing: "dog" /ɬi/. MacKenzie's transcription presumably reflects /ɬĩ/.
thidnu: "ground-hog". I can't identify this. "ground-hog" is /dʌtni/ today.
thlisitoh: "iron" /ɬʌztih/
coun: "fire" /kwʌn/
tou: "water" /tu/
zeh: "stone" /t̪s̪e/
nettuny: "bow" Perhaps /neɬtʌɲ/ "our bows". This is currently pronounced /neɬti/ or /neʔʌɬti/ depending on dialect and means "our rifles", but there is other evidence for the sound change involved as well as the shift in meaning.
igah: arrow. Probably /ʔi k'a/ "it" followed by "arrow" (which generally means "rifle cartridge" today).
nesi: "yes". I can't identify this. "yes" is /a/ at present.
thoughoud: "plains" Perhaps /tl'ok'ʌt/.
andezei: "come here". I can't identify this. "come here" is /ʔanih/ at present.

Although MacKenzie wasn't able to transcribe very accurately, due both to his lack of familiarity with the language and its sounds and to the lack of a notation like the International Phonetic Alphabet, not only can we still recognize most of what he wrote, we can even learn a little bit about the history of Carrier. His spelling of tree and leg shows that the Proto-Athabaskan velars had not yet become palatal affricates, as they soon thereafter did. (I recently wrote a little paper [PDF file] about this.)

His spelling of dog indicates that Carrier still had nasalized vowels. On the basis of comparison with the other Athabaskan languages Carrier must at some point in the past have had nasalized vowels, but it no longer does, and even records from the late nineteenth century don't show it.

The fact that MacKenzie wrote <th> where older speakers of the current language have /t̪s̪/, as in "hair", "head", and "tongue" (but not, for some reason, "stone"), suggests that he heard a true interdental affricate [tθ]. Nowadays, older speakers contrast apico-alveolars (like the usual American English pronounciation of /s/) with lamino-dentals. The distinction is difficult to hear and has been lost by younger speakers, who have merged the lamino-dentals with the apico-alveolars. The lamino-dentals were very likely once interdental, as their cognates are in some related languages.

One surprising feature of MacKenzie's word list is that it gives the bare stems of the body parts: "eye", "hair", "teeth", "nose", "head", "hand", "leg", "tongue", "ear". Carrier is a language in which body parts, as well as kinship terms and a few other nouns, are inalienably possessed. That is, they cannot be used in isolation but must occur either as part of a compound word or with a possessive prefix. A Carrier speaker will never say /na/ for "eye". He or she will say /sʌna/ "my eye" or /nena/ "our eyes" or /hʌbʌna/ "their eyes" etc. If he wants to refer to an eye without saying whose it is, he will say /ʔʌna/ "an eye, someone's eye". A common problem with lists of words such as this collected by people who did not have much knowledge of the language and were unfamiliar with this phenomenon is that they give as words for body parts or kinship terms forms that actually contain a possessive prefix. It is remotely possible that the body parts were not inalienably possessed at this time in this dialect of Carrier, but very unlikely, since they are today and are in all of the related languages. If they were inalienably possessed, it is a surprise that MacKenzie shows the bare stems. Perhaps he obtained more than one possessed form and extracted the common part as the stem. He never mentions any such analysis, nor is there any evidence that he had any interest in linguistics, so this would be surprising, but either he did this, or one of the Carrier speakers had performed the same analysis on his own language and gave MacKenzie bare stems. Somebody did a surprising bit of morphological analysis back in 1793.

Posted by Bill Poser at 11:45 PM

Birlashdirilmish yangi Turk alifbesi

Spurred by Language Hat's post on origins of Bishkek, the (current name of) the capital of Kyrgyzstan, I typed in some sections of a fiction set against the background of Stalin-era Central Asia, featuring the effort to develop a "New Turkic Alphabet". Looking around for the true history, I found that Mark Dickens has posted a fascinating paper (from 1989) entitled "Soviet Language Policy in Central Asia".

Here are a few relevant quotes:

Central Asian culture was abruptly altered by the advent of Islam in the area, as the armies of the Caliph swept across the Oxus River (now called the Amu Darya) in 673 AD. By the early eighth century, the Arabs had consolidated their power in what was then known as Transoxiana (The Land Across the Oxus), and by the tenth century, Islam was firmly established as the religion of the general population (although some of the more nomadic tribes in the north continued their animistic and shamanistic practices for several centuries after). Arabic became the language not only of religion but also of higher learning and the Arabic script was employed in all writing, although only a privileged few were able to read and write. In addition to Arabic, classical Persian was also utilized in academic circles. However, most of the people continued to speak in various Turkic or Iranian dialects.

Over the next several centuries, the Central Asian cities of Bukhara, Khiva, and later Samarkand became elite centers of learning in the Islamic world. The area has produced several famous sons, including Al-Khwarizmi (783-847), a brilliant mathematician who has been called "the father of algebra", and the great philosopher, physician, and poet Ibn Sina (980-1037), known in the West as Avicenna. Over the years, a large body of Central Asian literature developed in Arabic, Persian, and Chagatay, a Turkic literary language named after one of the sons of Chingiz Khan. However, despite the great accomplishments of the scholars, most Turkestanis remained illiterate. Indeed, there was little need for the vast majority of them, whether merchants, farmers, or herdsmen, to know how to read or write. In the absence of widespread literacy, though, a rich body of oral literature developed; Central Asia is still renowned as the home of some of the longest epic poems in the world.

Dickens observes that the Bolsheviks decreed as early as 1919 that "All illiterate citizens of the Soviet Republic [the Russian Socialist Federal Soviet Republic, or RSFSR] aged between 8 and 50 years are required to learn to read and write in their native language, or in the Russian language, as they prefer", with a motivation expressed by Lenin as "It is impossible to build a Communist society in a country where people are illiterate". These efforts (to promote literacy, not to build communism) were eventually successful.

The second phase of the literacy campaign began in 1921, its completion coinciding with that of the First Five-Year Plan in 1932. By the end of this phase, the literacy rates in the Tajik SSR, Turkmen SSR, and Uzbek SSR had risen to 52%, 61%, and 72%, respectively (Tonkonogaja 1976:48 - it should be noted that these figures as well as those in Table 2, include all the inhabitants of a given SSR, not just the members of the ethnic group it is named after). The third phase in the campaign began in 1933 and, by the time of the 1939 census, the literacy rates in the five Central Asian republics were 83.6% (Kazakh SSR), 79.8% (Kirghiz SSR), 62.8% (Tajik SSR), 77.7% (Turkmen SSR), and 78.7% (Uzbek SSR). Although there were temporary setbacks due to World War II, this overall upward trend continued after the war until near universal literacy was achieved in Soviet Central Asia and throughout the USSR in 1950's (see Tables 2 and 3). Nothing like this has ever been achieved in any other Muslim country in Asia.[4]

Among the things that had to be done was to create "languages" out of traditional dialect continua:

One of the chief linguistic tasks of the new government was to develop a separate literary language for each significant ethnic group in the Soviet Union. "In the USSR, the emergence of a written language is not always the result of a long internal evolution; it is frequently the consequence of a decision by the central authorities who can present a community with a literary language worked out by Russian linguists" (Bennigsen and Quelquejay 1961:16). Each Central Asian Group chosen to constitute a nation was given a literary language which was artificially differentiated from those of neighbouring nations which were often linguistically similar (as, for instance, with the Kazakh and the Kirghiz). Thus, the linguistic unity of the area was broken up while differences between the languages were emphasized. This process of separation was helped further by the National Delimitation of 1924, which fixed the boundaries of the five Central Asian republics, primarily along ethnic and linguistic lines.

One obvious alternative would have been to create a single pan-Turkic (or at least a single pan-Central-Asian-Turkic) literary language, but this was not the party's choice:

It is interesting to note that the Soviets could have developed a common Turkic language in order to promote sliyaniye, but they chose not to.

Although on the surface the coalescence of many Turkic languages into a single Turkic language would have corresponded to the CPSU program position on 'purging language differences,' it would have contradicted the Bolshevik's real aim, that is to 'purge language differences' in such a way that the Russian language would eventually supersede all other languages (Bruchis 1984:135).

The development and promotion of a common Turkic language, though linguistically logical, would have been politically suicidal, especially as the Soviet leadership began to realize that the expected world revolution was not as imminent as they had hoped. The reality of the situation was that the USSR was increasingly surrounded by political systems hostile to Communism. There was a need to consolidate internal unity, identifying the various Soviet languages with Russian and setting them apart from outside influences.

In order to achieve this end in Central Asia, the Soviet language policy encompassed three broad aims: “first, 'the "completion" and "enrichment" of existing languages, the widening of their scope and the transformation of tribal and community languages into developed national languages with a rich terminology and vocabulary'; secondly, the removal of the large Arabic and Persian loan vocabulary inherited from the Muslim conquests; and thirdly, the establishment of Russian as 'a second native language'."(Wheeler 1964:195).

This was the political and linguistic background of the "New Turkic Alphabet".

At the time of the Revolution, many of the languages of the national minorities lacked written forms. Others employed alphabets which were deemed to be unsuitable by the authorities for various reasons.[6]Soviet linguists set about the monumental task of devising alphabets for those groups which lacked them (over fifty languages received a written form for the first time) and modifying the writing systems which were considered to be inadequate for the purposes of the state. Certainly, the need for an effective vehicle to spread literacy was a legitimate reason for doing so in many cases, but other motivations can be inferred from this action as well.

One of the alphabets slated for reform was the Arabic script used throughout Central Asia, as well as among the other Muslim nationalities in the newly-formed Soviet Union.[7] Various pragmatic reasons were given for the proposed reforms and indeed there were certain Central Asian intellectuals who wanted to get rid of the script. One of the chief problems was that the rich system of vowel harmony found in Turkic languages cannot be represented adequately by the Arabic alphabet, since it has letters for only three vowel phonemes. In addition, the script contains several letters for sounds not found in either Iranian or Turkic languages, and most graphemes have different forms depending on their position in the word. All this tended to make it a difficult alphabet to learn and hence a potential barrier to the spread of literacy.

However, there were equally important political reasons why the alphabet was not satisfactory to the new Soviet rulers. As the alphabet of the Qur'an and of all the great Islamic literature of the past, whether Arabic or Persian, it served as a powerful symbol of the natural ties that the Turkestanis had with the rest of the Muslim world, particularly the Arabs and Persians, who had so shaped the religious and cultural landscape of the area. Indeed, most of the Turkic languages had a significant percentage (e.g. 20-40%) of Arabic and Persian loan elements. In an atheistic state that realized the power of symbols, such a potential rallying point for pan-Islamism could not be permitted to remain. In addition, the common alphabet made communication between the Turkic peoples of the Soviet Union, as well as their kinsmen across the border, all too easy. The spectre of pan-Turkism was equally as threatening to the Soviets as that of pan-Islamism.

In the early 1920's, the steps taken away from Arabic script were gradual ones, limited to adding some diacritical marks and eliminating unused letters. However, the next stage was the one described (accurately, aside from the lurid and fantastical novelistic interpolations) by Pynchon:

The next step in alphabet reform came at the 1926 Baku (Azerbaijan) Turkological Congress, which proposed the adoption of the Latin script for all Turkic languages in the USSR. By 1930, the Arabic script had been replaced by the Birlashdirilmish yangi Turk alifbesi (New Unified Turkic alphabet). By 1935, a total of seventy Soviet languages (not all of them Turkic), representing 36 million people, were being written in the Latin alphabet, modified by diacritics where needed. Although this obviously slowed down the literacy campaign, it also came at a time when there was a new push to eliminate illiteracy. Furthermore, this changeover coincided with the adoption of the Latin alphabet in Turkey, at the instigation of Ataturk. The alphabet was viewed as a culturally neutral script, unlikely to communicate any desires for Russification on the part of the Communist leadership.

At the same time, however, the Latinization of the script "dealt a crushing blow to the Moslem clergy, which utilized the Arabic script as an instrument of spiritual oppression of the... working people"(cited in Isayev 1977:242). It cut off Soviet Muslims from their literary past and the traditional ties to Arab and Persian culture, as well as the rest of the Muslim world. Furthermore, it served to emphasize rather than diminish linguistic differences between the Soviet Central Asians and their compatriots in adjoining countries. Finally, the Muslim clerics and intelligentsia, two possible sources of leadership for anti-Soviet agitation, were essentially reduced to the status of semi-literates, having to learn how to read and write all over again. "For the generations beginning their education in Soviet schools and adult education classes, the literacy blackboard was wiped clean, ready for new writing"(Bacon 1966:191).

The Latin-alphabet writing systems were artificially differentiated to make the different languages seem more different -- and then they were all replaced by Cyrillic, forcing everyone to learn to read all over again!

Further modifications to the Latin script served to create artificial differences between related Turkic languages as the same phoneme was represented by different letters in different languages,[9] a practice which was intensified when these languages were subsequently switched over to the Cyrillic alphabet. There is no good linguistic reason for having done this. An important change in the alphabet of a specific language occurred in 1934 when the standard for literary Uzbek was switched from a northern dialect which utilized vowel harmony to the Iranized Tashkent dialect, which had lost its harmony. This necessitated removing four vowel letters from the alphabet, thus further differentiating Uzbek from related Turkic languages, as well as frustrating the attempts of Uzbek nationalists to maintain the purity of the language. A final change came in 1938 when the letters in the Latin alphabet were rearranged to conform to the sequence of the Cyrillic script,[10] as if in anticipation of the next move.

In the late 1930's, the suggestion was made that the Latin script should be replaced by the Cyrillic. Many of the potential voices of opposition had been silenced in the terrible purges carried out by Stalin during that decade, in which the majority of the Central Asian intelligentsia were liquidated and the remainder were reduced to unwilling collaboration with the regime. The switch to Cyrillic in Central Asia was largely completed by 1940. Again, linguistic reasons were given for this move but, contrary to what Soviet linguists may maintain, the Cyrillic alphabet is no better for representing the Turkic sounds than the Latin script, nor does it involve fewer diacritical marks. Extra letters for certain Turkic sounds are necessary in both systems.[11] The contention that the non-Russian peoples of the Soviet Union, recognizing the great value of the Russian script, desired to make this switch also arouses suspicion.

You can read the rest for yourself.

This leaves me with one more question: what were Pynchon's sources for the history of this period? Since Gravity's Rainbow was originally published in 1973, the source certainly wasn't Dickens' (1989) paper.

[Update: Jim Bisso answered my question, as you can read here. ]

Posted by Mark Liberman at 05:14 PM

How alphabetic is the nature of molecules

The business about Kumiss-whisk, the capital of Kyrgyzstan, reminded me of Thomas Pynchon's treatment of Soviet linguistic imperialism in Central Asia. This is in Gravity's Rainbow (chapter 34, p. 338-359 in the 1995 Penguin edition). It's a long, typically strange mixture of obscure facts and wild inventions. I'll share some of it with you now, because mixed in with mentions of Pishpek, kumiss, an oil man from Midland, Texas with a curious relationship to Saudi Arabia, and the development of a new alphabet for Turkic languages, there's an interesting meditation on the similarity between linguistics and the oil business.

Here's how it starts:

During the early Stalin days, Tchitcherine was stationed in a remote "bear's corner" (medvezhy ugolok), out in Seven Rivers country. In the summer, irrigation canals sweated a blurry fretwork across the green oasis. In the winter, sticky teaglasses ranked the windowsills, soldiers played preference and stepped outside only to piss, or to shoot down the street at surprised wolves with a lately retooled version of the Moisin. It was a land of drunken nostalgia for the cities, silent Kirghiz riding, endless tremors of the earth ...

He had come to give the tribesmen out here, this far out, an alphabet: it was purely speech, gesture, touch among them, not even an Arabic script to replace. Tchitcherine coordinated with the local Likbez center, one of a string known back in Moscow as the "red džurts." Young and old Kirghiz came in from the plains, smelling of horses, sour milk and weed-smoke, inside to stare at slates filled with chalk marks. The stiff Latin symbols were almost as strange to the Russian cadre -- tall Galina in her cast-off Army trousers and gray Cossack shirts . . . marcelled and soft-faced Luba, her dear friend . . . Vaslav Tchitcherine, the political eye . . . all agents -- though none thought of it this way -- representing the NTA (New Turkic Alphabet) in uncommonly alien country.

After some more description, we meet Džaqyp Qulan:

Light pulses behind the clouds. Tchitcherine tracks mud off the street into the Center, gets a blush from Luba, a kind of kowtow and mopflourish from the comical Chinese swamper Chu Piang, unreadable stares from an early pupil or two. The traveling "native" schoolteacher Džaqyp Qulan looks up from a clutter of pastel survey maps, black theodolites, bootlaces, tractor gaskets, plugs, greasy tierod ends, steel mapcases, 7.62 mm rounds, crumbs and chunks of lepeshka, about to ask for a cigarette which is already out of Tchitcherine's pocket and on route.

He smiles thank you. He'd better. He's not sure of Tchitcherine's intentions, much less the Russian's friendship. Džaqyp Qulan's father was killed during the 1916 rising, trying to get away from Kuropatkin's troops and over the border into China -- one of about 100 fleeing Kirghiz massacred one evening beside a drying trickle of river that might be traceable somehow north to the zero at the top of the world. Russian settlers, in full vigilante panic, surrounded and killed the darker refugees with shovels, pitchforks, old rifles, any weapon to hand. A common occurrence in Semirechie then, even that far from the railroad. They hunted Sarts, Kazakhs, Kirghiz, and Dungans that terrible summer like wild game. Daily scores were kept. It was a competition, good-natured but more than play. ... This native uprising was supposed to the be the doing of foreigners, an international conspiracy to open a new front in the war. More Western paranoia, based solidly on the European balance of power. How could there be Kazakh, Kirghiz -- Eastern -- reasons? Hadn't the nationalities been happy? Hadn't fifty years of Russian rule brought progress? enrichment?

Well, for now, under the current dispensation in Moscow, Džaqyp Qulan is the son of a national martyr. The Georgian has come to power, power in Russia, ancient and absolute, proclaiming Be Kind To The Nationalities. ...

A little later,

Out into the bones of the backlands ride Tchitcherine and his faithful Kirghiz companion Džaqyp Qulan. Tchitcherine's horse is a version of himself -- an Appaloosa from the United States named Snake. Snake used to be some kind of remittance horse. Year before last he was in Saudi Arabia, being sent a check each month by a zany (or, if you enjoy paranoid systems, a horribly rational) Midland, Texas oil man to stay off of the U.S. rodeo circuits, where in those days the famous bucking bronco Midnight was flinging young men right and left into the sun-beat fences. But Snake is not so much Midnight-wild as methodically homicidal. ...

Strange, strange are the dynamics of oil and the ways of oilmen. Snake has seen a lot of changes since Arabia, on route to Tchitcherine, who may be his other half -- lot of horse thieves, hard riding, confiscation by this government and that, escapes into ever more remote country. This time ... Snake is going out into what could be the last adventure of all...

The story works back through Tchitcherine's father's voyage with Admiral Rozhdestvenski in 1904 to the South-West African port of Lüderitzbucht, and Tchitcherine's own researches in the Krasnyy Arkhiv to learn about it, which is what originally caused his posting to central Asia:

And so it transpired, no more than a month or two later, that somebody equally anonymous had cut Tchitcherine's orders for Baku, and he was grimly off to attend the first plenary session of the VTsK NTA (Vsesoynznyy Tsentral'nyy Komitet Novogo Tyurkskogo Alfavita), where he was promptly assigned to the ƣ Committee.

ƣ seems to be a kind of G, a voiced uvular plosive. The distinction between it and your ordinary G is one Tchitcherine will never learn to appreciate. Come to find out, all the Weird Letter Assignments have been reserved for ne'er-do-wells like himself. Shatsk, the notorious Leningrad nose-fetishist, who carries a black satin handkerchief to Party congresses and yes, more than once has been unable to refrain from reaching out and actually stroking the noses of powerful officials, is here -- banished to the Θ Committee,where he keeps forgetting that Θ, in the NTA, is œ, not Russian F, thus retarding progress and sowing confusion at every working session. Most of his time is taken up with trying to hustle himself a transfer to the Ņ Committee, "Or actually," sidling closer, breathing heavily, "just a plain, N, or even an M, will, do. . . ." The impetuous and unstable practical joker Radnichny has pulled the ə Committee, ə being a schwa or neutral uh, where he has set out on a megalomaniac project to replace every spoken vowel in Central Asia -- and why stop there, why not even a consonant or two? with these schwas here . . . not unusual considering his record of impersonations and dummy resolutions, and a brilliant but doomed conspiracy to hit Stalin in the face with a grape chiffon pie, in which he was implicated only enough to get him Baku instead of worse.

Naturally Tchitcherine gravitates into this crew of irredeemables. Before long, if it isn't some scheme of Radnichny's to infiltrate an oil-field and disguise a derrick as a giant penis, it's lurking down in Arab quarters of the city, waiting with the infamous Ukrainian doper Bugnogorkov of the glottal K Committee (ordinary K being represented by Q, whereas C is pronounced with a sort of tch sound) for a hashish connection, or fending off the nasal advances of Shatsk. ...

Most distressing of all is the power struggle he has somehow been suckered into with one Igor Blobadjian, a party representative on the prestigious G Committee. Blobadjian is fanatically attempting to steal ƣs from Tchitcherine's Committee, and change them to Gs, using loan-words as an entering wedge. In the sunlit, sweltering commissary the two men sneer at each other across trays of zapekanka and Georgian fruit soup.

There is a crisis over which kind of g to use in the word "stenography." There is a lot of emotional attachment to the word around here. Tchitcherine one morning finds all the pencils in his conference room have mysteriously vanished. In revenge, he and Radnichny sneak in Blobadjian's conference room next night with hacksaws, files and torches, and reform the alphabet on his typewriter. It is some fun in the morning. Blobadjian runs around in a prolonged screaming fit. Tchitcherine's in conference, meeting's called to order, CRASH! two dozen linguists and bureaucrats go toppling over on their ass. ... Could Radnichny be a double agent?

Here's where it gets serious, or at least seriously fantastic:

The time for lighthearted practical jokes is past. Tchitcherine must go it alone. Painstakingly, by midwatch lantern light, when the manipulations of letters are most apt to produce other kinds of illumination, Tchitcherine transliterates the opening sura of the holy Koran into the proposed NTA, and causes it to be circulated among the Arabists at the session, over the name of Igor Blobadjian.

[...] Does Tchitcherine know what he's doing with this forgery of his? It is more than blasphemy, it is an invitation to holy war. Blobadjian, accordingly, is pursued through the back end of Baku by a passel of screaming Arabists waving scimitars and grinning horribly. The oil towers stand sentinel, bone-empty, in the dark. [...]

"In here, Blobadjian -- quickly". Close behind, Arabists are ululating, shrill, merciless, among the red-orange stars over the crowds of derricks.

Slam. The last hatch is dogged. "Wait -- what is this?"

"But I don't want --"

"You don't want to be another slaughtered infidel. Too late, Blobadjian. Here we go. . . ."

The first thing he learns is how to vary his index of refraction. He can choose anything between transparent and opaque. After the thrill of experimenting has worn off, he settles on a pale, banded onyx effect.

"It suits you," murmur his guides. "Now hurry."

"No, I want to pay Tchitcherine what he's got coming."

"Too late. You're no part of what he's got coming. Not any more."

"But he -- "

"He's a blasphemer. Islam has its own machineries for that. Angels and sanctions, and careful interrogating. Leave him. He has a different way to go."

How alphabetic is the nature of molecules. One grows aware of it down here: one finds Committees on molecular structure which are very similar to those back at the NTA plenary session. "See: how they are taken out from the coarse flow -- shaped, cleaned, rectified, just as you once redeemed your letters from the lawless, the mortal streaming of human speech. . . These are our letters, our words: they too can be modulated, broken, recoupled, redefined, co-polymerized one to the other in worldwide chains that will surface now and then over long molecular silences, like the seen parts of a tapestry."

Blobadjian comes to see that the New Turkic Alphabet is only one version of a process really much older -- and less unaware of itself -- than he has ever had cause to dream. [...]

And print just goes marching on without him. Copy boys go running down the rows of desks trailing smeared galleys in the air. Native printers get crash courses from experts airlifted in from Tiflis on how to set up that NTA. Printed posters go up in the cities, in Samarkand and Pishpek, Verney and Tashkent. On sidewalks and walls the very first printed slogans start to show up, the first Central Asian fuck you signs, the first kill-the-police-commissioner signs (and somebody does! this alphabet is really something!) and so the magic that the shamans, out in the wind, have always known, begins to operate now in a political way, and Džaqyp Qulan hears the ghost of his own lynched father with a scratchy pen in the night, practicing As and Bs.

Back to the ride into the Kyrgyz backlands:

But right about now, here come Tchitcherine and Džaqyp Qulan riding up over some low hills and down into the village they've been looking for. The people are gathered in a circle: there's been a feast all day. Fires are smoldering. In the middle of the crowd a small space has been cleared, and two young voices can be heard even at this distance.

It is an ajtys -- a singing-duel. The boy and girl stand in the eye of the village carrying on a mocking well-I-sort-of-like-you-even-if-there's-one-or-two-weird-things-about-you-for-instance kind of game while the tune darts in and out of qobyz and dombra strummed and plucked. The people laugh at the good lines. You have to be on your toes for this: you trade four-line stanzas, first, second and last lines all have to rhyme though the lines don't have to be any special length, just breathable. Still, it's tricky. It gets insulting too. There are villages where some partners haven't spoken to each other for years after an ajtys. As Tchitcherine and Džaqyp Qulan ride, the girls is making fun of her opponent's horse, who is just a little -- nothing serious, but kind of heavy-set . . . well, fat, really. Really fat. And it's getting to the kid. He's annoyed. He zips back a fast one about bringing all his friends around and demolishing her and her family too. Everybody sort of goes hmm. No laughs. She smiles, tightly, and sings:

You've been drinking a lot of qumys,
I must be hearing the words of qumys--
For where were you the night my brother
Came looking for his stolen qumys?

Oh-oh. The brother she mentioned is laughing fit to bust. The kid singing is not so happy.

[...]

They sit down and are passed cups of the fermented mare's milk, with a bit of lamb, lepeshka, a few strawberries. . . The boy and girl go on battling with their voices -- and Tchitcherine understands, abruptly, that soon someone will come out and begin to write some of these down in the New Turkic Alphabet he helped frame . . . and this is how they will be lost.

[Note: Pynchon indicates that the "New Turkic Alphabet" was based on Latin characters:

...nobody is really too keen on a Cyrillic NTA. Old Czarist albatrosses still hang around the Soviet neck. There is strong native resistance in Central Asia these days to anything suggesting Russification, and that goes even for the look of a printed language. The objections to an Arabic alphabet have to do with the absence of vowel symbols, and no strict one-to-one relation betwen sounds and characters. So this has left Latin, by default.

but whatever the fate of the "New Turkic Alphabet" in "early Stalin times" (and Pynchon usually seems to give a fairly accurate picture of concrete historical details), Ethnologue indicates that eventual outcome was for Kirghiz to be written in Cyrillic. ]

[Update: See my next post for the historical details. ]

[Update 9/28/2004: Busted. Tim May emailed to point out that my cursory scan through the Unicode code charts failed to find the correct Unicode code point for the letter whose Committee Tchitcherine was assigned to. The appropriate mapping to Unicode is undoubtedly Ƣ - U+01A2 LATIN CAPITAL LETTER OI and ƣ - U+01A3 LATIN SMALL LETTER OI, about which the code chart for Latin Extended B adds the note "= gha [in] Pan-Turkic Alphabets". Tim also points to this chart of the "Kirghiz (Kyrgyz) Latin alphabet (1928 - 1940) which shows that Ƣ and ƣ were definitely in there, ordered right after G.

I had found and used ଗ U+0B17 ORIYA LETTER GA, which is vaguely similar in appearance and also in phonetic value, but comes from what Tim calls "one of the more obscure Indic scripts in Unicode". What with the silk road and the spread of Buddhism and all, I guess it's conceivable that there's some historical connection, but in this case it was just an unscholarly expedient on my part. Now, if I was the president of CBS News, I guess this would be my cue to say that "Language Log cannot prove that this code point is authentic". But I'm not, so I'll just say that I made a mistake. I knew it was a mistake at the time, and considered adding a note about it, but thought that would be piling pedantry on top of pedantry, so... Hell, I didn't think anyone would notice. Isn't it incredible that there is someone who knows enough, and cares enough, to look at the page source of a post like this, track down the Unicode code point involved, and send a helpful email to correct it?

Unicode Latin Extended B also includes a clue about the letter whose committee Shatsk is assigned to, which I mistakenly rendered as the html character entity Θ. The correct value is almost certainlyƟ U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE, which the code chart notes as "= barred o, o bar" and also as "→ 004E8 cyrillic capital letter barred o".

Posted by Mark Liberman at 01:08 PM

A capital to stir kumiss with

Steve at Language Hat has a wonderful post about the capital of Kyrgyzstan. About its name, of course, which used to be Pishpek, and then became Frunze in Soviet times ("Purunze" to the locals, at least in pronunciation). Since the Soviet name was a reference to the Bolshevik political and military leader Mikhail Frunze, the post-Soviet Kyrgyzstan decided to return to the old name. Unfortunately, no one knew its etymology. I'm not completely clear why this was viewed as a problem -- perhaps local linguistic nationalism prefers etymologically transparent place names? Anyhow, it was decided to use the Kyrgyz word nearest in sound, which is bishkek, meaning "whisk to stir kumiss with".

As Steve pointed out to me in email, this story (if true) means that the name of the capital of Kyrgyzstan is a very special type of eggcorn, namely a false analysis, with a slight change in sound, created on purpose to provide an interpretation for a name that otherwise lacks one. A few other examples of this sort of thing come to mind. For example, there are the cruel nicknames that children invent by malicious re-intepretation of the sound of other children's names. And there are some brand names that find more positive eggcornic analyses for common words.

It stirs the imagination to consider the possible results if other nations used a similar approach to replacing etymologically opaque place names.

Here's a recipe for kumiss: the step requiring a bishkek might be "dissolve the lactose in the water, add it to the milk, mix the yeast and brown sugar thoroughly, adding a little of the milk mixture to make it a thin paste, then add that to the rest of the milk solution and stir well". But I can't imagine that the nomadic Kyrgyz had either lactose or brown sugar -- not to speak of the champagne bottles suggested as containers. This recipe suggests using the mylar bags from boxed wine, which explains why a leather kumiss bag hanging on one's saddle would be a far more ecologically appropriate method. But there's still that lactose and brown sugar. Is mare's milk maybe just higher in sugar content? Or did the nomads of the asiatic steppes just make do with larger amounts of weaker kumiss?

[Note: Ethnologue spells the name of the language as Kirghiz, even though the country is now officially Kyrgyzstan. ]

[See these two later posts for more of the linguistic history.]

Posted by Mark Liberman at 09:55 AM

September 26, 2004

Ocian in view! Oh! The Joy!

According to this CNN report, the U.S. Mint's new nickels feature a new profile of the well-known American linguist Thomas Jefferson. The reverse side will come in two kinds, one with the traditional buffalo, and the other featuring the Pacific Ocean, inscribed with the words that William Clark almost wrote in his journal when he reached the mouth of the Columbia River: "Ocean in view! O! The Joy!"

But what he actually wrote, it seems, was "Ocian in view! O! The Joy!" CNN explains the substitution:

According to spokeswoman Becky Bailey, the Mint considered the issue, and chose to use the modern spelling.

"We didn't want to confuse anyone into thinking we couldn't spell," she said.

Since the word is obviously derived from Latin oceanus and (the Greek equivalent which I don't have time to render into html entities) via French océan, I was surprised to see the suggestion that this was an old spelling rather than just a misspelling. Of course, there was no such thing as a misspelling in the libertarian English orthography of Shakespeare's time, but by November 1805, when Lewis and Clark reached the Pacific, things had settled down considerably. Even so, the great linguist Thomas Jefferson himself was known to misspell a word or two from time to time (a fact in which I take great personal comfort), so I thought maybe Clark was just exhibiting a similar sort of orthographic independence.

It's true that the OED has the following spellings for ocean among its citations: "occean", "occian", "oxian", "occion", "occione", "occyon", "occyan", "occeane", "occiane", "occæan", "ocian", "ociane", "ocyane", and (of course) "ocean". But (except for "ocean") these all seem to date from the fine free old days when men and women spelled as they pleased. In fact, the most recent non-"ocean" citations (based on a quick scan of the entry) are:

1545 Brinklowe Compl. 45, I thynck it is as well possyble for the ocyane se to be without water.
1591 Spenser Ruins of Time 541 For from the one he could to th' other coast, Stretch his strong thighes, and th' Occæan ouerstride.

And even from the late 16th century, most of the citations are spelled "ocean":

1590 Spenser F.Q. II. ii. 22 A Beare and Tygre being met..on Lybicke Ocean wide.
1591 Shakes. Two Gent. II. vii. 69 A thousand oathes, an Ocean of his tears,..Warrant me welcome to my Protheus.

So I'm quite skeptical that "ocian" was a spelling current at the time of Clark's education, two hundred years later. I think this was just an example of good old American freedom of expression.

Can you imagine the fuss, though, if the first explorer on Mars sends back a misspelled journal entry? Oh! The Shame!

[Update: I'm puzzled about how to square the CNN story with this version of Clark's journal, which treats the expedition's first sight of the Pacific as follows:

Encamped under a high hill on the starboard side, opposite to a rock situated half a mile from the shore, about 50 feet high and 20 feet in diameter. We with difficulty found a place clear of the tide and sufficiently large to lie on, and the only place we could get was on round stones on which we laid our mats. Rain continued moderately all day, and two Indians accompanied us from the last village. They were detected in stealing a knife and returned. Our small canoe, which got separated in a fog this morning, joined us this evening from a large island situated nearest the larboard side, below the high hills on that side, the river being too wide to see either the form, shape, or size of the islands on the larboard side.
Great joy in camp. We are in view of the ocean, this great Pacific Ocean which we have been so long anxious to see, and the roaring or noise made by the waves breaking on the rocky shores (as I suppose) may be heard distinctly.
Captain Clark, 7 November 1805

]

[Update 9/26/2004: Emily Bender emails that

I just read your Language Log entry on "Ocian in view!", and had a tidbit you might be interested in. For the past few months Smithsonian Magazine has been running excerpts of the Lewis & Clark journals, and the spelling is uniformly pretty awful. Given that background, I'd be surprised if "ocian" was anything other than another misspelling.

(Some of?) this feature is available on line here, and Emily is absolutely right. The most non-standard spellings come from the journals of other members of the expedition, for instance Sergeant Charles Floyd, who wrote on August 2, 1804 that

The Indianes Came whare we had expected thay fired meney Guns when thay Came in Site of us and we ansered them withe the Cannon.

but William Clark shows a robust orthographic independence of his own, as in this entry for August 4, 1804:

Set out early- (at 7 oClock last night we had a Violent wind from the NW Som little rain Succeeded, the wind lasted with violence for one hour after the wind it was clear Sereen and Cool all night.) proceeded on passed thro betwen Snags which was quit across the Rivr the Channel Confined within 200 yards one Side a Sand pt. S S. the other a Bend, the Banks washing away & trees falling in constantly for 1 mile, abov this place is the remains of an old Tradeing establishment L.S. where Petr. Crusett one of our hands Stayed two years and traded with the Mahars.

It seems that the version of the journals from which I took the (standard-spelling) quote above come from a later published version, which was presumably copy-edited by someone. However, it still lacks the "O! The Joy!" quote.

The Smithsonian Magazine's publication of the journal selections is proceeding month-by-month (in a sort of monthly bloggish fashion), so I guess we'll probably find out the answer in their November issue. ]

[Update #3: We don't have to wait that long, as the University of Nebraska Press has a website devoted to Gary E. Moulton's edition of the Lewis and Clark Journals. In the time available to me, I didn't find the "O! Joy!" post, but this entry (due to Clark, from Nov. 17, 1805) makes Clark's spelling propensities clear:

at half past 1 oClock Capt. Lewis and his Party returned haveing around passd. Point Disapointment and Some distance on the main Ocian to the N W. Several Indians followed him & Soon after a canoe with wapto roots, & [ML: Liquorice] [1] boiled, which they gave as presents, in return for which we gave more than the worth to Satisfy them a bad practice to receive a present of Indians, as they are never Satisfied in return. our hunters killed 3 Deer & th fowler 2 Ducks & 4 brant I Surveyed a little on the corse & made Some observns. The Chief of the nation below us Came up to See us [2] the name of the nation is Chin-nook and is noumerous live principally on fish roots a fiew Elk and fowls. they are well armed with good Fusees. I directed all the men who wished to See more of the Ocean to Get ready to Set out with me on tomorrow day light. the following men expressed a wish to accompany me i'e' Serj. Nat Pryor Serjt. J. Ordway, Jo: Fields R. Fields, Jo. Shannon, Jo Colter, William Bratten, Peter Wiser, Shabono & my Servant York. all others being well Contented with what part of the Ocean & its curiosities which Could be Seen from the vicinity of our Camp.

Score (in this paragraph): "Ocian" 1, "Ocean" 2. "Modern spelling", indeed!]

Posted by Mark Liberman at 10:30 PM

Eska and Ringe on Forster and Toth

The September issue of Language has an "discussion note" by Joseph Eska and Don Ringe entitled "Recent work in computational linguistic phylogeny." If your institution has a Project Muse subscription, you can access the pdf here (though you still can't learn this from the LSA's pathetically out-of-date "Language Online" web page, nor anywhere else on the LSA's web site, as far as I can tell).

The article starts by listing "a number of recent attempts by nonlinguists to reconstruct linguistic evolutionary trees," including Rexová et al. 2003, Gray & Atkinson 2003, and Forster & Toth 2003, and asserting that "[s]cientific linguists have not been impressed for a variety of reasons". Eska and Ringe write that

Though no two of the publications in question exhibit exactlythe same weaknesses, all can be impugned on one or more of the following grounds: the linguistic data employed have not been adequately analyzed, or—in some cases—even competently analyzed; the model of language change employed has not been shown to fit the known facts of language change; attempts to fix the dates of prehistoric languages have ignored the fatal shortcomings of glottochronology discovered by Bergsland and Vogt (1962...); the researchers assume that vocabulary replacement is governed by a LEXICAL CLOCK (similar to the controversial MOLECULAR CLOCK posited by some biological cladists); and/or the data set used is too small to yield statistically reliable conclusions.

A thoroughgoing critique of all recentlypublished work in this vein would be unwieldy and would require far more space than a discussion note permits. Instead, we focus on the article that best exemplifies the shortcomings listed above, namely the work of Forster and Toth.

Earlier more informal analyses of Foster and Toth were already devastating, and under Eska and Ringe's scrutiny, things don't get any better. Let's just say that they make the rubble bounce, several times.

I'm not familiar with Rexová et al. 2003 (that's Rexova, K. Frynta, D. and Zrzavy, Jan. 2003. "Cladistic analysis of languages: Indo-European classification based on lexicostatistical data". Cladistics. 19: 120-127). However, I've read Gray & Atkinson, and discussed the work with Russell Gray as well as with Don Ringe, Bill Poser and Tandy Warnow. See here, here, here, here, here and especially here and here for various prior Language Log posts on this topic.

Based on that experience, I think it's unfair to put Gray & Atkinson in the same category as Foster & Toth. As I wrote earlier, Gray & Atkinson's stuff "is serious and interesting work. Its methods and conclusions remain controversial but they are worthy of very close attention".

I know that Don Ringe is not convinced by their arguments, but in my opinion, that belongs in the large category of differences of opinion among serious scholars.

Posted by Mark Liberman at 08:18 PM

Learn your grammar, Becky

A large amount of work went into the preparation of the recent spam message from "Becky Miranda" about tentative scheduling of a meeting that was sent to a random UCSC address and blind-copied to me (and doubtless hundreds of others). The body did nothing but display an icon which would take the viewer to a web site if clicked on. A significant amount of random Angloid text with English-type letter transition frequencies ("align fatbikini esquire granularhemorrhage applicable augerdominic chalet aggressivebarbudo wherefore verbsomewhat germane israelballroom toefl refrainnoetherian committal typewritethickish...") had been added to try and defeat spam detection algorithms which look for an excess of HTML over plain text. And work had been done to forward it through a trail of relay machines. But the imaginary Becky let herself down with the Subject line:

From: "Becky Miranda" <hgdrzftoelrwt@takas.lt> To: <Prest@ucsc.edu> Subject: tentative meeting on the 2th Date: Fri, 24 Sep 2004 01:59:20 -0300

The suffix for numerically abbreviated ordinal numerals isn't always th in English, Becky. It's st those that end in 1 but not in 11; it's nd for those that end in 2 but not in 12; it's rd for those that end in 3 but not in 13; and otherwise it's th. (I have to admit to you that on page 1718 of The Cambridge Grammar of the English Language this is only implicit; it's carefully described for the spelled-out words, but not for the numerical abbreviations.) That little detail of the lexical structure of English number names (that we don't have a "2th" of any month) gave you away, and would have revealed you as a foreign spammer even if the incongruity of someone at a Lithuanian address inviting me to a meeting had not. You see how important grammar is?

Posted by Geoffrey K. Pullum at 03:30 PM

Happy birthday, eggcorn!

The term "eggcorn", meaning a nonce or sporadic folk etymology, was coined by Geoff Pullum almost exactly one year ago -- September 30, 2004. Since then, it's caught on to the extent that {"eggcorn|eggcorns"} gets 3,680 hits on Google.

Some of these hits are real eggcorns, if I can put it that way:

(link) Also I have oak trees in my pasture. My vet said my horses would be ok because they shouldn't eat the eggcorns. But everything I read says they are poiseneous...
(link) This animal eats mostly berries and eggcorns.

But most seem to be uses of the newly-coined term. I've been saving up new examples for the past couple of weeks, mostly as cited in weblogs I read, or sent it by email. Here are a few of them -- I'll add some more later.

On 9/05, entangled bank posted about "wrought with pain/errors":

Either an eggcorn or a development of a new verb form. Wrought is being treated as meaning racked or fraught, in senses such as wrought with pain, wrought with errors. In fact racked makes better sense as an origin for the first, fraught for the second, and it's picking up on the fact that racked with is often spelt wracked with by association with the older word wrack = wreck.

(And in response to entangled's 9/15 "FINAL POST", I sincerely echo TstT's entreaty: "Don't do it, man! You've got so much to blog for!")

On 9/15, wolf angel posted about the substitution "uncharted waters" → "unchartered waters", and commenters on the post added

"deep seated" → "deep seeded"
"a grain of salt" → "a grain assault"
"pummeling"→ "pommeling"

By email, Linda Seebach contributed an example from a comment on a post on Arnold Kling's blog:

The Coup de grasse must be a friend who bought a House for $18,500 in 1965, and sold it for $92,300 last year.

As Linda pointed out, "Notice he knows it isn't 'grass.'"

Linda also contributed a "different kind of mistake, from someone writing a letter to the Mensa Bulletin":

Few Americans seem to realize that, had the U.S. Supreme Court allowed the Florida Supreme Court to interfere extra-legally, blood would have flown in the streets of America.

I share Linda's reaction to that one: "That was so odd I had to think for a while before I could figure out what was wrong with it."

Philip Brooks emailed a citation from a Sports Illustrated story:

"What a game to go off for a career high. Forte scored 17 of his 28 in the second half, 10 during a 14-4 run that turned a tie game into a 53-53 Carolina lead. This wise-beyond-his-ears freshman has been UNC's go-to guy and only consistent scorer all season, and he delivered in a big way Sunday."

As Philip pointed out, this one could be a joke or a typographical error.

Badaunt emailed this:

I've just encountered another eggcorn to add to your list, (if you want more):

Someone wrote a comment on my blog saying: 'I suspect I had an outer body experience at the age of 10.'

I googled on 'outer body experience' and discovered there's a brand of bath and beauty products called 'Outer Body Experience' which confuses things a bit, but the eggcorn seems quite common, too. Googling on

"outer body experience" -beauty -products -soap

gets 1370 hits.

and Googling on

"out of body experience"

gets 62,400.

(What is the etiquette when someone lays an eggcorn on your blog? Should I correct him/her, or would that be rude?)

My own reaction, for what little it's worth, is (a) it would be rude to make an explicit correction in public, (b) it would be OK to explain the situation, in a gentle way, by email, and (c) if you wind up responding in a way that uses the same word or phrase, it's OK to use the standard version rather than accomodating to the eggcorn. Also, I think that outer body experiences are the very best kind!

Fernando Pereira emailed a subtle case. Is the phrase "ready and roaring to make a go" in this article a substitution for "raring to go", or just a different and less commonplace expression?

Since the June 21 flight of SpaceShipOne at the Mojave Airport in California, rumors have grown that Scaled Composites is ready and roaring to make a go at the Ansari X Prize.

Posted by Mark Liberman at 01:12 PM

September 25, 2004

A random monkey begins Julius Caesar

Bill Poser describes it as "a well known claim" that "If you have enough monkeys banging randomly on typewriters, they will eventually type the works of William Shakespeare." There's one little thing he does not note which I think might be worth pointing out here, and that is that the claim is definitely true. I think a lot of people might not realize that: some incorrectly call it a conjecture. No, it is absolutely true, a special case of a theorem of Kolmogorov. I went to the monkey simulator site that Bill refers to, and ran it during lunch. After lunch I found that my monkey had actually started typing out Act 1, Scene 1, of Julius Caesar, from the very beginning.

The play opens with a speech by Flavius (spelled "Flauius" in the old Latin style where u and v are not distinguished since the latter is just the semivowel corresponding to the vowel represented by the former):

Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
Being mechanical, you ought not walk
Upon a labouring day without the sign
Of your profession? Speak, what trade art thou?

My monkey (actually a random character generator) had begun typing out the play, as far as the 17th character:

F	l	a	u	i	u	s	.		H	e	n	c	e	:		h
1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17

Now, to say that my monkey would not be permitted to choose o as the next character after a sequence like this would be to say that his choices are not random. Full randomness means that he can pick any letter he chooses, and we have no way to predict which one he might pick. Therefore he could choose o as his next letter. And by the same reasoning he could pick m as the one after that, and choose e after that, and so on. As the monkeys randomly type through the billions and trillions of years, one day I'll get the whole of Julius Caesar, simply by an accident of such choices. You can take that to the bank.

This time, I missed: my monkey chose b instead of o, and I got:

Flauius. Hence: hb'in-p1:s]Ij"PpXeygefFPXD)gg8Ns...

So it didn't work out. Not today. But it doesn't matter. The claim is true. One day the sequence Flauius. Hence: h will come back, this time with an o and the whole of the rest of Julius Caesar following it, followed by all the other plays. That's the really staggering thing about the claim: it's not a speculation that one day we'll get the whole of the Immortal Bard's works out of an untiring team of monkeys working away on keyboards; it's definitely true — unless astronomy imposes its cosmic time limit on everything and the earth is destroyed or the universe shuts down before we get there. And because of the randomness we have no way to tell whether that will happen before we get the right character sequence.

[Note added later: Fernando Pereira warns me to be cautious about jumping from mathematical monkeys (independent random variables) to real monkeys. He notes (quite rightly) that a finite computational device such as a monkey cannot generate an infinite sequence of independent random draws, because any random number generator has a cycle, however large. It may turn out that the cycle of your number generator is smaller than the number of draws needed before you ever hit what you want. The monkeys would need a true source of randomness to be really truly random So let me just make it clear that I am assuming very special monkeys here, with access to a true and perfect source of total randomness. They never develop repetitive habits in their typing, it truly is wide open at every point what key they will hit next. They also don't crap on the keyboard or urinate into the system unit. You don't find monkeys of this sort in the typical zoo. In fact you won't even find them in the above-referenced simulation, since it will be using a finite computing device with a finite cycle for its random generation. But abstract away from all that. In the beautiful, abstract world of pure math, where random sequences are genuinely random, the monkey Shakespeare claim really is true. Trust me. Why would I lie to you?

But Mike Albaugh has convinced me that it may be true even with real monkeys, in principle. Mike points out that we don't need total randomness; we just need for the monkeys to be working their way through the logical space of all character sequences without getting stuck in repetitive loops that are too small (and for there to be enough of them working for long enough in a universe that lasts for long enough, of course): "if the state-space is sufficiently large to contain the complete works of Shakespeare," writes Mike, "then even the simplest, linear traversal of it will eventually 'find' the works. To a first approximation, all pseudo-random number generators are just 'fancy ways of counting' the states of their state-space, so as to 'look random'. There is a long history of folks 'looking sideways' at such series and seeing distinct patterns. This goes back at least to the cryptanalysis of the Lorentz Machine, in WWII. John Von Neuman is quoted to the effect that people who attempt to generate random number by computation are '.. in a state of sin.' But for a well-specified task of finite length, simply having a large enough state-space will do." Mike also points out that there will come a problem at the point where the first monkey violates someone's jealously guarded copyright. Presumably the monkey that starts typing out the source code for Windows XP will be a monkey in deep, deep trouble.]

Posted by Geoffrey K. Pullum at 05:26 PM

Once more unto the Breach, search engine hackers!

Sometimes, John Henry still beats that steam drill. Well-informed humans can still be better at finding relevant information than search engines are. Here's an example.

In a recent post, I discussed some dismissive comments by Lewis Lapham about bloggers. The context was an interview with Marcie Sillman on a radio show called Weekday, which airs on KUOW in Seattle. As far I know it's not syndicated, but it's available on the web. I found out on about Lapham's remarks from a note scratched on the men's room wall in a local bar. Well, really it was from an post on Andrew Sullivan's weblog, which quoted Lapham comparing weblog postings to toilet-stall graffiti.

I tracked down the Lapham interview, and listened for myself, since I was planning to make fun of Lapham and didn't want to treat him unfairly. The relevant exchange, in my transcription, goes like this:

Sillman: We've had several listeners email in uh- uh- going back to the idea of- of media, asking you what you think about blogs, whether the information that- that has come out on various blogs is- is valuable, is reliable, is something that people should turn to.

Lapham: I don't know enough about blogs. I- I don't scan the Internet and the- so- but I I- guess as a source for clues, or for leads, uh for let's say a newspaper, or a m- Harper's Magazine's a monthly, so we're not into the timely news, might prove useful, but I'm sure it would be very difficult to learn which ones are worthwhile, and which one- I'm sure there- there are a lot of them that are uh simply uh the equivalent of scratching your name on the men's room wall of the, you know, Blue Moon Bar. I've-

Sillman: Have you been there? It's just down the street!

[both laugh]

It seemed clear from the context that there is a real Blue Moon Bar in Seattle, maybe with some relevant properties. But searching for {"blue moon bar" seattle} didn't turn up anything useful, at least not on the first couple of pages, so I settled for a generic reference to the myriad Blue Moon Bars around the word.

A little while later, I got an email from reader Mike Pope.

First, allow me to note how much I enjoy your posts and the Language Log generally. I was startled to read your post because I'd been listening to the same report and had the same reaction ("No, dope, it's Little Green Footballs") and a more general reaction that the guy's attitude toward bloggers was a bit supercilious.

Mike is talking about the first radio bit referenced in my post, which dealt with John Powers' commentary on Fresh Air. My impression is that Powers' attitude towards everything is a bit supercilious, but I guess you need that to be a "critic at large".

But I was also startled to read that you were listening to Marcie, which means either that you were in Seattle or that "Weekday" is syndicated. If the latter, that's great for them and news to me.

No, I couldn't find any evidence that the show is syndicated. I had to (or really I should say "got to") listen to it on the web.

Incidentally, you probably know that the Blue Moon Tavern is famously the one-time hangout of Theodore Roethke. While this lends the place a certain infamy, the place is still a dive. :-)

No, I didn't know that.

So Lapham got the name wrong -- it's the Blue Moon Tavern, not the Blue Moon Bar! If I search for {"Blue Moon Tavern" Seattle}, right at the top I find this. And down the rest of the first page of Google hits are nine other relevant and often interesting things, such as the second link, from which I learn that the comments scratched on the restroom walls at the Blue Moon Tavern might have been authored by Theodore Roethke, Carolyn Kizer, Dylan Thomas or Allen Ginsberg. Somehow, though, I don't think that's what Lapham meant.

The point is, Mike -- a well-informed human -- immediately knew that Lapham and Sillman must be talking about the (in)famous Blue Moon Tavern. In fact, it was so obvious to him that he figured I must know about it too, though perhaps he was just being polite.

But I didn't know about it, and neither did Google. Chalk one up for Mike Pope and John Henry -- though of course here the benefit comes from the combination of having access to a good search engine and knowing what search terms to use.

And we have to end with a quote from Theodore Roethke. Say, this one (from PRAISE TO THE END! 1951):

84 My friend, the rat in the wall, brings me the clearest messages;
85 I bask in the bower of change;

Here's to the writing on the wall!

Posted by Mark Liberman at 03:28 PM

September 24, 2004

Monkey Shakespeare

There's a well known claim that:

If you have enough monkeys banging randomly on typewriters, they will eventually type the works of William Shakespeare.

There's a web site devoted to simulating this task: Monkey Shakespeare Simulator. When someone goes to this site, it runs a simulation of monkeys typing randomly. So far, the record is 20 letters from Coriolanus after 462,060,000,000,000,000,000,000,000,000 monkey-years, sent in by Jens Ulrik Jacobsen from Denmark on 31 Aug 2004. With eighty letters available, there are 115,292,150,460,684,697,600,000,000,000,000,000,000 possible 20 letter sequences, so the probability of any particular 20 letter sequence is .00000000000000000000000000000000000000867361737988403 547205962240695953369140625.

Posted by Bill Poser at 11:57 PM

Little Green Apples at the Blue Moon Bar

Fresh Air "critic-at-large" John Powers, discussing Rathergate yesterday, said:

The first doubts came from conservative blogs with colorful names like "Free Republic" and "Little Green Apples".

Uh, that's "Little Green Footballs".

Complaining about the faults of weblogs, Powers observes that

Some shriek "gotcha!" at tiny factual errors in articles written on short deadlines by people who actually have to leave the house to do their work.

There's no evidence that Power did much pounding the pavements in researching his commentary. Anyhow, more expenditure of shoe leather wouldn't have helped him to make a correct attribution of the name of the weblog that made the first convincing case for the CBS memos being forged. That's just a matter of elementary care with relevant facts.

Power condescendingly allows that

Although I myself work in print, and am far too lazy to have a blog of my own, I must say that American political culture is far better for this explosion of lively political voices.

Apparently he's also too lazy to deal with details like the actual names of publications -- or does he count on his editors at Vogue (where he's now the film critic) to fix things up if he writes "Leonard Cohen" when he means "Leonard Bernstein", or "Alfred Prufrock" when he means "Alfred Hitchcock"?

A better known representative of the ancien regime, Lewis Lapham, was less gracious to weblogs when interviewed Wednesday by Marcie Sillman on the Weekday radio show:

Sillman: We've had several listeners email in uh- uh- going back to the idea of- of media, asking you what you think about blogs, whether the information that- that has come out on various blogs is- is valuable, is reliable, is something that people should turn to.

Lapham: I don't know enough about blogs. I- I don't scan the Internet and the- so- but I I- guess as a source for clues, or for leads, uh for let's say a newspaper, or a m- Harper's Magazine's a monthly, so we're not into the timely news, might prove useful, but I'm sure it would be very difficult to learn which ones are worthwhile, and which one- I'm sure there- there are a lot of them that are uh simply uh the equivalent of scratching your name on the men's room wall of the, you know, Blue Moon Bar. I've-

Sillman: Have you been there? It's just down the street!

[both laugh]

And that's the end of that question.

Lapham's own reputation for journalistic reliability recently took a hit when he described the speeches at the Republican national convention in an edition of his magazine that was mailed to subscribers several weeks before the convention took place. So of course those finicky bloggers did their gotcha-shrieking thing in response to this tiny factual error, despite the fact that Lapham was writing on deadline and was planning to actually leave his house to attend the convention. Lapham was honest and smart enough to apologize right away, though he explained the fault as a matter of poetic license and mixing up tenses.

I've never visited any of the many Blue Moon Bars that Google finds around the world, but I've read a lot of graffiti over the years, and I don't recall ever having seen a verb-tense error. I'll confess, though, that poetic license has long been as rampant in that medium as it apparently is now at Harper's.

As far as blogs are concerned, you just have to make your own judgment about which ones are worthwhile, based on your experience with the source. That seems to me quite a lot like making a judgment about a magazine article or a radio commentator, and quite different from evaluating restroom graffiti.

[Update 9/29/2004: Here's a relevant discussion by NPR's Ombudsman Jeffrey A. Dvorkin. Key quote:

First, we must acknowledge that the blogs have truly arrived. It is hard for journalists who have led a sheltered life without public accountability to acknowledge that those days are over.
Second, it will be tough for ombudsmen and women to admit that their unique role as overseers on behalf of the public is also changing. We need to make room on the bench and give the bloggers a place at the dinner table. The question remains: who's for dinner?
NPR listeners have always been quick to point out our errors and lapses, and in a non-partisan way. The blogs are different because many are explicitly political. It will be interesting to see if the "blogosphere" still has as much impact on mainstream journalism once the election is over.

But blogs are also different because they have an independent way to reach the public, not subject to the control of NPR or any other institution.

]

Posted by Mark Liberman at 09:32 PM

Role reversal

Reading Mark's response to Marc's phonoloblog post, I was reminded of an interesting example of the 'Old World' vs. 'New World' approach-dichotomy.

Around the time I spent a year at Harvard in 1998-1999, I had a conversation with someone (who would remain nameless even if I were able to recall who it was) who noted that my professional interactions with at least some folks at MIT would be particularly frustrating because (a) if you have the facts right, they'll take aim at the theory, and (b) if your theory is impeccable, they'll take aim at the facts. Old World or New World, you apparently can't win at MIT.

(I should note that my own interactions with MIT folks were not like this during my year in Cambridge, though my e-mail interactions with [famous Spanish phonologist X] -- all prior to that year -- certainly had that character.)

I did see one interesting example of this kind of interaction, however, between Chomsky and Anders Holmberg, at one of Chomsky's Thursday afternoon lectures. (I had to go to at least one -- when-in-Cambridge and all that). After discussing some facts from Icelandic (which he referred to as "the ebola of linguistics"), Chomsky was explaining an important empirical consequence of some minimalist assumption or other. While the rest of us are still processing it (or trying not to fall asleep too conspicuously), Holmberg raised his hand and Chomsky called on him.

Holmberg explained a set of facts from mainland Scandinavian that, by the time we had all processed everything, clearly contradicted the empirical consequence of Chomsky's minimalist assumptions. I wish I had a record of the actual wording that Chomsky used, but his reply was basically this:

You don't throw out all of chemistry just because you've thrown a couple of elements together and caused a beaker to explode.

True enough, but the analogy escaped me then and continues to escape me now.

Now, in this scenario, Holmberg was arguably the New World representative. But does the Old World really want to claim (this example of) Chomsky?

[ Comments? ]

Posted by Eric Bakovic at 02:40 PM

Old Geeks for Truth

Andrew Cline at Rhetorica: Press-Politics Journal says that

There will be a serious announcement appearing on this blog, and on A Voyage to Arcturus, sometime in the next few days regarding the creation of an orderly mechanism by which the mainstream media can draw upon the expertise of old geeks, young geeks, and SME (subject-matter expert) bloggers in order to improve the quality of their reporting, especially in regard to technical/specialty items. Watch for it. We prospectively thank you for your support ;-)

Jay Manifold at A Voyage to Arcturus (jokingly?) refers to this in advance as "Old Geeks for Truth".

I'm a big fan of disorderly mechanisms, myself. At least, I think it's hard to create "orderly mechanisms" that are worth more than their cost, so that the partisans of order (among whom I also include myself) need to pick their spots. Still, I look forward to this experiment with great interest.

With respect to linguistic issues, we here at Language Log have spent some time examining (what we think are) mistakes in the popular press and also sometimes in the scientific literature. My prejudice is that the best long-term way to improve the coverage of language-related issues is to work for better linguistic education, not to promote access to a better set of linguistic experts. But there's nothing wrong with better experts, either...

And heck, maybe someday a big political controversy will hinge on the PDP-9's control panel, or the instruction set of the DDP-224, or one of my other long-useless bits of specialized knowledge.

Posted by Mark Liberman at 10:22 AM

Sleep your way to ocular health

I stopped by the eye care center for a routine check of my ocular acuity, and although by the end of the session my pupils had been chemically dilated so things were looking a bit strange to me, I had enough clarity of vision to notice the title of a pamphlet in a rack near the reception desk:

SLEEP YOUR WAY TO GREAT VISION!™

It did seem an interesting proposition. I have often heard of people who were said to have slept their way to the top, so I knew the idiom that was in play here. (Notice, by the way, that the quoted sentence is trademarked. They are clearly serious about the exact wording of what they are saying.) It seemed most interesting to me that there would be some way in which one could ensure 20/20 vision by similar means. I looked over again at my attractive fair-haired Slovakian oculist, who had told me nothing about this possibility while we were alone together in a darkened cubicle doing close-up tests of my gaze alignment and visual field sensitivity. Past generations tended to be warned about recreational sexual practices that would make you go blind. But now, I was given to understand, appropriately directed promiscuity could contribute to good ocular health? Clearly the liberal, experimental, polymorphously perverse Golden State where I am so happy to live had yet more to teach me about holistic health and wellness practices. I took a copy of the brochure.

Now, notice (may I draw your attention to some grammar, since this is Language Log?) that sleep is standardly an intransitive verb. It's like crawl, laugh, cry, snore, etc., in that it doesn't normally take a direct object noun phrase: you don't sleep something, you just sleep; you don't cry something, you just cry; and so on. Yet in the idiom sleep one's way to something, it does take a direct object noun phrase (underlined). Although the definition of an intransitive verb is one that can be used without an object either overt or understood, it is actually a fairly regular fact about English that most intransitive verbs, even the strongly intransitive ones that aren't the slightest bit ambivalent about their status, are capable of taking an object under one of at least two special conditions:

Cognate object constructions. An object headed by a noun made from the same root as the verb itself, usually with some elaborating modifiers, can be used with lots of intransitive verbs: you can sleep the sleep of the dead; laugh a filthy laugh; and so on.
Resultative constructions. You can laugh someone off the stage (bring about the result that they leave the stage, by means of laughter), cry yourself to sleep (bring about the result that you are asleep, by means of crying)... or sleep your way to something (bring about the result of finding the way to something, by means of sleeping — a euphemism, of course: we all know that to sleep your way to the top you have to do a lot more in the beds of the powerful and influential than just doze off beside them).

Now, you may be sleeping your way through this post, having already had more syntax that you bargained for. When you started out, you thought you were going to get something saucy about improving your vision simply by screwing, didn't you? So did I. But no, I'm afraid the pamphlet turned out to be a boring description of corneal refractive therapy™. It does involve going to bed, but the idea is that you put on specially shaped contact lenses that sort of squidge your corneal topography so by the morning your eyes will focus without your needing glasses or contacts or anything. You talk to your Eye Care Professional about it and he or she contacts Paragon Vision Sciences if it is the right therapy for you. No sexual activity is relevant at all. Sorry. Sweet dreams.

Posted by Geoffrey K. Pullum at 01:21 AM

An odd little linguistic artifact

On Aug. 24, J.D. Lasica posted a blog entry noting that when he went to Google News and clicked on the link "John Kerry" under In the News, the first 35 results (other than those from mainstream newspapers and magazines) were from anti-Kerry rightwing sites, with at least a dozen of these appearing on the first page.

Lasica's post was spotted by the editor of the Online Journalism Review, who asked Lasica to write about it, which he did, in an article appearing on 9/23. The explanation for the effect was worked out by Ethan Zuckerman (see his post about it here), whom Lasica quotes in his OJR piece. Lasica discussed the sequence of events in this 9/23 blog entry.

Here's the proposed explanation:

"I think what you're seeing is an odd little linguistic artifact," said [Ethan] Zuckerman, former vice president of Tripod.com and now a fellow at Harvard's Berkman Center for Internet and Society who studies search engines. The chief culprit, he theorized, is that mainstream news publications refer to the senator on second reference as Kerry, while alternative news sites often use the phrase "John Kerry" multiple times, for effect or derision. To Google News' eye, that's a more exact search result.

A second possible factor, Zuckerman said, is that small, alternative news sites have no hesitancy about using "John Kerry" in a headline, while most mainstream news sites eschew first names in headlines. The inadvertent result is that the smaller sites score better results with the search engines.

Zuckerman gives some advice on how to game Google News, if you want to do it:

With an occasional exception, Weblogs are generally not found among the Google News results, so Zuckerman had some advice for aspiring political publishers who want to game the search engines: Don't blog -- start an alternative news network. Use terms like George Bush and John Kerry frequently, rather than their last names alone, in both your text and headlines. Publish new works frequently.

I'm not sure how smart Google's algorithms are -- it would be pretty easy to spin off a blog's RSS feed into something that looked like an "alternative news network".

In any case, whether because Google has changed its algorithms or because the pro-Kerry sources have changed their behavior, the results at Google News today seem to be more balanced than what Lasica reports from a month ago. At least, the first page of search results for "John Kerry" now includes AlterNet and Daily Kos (which is a blog, right?!?) as well as Useless-Knowledge.com; not to speak of the Socialist Worker and the Collective Bellaciao, who attack both Kerry and Bush.

This new-found balance is too bad, in a way, because to satisfy a linguist (well, at least to satisfy this one), we should look at some controls. Specifically, we should examine the treatment of a wider variety of political and non-political figures. Do the left-wing "alternative news networks" use George Bush's full name more often than right wing ones do? What about Dick Cheney and John Edwards? What about down-ticket candidates, and other national political figures like Ted Kennedy or Tom DeLay? How about celebrities without significant partisan political associations, like Ray Charles or Dan Brown?

Zuckerman closes his blog entry this way:

Basically, I think it's an interesting, accidental linguistic artifact which demonstrates just how hard it is to get an AI to do something as complex as laying out a page of news stories. But stop listening to me and go read Lasica's excellent article.

It's clear from Lasica's results that there is some kind of "linguistic artifact" here, but without some further work in quantitative rhetoric, I'm not sure what it is. So my advice is to put on your pajamas and start counting.

However, if it no longer affects Google News, then your motivation can only be an interest in the statistics of political rhetoric. And I suspect that interest in this topic is somewhat lower than interest in googlebombing. At least, some popular application like googlebombing would be needed to spur the efforts of most amateur quantitative rhetoricians, just as debunking Dan Rather inspired the interest of those who understand the details of typography.

One last point. Apparently, the folks behind Google News changed their algorithms within 24 hours of the publication of Lasica's article. I guess they might have started earlier, perhaps alerted by Lasica's phone call to Krishna Bharat while the story was in preparation. But whatever the exact time line, this looks like pretty fast work. It took CBS longer than that to decide to 'fess up about those forged memos, and they didn't even have to modify any software.

Posted by Mark Liberman at 12:41 AM

September 23, 2004

Which vs. that: integration gradation

A few days ago, I rashly took up a syntactic challenge issued by Geoff Pullum.

Here's the backstory.

First, Geoff took Sidney Goldberg to task for promulgating falsehoods about English grammar, and criticized the National Review for publishing his uninformed pontifications without any linguistic fact-checking. One of the three (out of three) wrong grammatical points in Goldberg's screed was an alleged distinction between which and that. Geoff demolished the notion that "integrated" relative clauses (also known as "restrictive" relatives) require that (and prohibit which) by observing that six classic novels, the first integrated relative using which occurs on average about 3% of the way into the book.

Second, I drew Geoff's attention to a comment on a livejournal blog that said "okay, but it would be more fun to see stats on how often these Canonical Texts use each one in a ... restrictive way (and in what circumstances?), rather than flagging a single ... example from each text." Geoff responded with statistics from journalistic text, by Doug Biber and others, nailing his point beyond any reasonable doubt.

So far so good. But Geoff went on to argue that there's no point in "noting the thats and whiches and forming semantic hypotheses", because that "would amount to looking for a meaning difference that isn't there". Now, Geoff is a syntactician, and co-author of the monumental Cambridge Grammar of the English Language. I'm merely a phonetician who occasionally dabbles in practical text analysis. But my prejudice in such matters is that optional variants usually do have interestingly different distributions, and that "meaning" is usually part of the story, at least in a weak sense of the word.

There are two uncontroversial semantically-relevant distinctions between that and which in relative clauses in standard English. First, which can't be used with what CGEL calls "personal" referents -- "*the people which speak English" is not standard English. Second, that can't be used in "supplementary" (or "non-restrictive") relative clauses -- "her head, that was covered with a floppy straw hat" is unlikely if not impossible in contemporary standard English.

So I decided to look for what you might call ripples or echoes of those two distinctions, in contexts where that and which are both fully grammatical. I started by looking for evidence that the personal/non-personal distinction might have a non-trivial influence on the choice among that, which and who. I found several contexts where that is used much less often for "personal" referents than we would expect, based on the ratio of uses of who vs. which and similar considerations. This suggests, at least, that perhaps that has come to be tinged with a bit of "non-personal" meaning. I might venture (on no evidence whatsoever) to predict that this is an unstable situation, and that over time, we might find this tinge deepening and becoming categorical. At least, that's the sort of thing that sometimes happens in the history of syntax.

In this post, I'm going to take up the second idea, namely that perhaps which is tinged with a bit of "supplementarity", even in the context of integrated relatives. The idea here is to look at categories of "integrated" relative clauses that are in some sense more or less tightly "integrated", and see whether the difference in degree of integration affects the probability of using which (or who) vs. that.

Here's the idea that I started with. There are some kinds of relative clauses in which a quantifier or other operator binds the relative especially tightly to the intepretation of the syntactic head, e.g. "the only thing that trumps fear is greed". In contexts like this, which seems much less natural to me than that, though that still seems fully grammatical. Similar phrases without only seem somehow to bind the relative clause less tightly, and in consequence to be more amenable to which, e.g. "the thing that is really hard is giving up on being perfect."

Now, I can't offer any plausible logical analysis to cash in this intuitive impression of "binding more/less tightly". But it's easy enough to check the prediction about the relative probability of which and that in these contexts:

	*thing*	*things*	total	place	places	total	grand total
the only __ that	944,000	82,800	1,026,800	61,100	5,890	66,990	1,093,790
the only __ which	38,500	3,280	41,780	1,980	295	2,275	44,055
that/which ratio	24.2	25.2	24.6	30.9	20.0	29.4	24.8
the __ that	658,000	2,300,000	2,958,000	210,000	120,000	330,000	3,288,000
the __ which	66,300	201,000	267,300	70,200	12,500	82,700	350,000
that/which ratio	9.9	11.4	11.1	3.0	9.6	4.0	9.4

The table above shows counts for the words thing(s) and place(s) in the contexts "the only __ that/which" and "the ___ that/which". (Note that with very few exceptions, all of the relative clauses found would count as "integrated" by anyone's standard -- these results cannot be explained directly by the integrated/supplementary distinction). Across these cases, the ratio of that to which is 24.8 when only is present, and 9.4 when it isn't. Q.E.D.

This diffence seems to be something particular about that vs. which. The other personal relative pronoun, who, doesn't seem to be affected nearly as much:

	*people*	*group*	*category*
the only __ that	83,500	15,200	2,320
the only __ who	381,000	2,590	10
the only __ which	320	1,640	301
that/who ratio	0.22	5.9	232
that/which ratio	260.9	9.3	7.7
the __ that	1,710,000	635,000	101,000
the __ who	7,740,000	118,000	585
the __ which	63,900	210,000	82,900
that/who ratio	0.22	5.4	173
that/which ratio	26.8	0.56	1.2

Nouns like people, group and category can have to personal as well as non-personal referents, and so occur in reasonable numbers with who as well as which and that, as the above table shows. But the that/who ratio is only slightly increased by the presence of only (between 0 and 34% in these examples), while the that/which ratio is much more strongly affected (between 642% and 1,661%).

The table below summarizes the effects of only on the that/which ratio of five different cases:

	thing(s)	place(s)	people	group	category
the only __ (that\|which) [that/which ratio]	24.6	29.4	260.9	5.4	7.7
the __ (that\|which) [that/which ratio]	11.1	4.0	26.8	0.56	1.2

As crude support for the idea that other sorts of quantification of the head have a similar effect, compare the following two tables.The first one looks at a variety of quantifiers with things as head and a definite article present, where the that/which ratios vary from 17.2 to 41.6:

	that	which	that/which ratio
the only things	82,500	3,280	25.2
all of the things	63,700	1,530	41.6
all the things	299,000	15,700	19.0
some of the things	217,000	7,960	27.3
few of the things	24,100	633	38.1
the few things	29,100	761	38.2
the three things	10,100	588	17.2

Now we look at "the things" (without additional quantification) as the NP in a variety of prepositional phrases, where the that/which ratios vary from 2.7 to 13.2:

	that	which	that/which ratio
for the things	53,500	9,400	5.7
to the things	42,800	7,990	5.4
from the things	18,800	6,860	2.7
with the things	24,800	1,880	13.2
by the things	21,400	6,930	3.1
because of the things	3,860	498	7.8
without the things	910	86	10.6

Again, nearly all of the examples in both tables are integrated relative clauses. But I think it's fairly clear that quantification of the head tends to predispose the choice away from which and towards that. At a minimum, I'd submit that this is a "semantic difference" that influences the choice between the two words, in contexts where both are fully grammatical. I hypothesize (without any evidence) that the influence arises because of some kind of psychological gradient of integration, where the process of intepreting the quantifier somehow binds the relative clause more tightly to its head, at least in processing terms, and therefore biases the choice away from which and toward that.

Posted by Mark Liberman at 11:36 PM

Inexpert and expert phishing spam

My friend Nathan Sanders has shown me a phishing spam that he got which purported to be from Citibank. It did very badly indeed on linguistic accuracy and thus was much easier than usual to spot as trickery. In fact it's a little lesson in grammatical and orthographic slip-ups all on its own.

From: Citibank Subject: ATTN: SafeGuard your account (Citi.com) MsgID# 80309245
Dear Customer: Recently there have been a large number of cyber attacks pointing our database servers. In order to safeguard your account, we require you to sign on immediately. This personal check is requested of you as a precautionary measure and to ensure yourselves that everything is normal with your balance and personal information. This process is mandatory, and if you did not sign on within the nearest time your account may be subject to temporary suspension. Please make sure you have your Citibank(R) debit card number and your User ID and Password at hand. Please use our secure counter server to indicate that you have signed on, please click the link bellow: http://219.138.133.5/verification/ !! Note that we have no particular indications that your details have been compromised in any way. Thank you for your prompt attention to this matter and thank you for using Citibank(R) Regards, Citibank(R) Card Department MsgID# 80309245
(C)2004 Citibank. Citibank, N.A., Citibank, F.S.B., Citibank (West), FSB. Member FDIC.Citibank and Arc Design is a registered service mark of Citicorp.

Those of you who are taking my distance-learning course in Forensic Syntax For Spam Detection should spend a moment listing the errors in this text. You should be able to find ten errors.

* * * * * * *

O.K., time's up. I'll just run through the correct answers.

"SafeGuard" in the Subject line has a spurious capital G. This word is not a trademark (at least, not here), it is just an ordinary English verb. The spammer was being too clever with capitalization.
The phrase "pointing our database servers" is not grammatical, or at least not meaningful. I'm not sure where that error comes from. "Targeting our database servers" would make more sense.
The phrase "personal check" would not normally be used to mean "check or test that you have to carry out personally", or "check or test to verify your personal information", because it is used instead to mean "check written by an individual as opposed to a corporation". It's not ungrammatical, but it's a sign of not being familiar with American English banking talk.
"This personal check is requested of you as a precautionary measure and to ensure yourselves that everything is normal..." has a badly chosen word, ensure. You ensure that something is done by either causing it to be done or checking that it has been done; you don't ensure a person. (You can insure a person, but you should be an insurance agent if you do this.) The spammer meant "assure", not "ensure".
"This personal check is requested of you as a precautionary measure and to ensure yourselves that everything is normal..." has another mistake. The message begins "Dear Customer" (singular). This makes the plural number on yourselves mysterious. It should be the singular, yourself.
The error in "if you did not sign on ... your account may be subject to temporary suspension" is beautiful and subtle, something to warm even the small and stony heart of a grammarian such as I. With all verbs except the copula (be), the preterite inflectional form is used to signal what the irrealis form were signals in the case of the copula. The Cambridge Grammar (chapter 3) calls this a modal remoteness use of the preterite. A particularly clear case of where you need it is in counterfactual conditionals: "If you did not sign on, your account could be temporarily suspended." That means that if a hypothetical world were to arise where you did not sign on (and may that day never come), your account could get suspended, in that world — but it won't in this one, we hope. However, it's crucial that the second part of such a sentence (the apodosis of the conditional) normally also has a modal preterite, often would or could or might, but not will or can or may. You get "If you did not sign on, your account would be suspended" for referring to a hypothetical situation and "If you do not sign on, your account will be suspended" to refer more forthrightly to a claim about what the future is going to be like if you don't sign on. The sentence in the email, "if you did not sign on ... your account may be subject to temporary suspension", should have been "if you do not sign on ... your account may be subject to temporary suspension".
The phrase "within the nearest time" is of course not idiomatic English. Perhaps "at your earliest convenience" was meant.
The phrase "secure counter server" is not known to me and gets no Google hits at all. The spammer meant "secure server", and I just don't know what "counter" was doing in there.
Actually the whole sentence "Please use our secure counter server to indicate that you have signed on, please click the link..." is ungrammatical. It seems to be a very bad run-on sentence with no comma splice: the spammer meant "Please use our secure counter server. To indicate that you have signed on, please click the link..."
In "please click the link bellow", the preposition below is misspelled. (Bellow is a verb meaning "emit a loud, deep, hollow, prolonged sound such as a bull might make, or to speak or shout in a manner reminiscent of this"; that's why a spelling checker wouldn't have caught the mistake.)

So this message is an illiterate, error-stuffed disaster, and the spammer who wrote it will only be stealing the bank account contents of particularly unobservant and linguistically uneducated people: poor people, immigrants, foreigners, semi-literate people, careless readers, not Language Log people at all. Alert Language Loggers are not likely to fall for this piece of junkware.

But beware: I got a message purporting to come from Citibank too, and unfortunately it's grammatically impeccable:

Dear Citibank valued customer,

Citibank is committed to protecting the security of our clients' personal information, including when it is transmitted online. Therefore our ATM services utilize advanced security technology to protect your personal financial information.

In order to be prepared for the smart card upgrade on Visa and MasterCard debit and credit cards and to avoid problems with our ATM services, we have recently introduced additional security measures and upgraded our software.

This security upgrade will be effective immediately and requires our customers to update their ATM card information. Please update your information here

© Citibank Customer Support Dept.

It ended with some invisible words written in white, probably a device designed (unsuccessfully in this case) to fool spam filters: "b 5 2141 arboretum preponderate seoul addle devolve salve bette remembrance loud countdown fascicle milk hook finesse lagging daedalus deanna bluish bonneville condemnate bar transmitted perennial Freddie 1 J rendezvous witt nina catalogue walden apologetic gaspee evacuate enol preferring giveth substantiate ladyfern shepard inclose gary contradistinction 638 65093358[0-255", it said, implausibly but also invisibly. (It wasn't invisible to me because I examine my suspected spam with Unix tools, not the brightly colored click-here tempting toyware that Windows programmers want me to use.)

The second example shows what can be done by literate guys who control the grammar and really know how to phish. Caveat browsor.

Posted by Geoffrey K. Pullum at 01:56 PM

September 22, 2004

Facts, theories, fetishes

On phonoloblog, Marc van Oostendorp quotes Stephen Anderson:

If a paper on ‘the morphosyntax of medial suffixes in Kickapoo’, bursting with unfamiliar forms and descriptive difficulties, is typical of American linguistics, its European counterpart is likely to be a paper on l’arbitraire du signe’ whose factual basis is limited to the observation that tree means ‘tree’ in English, while arbre has essentially the same meaning in French.

Marc adds that "[t]his is obviously a caricature (of the way things were in the 1930s), and a funny one at that, but it is also accurate even to describe the current situation. [...] It is a mystery to me what explains this different academic and intellectual culture, especially since it seems to have been true for such a long time".

For discussion of some related issues, see Marc's extensive comment on my post on (some) Europeans' self-identified difficulty in "getting" the internet. And as a contribution to understanding the stubbornness of this difference between American and European intellectuals (maybe this should be Anglophone and Continental intellectuals?) , I'll match Marc's Steve Anderson quote with one from Adam Gopnik's collection of essays Paris to the Moon (pp. 94-97):

My favorite bit of evidence of the French habit of pervasive, permanent abstraction lies in the difficulties of telling people about fact checking. (I use the English word usually; there doesn't seem to be a simple French equivalent.) "Thank you so much for your help," I will say after interviewing a man of letters or politician. "I'm going to write this up, and you'll probably be hearing from what we call une fact checker in a couple of weeks." (I make it feminine since the fact checker usually is.)

"What do you mean, une fact checker?"

"Oh, it's someone to make sure that I've got all the facts right, reported them correctly."

Annoyed: "No, no, I've told you everything I know."

I, soothing: "Oh, I know you have."

Suspicious: "You mean your editor double-checks?"

"No, no, it's just a way of making sure that we haven't made a mistake in facts."

More wary and curious: "This is a way of maintaining an ideological line?"

"No, no -- well, in a sense I suppose ... " (For positivism, of which New Yorker fact checking is the last redoubt, is an ideological line: I've lived long enough in France to see that move coming ...)

"But really," I go on, "it's just to make sure that your dates and what we have you quoted as saying are accurate. Just to be sure."

Dubious look: there is More Here Than Meets the Eye. On occasion I even get a helpful, warning call from the subject after the fact checker has called. "You know, someone, another reporter, called me from the magazine. They were checking up on you." ("No, no, really checking on you," I want to say, offended, but don't -- and then think he's right: They are checking up on me too: never thought of it that way, though.) There is a certainty in France that what assumes the guise of transparent positivism, "fact checking" is in fact a complicated plot of one kind or another, a way of enforcing ideological coherence. That there might really be facts worth checking is an obvious and annoying absurdity: it would be naive to think otherwise.

I was baffled and exasperated by this until it occurred to me that you get exactly the same incomprehension and suspicion if you told American intellectuals and politicians, post-interview, that a theory checker would be calling them. "It's been a pleasure speaking to you," you'd say to Al Gore or Mayor Giuliani. "And I'm going to write this up; probably in a couple of weeks a theory checker will be in touch with you."

Alarmed, suspicious: "A what?"

"You know, a theory checker. Just someone to make sure that all your premises agree with your conclusions, that there aren't any obvious errors of logic in your argument, that all your allusions flow together in a coherent stream -- that kind of thing."

"What do you mean?" the American would say, alarmed. "Of course they do. I don't need to talk to a theory checker."

"Oh, no, you don't need to. It's for your protection, really. They just want to make sure that the theory hangs together... "

The American subject would be exactly as startled and annoyed at the idea of being investigated by a theory checker as the French are by being harassed by a fact checker, since this process would claim some special status, some "privileged" place for theory. A theory checker? What an absurd waste of time, since it's apparent (to us Americans) that people don't speak in theories, that the theories they employ change, flexibly, and of necessity, from moment to moment in conversation, that the notion of limiting conversation to a rigid rule of theoretical constancy is an absurd denial of what conversation is.

Well, replace fact (and factual) for theory in that last sentence, and you hav ethe common French view of fact checking. People don't speak in straight facts; the facts they employ to enforce their truths change, flexibly and with varying emphasis, as the conversation changes, and the notion of limiting conversation to a rigid rule of pure factual consistency is an absurd denial of what conversation ought to be. Not, of course, that the French intellectual doesn't use and respect facts, up to a useful point, any more than the even the last remaining American positivist doesn't use and respect theory, up to a point. It's simply the fetishizing of one term in the game of conversation that strikes the French funny. Conversation is an organic, improvised web of fact and theory, and to pick out one bit of it for microscopic overexamination is typically American overearnest comedy.

Gopnik ignores one crucial asymmetry -- the French don't actually have theory checkers, as far as I know. And lord knows, some continental intellectuals could use one. (Though in fairness, I have to admit that many American news organizations seem to have delegated fact-checking to the pajamahadeen in the blogosphere...) But I think Gopnik makes a valid point, all the same. WCFCYA is a characteristically American boast.

[Update: Bill Poser emailed:

I heard a story once about [famous French linguist X]. He gave a talk on autosegmental phonology at Harvard, speaking in very abstract terms. At length, someone in the audience intervened and asked him if he could give an example. X was discomfited. He paced back and forth, scratched his head, and finally strode to the board and wrote: CVCVC.

After reading this anecdote, David Nash emailed:

That gave me a chuckle Bill, and makes me recall how I (eyewitness!) was in an MIT seminar by [famous French linguist X] (where X is undoubtedly the same X as above!), who said something like "Take a word of Mongolian (or whatever)" and wrote up "CVCVCV" and said stuff about it; then "Take a word of Tubatulabal (this name I do recall)" whereupon he erased the "CVCVCV" and wrote up "CVCVCV".

]

[Update: Hubert Truckenbrodt's advice about how to write papers is also relevant:

American-style (caricature)

Right after the introduction, the author makes it clear, what the claim of the paper is.

"In this paper, I present a solution to the old puzzle why eggs play no role in the reproduction of whales. I will show that whales lay eggs, which dissolve in salt water about one minute after they were laid. And that is the reason eggs are not used for reproduction with whales. ..."

After that, the arguments for the claim are presented, if possible comparing the new theory with earlier claims. State the strong arguments that you have. After that, conclusion, and shut up. You say what you have to say, and that's that.

Traditional German-style (caricature)

Did the Ancient Greeks have something to say about the topic? What has been written about it since? Extra credit if you find someone who has commented on it a few centuries ago, and has since been overlooked.

From there, the author builds up slowly and steadily. After about half of the paper, the specific question of the paper comes into view. First in a vague, general way, then somewhat more concretely. One starts to suspect that there may also be a claim made later on. A preference of the author becomes noticeable. At some point, however, the paper ends. Sorry, we are out of space. More next time.

(via Kai von Fintel at semantics etc.)

The trouble with this kind of cultural stereotyping is that it can seem accurate as well as funny, while in the end offering such a varied and diffuse set of stereotypes for overlapping groups that even quite incompatible sorts of behavior on the part of a denigrated or admired type can be accepted as evidence that "yes, they're all like that". Thus in commenting on Kai's post, Tony Marmo offers a stereotype of the Dutch as fact-ridden and uninterested in general points other than negative ones. This may strike some as valid -- I don't see it, myself -- but in any case it's the precise opposite of the characterization of Europeans in general that Marc van Oostendorp started with.

]

[Jeff Erickson at Ernie's 3D Pancakes writes:

We have another word for "theory checkers" in my line of work. We call them referees. On the other hand, we're a bit short (and distrustful) of fact checkers. Far too many algorithms papers claim "in practice, people do X" when in practice, people don't do X, or even anything remotely similar to X. Conversely, in the more practical parts of computer science, fact checkers are endemic, but theory checkers are rare.
So does that make les algorithmistes the French of computer science? Quelle horreur!

While you're visiting Jeff's blog, take a look at his related post on "'Applied' Papers at SoCG, which mutatis mutandis describes the situation in several other technical areas known to me. ]

Posted by Mark Liberman at 09:46 PM

Italian: to vowel or not to vowel

On Monday, Stacy Albin had an article in the NYT under the headline "You say Prosciutto, I say Pro-SHOOT, and Purists Cringe". I can tell you that linguists cringe too, though for different reasons. As Steve at Language Hat put it, "this being the Times, my hopes were not particularly high, and they were not fulfilled". Steve explains that "the article nods in the direction of actual linguistics ('In fact, in some parts of Italy, the dropping of final vowels is common') but basically wallows in the lowest sort of purist chauvinism ('As for the linguistically challenged, who mangle 'prosciutto'...').

Since Stefano Taschini had emailed me recently on another topic, I asked him about this article, and he responded as follows:

I think that the standard reference is the monumental

Gerhard Rohlfs, "Historische Grammatik der italienischen Sprache und ihrer Mundarten"; (3 vols.) Bern: Francke 1949-54.

translated in Italian as

Gerhard Rohlfs, "Grammatica storica della lingua italiana e dei suoi dialetti"; (3 vols.) Torino: Einaudi 1966-69.

Volume 1 is dedicated to phonetics.

A good on-line reference is Princìpi e metodi di dialettologia italiana, which requires a free registration for full access.

If you just want to have a feeling for the several dialects, at http://userhome.brooklyn.cuny.edu/bonaffini/DP/index.html you can find poems in many dialects with translations in Italian and English (some also with audio.) The site has also an interesting collection of resources. And I cannot but recommend the site dedicated to and written in the dialect of my hometown: http://www.bulgnais.com/ .

Back to the NYT article, which I also saw mentioned on the phonoloblog, I have a couple of comments about it.

1. Gallo-italic dialects (Lombardo, Piemontese, Emiliano, and, in particular, Bolognese), have a rather different phonology from standard Italian. In these dialects many words end in a consonant but they cannot be seen as an apocope of an Italian word. The "fasul" [fa'zu:l], common to Gallo-italic dialects, Veneto and Friulano, is not immediately reconducted to the Italian "fagioli" [fa'ʤɔli] (Pasta e fagioli is a typical northern dish). Similarly, the Bolognese [par'sot] is not an apocope of "prosciutto" [pro'ʃut:o].

2. In neither northern nor central Italian dialects you can find the schwa. In the dialects of central and northern Italy, as well in the standard language, non-stressed vowel sounds are clearly pronounced and the rhythm is syllable based.

3. Leaving Rome and heading south or east, you find a tendency of shortening non-stressed vowels and reducing them to schwas. E.g., the word repeated in the refrain of the Neapolitan song Funicolì Funicolà (almost a trademark for many Italian-Americans, I believe) is [jam:ə], often transcribed as "iammo", closer to the Latin "eamus" (conj. of eo,is, if I correctly remember) than the Italian "andiamo". In some places, such as parts of Abruzzi and Puglie, this leads to the unvoicing of final vowels and ultimately to their disappearance. In these areas you also hear the tendency to follow a stress-based rhythm.

4. Sicilian, phonetically characterized by the presence of retroflexed consonants, not only keeps all the vowel sounds clear and loud but introduces an epenthetic [i] in some consonantic groups. This epenthesis often shows up even when Sicilians speak Italian, with terms like "psicologia" pronounced as [pisicolo'ʤia].

5. Schooling, internal migrations and sixty years of exposure to radio, TV, and movies, have made it such that, virtually, the entire Italian population can speak and understand the standard language, albeit with regional variations and dialect influences. However, using dialect forms in a conversation held in Italian is frowned upon and generally regarded as a "mistake".

6. The misconception that in standard Italian every letter is pronounced (repeated in the NYT article) is probably responsible for the typical English pronunciation of the name Giovanni as [ʤio'va:ni] instead of [ʤo'van:i]. The first i is not pronounced on its own, but simply forms with the preceding g a symbol pronounced [ʤ]. Even in Italy, you occasionally hear school children use the hypercorrected form [ʃi'enʦa]* instead of ['ʃenʦa] for the word "scienza" (science). In this case the "i" is completely mute: were the word spelled as "scenza"* the pronunciation would not change.

In some future equivalent of Albin's article, when NYT writers have the same excellent linguistic education as everyone else, "pro-SHOOT" will be rendered in IPA (I think with the medial palatal fricative voiced, at least if the pronunciations that I hear in the Italian Market in South Philly are representative), and the relevant variations of pronunciation, cheese-making and meat preservation will all be described in accurate and mouth-watering detail.

Posted by Mark Liberman at 09:01 PM

Autumn Day

The Bookish Gardener describes a recent performance of the Brahms B minor clarinet quintet, at which John Harbison read Rilke's poem Herbsttag ("Autumn Day") in German, with an on-the-spot English translation. She links to a Poetry Connection page that gives four translations of the Rilke poem, along with the German original, adding

And here's the thing: even if you have only a smattering of German, it's worth reading the original poem aloud—even without a contemporaneous understanding of the German—the better to appreciate the poem's "music".

She makes the same point with respect to the Basho haiku and its fourteen translations at Bemsha Swing, and the interlinear version that I gave a little while ago. I agree. But wouldn't it be nice (and easy!) to have a few links to digital recordings as well? Many people put their own writing on the net, and their pictures, and (less often) their own music or their performances of someone else's music. But I don't think I've ever encountered a link to a poetry reading, though I'm sure some are out there. And Google has a special site for searching for pictures, but no one has a site for searching for (spoken) audio.

Here's the first sentence of Herbsttag in German, courtesy of Poetry Connection:

Herr: es ist Zeit. Der Sommer war sehr gross.
Leg deinen Schatten auf die Sonnenuhren,
und auf den Fluren lass die Winde los.

Here are the four translations of this first sentence:

Lord: it is time. The summer was immense.
Lay your shadow on the sundials
and let loose the wind in the fields.

(Galway Kinnell and Hannah Liebmann, "The Essential Rilke")

Lord, it is time. The summer was too long.
Lay your shadow on the sundials now,
and through the meadow let the winds throng.

(William Gass, "Reading Rilke: Reflections on the Problem of Translation")

Lord: it is time. The huge summer has gone by.
Now overlap the sundials with your shadows,
and on the meadows let the wind go free.

(Stephen Mitchell, "The Selected Poetry of Rainer Maria Rilke")

Lord, it is time now,
for the summer has gone on
and gone on.
Lay your shadow along the sun-
dials and in the field
let the great wind blow free.

(John Logan, "Homage to Rainer Maria Rilke")

Here's a stab at the (missing) interlinear version...

Herbst tag
autumn day


Herr: es ist Zeit. Der Sommer war   sehr   gross.
lord  it  is time. the summer was   very    big.

Leg deinen Schatten auf die Sonnenuhren,
lay your   shadow   on  the sundials

und auf den Fluren lass die Winde los. 
and on  the meadows let the winds loose

Now, how about some links to (legally downloadable) digital recordings of the poem in German?

[Update: Chan (the Bookish Gardener herself) emails that

Mark - thanks for the link! Here's a link to a Harper Audio site with audio of author readings, including some poets (notably Wallace Stevens and T.S. Eliot). There must be other similar sites out there, but you're right--the search engines are inadequate for finding them, and the Harper Audio one is the only one I've run across in my travels.
A small correction on a point which I may not have been clear about in my post -- what Harbison read was his English translation only, and he did not read it the poem in German first (although it would have been very cool if he had).

I think there's quite a bit of stuff like this out there, but indeed, it's not as easy to find it as it should be.

For example, here is a page of poems by Assunta Finiguerra, who writes in the dialect of Basilicata ("a region bordering with Campania (West), with Puglia (North-East) and with Calabria (South), with a population of about 610.000 people scattered in one hundred and thirty towns"); each poem is given in Basilicata text, Standard Italian translation, English translation, and a link to an .mp3 of a reading in by the poet herself. I found this link by following some references due to Stefano Taschini, about which more later.

I'm grateful for this much, and it's unreasonable to ask for more. But in such cases, even a little bit of analytical text (about the language, the context, the meter, whatever) would add a lot.

And someday, we'll have online editions of Rilke (and Dante and Pushkin and Homer), with translations and interlinear analysis and sound files and enlightening footnotes and...

]

[Oh, is my face red! The perceptive Dr. Margaret Marks of Transblawg did the obvious due diligence by typing {rilke mp3} into Google, and the first hit is this...

as she explained to me by email.

]

Posted by Mark Liberman at 06:19 PM

Lakoff hits the big time, blogwise

Coturnix at science and politics has a roundup (also here and here)of blogospheric reaction to George Lakoff's Moral Politics and Don't think of an elephant, with many links, most of which I haven't had time to follow. More on this later.

Amazon's sales rank for Don't think of an elephant is 15, which (according to the mode of calculation that we've referenced before, corresponds to approximately 100 copies per day (sold through amazon, that is). On amazon's bestsellers list, it's just behind Kitty Kelley's Bush family biography (at #14). Other political books in the top 20 include Jon Stewart's America at #1, Unfit for Command at #8, and the 9/11 Commission report at #12.

George is beating Dan Brown's Angels and Demons (#18), but losing to The Da Vinci Code at #10. I'm sure that yesterday's plug in The Daily Kos (with a current average of 283,289 visits a day) doesn't hurt sales -- "the best book this cycle", "you HAVE to get this book", "I'm absolutely smitten". Still, if #15 on amazon still means a bit less than 100 copies a day, that's not a very high proportion of daily buyers among 283,289 Daily Kos readers.

Posted by Mark Liberman at 10:46 AM

Three vs. four asterisks at Boondocks

The Washington Post has decided not to carry this week's Boondocks cartoons, which explore a fictional reality show called "Can a N***a Get a Job?" (Monday's cartoon, introducing the concept, is here). Specifically, the Post's online edition says that "The Post is not publishing this week's strip because of content issues." Instead, their online comics page is linking to last Sunday's strip, while the print edition is apparently re-running an older set of strips.

This week's Boondocks sequence has obviously posed a problem for papers around the country. At the St. Petersburg Times, Eric Deggans took the issue to his local barber shop, where the consensus seemed to be that "the strips weren't funny enough to justify the pain".

Deggans also called Greg Melvin, the editor at Universal Press Syndicate responsible for Boondocks, and writes that

Anticipating client concerns, the syndicator offered newspapers a choice: one version with the middle letters of the n-word dashed out, another version with the entire word replaced by symbols, and an older set of strips from last year on a different subject entirely. The Chicago Tribune and the St. Petersburg Times are among major newspapers that ran the dashed-out version, but the Washington Post chose to substitute the older strip.

"One paper called and said, "Can you asterisk out the "a'? - weird, hairsplitting stuff," said Melvin, who estimated he heard from "seven or eight" newspapers that planned not to run this week's strips, though other outlets could have decided to use the substitute strips without calling him .

It hasn't occurred to me before -- though it's obvious in retrospect -- that there's a sort of information-theoretic issue about how much of a hint about the identity of an offensive word is OK. Of course, in this case, the problems seem to have more to do with the content of the stereotypes being presented than with the simple identity of the word associated with them.

As far as the realities of the situation are concerned, here's a relevant discussion from March 2004 (link by email from Kerim Friedman at Keywords).

[Q_pheevr at A Roguish Crestomathy has a lovely set of observations about this issue. A sample paragraph:

Another popular approach is to blot out only the vowels. As anybody familiar with Semitic or telephonic writing systems can tell you, t's nt hrd t rcnstrct txt wrttn lk tht. Nor would it be terribly difficult for a determined and reasonably intelligent child to build his or her vocabulary with the help of such input: you just go around saying "Fack! Fick! Feck! Fook! Feek!" and so on until you find the one that makes your parent or guardian cringe. Much of the time, I think this kind of bleeping is done purely pro forma, as a gesture of respect to the reader; the writer refrains from printing the whole word, but there is no intention of hiding it from anyone, even the kids.

Q suggests reference to Fiengo & Lasnik (1972) [that's a work entitled "On Non-recoverable Deletion in Syntax]; but as Kerim Friedman pointed out by email, Freud (1913) [that's "Totem and Taboo"] is just as relevant. ]

Posted by Mark Liberman at 09:39 AM

Phishing

The kind of email scam that Geoff Pullum recently discussed is called phishing in the trade -- see this anti-phishing website for more detailed information.

The basic etymology is simple and obvious -- the scammers are "fishing" for gullible customers. The orthographic substitution of ph for f is by analogy to "phone phreaking". I suspect that the band Phish may have been inspired to use the same f-to-ph substitution by the same analogy, but I haven't been able to confirm this.

From the Jargon File 4.4.7:

phreaking: /freek´ing/, n.: [from ‘phone phreak’]; 1. The art and science of cracking the phone network (so as, for example, to make free long-distance calls).; 2. By extension, security-cracking in any other context (especially, but not exclusively, on communications networks) (see cracking).; At one time phreaking was a semi-respectable activity among hackers; there was a gentleman's agreement that phreaking as an intellectual game and a form of exploration was OK, but serious theft of services was taboo. There was significant crossover between the hacker community and the hard-core phone phreaks who ran semi-underground networks of their own through such media as the legendary TAP Newsletter. This ethos began to break down in the mid-1980s as wider dissemination of the techniques put them in the hands of less responsible phreaks. Around the same time, changes in the phone network made old-style technical ingenuity less effective as a way of hacking it, so phreaking came to depend more on overtly criminal acts such as stealing phone-card numbers. The crimes and punishments of gangs like the ‘414 group’ turned that game very ugly. A few old-time hackers still phreak casually just to keep their hand in, but most these days have hardly even heard of ‘blue boxes’ or any of the other paraphernalia of the great phreaks of yore.

[Update: Adam Merton Cooper emails

I was raised in Philadelphia (apparently you work there), home of the baseball Phillies. Officially, the baseball mascot is the "Phanatic", & unofficially many of us call ourselves "phans of the Phightin' Phils". Less obvious is a phenomenon from my college days (1987), in which an acquaintance named Fred drunkenly spelled his own name "P-H-E-D". In the dorm, this spawned substitutions like "P-H-U-N: fun" before spinning out of control ("P-H-T-Y: party!").
All to say that f->ph can happen spontaneously & independently, I think.

Sure enough. It's pretty clear that "phishing" (in the telecom/email scam sense) is based on an orthographic analogy to "(phone) phreaking"; but the band Phish could easily be unconnected. Or connected only by virtue of slightly greater "ph" coolness in the countercultural zeitgeist when the band was being named, due to the phonetic penumbra of phreaking.]

Posted by Mark Liberman at 08:37 AM

Forensic syntax for spam detection

The spammers get cleverer all the time. The email I got from the address of my bank, Wells Fargo Bank, at a proper-looking commercial address ending wellsfargo.com, had the bank's official logo in the right colors (as you see it here: it appears to be served from a248.e.akamai.net/7/248/1856/bb61162e7a787f/ where there is a subdirectory called www.wellsfargo.com within which is a file with the relative pathname /img/header/logo_62sq.gif; the logo may be the actual genuine one, not an imitation as an earlier version of this post suggested). The email has the picture of the guys on the stagecoach and everything. The visual details are just about perfect. The message looked businesslike, it looked real. It appeared to even a fairly expert eye to come from my own bank. What it wanted was for me to visit a certain website where the bank's security system would just check a couple of details like my account number and mother's maiden name, and then it would confirm that things were now fine and I would be able to go on using my ATM card. The message began:

During our regular update and verification of the Wells Fargo ATM Service®, we could not verify your current information. Either your information has been changed or incomplete, as a result your access to use our services has been limited. Please update your information.

But the spammers messed up. Their syntax let them down. Did you spot the two slips? It's bad luck for those recipients who didn't, because they'll believe this is the bank talking, and in many cases they'll click the link, and they'll answer the questions, and in the morning their checking balance will be $0.00 and their money will be in Africa or Taiwan or Poland or somewhere. You need to be sharp on your grammar to spot the crooks these days.

Look at the second sentence:

"Either your information has been changed or incomplete, as a result your access to use our services has been limited."

First, that has an illicit reduction (they should have said "Either your information has been changed or it is incomplete"), and second it continues with a comma splice ("as a result..." should have been preceded by a big-league punctuation mark like the period, semcolon, or colon, but a wimpy little comma won't do it). Just enough in the way of syntactic slips to sound illiterate, and to convince me that foreign criminals wrote the text and Wells Fargo knew nothing about it and the last thing in the world I should do would be to visit their website and supply some updated security information. So don't ever tell me that being a grammarian doesn't have cash value! Thousands of people fall for these bank security-check scams every day (this one came decorated with a warning at the bottom that you could not initiate the process by calling their customer services line, it had to be initiated by them through email; that's to try and stop people calling the bank to check). Many people who clicked and answered the questions will find their bank accounts have been raided tomorrow. Syntactic analysis can save you real money.

[Note added September 22, 8 a.m.: The first version of this post asserted that no bank ever corresponds with customers about security matters by unsolicited email. But wolfangel told me by email, to my utter astonishment, that at least one bank (Wachovia Bank) did send unsolicited emails to its customers about updating their security information. So that clinches it: grammatical analysis is actually a better source of evidence about whether your bank is emailing you than is general knowledge about bank security practice. Got syntax? Take my course.]

Posted by Geoffrey K. Pullum at 12:58 AM

September 21, 2004

Medical Interpretation

Eric Bakovic's post about the lawsuit filed by a physician in San Diego opposing a federal requirement that federally-funded clinics and hospitals provide interpreters for patients with poor knowledge of English, and its successor, and Mark Liberman's response, all address the bizarre claim that this requirement violates physicians' freedom of speech. I take Mark's hypothetical example to show that in principle, yes, a requirement that a speaker provide translation of his speech could interfere with freedom of speech by making that speech impracticable, but I don't consider it applicable in this case, for two reasons.

First, in the medical situation, it isn't impracticable or even enormously burdensome. Second, in the medical situation, it isn't the kind of infringement of freedom of speech that concerns civil libertarians and the First Amendment. To take an even more extreme example than Mark's hypothetical, let's consider a law that required anyone speaking in public (for concreteness, let's say to three or more people in a public place) to provide translation into any language spoken by anyone in the audience. With current technology, this would impose an impossible burden and effectively make public speaking impossible. I think that we would all agree that this would be an intolerable infringment of freedom of speech and that such a law would be unconstitutional. Mark's example of university colloquia is comparable. The doctor's speech to a patient, however, falls into another category. I submit that it is commercial speech, which does not receive the same protection. Prohibiting deceptive advertising infringes in a way on the advertiser's freedom of speech, but this infringement is acceptable because commercial speech does not deserve the same protection as other speech and because there is a strong public interest in the restriction. Requiring translation of a doctor's speech to his patient is no more of a violation of his freedom of speech than is taking away his medical license for giving incompetant advice to patients.

The lawsuit also raises issues of liability for the physician if he fails to provide competant interpretation. I agree that that is potentially a valid issue. It would be unfair to require more of physicians than they can reasonably do. They don't have the time or, generally, the competance,to make a careful evaluation of an interpreter's ability. However, I don't think that this is a serious issue. To begin with, there don't seem to have been a spate of lawsuits against physicians on this basis. In fact, as far as I can tell, there haven't been any. Furthermore, as I read the Health and Human Services Guidance Memorandum, the regulations aren't very strict. All they say is that the physician should make an effort to use a competant interpreter. They explicitly provide that the interpreter need not be certified. A physician cannot be expected to give the interpreter a language examnation, but he can be expected to ask basic questions, such as whether the person speaks English sufficiently well and whether the person's background makes him or her likely to be sufficiently familiar with medical terminology.

Reading the brief and other statements of the plaintiffs convinces me that there is another linguistic issue here, namely the quality of medical care that results when patient and doctor can't understand each other. I don't think that the plaintiffs understand the nature and extent of the problem. It should be obvious that medical care suffers when doctor and patient can't communicate. The patient can't convey to the doctor his or her symptoms and history, so the doctor can't diagnose adequately and choose appropriate treatment. The patient can't express his or her needs to the nursing staff. The patient can't understand the doctor's instructions. All of this is well known, but the lack of adequate interpretation continues to be a problem.

According to this Medserv Medical News article, the California Assembly is considering a bill that would require hospitals to have official adult translators on hand. It bans the now common practice of relying on immigrants' children as interpreters. According to the article:

The California Medical Association last month warned that patients with limited English skills posed a health threat because they are not able to read medicine and prescription labels, follow doctors' instructions or absorb advice about healthy lifestyles. "Communication between a patient and his or her physician is at the heart of medical care," said CMA President Dr. John Whitelaw.

Here's an article from yesterday's Hartford Courant, entitled Language Still A Barrier For Good Medical Care, according to which:

advocates for immigrants and refugees in Connecticut maintain that a dearth of competent interpreters continues to threaten the health of non-English-speaking residents.

A study by Dr. Dennis P. Andrulis [PDF file] of 4,100 patients in 12 states reported that:

Patients who did not get needed language assistance reported problems that touched nearly all aspects of their healthcare experience. ... the most disturbing finding was that more than one quarter of those unsuccessful in finding needed language services did not fully understand the prescription instructions they were given - a problem experienced by only 2% of the other patients.

The response of Dr. Clifford Colwell, one of the plaintiffs in the case, is that patients' family members should be relied upon as interpreters. That's exactly what the bill in the California legislature intends to ban. To begin with, just because you speak two languages doesn't make you a good interpreter. That's all the more true in a situation which may be stressful for the interpreter as well as the patient. And even if they are competant at ordinary conversation, untrained interpreters may well not be familiar with medical terminology and the medical system. Using a family member as interpreter also raises privacy issues - the patient may not want the interpreter to know about his condition or related matters, such as his sex life. An additional issue is that the family members used as interpreters are often children, since the children learn English more rapidly than their parents. Children, however, are particularly unlikely to understand medical terminology, and their parents may be reluctant to discuss some topics in their presence. Finally, some people have no family who live nearby or are able to come with them and who speak English. Relying on family members to interpret is a dreadful idea.

Another plaintiff in the suit is the American Association of Physicians and Surgeons. This is not just any medical association. It appears to devote its efforts to opposition to efforts that in its view infringe on the sanctity of the doctor/patient relationship. Its positions include opposition to mandatory vaccination of children and to Medicare. The AAPS has posted its comments on the interpreting regulations here, here, and here. What I find striking is that most of the AAPS comments are directed either at the potential liability problems for physicians if a patient decides that his or her problem was caused by inadequate interpretation or by arguments in favor of requiring the use of English in various contexts and that it is improper "to shift the burden of understanding English from the listener to the speaker". The little that is said about communication between doctor and patient is disappointing. They suggest that patients may be more comfortable with a family member as interpreter than with someone they don't know. That's true: sometimes they will be. And other times they won't be. The regulations don't require patients to make use of non-family members as interpreters - they require that competant interpretation be made available, and forbid the clinic from requiring the patient to provide a friend or family member to interpret. The AAPS comments don't address the problems of interpreter competence or privacy issues, nor do they say what should be done when the patient doesn't have a family member to interpret. You would hope that a medical association would give more thought its public policy positions.

The AAPS comments dwell on the need to make clear the status of English as the official language of the United States, and complain about immigrants who "choose" not to learn English. The idea that immigrants refuse to learn English is one of the chestnuts of the English Only movement, and is without foundation. Most immigrants do learn English, and most who don't would if they could. Some people aren't good at learning languages, or live and work in a situation in which they can't. Some people lose their ability to speak a second language when they become old and sick. What is more, there are many people in the United States who speak languages other than English who are not immigrants. It is estimated that 75% of the older generation of Navajos do not speak English. There are many people in New Mexico and Arizona, and still a few in California, who speak only Spanish. They and their ancestors were born in what is now the United States. They aren't immigrants who chose to come to the United States but perversely refuse to learn English.

Without qualified interpreters, patients with limited English cannot communicate adequately with their doctors. Reasonable people can disagree about who should pay for such services (a sensible proposal is that Medicare should add a provider category for interpreters so that the interpreting bill could be reimbursed as a cost of the visit) and whether not providing them is illegal discrimination (the basis for President Clinton's Executive Order), but no reasonable person can claim that interpretation isn't necessary for adequate medical care. For doctors to oppose the requirement that patients be provided with interpretation is for them to oppose giving adequate medical care. Physicians have an ethical and legal duty to provide competant medical care, which isn't possible if doctor and patient can't communicate. Some of the arguments of the plaintiffs in this case are just thoughtless and ignorant, but what the AAPS position statements suggest to me is that the main problem is that the plaintiffs are bigots with an impoverished understanding of their ethical obligations.

Posted by Bill Poser at 11:19 PM

Translation and freedom

In a Language Log post on September 2, Eric Bakovic asked the world to "explain to me how having your federally subsidized medical advice and instructions translated limits your freedom of speech". He got just one answer, a self-consciously cynical one suggesting that such a policy violates the medical profession's right to "[be] incomprehensible to ordinary mortals", a straw man that he effectively flattened in a post earlier today.

Now, I'm inclined to agree with Eric that medical professionals ought to be prepared to deal with patients with whom they don't share a language, though I haven't formed an opinion yet about Executive Order 13166 and the suit to overturn it, because I don't know enough about either the facts or the law. And I agree with Eric in opposing "Official English" legislation. However, I'm willing to take up Eric's challenge in a more serious way, acting in the role of advocatus diaboli.

To do so, I need to edit Eric's wording of the question, because his phrase "having your ... advice and instructions translated" is ambiguous. It could mean have in the sense of "to be subject to the experience of" -- "he had his nose broken in an accident" -- or have in the sense of "to cause something to be done" -- "he had his car waxed to impress his roommate". Although Eric seems to assume the experiencer sense, I'm going to make the causative sense explicit. This is partly because that's what my argument needs, but it's also because that's what the challenge to E.O. 13166 is actually about. According to the initial brief, the various HHS Policy Guidance regulations "[order] medical service providers and others to provide free translation services to limited English proficient (LEP) persons, [and] ensure the competency of the translation". It's also claimed that the same regulations "[expose] these providers to liability under both federal law and malpractice claims". This is not just a matter of letting someone translate for you -- you're required to provide the translation services, at your own expense, and to ensure their quality. If you don't do it, you expose yourself to legal liability, and also endanger your employer or the owner of the facilities that you use.

To see how this might create a first amendment issue, let's shift from medicine to higher education, and from professional services to intellectual discourse. This is a big shift, but not a legally or logically unreasonable one, since the law in question is the general Title VI "Prohibition Against National Origin Discrimination", which applies to schools just as much as to hospitals. In particular applies to any university that gets federal funds, and that certainly include both UCSD, where Eric teaches, and Penn, my own institution.

Now we can create all sorts of mandated-translation scenarios. For example, should colleges and universities be required to provide simultaneous translation of lectures, on demand, for any students with "limited English proficiency" (LEP)? How about providing translations of textbooks and course notes? In a big undergraduate class, where there might easily be LEP students from half a dozen different language backgrounds, this would affect the economics of higher education in a pretty major way. There might be a first-amendment issue in there, since the result would be to radically restrict the range of courses that could be offered and the set of texts that could be used, but let's look at a different hypothetical case instead.

Consider the usual departmental colloquium series, like the one that the UCSD linguistics department offers, or the all-too-numerous series at Penn that I sometimes attend. In general, these talks are open to the public, and those that sponsor them are happy to welcome anyone who wants to come listen. But suppose that we had an affirmative duty to provide simultaneous translation of the speeches -- and also translation of all handouts and slides employed -- into any language that any member of the audience chose to request. This would again be rather expensive, and so it would certainly be an effective way to cut down on the number of talks that could be offered. But suppose in fact that no such talk could go forward, if such a request were for any reason not complied with.

"Sorry, folks, today's talk is cancelled, because the Ukrainian translation of the slides didn't arrive, and the Turkish translator got stuck in traffic".

"But there are 50 of us here who would like the hear the English, Spanish, Chinese and Arabic versions!"

"Sorry, no can do. Legal liability, you know -- the General Counsel's Office has been really strict about this Title VI stuff since Duke lost that big damage suit last year for using unqualified Spanish translators at their bioinformatics symposium."

Well, this is not going to happen. I hope. But what makes this a silly speculative fiction, as I understand the issues, is not legal logic, but rather the low demand in practice for LEP assistance in this context, combined with the high cost of providing it. So as a matter of logic, Eric, would you concede that there's a possible free speech issue in being required to "[have] your federally subsidized advice and instructions translated"?

Posted by Mark Liberman at 08:44 PM

Freedom of speech and obfuscation

In a recent post I challenged Language Log readers to:

explain to me how having your federally subsidized medical advice and instructions translated limits your freedom of speech

Kevin Russell takes up the challenge via e-mail.

Kevin writes:

The cynical answer is: Just as the freedom of religion includes the freedom not to have any religion at all, freedom of speech includes the freedom not to speak -- or in this case the freedom not to make yourself understood. The medical profession has a lot invested in the mystique of being incomprehensible to ordinary mortals. (Not that we linguists would ever dream of doing anything similar. Nah!) If the absence of a locutionary force is a deliberate part of your perlocutionary force, then having the government substitute its own perlocutionary force does indeed infringe on your freedom of speech. (Should government proofreaders be allowed to "correct" Jabberwocky or Finnegans Wake?)

Kevin's right -- that is a pretty cynical answer. My cynical rejoinder is: does this mean that medical professionals (henceforth MPs) are free to only accept patients who don't speak English natively, so that their advice and instructions are better obfuscated? Or to give their advice and instructions in a language that their patient does not speak? (I realize that this is kind of a red herring, but as long as we're being cynical, let's put that aside.)

Kevin continues "[a] bit more seriously, in the interests of reducing our levels of hypocrisy:"

Yes, it's a very good thing for government-sponsored communication to be understandable and understood. Such a good thing that I think it trumps the interests of professions in encouraging obfuscation and even usually the freedom of speech of its employees and sponsorees. But given governments' utterly abysmal success at accomplishing this in every other area, it's not too hard to understand why these doctors might see hit-and-miss attempts at imposing comprehensibility in just this one context as being a bit unfair.

I take issue with Kevin's notion that "translating" is the same thing as "imposing comprehensibility". There is no guarantee (or requirement, as far as I can tell) that a translation of medical advice and instructions from English to any other language will be any more or less comprehensible than the original English, to the patient or otherwise. True, a translator may request clarification from the MP in order to better do their job, but an English-speaking patient also has the right to request clarification directly from their MP -- and, presumably, MPs can in both cases continue to obfuscate if they so choose, at the risk of their patients not understanding them (and, hopefully, choosing to get their medical advice and instructions elsewhere).

To me, it all boils down to the playing field being as level as possible. All else being equal, (native) English speakers are at an advantage when it comes to receiving medical advice and instructions in English. To the extent that federal law sees this advantage as being unfair, translation of federally-funded medical advice and instructions should be provided to the patient free of charge. (As I insinuated in my original post, I think the point of the relevant lawsuit is that at least one of the plaintiffs is of the opinion that federal law should see the advantage of (native) English speakers as fair and just. If they succeed with this lawsuit, it will likely set a precedent that will be very difficult to ignore, much less overturn.)

Kevin concludes:

Most of me agrees with you in saying: Tough luck, then don't take federal money. But part of me asks: If it were a requirement of accepting a grant from a research council that your findings would be butchered by a clueless journalist, how many linguists would simply pass up federal money without complaint?

Personally, I might grumble a bit once I read how my findings have been butchered by the clueless journalist. (I would also cringe at any "corrections" to Jabberwocky.) But I would be consoled by the fact that the original material is still out there, available to anyone who gives a damn, and that I also have the opportunity to respond to any misrepresentations of my work -- whether they were required by the terms of my funding or not.

[ Comments? ]

Posted by Eric Bakovic at 02:37 PM

Stress for Russian tennis players' names

Guest submission by Barbara Partee

All through the television coverage of US Open tennis tournament this year, the names of many of the Russian women tennis players were pronounced incorrectly. I recently hunted around on the Internet for anything I could find about it, and found this article by Neil Schmidt in the Cincinnati Enquirer (August 18). The article includes a pronunciation guide, which is taken directly from the WTA's own pronunciation guide.

Amazingly, more than half the names are listed with the stress on the wrong syllable. Below is a list along with corrections provided by my husband Vladimir Borschev (Vlah-DEE-mir Bar-SHOFE), who is Russian and who watches tennis on Russian TV, where the players names are pronounced by Russian sports announcers (and by the players themselves in interviews). Most of the names would be clear to any Russian speaker anyway, although sometimes the stress is unpredictable.

**Player Pronunciation**
Name	WTA Guide (from Schmidt's article)	Comments
3. Anastasia Myskina	Miss-KEE-nah	NO: MYSS-kee-nah
(English speakers are unlikely to be able to distinguish the vowel traditionally transliterated as [Y], so it would probably be acceptable to represent it as "MISS-kee-nah".)
6. Elena Dementieva	De-MENT-ye-vuh	OK
(My husband writes "De-MENT'-ye-vah", but again English speakers are unlikely to be able to distinguish distinguish the ordinary consonant [t] from the palatalized [t]. )
8. Maria Sharapova	Sha-ra-POH-vuh	NO: Sha-RAH-pa-vuh
9. Vera Zvonareva	Zvon-a-RAY-vuh	NO: Zvah-na-RYO-vah
(If it's easier, it would be OK to write it as "Zvon-a-RYO-vuh" or even "Zvon-ar-YO-vuh". This is the Russian ë, pronounced "YO" and always stressed. But written Russians usually omit umlauts, so non-Russians often aren't sure whether it's "YE" (stressed or unstressed) or "YO". By the way, if there hadn't been the umlaut, it would have been Zvahn-a-RYEH-vuh, not Zvon-a-RAY-vuh.)
10. Svetlana Kuznetsova	Kooz-NET-so-vuh	NO: Kooz-ne-TSO-vuh or Kooz-net-SO-vuh
14. Nadia Petrova	Pe-TROH-vuh	OK
(My husband writes "Pet-RO-vah", but American announcers are not going to distinguish exactly where the syllable break comes anyway.)
25. Elena Bovina	Bo-VEE-nah	NO: BO-vee-nah or BO-vee-nuh
(I wonder why they wrote 'vah' rather than 'vuh' this time for the last syllable. These unstressed final "a"s are all the same, namely schwa, and "uh" is a good way to represent it in English.)
26. Elena Likhovtseva	Lee-HOFF-she-vuh	NO: LEE-hof-tse-vuh
(All of the last three syllables are unaccented. The "tse" gets a slight secondary stress, but the only real accent is on the first syllable.)
41. Dinara Safina	Sa-FEE-nah	NO: SAH-fee-nuh
71. Alina Jidkova	YID-ko-vuh	NO: Yid-KOH-vuh

It's remarkable -- 8 out of 10 are seriously wrong. 7 of the 10 in the list are given with the accent on the wrong syllable, and an eighth one (Zvonareva) had the stressed vowel wrong in an important way. (The two that were correct are Dementieva and Petrova.) I know from studying Russian that pronouncing Russian names is not easy! But if someone was going to make up a pronunciation guide, shouldn't they have checked more carefully before telling everyone "here's the right way"?

Correspondence with Neil Schmidt, the author of the Cincinnati Enquirer (who has, however, since left that paper), provided some clues. Schmidt reported to me that the WTA stands by its pronunciation guide. Its spokesperson suggested to him that many players might adopt Americanized pronunciations when they speak with foreign reporters. "Supposedly," Schmidt wrote to me, "the WTA Tour lists its pronunciations by what the players themselves submit".

Posted by Christopher Potts at 11:40 AM

The missing word is Société!

Chris Waigl of serendipity kindly emailed to explain everything. Well, everything about the grammar of " la Mexicaine de perforation", anyhow:

You are missing an ellipsis. La Française des jeux is the (state-controlled) French lottery operator; La Lyonnaise des Eaux one of the two dominant drink water providers (plus other activities). Insurance or financial services companies in particular, but not exclusively, are sometimes called La [feminine form of a toponymical adjective] de/s... . The missing word is société.

The same phenomenon as Le vieux Nice/Marseille/[town name]: Nice, feminine in Italian, is, like any town of indeterminate gender in French. But vieux doesn't modify the town name at all, it modifies an elided quartier.

I figured there was some missing head noun, but didn't know which one. Nor did I know the pragmatic associations of the construction with company naming patterns.

Chris added more helpful information in a second email:

P.S.: About the definite article. The French news outlets all have (La) Mexicaine de Perforation. Not de la p.. I found this strange, but apparently company names vary.
Google finds:
- La Parisienne de Routage
- La Parisienne de Chauffage Urbain
- La Parisienne de Distribution

This naming scheme is common in companies that deal with infrastructure.
As for a good translation, it's bound to be a compromise.

But now to get a good translation, we need to think about the comparable naming schemes for infrastructure companies in the U.S. or the U.K.

Maybe "the Mexican Consolidated Tunneling Authority", or something like that. It'd take a bit of research and thought.

[Update: though I hesitate to question Mlle. Waigl, I notice that the website for what seems to be "La Parisienne de Chauffage Urbain" identifies itself in full as "Compagnie Parisienne de Chauffage Urbain", not "Société".

...and just as well, too, or there'd be a faux amis mistranslation waiting to happen: "the Parisian Society for Urban Warming". Their motto: "Change globally, warm locally"? (Of course, this is really a public utility that provides heat, I suppose by means of a network of steam pipes.) ]

[Update #2: Stefano Taschini emails to disagree with Chris in one particular, while supporting her account of the phrase originally discussed:

Nice is feminine in French, as can be seen on the pages of the tourist office of Nice were the city is referred to as "la belle Nice".

Many cities are obviously feminine in French: La Nouvelle Orléans, Andorre la Vieille.

In fact, I suspect that most cities are feminine, the only masculine ones being preceded by the masculine article: Le Mans (as in Les 24 Heures du Mans), Le Caire.

I do agree with Chris, though, when she says that masculine adjectives are used to refer to a specific quartier, as in Vieux Nice, and I agree with her explanation.

Stefano adds that

Regarding the absence of the article in the name "la Mexicaine de perforation", well, that conforms to the standard usage, as you would say "une maison de bois" or "Ecole de ski" (but "Ecole du Ski Français").

The extra article ("la M. de la P." instead of "la M. de P.") seems to have been an interpolation by Jon Henley in his Guardian piece of 9/8/2004, picked up from him by some other anglophone writers. As Chris Waigl observes, it does not appear in any of the French-language sources I've read. ]

Posted by Mark Liberman at 10:56 AM

A Mexican perforator by any other name

It was a wonderful story, rivaling the greatest inventions of Jules Verne or Thomas Pynchon. As Jon Henley wrote:

There are, at most, 15 of them. Their ages range from 19 to 42, their professions from nurse to window dresser, mason to film director. And in a cave beneath the streets of Paris, they built a subterranean cinema whose discovery this week sent the city's police into a frenzy.

The bar. The complex electrical circuits and multiple phone lines. The automatically triggered recording of a barking dog. The "couscoussière à deux étages d'où partaient des fils électriques". And by the time the police who discover it get back with the bomb squad, everything is gone, vanished down a hole a foot in diameter, except for a note reading "ne nous cherchez pas" ("don't look for us").

But there are a couple of minor linguistic mysteries here, along with the larger factual questions echoing down the Pynchonesque passages of this «gros coup de pub médiatique». In particular, what is the name of the responsible group? what is its analysis, its structure and meaning?

The 9/7/2004 story in Libération, by Frédérique Roussel and Ludovic Blecher, said that the responsible group "s'est même présenté, sur RTL, sous le nom de «Mexicaine de perforation»". And indeed the text on the RTL web site, framing radio interviews with Lazar Kunsman and Patrick Alk, says that "RTL a fait une enquête sur la 'Mexicaine de perforation'". (The RTL text is dated 9/8/2004, but I suppose that the interviews must have aired a day or two before that).

Jon Henley's 9/8/2004 Guardian story called them "the perforating Mexicans". His 9/11/2004 follow-up used the French term "La Mexicaine de la Perforation".

On 9/18/2004, Eleanor Beardsley on NPR called them "the Mexican perforation".

Now, everyone seems to agree about where the basic referential morphemes here come from. The "Mexicaine" part is due to the fact that the responsible groupuscule frequents a bar called "Le Mexico". The "perforation" part is said to relate to the word perforateur (or perforatrice), about which the Dictionnaire de l'Académie Française says that

Il s'emploie particulièrement au féminin, comme nom, pour désigner une Machine-outil, qui sert à creuser des trous dans la pierre, les roches, les matières dures. ("It is used especially in the feminine form, as a noun, to denote a power tool used to cut holes in stone, rocks (or) hard materials").

The more general sense for perforation given by the same dictionary is "action de percer quelque chose" = "action of piercing something"; but the original RTL interview is explicit that it's the rock-drilling sense that's at issue. Given this, both Henley and Beardsley seem to be guilty of a mistranslation: instead of perforating or perforation, they should be writing and talking about drilling or boring or something like that. Of course, "the Mexican rock-borers" or "the stone-drillers from the Mexico bar " don't quite have the eclat of "the perforating Mexicans" or "the Mexican perforation".

But I knew that much from the beginning. What still puzzles me is not lexical, but grammatical, and deals not with the translation, but the original French. I understand the basic contentful morphemes here, but everything about the way they're combined puzzles me.

First, why is Mexicaine singular and feminine? It refers to a group that is semantically plural and of unknown sex. If some singular group noun -- like groupe -- is assumed, what is it? Not groupe, which is masculine. In general, I'm not used to seeing a group of people referred to with a singular feminine form: la Française for "the French"? l'Américaine for "the Americans"? I don't think so.

Second, why the noun + prepositional phrase structure "la Mexicaine de perforation"? This is not NPR's "Mexican perforation", which would be "la perforation Mexicaine". It's not the Guardian's "perforating Mexicans", which would be "les Mexicaines perforatrices" (if they were women), or "les Mexicains perforateurs" (if they were male).

"The (female) Mexican of rock-drilling"? What's with that?

I'm exposing my ignorance here, but how else will I learn?

[Update: for the answer, see here. ]

[I can't resist adding my own hypothesis about this story, which is that these "Mexicans" are actually descendents of the Argentinian crew of the hijacked German submarine Der Aal, described in chapter 37 of Pynchon's Gravity's Rainbow. Here's Larry Daw's summary:

(37) Aboard a hijacked German submarine named Der Aal, the Argentine anarchists lazily plan a film version of Jose Hernandez's epic poem of the Argentine pampas, "Martin Fierro." Squalidozzi has been introduced to Gerhardt von Göll, also known by his nom de pègre, "Der Springer." He has sinister connections, through Spottbillingfilm AG in Berlin (another IG Farben outfit), from whom von Göll used to get cut rates on most of his film stock, especially the peculiar and slow-moving "Emulsion J," invented by Laszlo Jamf. Somehow, it was able to render human skin transparent, revealing the face just beneath the surface. It was used extensively in von Göll's immortal Alpdrücken. He also brought the Schwarzkommando to life in the Zone from out of a film for Operation Black Wing. One day they may shoot Squalidozzi's film on the Lüneberg Heath, where Rocket 00000 will be fired.

]

Posted by Mark Liberman at 07:14 AM

Garden paths at the Guardian

Perhaps it's only my new-world perspective, but this Guardian headline led me down not one, but two (and a half) garden paths:

French left torn in two in row over EU Constitution

First I pictured some poor French people, left torn in two. Maybe Chirac didn't get those hostages freed after all? But no, they were only left torn (though less drastically) two times in a row. No, wait a minute, it's the French political left. Who were torn in two metaphorically. And now I see, it's British row as in "spat".

Posted by Mark Liberman at 05:50 AM

A dishonest implicature

A few hours ago CBS News President Andrew Heyward put out a prepared statement saying, "Based on what we now know, CBS News cannot prove that the documents are authentic, which is the only acceptable journalistic standard to justify using them in the report. We should not have used them."

Not a false statement, yet less than candid. Human languages are tricky that way: you can state something true and simultaneously implicate, in the context at hand, something false. Of course CBS News can't prove the Killian memos are authentic. That's because they are completely obvious fakes. But that's what CBS still won't directly admit. Saying they "cannot prove that the documents are authentic" conversationally implies that authenticity is still a very reasonable hypothesis but they're just having a little trouble coming up with the solid evidence that those epistemologically truculent bloggers in their pajamas seem to need. A dishonest implicature. CBS News still hasn't won back my respect.

Posted by Geoffrey K. Pullum at 12:39 AM

September 20, 2004

Which vs. that: a test of faith

I agree completely with Geoff Pullum's views on the relationship between the which/that choice and the distinction between "integrated" and "supplementary" relative clauses. Copy-editors' strictures against using which in integrated relatives are an invention -- what in ordinary life we would call a lie -- with no basis in the facts of the English language. Specifically, that is no longer used in supplementary relatives; but in integrated relatives, both which and that continue to be in common use by all the best writers, as has been true for centuries.

However, I partly disagree with Geoff on one secondary question. He thinks that "reading a few books and noting the thats and whiches and forming semantic hypotheses" is not worth the trouble, because " it would amount to looking for a meaning difference that isn't there". This violates my belief -- maybe it should be called a prejudice or an article of faith -- that if there's a difference in form, there will generally turn out to be a difference in meaning, at least in one of the weaker senses of that protean word. These differences may be the lingering residue of a lost history -- of etymology or dialect or register -- or they may be an emerging association, engendered by compositional convenience, phonetic resonance or collocational accident. The differences are likely to be contextual and gradient. But my theology of linguistics, which is simple-minded but deeply felt, tells me that we'll find the differences if we look for them.

On the other hand, common grammatical morphemes like that and which are about as unlikely as any words can be to gather this sort of meaning-moss. So to test my faith, I decided to take up Geoff's challenge. It's likely that someone has already explored this area more thoroughly -- I didn't take the trouble to do a literature search -- but I'll present you with the fruits of a few minutes spent Googling.

I looked for evidence relating to two "semantic hypotheses", one having to do with humanity (or perhaps a more general hierarchy of animacy) and the other with (degree of relative clause) integration. I'll discuss the "humanity" finding in the rest of this post, and you can make up your own mind whether the results are worth the trouble. I'll take up the integration-gradation in another post.

It's well know that there's a contrast between which and who as relative pronouns -- CGEL (p. 497-499) characterizes this as the difference between "personal gender" and "non-personal gender". The facts are interestingly complicated, but the main point is that who is used for humans except in certain special circumstances, and which similarly for non-humans. However, the word that is obviously available for relative clauses with both human and non-human referents: "the man that corrupted Hadleyburg"; "the dog that didn't bark"; "the land that time forgot".

This is a kind of "meaning difference" between which and that -- which requires "non-personal gender" while that imposes no gender constraints. But Geoff already knows this -- he wrote the book, literally.

However, the facts are not quite so simple. If we look at integrated relative clauses of the form "those that/which/who ...", we expect to find who used for persons, which used for non-persons, and that used freely for either one. And for who and which, that's just the way it works out. However, in the case of "those that..." , there seems to be a strong overall preference (roughly 90%) for human referents. This is far from the 50/50 split that lack of personhood might seem to predict. Is my simple faith rewarded? Not yet, as it turns out -- but read on...

Google finds 1,570,000 pages containing the string "those which". I checked three pages of ten instances each (numbers 1, 5 and 10 in my search); unsurprisingly, I found 26 instances of non-human referents, as in

Temperate bonsai are those which require cool winter temperatures.
They ought to regulate their decisions by the fundamental laws, rather than by those which are not fundamental.
This took the form of a questionnaire sent to all Anglican cathedrals in England, followed by individual visits to those which showed a particular interest in being involved in the project.

and no instance of human referents (the other four cases were irrelevant things like "those 'which are you' quizzes", or duplicate pages).

Google finds 15,400,000 pages containing the string "those who", and again, in a sample of 30 I found 29 instances of human referents, and no instances of non-humans.

Google found 6,220,000 pages containing the string "those that". When I checked my three pages of ten examples each, I found 26 with human referents, like these:

Program students are those that live in dormitories or group homes on Heartland property.
Look, Paul, let me put it another way, those that aren't with us are against us.
Will initial teacher training for those that are not yet qualified teachers be different to that done by those joining the programme as qualified teachers?

as opposed to 3 instances of nonhuman referents, e.g.

Great countries are those that produce great people.

So about 90% of the "those that..." examples refer to people. Is Geoff wrong? This looks like a meaning difference (other than the obvious one) influencing the which/that choice. I mean, when a choice is supposed to be completely unspecified, but 90% of the tests go one way, that looks like a pretty big effect.

But it isn't -- because there's a contextual bias. Remember that in the relevant cases where we can tell, personal gender ("those who") is about ten times commoner than non-personal gender ("those which"). 15,400,000 to 1,570,000 ghits, to be precise, or 9.8 times commoner. Combining these two cases, we have 15,400,000/(15,400,000+1,570,000) = 90.7% personal gender.

So it's hardly a surprise that in the case of "those that...", where personhood is ambiguous, 26/29 examples in my sample (89.7%) turned out to have human referents. This is exactly the sort of result we expect from an underlying random process that is biased to produce human referents 9.8 times more often than non-human ones. Chalk up a score for Geoff and the "meaning difference that isn't there."

But let's continue a little bit further, and add a bit more context, in the form of a verb that selects subjects on the animate end on the great chain of being, like live. The string "those who live" gets 397,000 ghits, and "those which live" only 984; so in the "those (who|which) live" context, personal gender wins 99.8% of the time.

However, the string "those that live" gives me 27,300 ghits, and in a sample of 30 of these, 17 referred to humans and 12 were animals. Only 58.6% human. What gives?

And it gets worse: "those which live", in a sample of 30 (of 984), had 26 instances referring to animals, but 3, unexpectedly, referring to humans -- 10% "personal gender" where we expected none.

Here's a tabular summary of this case:

	whG	personal (of 30)	non-personal (of 30)	% personal
"those who live "	387,000	30	0	100%
"those which live "	984	3	26	10.3%
100*who/(who+which)				99.8%
"those that live"	27,300	17	12	58.6%

Looking at the 3 human heads that I found in my sample of "those which live", it's easy to come up with some possible explanations. In the first place, all three were all from old texts, like Malthus' 1798 work "An Essay on the Principle of Population":

(link) The rest of the inhabitants might be 1200 naked miserable and despicable Arabs, like the rest of those which live in villages.

and a passage from a 16th-century work informatively entitled "THE TRUE PICTURES AND FASHIONS OF THE PEOPLE IN THAT PART OF AMERICA NOW CALLED VIRGINIA, DISCOVERED BY ENGLISHMEN sent thither in the years of our Lord 1585, at the special charge and direction of the Honorable SIR WALTER RALEIGH Knight Lord Warden of the stannaries in the duchies of Carenwal and Oxford who therein has been favored and authorized by her MAJESTY and her letters patents. Translated out of Latin into English by RICHARD HACKLVIT. DILIGENTLY COLLECTED AND DRAWn by JOHN WHITE who was sent thither specially and for the same purpose by the said SIR WALTER RALEIGH the year abovesaid 1585. and also the year 1588. now cutt in copper and first published by THEODORE de BRY at his own charges":

(link) The apparel of the chief ladies of that town differ but little from the attire of those which live in Roanoke.

and a sermon preached in 1658:

(link) Those which live in impiety, and depart in their iniquity, they which have here provoked the wrath of God, and goe hence with that wrath abiding on them, as they could create nothing to their relations but sorrow in their life, so must they necessarily increase it at their death.

In addition to being old, all three examples also refer to ethnically or morally subordinated people. Though the N is too small to be very confident about either of these explanations, both seem plausible, and could be explored further if this were a real piece of research and not just an hour's test of linguistic faith.

In any case, we still don't have any explanation for the shortfall in human instances of "those that live..." Here's another little contextual test where it looks like there is a similar problem. This time we'll use the word concern, which predisposes the construction towards a non-human referent:

	whG	personal (of 30)	non-personal (of 30)	% personal
"those who concern"	661	30	0	100%
"those which concern"	3,740	0	30	0%
100*who/(who+which)				15.4%
"those that concern"	4,790	1	29	3.3%

Here indeed the non-personal forms are overall much commoner than the personal ones (about 85% by the who/which test), but again, that is a lot less likely to be human than the who/which ratio in the same context would suggest.

What's going on here?

Is that always less likely to be human than predicted by who/which, as a sort of statistical version of the prescriptivist stricture that I ridiculed in an earlier post? Perhaps, but I'd want to look at more than two contexts before coming to this conclusion. Is that is more likely to be omitted in introducing relative clauses when the head is human? Maybe, but subject relatives (where that is hardly ever omitted) form the bulk of these sets, so I don't think this can be the explanation for the effect.

On balance, I think my faith is upheld, though ambiguously and mysteriously.

Coming up: integration gradation.

Posted by Mark Liberman at 04:21 PM

Two bites of authors' remorse

Now that Rodney Huddleston and I have our book A Student's Introduction to English Grammar, and submitted it to Cambridge University Press, and gone through the (shudder) copy-editing stage, I have finally noticed two things about it. (This always happens. It's the author's analog — or in this case authors' analog — of buyer's remorse.) Both of them are somewhat shocking, but I don't know if we can do anything about them.

The first is that the initials of our title form the acronym SIEG. That doesn't seem good at all. But I suppose it could have been worse. If we had added a subtitle like How English Illustrates Linguistics, the acronym would have been SIEG HEIL.

The other shocking thing is more substantive. I just did a global search of the entire electroscript and found that nowhere does the book make the slightest mention of the concept "split infinitive".

Now, I should stress that to us there is no such thing. The word sequences to which people apply the term "split infinitive" are phrases like to really be careful. But nothing is split here. To be is not a word, it's two words. (One of them occurs in I won't because I don't want to; the other occurs in I'm drunk now but tomorrow I won't be; they're completely independent of each other.) To really be careful is an infinitival verb phrase in which to is attached to a verb phrase that happens to begin with an adverb. The imperative Really be careful, now! shows you that this is O.K. A verb phrase is allowed to begin with anything it wants, subject only to the syntactic principles about the contents of verb phrases. The only thing that to cares about is that if it is to attach to a verb phrase to make an infinitival verb phrase, the verb should be in its uninflected plain form (in the case of the present example, the form be, not am or is or been). But that verb doesn't have to be the very first word in the verb phrase. There are six hundred years' worth of examples in good writing that prove it (a wonderful collection is gathered by George O. Curme in his classic volume Syntax). And Arnold Zwicky has carefully demonstrated that in some cases of infinitival verb phrases it is actually required that a modifying word be located right at the beginning of the verb phrase that has the to on it, hence following to and preceding the plain form of the verb.

There isn't anything the slightest bit grammatically wrong with a sentence like You should wear both a belt and suspenders to really be careful. The people who regard such sentences as in need of editing are loonies (or copy editors, which is often very much the same thing). Huddleston and I had almost forgotten that there were such people when we were writing our book.

But there are surely enough loonies out there that we should at least have mentioned the issue. We may have to add something. Perhaps we could just add an entry to the glossary:

Split infinitive: No such thing. Don't be a loony.

Posted by Geoffrey K. Pullum at 01:55 PM

CBS and typography, BBC and herpetology

Ray Girvan at the Apothecary's Drawer Weblog surveys Ra^thergate, and draws a parallel to the BBC's alleged three-headed frog.

Although the frog story probably involved some political motivations, it obviously doesn't have an impact like that of the Andrew Gilligan business or the more recent Ra^thergate. But the frog story, and the chatnanny story, and others like them, help us to understand why organizations like the BBC and CBS cling to indefensible positions long after a rational person might think that they should have let go. It's standard operating procedure. And, alas, it usually works for them.

Posted by Mark Liberman at 10:15 AM

September 19, 2004

Which vs that? I have numbers!

A user who signs in as eub comments here on the topic of my recent remarks about which vs that in relative clauses:

On the "that"/"which" rantlet: okay, but it would be more fun to see stats on how often these Canonical Texts use each one in a relative and a restrictive way (and in what circumstances?), rather than flagging a single ("which", restrictive) example from each text.

and Jason (jcreed) agrees with his remarks ("Yeah, I was thinking the same thing. I imagine that would take much more effort than just pulling up some project gutenberg texts and grepping a few times"). I feel my honor on the point of being besmirched by people who imply (don't get me angry, guys! you wouldn't like me when I'm angry!) that perhaps I am too lazy to get the calculator out of the drawer and do some honest counting.

But in fact those who think they might be interested in the statistics on using which vs that in integrated ("restrictive") relative clauses don't have to wait for me to tap on the numeric keypad for an hour or two; they can find the figures in print, in the Longman Grammar of Spoken and Written English by Douglas Biber and colleagues. I confess to not using this book very much, because it has only a rather hazy theoretical basis (it uses a sort of confused amalgam of early and late Quirk terminology and does nothing to improve on earlier descriptions or clear up residual Quirkian confusedness), but I do use it for this sort of thing. For what it is worth (not very much, IMHO), here is an example of what they provide: some figures (a restatement of what Biber et al. gives on page 616, in approximate numbers of occurrences per million words) for relative clauses in American and British newpapers:

	AmE news	BrE news
integrated relatives with which	800	2600
integrated relatives with that	3400	2200
supplementary relatives with which	1400	1400
supplementary relatives with that	0	0

The only striking figure is the last one: you virtually don't get supplementary relatives with that any more — they occur very occasionally, but the ones Biber et al. cite in their text (page 615) look to me like integrated relatives that happen to be set off by commas; the true supplementary ones can be identified unambiguously when the head noun is one that (like a proper name or a uniquely referring definite NP) doesn't take integrated relatives at all, and although I have seen examples of that sort (here's one: His heart, that had lifted at the sight of Joanna, had become suddenly heavy), they are extremely rare.

But as regards the choice between that and which in integrated relatives, which is what eub was wondering about, although there is a clear frequency difference between the dialect groups, it's obvious that both that and which are grammatical in integrated relatives in both dialect groups, in accord with my earlier discussions.

As for why you get the relativizer word that you get in each case, anyone is as capable as I am of reading a few books and noting the thats and whiches and forming semantic hypotheses, but I'm not going to do it, because it is my belief that it would amount to looking for a meaning difference that isn't there. Biber et al. speculatively attribute the difference to a culutural style thing: a greater "willingness to use a form with colloquial associations" among Americans. This is a speculation I would not endorse. One might just as well attribute it to a greater willingness on the part of Americans to accept (unwisely) the pronouncements in Strunk and White's The Elements of Style, that pox-ridden little pocketbook of pointless pontifications.

Posted by Geoffrey K. Pullum at 06:51 PM

Blogorama: part 1

It's getting to be pretty hard to keep with the language-related blogosphere. Looking quickly through recent posts at the sites on our blogroll...

At Anggarrgoon, Claire Bowern has a post on word for deceased relatives in Bardi:

loomiyoon baawa (child who has lost a parent, = orphan; cf loomi baawa, neglected child)
gambaj(oo) (mother who has lost a child, now used as a swear word by Bardi men who don't know its original meaning)
algooyarr (father who's lost a child)
jilarr (man who has lost a brother, sister or cousin)
miiraj (woman who's lost a brother or sister)
galgarr (widow or widower)

At The Audhumlan Conspiracy, Ryan Gabbro writes about a Microsoft project to infer "sentiment" from text. For some reason, "If you know, to use an example he gives, whether or not your feedback contains an adverb followed by a pronoun followed by a preposition, you can classify it slightly more accurately". Ryan thinks that's "kinda neat", but I got stuck trying to think of some sentences with that sequence of parts of speech.

At bLing Blog, Marc Ettlinger tries to figure out the difference between East and West. I'm not sure quite what he means by the "disproportionate ratio of fervor to affect", though. Does he mean that West Coast vs. East Coast linguists -- Geoff Pullum and me, for instance -- relate with a lot of fervor but disproportionately little affect? or a lot of affect but disproportionately little fervor?

At Blogalization, blogalvillager re-blogs the old "Korean tongue surgery" (non-) story. This time the source is a story in the Scotsman, from Oct. 18, 2003 edition, which in turn seems to have been a reprint of a story that went on on Oct. 17 on the Reuters wire, which I discussed here and here , and which was already discussed on blogalization last April, as I discussed here. I'm not quite sure why this has popped up again -- perhaps the Blogalization poster did a search on the Scotsman's site and didn't notice that the story was a year old?

At Blogos, Andrew discusses a plan to use machine translation at the European Patent Office, and quotes a member of the European Parliament as saying that "the average cost of a patent is about € 30,000, much higher than in the United States, and that is because 40% of those costs are taken up by language problems – the translation costs – and we are trying to get to grips with that problem.”

C. Callosum has an interesting post on ventriloquism.

At Carob (a blog), Robin posted a link to a Translation Business Practices Report from the World Bank.

At Classics in Contemporary Culture, we get a link to Tony Ortega's gossipy take on the Naughty Nabob of Nazareth.

At Close Range, Marc Moffett discusses Geoff Pullum's discussion of content clauses and complement clauses.

And there's a new (as of 9/15/2004) language-related weblog, Curves and Corners. Welcome!

That's just through the letter C; but I need to go meet some friends for dinner.

Posted by Mark Liberman at 05:40 PM

The language faculty in a business school

Since language and speech are central to human experience, linguistics should be a normal part of all levels and aspects of modern education. That's my opinion, anyhow, but I recognize that not everyone understands this yet. So I was pleasantly surprised to find that the Copenhagen Business School has Fakultetet for Sprog, Kommunikation og Kultur ("Faculty of Language, Communication and Culture"), which has an Institut for Datalingvistik (translated on their web site as "Department of Computational Linguistics"), which in turn has a Center for Computermodeller af Sprog (translated as "Center for Computational Modeling of Language", and generally identified by the English-based acronym CMOL, rather than Danish CMAS or whatever).

I don't know of any business schools in the U.S. with serious linguistics programs, computational or otherwise. Presumably speech and language seem like more pressing business issues in a country like Denmark, where I suppose that most business activities involve other people's languages. Still, there are plenty of reasons for business education in the U.S. to cover language-related topics, both old and new.

The research projects at CMOL look interesting -- I especially want to find out more about "Discontinuous Grammar". I was also interested to find that the research goals at CMOL are framed primarily in psychological rather than engineering terms:

The general purpose of CMOL (Center for Computational Modelling of Language) is to develop computational models of human language processing (comprehension, production, and learning). This means that research funded by CMOL must aim to model:

the linguistic representations and processing mechanisms used by the brain;

the interactions between the linguistic processing systems and higher cognitive functions, such as reasoning, perception, and memory.

Thus, research at CMOL must aim at developing language formalisms and algorithms that accurately model human language processing, rather than taking an existing formal framework for granted and trying to analyze linguistic phenomena within it. It must also work towards the goal of modelling language users as intelligent agents with linguistic capabilities, inference capabilities, general knowledge, and communicative goals, which jointly determine how the agents communicate.

Nevertheless, there are no trained psychologists among CMOL's research staff. Most projects these days with that sort of staff profile -- computer science, mathematics, computational linguistics -- tend to frame their goals in a less psychological terms. Designing airplanes rather than studying birds, so to speak.

A hundred years ago, it was generally regarded as obvious that the study of nature and the design of machines are closely connected. Up to the present moment, a tight and productive connection among psychologists, neuroscientists and engineers has continued in areas related to vision. However, in areas related to sound, the research culture has become much more balkanized, with suprisingly little communication among researchers in the various relevant areas of psychophysics, neuroscience, cognitive psychology, audio and speech technology, phonetics and phonology, sociolinguistics and so on. The reasons for this are complex -- at least I don't entirely understand them -- but it's clear that the negative influence of a handful of strong personalities has been a crucial factor.

Research in CMOL's area -- parsing, discourse analysis, corpus linguistics, lexicography, and the like -- has been somewhere in the middle between the tight federal integration of vision and the hostile and inward-looking city-states of audition. A cynic might (somewhat unfairly) describe the situation as a combination of federal rhetoric and independent action.

Anyhow, I learned about the situation at the Copenhagen Business School because Dan Hardt, a forskningslektor at CMOL, set up a reading group on "The Faculty of Language: What Is It, Who Has It, and How Did It Evolve?" The readings include the 2002 Science article by Hauser, Chomsky & Fitch; Pinker & Jackendoff's reply; the 2004 Science article by Fitch & Hauser; and a series of Language Log comments on Fitch & Hauser. This morning, I saw some people clicking through from Dan's site in our referrer logs, and found the context.

It's gratifying to see weblog postings participating in intellectual discourse within the academy as well as outside it. I know Dan Hardt, who is a Penn alumn, and I knew he was in Denmark these days, but I didn't know that his affiliation is at a business school's Faculty of Language :-), and it's also nice to learn about that connection.

Posted by Mark Liberman at 08:42 AM

September 17, 2004

Sidney Goldberg on NYT grammar: zero for three

Sidney Goldberg at The National Review claims that the reputed 150 copy editors over at The New York Times are either illiterate or asleep. He fulminates; he positively foams at the mouth about it. Naturally, Language Log felt it had to investigate. And having had my rabies shots, I was handed this plum assignment. So let's take a look.

The article begins with grumbles that are entirely about spelling. The Times twice misspelled lectern as lecturn; once misspelled "took effect" as "took affect"; and often misspells the preterite form of the verb lead as lead. (It should be led. Nasty little point, that: the metal known as lead has the sound of led but the spelling of lead; and meanwhile the verb read has a preterite that rhymes with led but has the spelling read, which looks like lead, only led is not spelled lead... Are you confused? Then it shouldn't be you that casts the first stone.) I'm with Goldberg all the way on these: these are spelling errors, and you've just got to get your spelling right.

So at this point I was hoping for some grammar examples to get us into more serious territory, but instead Goldberg wanders off for a while into a strange tirade against the The New York Times for ridiculing Dan Quayle, who long ago misspelled potato as potatoe when MC-ing a spelling bee (his flashcard was wrong), but not writing any jokey stories about how Chief Justice Warren Burger used to misspell homicide as "homocide" and Associate Justice Harry Blackmun (whose papers were recently released) used to circle the misspelling angrily when commenting on the Chief Justice's draft opinions.

However, Goldberg finally pulls himself out of this bitter rumination on political bias: "All of this concerns orthographic ignorance," he says; "But the Times commits innumerable errors in syntax and style as well. "Innumerable" you say? Aha! I'm all ears: I'm waiting for a long, juicy list of errors of syntax and style. Unfortunately, only three are supplied, and only one is illustrated from The Times itself.

1. That and which. The first charge is that the Times "consistently proves that it does not know the difference between ‘that’ and ‘which,’ greatly favoring the latter." There's only one thing he could be alluding to here: he's one of those people who believe the old nonsense about which being disallowed in what The Cambridge Grammar calls integrated relative clauses (the old-fashioned term is "restrictive" or "defining" relative clauses). Strunk and White perpetuate that myth. I've discussed it elsewhere. The notion that phrases like any book which you would want to read are ungrammatical is so utterly in conflict with the facts that you can refute it by looking in... well, any book which you would want to read. As I said before about which in integrated relatives:

As a check on just how common it is in excellent writing, I searched electronic copies of a few classic novels to find the line on which they first use which to introduce an integrated relative with which, to tell us how much of the book you would need to read before you ran into an instance:

A Christmas Carol (Dickens): 1,921 lines, first occurrence on line 217 = 11% of the way through;
Alice in Wonderland (Carroll): 1,618 lines, line 143 = 8%;
Dracula (Stoker): 9,824 lines, line 8 = less than 1%;
Lord Jim (Conrad): 8,045 lines, line 15 = 1%;
Moby Dick (Melville): 10,263 lines, line 103 = 1%;
Wuthering Heights (Bronte): 7,599 lines, line 56 = 0.736%...

Do I need to go on? No. The point is clear. On average, by the time you've read about 3% of a book by an author who knows how to write you will already have encountered an integrated relative clause beginning with which. They are fully grammatical for everyone. The copy editors are enforcing a rule which has no support at all in the literature that defines what counts as good use of the English language. Their which hunts are pointless time-wasting nonsense.

But it's nonsense that Goldberg firmly believes in, you see. There will be no talking him out of it. He'll be about 3% into his copy The New York Times and he'll see something like "the idea which they considered" and he'll spit coffee out into his muesli and splutter for his wife to bring him his red pen and he'l circle it furiously like Justice Harry Blackmun circling "homocide"; only the difference is that Blackmun was right, "homocide" is an error. Using which in an integrated relative clause is not, and nobody who has carefully studied the English language would think that it was.

2. What and which. The second of the three syntax points is that the Times "also repeatedly confuses ‘what’ with ‘which’: ‘What movie are you going to see tonight?’" Is there really a confusion here? This case is interesting (there's a beautiful discussion by Rodney Huddleston in The Cambridge Grammar, pages 903–904), but again Goldberg doesn't really know his stuff. The example he gives (which I think is made up) is grammatical. You see, there are differences between which and what, but I'd bet the mortgage money Goldberg couldn't characterize them.

The relevant difference here is semantic. Which is selective: it asks for a pick from a defined list. What doesn't care, and leaves a wide-open field of things to pick from. As a result, you need which in what is called the partitive construction, which makes the set to be picked from explicit: you say Which of these jackets is yours?, not *What of these jackets is yours?. Nobody gets that wrong, including The New York Times. Another consequence is that if you use a cardinal numeral, you'll need which rather than what: you say Which three people in this group photo have spent time in jail?, not *What three people... etc.

But when no range is made explicit, it's just common sense that tells you what the range must be, both are OK: Which movie are you going to see tonight? is normal; the range to pick from isn't specified, but you can get it from the local paper. What movie are you going to see tonight? is also fine: it leaves the field of movies wide open, but again, the practical possibilities are limited to what's on this week. The difference between which and what doesn't matter in those contexts, and both are common.

Unless Goldberg has caught the Times saying something truly ungrammatical like *What of the candidates will win? (which seems unlikely), is getting his underpants in a bunch over nothing at all.

3. Had to have been. Goldberg only has one other case. He caught an editorial saying: "By late 2002, you'd have had to have been vacationing on Mars not to know...". He harrumphs that this a "monstrous construction". And once again he's wrong. He presumably thinks that the last occurrence of have is redundant, on grounds that You would have had to be vacationing on Mars not to know could be used instead (the "Omit needless words" mantra from Strunk and White's toxic little book of crap is doubtless ringing in his ears).

But unfortunately that would change the meaning. To say that in order to be ignorant you would have had to be vacationing on Mars is to say that it would have been necessary for you to be on Mars enjoying your vacation right at that point, the point of ignorance. Whereas to say that you would have had to have been vacationing on Mars is to say, in effect, that it would have been necessary for you to be recently back from a recent Martian vacation.

That is, to be vacationing on Mars (call that being in condition A) is to be there right now, hence out of the office and unavailable for comment. To have been vacationing on Mars (call that being in condition B) is to be back in New York after your two-year flight home showing photos of the Martian desert around at the office. The sentence Goldberg complains about was saying you would have had to be in condition B not to know: to have been vacationing on Mars in the past few months.

So Goldberg is fairly clearly mistaken on all three of the grammatical sins he mentions (only one of them actually illustrated). He's fairly clearly howling at the moon on two of the areas he alludes to, and the syntax employed by The New York Times is right in the only case where he gives a quote.

It is so often like that. The amateur language pontificators (Sidney Goldberg is a retired senior vice president for syndication who used to work at United Media) know very little of the subject they're pontificating about. They don't look anything up in serious grammars or dictionaries. They just shoot their mouths off. And of course (let's face it, politics is involved here), if they're criticizing the (reputedly way too liberal) New York Times, then the (thoroughly and angrily conservative) National Review will publish them without any fact-checking.

Nobody does fact-checking on stuff about language. You may recall the two spectacular cases (here and here where Mark Liberman caught journalists (Cullen Murphy and John Powers) inveighing in print against "mistaken" word uses, and being wrong, in both cases, on all three cases out of three that they cited. "Can't anybody use a dictionary anymore?", asks Mark. It looks like the answer is no. And they don't know even about the existence of The Cambridge Grammar. Everyone just assumes that whenever a stern grey-haired male professional says somebody's grammar is wrong, the charge must automatically be correct and the accused guilty, and no facts need to be checked. Well, it's not so.

Mr Goldberg, now that you're retired, you can educate yourself. My elementary course on Modern English Grammar starts in just over two weeks, on September 27th; you have time to get your butt out here to California and sign up as a concurrent-enrollment student (it's filling up, but I'll save you a seat). It's not the editorial staff at The New York Times who need syntax lessons, it's you.

Posted by Geoffrey K. Pullum at 09:25 PM

A kind word for a new Microsoft product

It is a pleasure to be able to say a kind word about the Microsoft Corporation for once: their new product Microsoft Forger really does seem to be the right tool for today's election campaign needs. In one convenient package it offers all of the many necessary features (1971 typefaces, randomized inaccurate centering, built-in spotting and blurring, etc.) for fooling not only the expert CBS staff with their many checks and balances but also the sleazy layabout know-it-alls of the blogosphere sitting typing in their pajamas and flaunting their Photoshop and their font analysis and their excruciatingly detailed knowledge of the standard features of common word processors.

Posted by Geoffrey K. Pullum at 08:53 PM

You couldn't have a starker contrast

It's not quite up to the level of "Let them eat cake". However, among recent symbols of ancien-regime arrogance, it's hard to beat what Jonathan Klein said, on the Fox News Channel on September 9, in a debate with Stephen Hayes about the authenticity of the forged Ra^thergate documents:

"You couldn't have a starker contrast between the multiple layers of checks and balances [at 60 Minutes] and a guy sitting in his living room in his pajamas writing."

Klein used to be the CBS News VP in charge of 60 Minutes, so he knows what he's talking about. And indeed the contrast could not have been starker. The multiple layers of highly-paid journalists, producers and editors at CBS News, and the many layers of less well paid staff working for them, swallowed a set of crude forgeries, hook, line and sinker. In contrast, a small set of unpaid, self-motivated citizens were able to unmask the forgeries within a few hours. By now, anyone with any sense recognizes that the bloggers were right.

As a result of Klein's remarks, I have to confess that I'm generally underdressed for blogging at home, where my usual costume is a t-shirt and shorts; and overdressed elsewhere, wearing a shirt and trousers and shoes. But I'm thinking of buying myself a set of formal blogging pajamas.

The Ra^thergate flap reminded me, indirectly, of a strange, striking article by Lindsay Waters in the 8/30/2004 Village Voice. Here the issue is not the arrogance and ineptitude of (some of) the mass media, but rather the arrogance and ineptitude of (some of) the guardians of high culture. Waters, who is executive editor for the humanities at Harvard University Press, explains that he has "warned humanities scholars and publishers to prepare for a future when publishers ... would go from publishing too many books to too few". In the larger pamphlet from which his piece was taken, he gives the obvious explanation: "We have gone from selling a minimum of 1,250 books of each title in the humanities to 275 books in the past thirty years." This doesn't surprise me -- I can find a way to be interested in a lot of different things, but the fraction of such books in which I can find anything of interest has declined in a similar proportion.

Waters blames librarians, who "have not been protecting book budgets from rapacious commercial presses who gouge them on journals." He blames "the corporatist demand for increased productivity and the draining from all publications of any significance other than as a number." He does observe that this problem has something to do with the fact that humanistic scholars are writing and publishing books that no one much wants to read, but he blames this on "markets", which "generate the pressures that increase productivity". According to Waters' analysis, "when the dollar becomes the ultimate term, the sky closes in."

But no one is making many dollars from a book that sells 1,250 copies, much less one that sells 275. And presumably it's only some minimal regard for financial prudence that's keeping the sales as high as that. The problem is the product, not the market. It's not market pressures that have resulted in the product's lack of appeal, but rather the lack of market pressures, and the broader lack of motivation to reach an interested public. If the sky has closed in on scholarship in the humanities, this is surely not because humanists are out there maximizing the dollar return on their scholarly product. What Waters means by "the corporatist demand for increased productivity" is a system of tenure and scholarly evaluation that depends heavily on publication counts, weighted roughly by the publisher's prestige. I worked for 15 years in an industrial lab, and for 15 years in a university, and I can say with considerable confidence that there is nothing "corporatist" about that system. It's a purely academic invention, and academics should stop trying to blame others for any problems it may have.

What does this have to do with Ra^thergate? Well, just as there are political bloggers and technical bloggers (and gardening bloggers and knitting bloggers and whatnot), there are also humanistic bloggers, who write about novels and poetry and history and philosophy and music and languages and so on. Many people cross categories as well -- the blogosphere is not a great respecter of traditional disciplinary boundaries. The level of knowledge and ability is variable, and there's a lot of junk out there, but some people are very good, and most readers are able to recognize quality when they see it, and to recognize it independent of professional status. I don't know how big the humanistic end of the blogosphere is, but it must include at least thousands of writers and hundreds of thousands of readers.

I'm not trying to make an equation between weblog entries and serious, large-scale works of humanistic scholarship like the Cambridge Grammar of the English Language. But the popularity of many humanistic weblogs shows that there is an audience out there. If the humanities offerings from Harvard University Press aren't reaching them, Waters shouldn't blame Wall Street or Main Street either.

He's probably right that the current regime in academic publishing will not last much longer. Its replacement(s) will emerge from the ferment of experimentation with e-journals, e-print archives and so on. It would be nice if the blogosphere's openness, energy and popularity could be part of that future, whatever it is.

[Note: the illustration above is not, as some may think, a portrait of Geoff Pullum without his parrots. I took it (with some small modifications) from James Lileks' Bleat of Monday, 9/13/2004.]

[I should also add that it's unfair for Waters to blame librarians for failing to (try to) defend their budgets against the increasing costs of journal subscriptions -- they've tried very hard, and continue to try hard, as you can learn by reading back issues of Peter Suber's Open Access News.]

Posted by Mark Liberman at 10:45 AM

September 16, 2004

Catalan, Galician, whatever

Trevor at kaleboel offers evidence that the high-end concept restaurant biz has openings for linguists. Or ought to, anyhow.

Just to add to the confusion, there's the Galicia in Spain that Trevor writes about, and then there's the Galicia now split between Poland and Ukraine. The etymology of the two names is quite different: apparently

"Halicz (certainly from [Ukrainian] halka = 'jackdaw'), [was] formerly the capital of the Russian [Rusyn] land and the seat of the Halicz duchy, from which today's Galicia (Halicia) received its name."

while the Galicia in Spain is from Latin Gallaecia, which apparently has to do with the Gallaeci (also Callaeci, Callaici, Kallaikoi) "a people in western Hispania Tarraconensis". They were Celts (which I gather is what the various forms of their Latin names mean), allies of Hannibal, and finally defeated by Rome in 25 B.C.

Posted by Mark Liberman at 03:57 PM

The jello of bike locks

When Jon Currier of Belmont Wheelworks told the Boston Herald that Kryptonite is "the jello of bike locks", he seems to have meant that it's so well known that the brand has become a name for the generic object. But he also may have been influenced by the recent discovery that most Kryptonite locks can be opened in a few seconds by using the barrel of a Bic pen as a key. "The jello of bike locks" is not a slogan that, shall we say, frames the product in an ideal way.

[Update 9/20/2004: Alex Smolyar emails to say:

See, I would have said "the Kleenex of bike locks" although I suppose that isn't any more effective for marketing purposes.

]

Posted by Mark Liberman at 03:54 PM

Someone set us up the stress clash

Some terrific flyers from Rachel Shallit at a tear in the fabric of spacetime.

Unfortunately, the LSA's "Linguistics, Language and the Public" award is only given in odd-numbered years.

Posted by Mark Liberman at 10:51 AM

The Secret Sins of Academics

A couple of weeks ago, I complained that the latter-day pamphleteers at Prickly Paradigm Press should listen to Kerim Friedman and put their stuff on line, instead of leaving it to languish on an academic press backlist. At the same time, I sent an email to the Prickly Paradigmatics with the same suggestion.

I'm happy to say that their first five titles on their catalog page do now have links to .pdfs (scroll down to Prickly Paradigm #1 - #5). This has nothing to do with my prodding -- a PPP representative responded to me by email that "[o]ver the summer we have been preparing to launch the pamphlets on-line from our website, in partnership with Creative Commons". Unfortunately, Michael Silverstein's pamphlet on political rhetoric, which provoked my interest in the first place, is PP #6, so it's not available yet -- though I gather from Michael and from my PPP email contact that this is in the works, and will be done before too many more weeks have gone by. I guess it takes a while to crank out those .pdfs on their old hand-operated press.

Meanwhile, there are a couple of gems among the first five titles. I was especially taken with Deirdre McCloskey's The Secret Sins of Economics. It's a great pamphlet, engaging and fun to read and (at least for an outsider) quite convincing. The one thing I disagreed with was what she said in passing about linguistics. This worries me, as it should, but now you can easily read what she wrote and decide for yourself.

Here's how she starts:

What’s sinful about economics is not what the average anthropologist or historian or journalist thinks. From the outside the dismal science seems obviously sinful, if irritatingly influential. But the obvious sins are not all that terrible; or, if terrible, they are committed anyway by everybody else. It is actually two particular, nonobvious, and unusual sins, two secret ones, that cripple the scientific enterprise—in economics and in a few other fields nowadays (like psychology and political science and medical science and population biology).

Yet a sympathetic critic who says these things and wishes that her own beloved economics would grow up and start focusing all its energies on doing proper science (the way physics or geology or anthropology or history or certain parts of literary criticism do it) finds herself sadly misunderstood. The commonplace and venial sins block scrutiny of the bizarre and mortal ones. Pity the poor sympathetic critic, construed regularly to be making this or that Idiot’s Critique: “Oh, I see. You’re one of those airy humanists who just can’t stand to think of numbers or mathematics.” Or, “Oh, I see. When you say economics is ‘rhetorical’ you want economists to write more warmly.”

I tell you it’s maddening. The sympathetic critic, herself an economist, even a Chicago-School economist, slowly during twenty years of groping came to recognize the ubiquity of the Two Secret Sins of Economics (in the end they are one, deriving from pride, as all sins do). She has developed helpful suggestions for redeeming economics from sin. And yet no one—not the anthropologist or English professor or others from the outside certainly, but least of all the economist or medical scientist—grasps her point, or acts on it.

And here's how she ends:

Cassandra, you know, was the most beautiful of the daughters of Priam, King of Troy. The god Apollo fell for her and made her a prophetess. In exchange he wanted sexual favors, which she refused. So he cursed her, in a most malicious way. He had already given her the power of prophecy, to know for example what would happen to a science that refused to ask seriously How Much. His curse was to add that though she would continue to be correct in her prophecies, no one would believe her.

Cassandra [to Trojan economists proposing to bring the wooden horse into the city]: The horse is filled with enemy soldiers! If you bring it into the city, economics is lost! Please don’t!

Leading Trojan Economist: Uh, yeah, I see what you mean, Cassie. Good point. Enemy soldiers. Inside. City lost. Qualitative theorems useless for a science. Statistical significance without a loss function equally useless. Economics ruined. Thanks very much for your prophecy. Great contribution. Love your stuff.
[Turning to colleagues] Okay, guys, let’s bring that sucker in!

In the intervening 56 pages, she has sections on "Virtues Misidentified as Sins" (these are quantification, mathematics and libertarian politics); on "Venial Sins, Easily Forgiven" (this is mainly economics' "obsessive, monomaniacal focus on a Prudent model of humanity", so that "[e]verything, simply everything, from marriage to murder is supposed by the modern economist to be explainable as a sort of Prudence"); and "Numerous Weighty Sins Requiring Special Grace to Forgive But Sins Not Peculiar to Economics" (these are Institutional Ignorance, Historical Ignorance, Cultural Behaviorism, Philosophical Naivete, "a high-school version of ethical philosophy", "arrogance in social engineering", "candid selfishness" and "personal arrogance").

On p. 37, she gets to the "The Two Real Sins, Almost Peculiar to Economics". In her view, these are proving qualitative theorems and testing statistical significance without a loss function. She argues that these secret sins are so debilitating that

The progress of economic science has been seriously damaged. You can’t believe anything that comes out of the Two Sins. Not a word. It is all nonsense, which future generations of economists are going to have to do all over again. Most of what appears in the best journals of economics is unscientific rubbish. I find this unspeakably sad. All my friends, my dear, dear friends in economics, have been wasting their time.

Her diagnosis is that the Two Sins are really two sides of the same coin: a way of "looking for machines to produce publishable articles", which of course is the Secret Sin of all academics, or at least their Great Temptation:

Economics has fallen for qualitative “results” in “theory” and significant/insignificant “results” in “empirical work.” You can see the similarity between the two. Both are looking for on/off findings that do not require any tiresome inquiry into How Much, how big is big, what is an important variable, How Much exactly is its oomph. Both are looking for machines to produce publishable articles. In this last they have succeeded since Samuelson spoke out loud and bold beyond the dreams of intellectual avarice. Bad science—using qualitative theorems with no quantitative oomph and statistical significance also with no quantitative oomph—has driven out good.

As she points out, the fact that some kinds of intellectual work are without (scientific) value doesn't mean that they're easy to do. Instead, "[t]hey are vigorous, difficult, demanding activities, like hard chess problems. But they are worthless as science".

I'm not really competent to evaluate her argument about the value of contemporary academic economics -- though I can still enjoy it. And in fact I think that there are some remarkably similar difficulties in contemporary academic linguistics, a point that might be worth taking up in some future post. However, I don't entirely agree with what Prof. McCloskey says specifically about linguistics:

... it is only fair to call both the sins of modern economics Samuelsonian. It is rather similar to the situation in linguistics: their Great MIT Leader is Noam Chomsky. Chomsky’s mechanical approach to grammar, fiercely denying pragmatics and therefore the main finding of the humanities in the twentieth century, blocks progress.

The "mechanical approach to grammar" strikes me like those "Venial Sins, Easily Forgiven" -- or even the "Virtues Misidentified as Sins" -- that McCloskey starts her pamphlet by removing from the list of complaints about economics. I do agree that focusing on linguistic form to the exclusion of research on language use is a mistake that blocks progress, but I also feel that Noam has plenty to answer for in the domain of grammatical mechanics.

Posted by Mark Liberman at 07:07 AM

September 15, 2004

Typography, truth, and politics

The documents that CBS, Dan Rather, and 60 Minutes presented as 1972 memos from the Texas Air National Guard, with their putative revelations that George W. Bush tried to wriggle out of his obligations, are crude forgeries. The evidence for this claim is basically linguistic. There are weaker points about style (a military officer writing a memo to file with "CYA" as the subject?) and abbreviatory arcana (OETR for OER), but the strong evidence has to do with technical topics often discussed on Language Log and fairly close to the business of many modern linguists: things like character sets, typographical details, and word processing technology. Enough so, anyway, that the story does merit a brief but rather serious discussion here, and a comment at the end.

The forger was too stupid (or careless) to realize that in order to forge a 1972 document it would be useful to get hold of a 1971 typewriter. The evidence from document analysis is discussed in minute detail on numerous blogs. A thorough summary of the bloggery can be found here. Dale Franks attempts a full compendium of the evidence here. There is a highly expert typographical analysis by an expert here. I'm not a primary investigator in this, and I'm not even redoing any of this work (it doesn't need it; it will stand); I'll just discuss a few particularly strong points to give the flavor.

One small but telling observation of typography has to do with two characters I recently discussed in another context here):

’

The first has the HTML code ' and is known as the apostrophe or tick or pock. The second has the HTML code ’ and also ’ because it is a 9-shaped right single quote, used to match the 6-shaped left single quote. As I remarked, no font distinguishes the functions by consistent uses of these differently shaped glyphs. The Times Roman font standardly uses the character ’-shaped character for both the apostrophe function and the single right quote function, though you can insert the '-shaped glyph if you want to for some special reason. One special reason might be that you wanted to simulate a typewriter: since their invention, typewriters have had only the ' glyph. You were supposed to use it for both left and right single quote functions as well as the apostrophe function. But many people do not seem to notice the difference in shape between these glyphs. And the alleged Bush memos have ’ (see the pictures given as part of the analysis here ), the one never found on typewriters. These memos were not typed in 1972.

A second and even clearer giveaway feature is the appearance of small-font superscripts in words like 117^th. In 1972 these could hardly be done at all using office equipment. If you had a fancy typesetter, the IBM Selectric Composer, which would have cost you the 1972 equivalent of about $20,000, then if you knew how you could produce something like this effect, but it was struggle, and involved stopping to adjust the paper position and change the type ball before and after the th (a blog called The Shape of Days gives the full details). But Microsoft Word's AutoCorrect feature and WordPerfect's QuickCorrect feature both automatically change 117th to 117^th as you type if you leave them with the default settings the way the programs come from the factory — unless you leave a space to break up the sequence, getting a thoroughly non-standard look (117 th). The alleged Bush memos have a mixture of 117 th (with a space) and 117^th (with a small-font superscript). They were typed using a modern word processor, like Word or WordPerfect, using the factory defaults. The forger was not careful enough either to switch off the automatic substitution, or to go back and remove the space in 117 th, or to go back and turn the superscript off in 117^th (any of which would have been fairly easy). These memos were not typed in 1972.

A third giveaway is the positioning of the date. It matches perfectly with one of the positions you get if you just tab across the page a bit using the factory tab defaults of Microsoft Word. In fact everything in the document does, as reported here, with screen shots: if you just retype in Word with default margins, default tabs, and default AutoCorrect substitutions, every line break comes in the same place, every line comes out to the same length, even the letter positions are essentially the same down to sub-millimeter levels. These memos were not typed in 1972.

An even clearer piece of evidence lies in something very simple: the centered address at the head of each memo. The memos are not printed on Texas Air National Guard preprinted stationery as you might have expected. The address at the head is typed in the same face as the content of the memo. But the forger made the terrible mistake of using the word processor's centering function, which did the center alignment perfectly. Word processors do such things to an accuracy of something like one twentieth of a point. Typists can only do it at all in a crude way after some careful measurement, and then can only get it to an accuracy of about one character width. The paper can roll into the machine with a few millimeters' difference either way, so it is very unlikely to find the same line typed with matching distances at right and left on two different pieces of paper. Yet the centered addresses at the tops of the alleged memos about Bush match up so perfectly that if you superimpose them you can't see that there's more than one (read a bit further in the reference I gave above to this site for a demonstration). These memos were not typed in 1972.

Mostly it is conservative bloggers who are making these points. A few liberal blogs are resisting the conclusions and some hair-splitting is going on about micro-details of line spacing and superscript heights. It's beside the point as far as I can see: I would say that the forgeries were subjected to repeated faxing and/or graphical scanning to make them look fuzzy and sort of old. Faxing a fax and then making a PDF from the faxed fax will play minor havoc with letter definition and apparent position. Still the stunningly stupid enormity of the forgery is perfectly clear. You really have to be pretty ignorant about word processing (as plenty of journalists and even some bloggers may be) to doubt this evidence.

Where does that put things for the current Presidential campaign? This is Language Log, and we don't get into politics much. If the textual and typographical evidence of the Texas memo forgeries were a mainly political topic it would not be discussed here. But I am going to allow myself to say one thing about political discourse. I do have a modest proposal about the present battle of competing allegations of wartime mendacity and neglect of duty that is afflicting the Presidential campaign. But before I present it, I must stray from linguistics into a neighbouring discipline for a moment, psychology, to bring the Swiftboats Veterans for Truth story into this.

Human memories from a time over thirty years ago (even if people had not been recently talking in prejudicial terms about the events) are worth nothing. Show students a film of a fender bender accident and then ask them for a guess at the speed of the cars "smashing into each other" and some of them will report seeing broken glass in the video when there was none. Call it "bumping into each other" and the speed estimates are lower and they don't have false memories of broken glass. Stage a brief struggle between a black man and a knife-wielding white man in front of a psychology class and get them to write reports and quite a few will report that the negro had the knife. And this is the state of memory reports from only minutes ago using observers who have nothing to gain or hide. The prospects for getting from a committed Republican veteran in 2004 a totally uncolored memory of a couple of minutes on the Mekong river in 1968 when the crux concerns what a certain fleeing Vietnamese man was wearing and the outcome of the Presidential election might hang on what happened? Zero. Nul. Nada. No chance. Not even given the very best intentions, which we probably do not have in this case. It's possible that even contemporary reports might get things wrong. Forget remembered reports by highly interested parties over three decades later.

The Swifties' stories about Kerry therefore align with the story of dereliction of duty told in the forged memos about Bush: none of this nonsense is worth a serious person's time. There is exactly one thing we can do about those stories that is rational: accept military records and actions as definitively settling the question, for both sides. Was Bush in effect a deserter? The records of the Texas Air National Guard say he was not; they gave him an honorable discharge. End of story. Was Kerry a minor hero? The records of the Navy say he was; they gave him a Silver Star Medal for gallantry in action, a Bronze Star Medal with Combat V device for heroic achievement, three Purple Hearts for shrapnel wounds to his arms, legs and buttocks, and an honorable discharge. End of story.

I'm not saying end of true story. I'm saying end of story. We can go no further than this with any hope of arriving at truth. Mark Liberman served in the Army in Vietnam and was discharged. George W. Bush served in the Texas Air National Guard and was discharged. John Kerry served in the Navy in Vietnam and was discharged. After all these years, we must just let the accepted official permanent records of such bygone military service stand, put a bilateral stop to this inexpert fiddling with Vietnam-era history, and turn to more pressing contemporary matters.

We'd better. Because there are political issues (they will not be discussed here) about which I need to hear some answers. Not just the stupid contentless political blather of which I wrote light-heartedly a few days ago, but actual answers to compelling economic and law enforcement and governmental and military policy issues on which the fortunes of my country are going to turn.

I get no answers though, because the conduct and content of the two main Presidential campaigns is dominated and driven by lying, forging, conniving, slandering, mendacious, frothing, snarling assholes who seem to think that spreading innuendo and forgery and calumny and fraud and rumor across the landscape will help to turn voters toward their favored guy even if he completely avoids substantive discussion of anything that could be of relevance. Well, they have profoundly misjudged at least one very angry voter.

We have free speech in this country, and access to a magnificently flexible and expressive language. This power of linguistic expression, granted to our species alone it would seem, is strong magic. We must be very careful what we do with it. This descent into slander and false memory recovery and document forgery and history denial and mutual accusations of cowardice and treachery is not the free discussion of political matters that the authors of the First Amendment envisaged for us. It is not political activity at all; it is the destruction of political activity.

Posted by Geoffrey K. Pullum at 11:18 AM

Voice writing

Yesterday afternoon I took the train up to New York, to participate in a panel that Caroline Henton organized at SpeechTEK entitled "Everything You Always Wanted to Know about Voice but were Afraid to Ask". I really enjoyed the other panelists' talks. Marc Moens talked about how to furnish synthetic voices with personality, attitude and emphasis; Sandy Disner talked about "voice stress analysis" and similar things; Judith Markowitz talked about speech-based biometrics; and Chad Theriot talked about the use of automatic speech recognition in real-time transcription, with a demonstration by Jennifer Smith, the President-Elect of the National Verbatim Reporters Association.

The whole experience made me sorry that I didn't plan to spend more time at the SpeechTEK meeting, which I haven't attended in many years. I'll post about the other talks I heard, sooner or later -- the only good thing about dropping in to an interesting meeting for a couple of hours is that there are only a few things to describe -- but I'll start by explaining what I learned from Chad and Jennifer, which complements what I wrote a few days ago about the technology of real-time transcription.

An alternative to the special chording keyboards that I wrote about earlier is "voice writing", a method originally developed by Horace Webb more than 60 years ago. The basic equipment is traditionally a two-track recorder, a microphone for picking up the proceedings that are being transcribed, and a special "stenomask" which the transcriptionist can use to "repeat everything that occurs during testimony" without being heard by others. These days, people use a laptop computer as the multi-track recorder, and they also often use automatic speech recognition software to create a draft of the transcript. The software analyzes the transcriptionist's shadowing of the proceedings, not the original signal -- this allows the (much) higher recognition rates that are possible when the program is adapted to the speaker, and the speaker is adapted to the program. The ASR software used is one of the standard systems, typically either IBM's ViaVoice or Dragon's Naturally Speaking.

Chad's company, Audioscribe, sells software and system packages for this application. Jennifer used this method to transcribe the panel presentations and discussions, with the results appearing in real time on a computer projection screen, in the format shown in the (promotional) screenshot below. The quality was very good, definitely in the range of the "95% correct or better" that is claimed -- which means several mistakes per average screenful, to be corrected in a proofreading stage later on.

The approach has some problems, both for the human users and for the ASR systems. The human users need to learn to shadow others' speech accurately at high rates for long periods of time, while also entering the other sorts of information that a transcript requires. The ASR systems need to learn to deal with sotto voce or even whispered speech.

But the most difficult challenge for both is dealing with fast speech. ASR systems are not supposed to work past about 160 words per minute; but transcriptionists find that they need to keep up with people who are often talking in the range of 180-350 wpm. The speech recognition engines can "learn" to work at rate up to about 250 wpm, according to Chad and Jennifer, but above that rate, they break down in a serious way, even though the human "voice writers" can shadow accurately at up to 350 wpm. In order to deal with this problem, the transcriptionists create special "fast speech vocabulary" -- pseudowords for common words or word sequences spoken rapidly -- which the recognition engines can learn to map to the right transcription. This is apparently one of the more difficult aspects of learning to use this technique.

I was very interested to learn about voice writing, not least because the practice offers the possibility of getting a very large amount of interesting material for speech research. As I understand it, there are about 10,000 trained "voice writers", and about 2,500 who use the computer technology, each doing several hours of transcription per work day. (I gather that there are about 100,000 users of stenograph machines). While some of the recorded and transcribed material is confidential or otherwise limited in distribution, much of it is not, and so there are millions of hours of speech every year for which digital audio, digital audio of the "shadow" track", and a digital form of the (corrected) transcript are being created, and might in principle be used for speech research.

Jim Baker has recently speculated in interesting ways about what speech recognition research could do with millions of hours of speech. Anyone with a bit of imagination can also think of many ways that access to very large transcribed speech corpora could be used as an empirical foundation for scientific or lexicographic investigations of speech and language. The Linguistic Data Consortium and other organizations have found ways to get thousands of hours of transcribed audio that can be used as a shared basis for research in speech science and engineering. There are millions of hours of transcribed audio out there, but there are both practical and legal impediments to getting research access. My conversation with Chad and Jennifer suggests a route to a solution, since the members of the voice writing community have an active interest in fostering the development of better ASR technology.

Posted by Mark Liberman at 09:33 AM

Dialects without borders

Chris Weigl at serendipity writes thoughtfully in response to Bill Poser's question about "whether it is possible for ... a reading pronunciation to become so firmly fixed that subsequent intensive exposure to the spoken language does not correct it". Chris notes that "[t]here was a period of about ten years starting at the age of 16 ... during which I read voraciously in English and thereby improved my vocabulary and knowledge of English grammar, collocations and idiomatic expressions while having hardly any need to actually speak it". As a result, she developed a variety of " idiosyncrasies [that] proved relatively difficult to eradicate."

I'd like to point out that many monolingual English speakers have a similar experience in formal or technical registers, which may never come up in conversations with those around them, especially if their real-life environment is not an intellectual one. Geoff Pullum discussed this in his post on "Mispronunciation and Autodidacts".

As Chris and Bill both note, these reading pronunciations and similar mis-analyses are not random. They're the result of of over-generalization of letter-to-sound relationships, or analogy to specific similar words, or intrusion of more natural phonological patterns. This interference can happen within a language (and its orthography) or across languages, but in either case, it's psychologically natural. Therefore, the resulting errors are usually not really idiosyncratic. As with eggcorns, if one person makes the mistake, it's likely that at least a few others have done so as well. This creates a sort of "dialect without borders".

At least one linguist of my acquaintance has used this fact to construct a justification, or perhaps rationalization, for his own (relatively small) set of stubborn spelling pronunciations and malapropisms. "That's my dialect", he says. When he occasionally hears someone else speak in a similar way, he takes this as confirmation.

This brings up some of the issues that Arnold Zwicky has discussed under the heading of "The Thin Line between Error and Mere Variation". All of us believe that sporadic mispronunciation of read words -- like the mis-stressing of attributive with penultimate stress that Chris cites -- are not really dialect differences, they're just mistakes. My "that's my dialect" friend is mis-using the concept of grammar as description. Every once in a while, such mistakes do get picked up by enough members of some speech community to reach the status of a genuine variant. And as Arnold has pointed it, ambiguous cases are common. But sometimes a mistake is simply and clearly just a mistake.

Posted by Mark Liberman at 07:21 AM

September 14, 2004

Purple advertising prose

I just received in the U.S. mail an unsolicited advertising brochure on stiff glossy card in full color, with text that opens in the following way:

DISCOVER THE NATURAL CONNECTION FOR YOUR SUCCESS

With each moment that passes, the sun's color shimmers and changes from startling purples and magentas to the soft hues of cornflower blue. Waves gently roll to shore in nature's own version of a childhood lullaby. And as the sun finally dips below the horizon, the mast of a sailboat—making its way back to harbor—dances on the horizon like a ballerina in a dream.

And people accuse me of over-writing!!

Posted by Geoffrey K. Pullum at 11:52 AM

"Protest" Banner

"Vietnam Veterans Protest Kerry," was the headline in today's Washington Times (courtesy of Scott Parker of American U.). That use of protest with an animate object is a bit strange; most people reserve the verb for objects that denote events or states of affairs. But the usage evokes the scene in First Blood, the first and best of the Rambo movies, where Sylvester Stallone famously says:

Nothing is over! Nothing! You just don't turn it off! It wasn't my war! You asked me, I didn't ask you! And I did what I had to do to win! But somebody wouldn't let us win! And I come back to the world and I see all those maggots at the airport, protesting me, spitting. Calling me baby killer and all kinds of vile crap! Who are they to protest me? Who are they? Unless they've been me and been there and know what the hell they're yelling about!

What are we dealing with here: synchronicity or allusion?

Posted by Geoff Nunberg at 12:55 AM

September 13, 2004

Uncorrectable Reading Pronunciation?

I just re-read Shibumi by Trevanian, probably better known for The Eiger Sanction. The protagonist, Nicholai Hel, teaches himself Basque while in prison using a bilingual dictionary and some other books, so he has no description of the pronunciation, much less actual Basque speech to serve as a model. He therefore has to guess at the pronunciation of the alphabet. For the most part, he does well since Basque follows fairly closely the usual conventions for Western European languages, but he guesses wrong for the letter x, which he takes to represent the voiceless velar fricative [x]. In fact, in Basque x represents [ʃ]. He later goes to the Basque country and is exposed to the living language. He becomes fluent in Basque, but never overcomes his erroneous decision to read x as [x] and so has an idiosyncratic pronunciation. I wonder whether this has happened in real life? Pronouncing a written language incorrectly happens all the time; what I wonder is whether it is possible for such a reading pronunciation to become so firmly fixed that subsequent intensive exposure to the spoken language does not correct it, even in a case such as this where the correct sound poses no problem for the speaker?

Posted by Bill Poser at 09:33 PM

Another ex-mistake

Seeing Cornell students actually use whom as James Thurber jokingly recommended, just to add a "note of dignity or austerity" to their flag desecration, Geoff Pullum recently concluded that we should "kiss whom goodbye".

Reading in Forbes today that

"The Beatles' company, Apple Corps., is involved in a legal battle with Jobs' Apple Computer, claiming the hardware manufacturer is in breach of a 1991 agreement that that forbids it from using the trademark for any application "whose principle content is music." The two companies have been involved in a number of court battles over the years involving the use of the Apple trademark."

the emphasized misspelling similarly reminded me that the principle/principal distinction is now orthographic roadkill.

I guess it's possible that the legal agreement between Apple Corps and Apple Computer actually spells the adjective as "principle"? I doubt it, but if so, this would give Jobs and Co. a way out. As one of the half-dozen people now living who still mostly remembers the distinction without looking it up, I'll be happy to act as an expert witness, if asked.

Of course, the same reporter and editor also describe Apple Computer as a "hardware manufacturer", so the usual rule in such cases applies: when in doubt, blame the journalist.

Posted by Mark Liberman at 04:10 PM

Translation and analysis

Jonathan Mayhew at Bemsha Swing has posted 14 different translations of a Basho haiku, under the heading "the wisdom of crowds in translation".

A few of the examples:

Britton:	A mound of summer grass: Are warriors' heroic deeds Only dreams that pass?
Sato:	Summer grass: where the warriors used to dream
Hamill:	Summer grasses: all that remains of great soldiers' imperial dreams
Rexroth:	Summer Grass where warriors dream.

Jonathan comments that

We might prefer or despise [a] particular version, but the best version is probably the sum total or average of all these. The more you have, the better. Any eccentricity or redundancy simply drops away. You don't need a mound of grass or a thicket of grass, just plain old natsugusa is fine.

I very much like the idea of looking at a large number of alternative translations, but as a linguist, I want to be able to see an interlinear analysis of the original as well. I don't know any Japanese, but Bill Poser knows a lot, so I asked him, and he was kind enough to supply one (with the Japanese written in romaji):


natsu	kusa	ya
summer	grass	lo!

tsuwa	mono	domo	ga
strong	person	PLURAL	GEN

yume	no	ato
dream	GEN	remains

Bill pointed out that the /k/ of /kusa/ becomes voiced in the compound, so the unparsed original is /natsugusa/. He also observed that

... two different genitive particles are used, /ga/ and /no/. In Modern Standard Japanese /ga/ no longer has this genitive usage, but it used to. The distribution is imperfectly understood but arguably is /ga/ with a human possessor, /no/ with non-human.

and he ended by mentioning that

The word /ato/ is interesting. Its most common usage is probably with the meaning "after", as in /ato de/ "later". It is however a noun and "after" clauses are nominal. As a noun it has the meaning "remains, relics" as in /shiroato/ "remains of a castle". Interestingly, it need not refer to physical remains in the usual sense. /ashiato/ are "tracks, footprints".

For people with even a minimal linguistic education, this kind of transcription, analysis and commentary is easy to assimilate, and adds a great deal to the appreciation of the work, even for those who don't know the original language at all.

Someone like Bill could also present an equally simple and equally interesting set of observations about the characters used in the original orthography, the calligraphy of a notable presentation, and the sound of a reading.

I don't know any sites on the web that offer this sort of access to poetry in other languages, but such things must exist at least in embryo. And there should be more of them, in my opinion. Haiku would a particularly good subject for such analyses, since the originals are so short. However, you could do the same for famous selections from Sappho or Petrarch or Akhmatova. Or writers in less accessible languages, like the 19th-century Somali poet Raage Ugaas.

Posted by Mark Liberman at 12:56 PM

New Open-source speech code from IBM

According to an article by Steve Lohr in today's NYT, IBM is announcing today that it will donate some source code related to speech technology to two open-source software groups. Apache will get some software for dealing with spoken dates, times and locations, and Eclipse will get some "speech editing tools". The NYT article doesn't explain clearly what the software really is and does; in fact, what the article says is somewhat misleading.

There's an item on the Eclipse site today about a so-called " Voice Tools Project", according to which

The Voice Tools Technology Project will focus on Voice Application tools in the JSP/J2EE space, based on W3C standards, so that these standards become dominant in voice application development. ... Initially, Voice Tools will consist of editors for VoiceXML, the XML Form of SRGS (Speech Recognition Grammar Specification), and CCXML (Call Control eXtensible Markup Language). Implementations of other tools that implement W3C voice standards, such as the LexiconML (Pronunciation Markup Language), will be added as the standards solidify and the Voice Tools Eclipse community grows.

The same announcement mentions "committers" from SBC Communications and Voice Genie as well as IBM. As of 1:00 p.m. today, there will be a newsgroup which may have some more information. So far, though, this looks as if it will mainly be interesting to people who want to build interactive voice applications using open-source software for the framework controlling the interaction -- the available options for the component technologies (such as speech recognition and synthesis) are not changed. And if you're looking for open-source software for what you might think the NYT's phrase "speech editing tools" means, try Audacity, WaveSurfer, or Praat.

Here's the IBM press release. It gives a longer list of participants in the Voice Tools Technology Project: "Apptera, AT&T, Audium, Avaya, Cisco, Fluency, Genesys, Kirusa, Loquendo, Motorola, Nortel, Nuance, Openstream, ScanSoft, Siebel, Syntellect, Telisma, TuVox, V-Enable, Viecore, Vocomo, VoiceGenie, Voice Partners, and VoxGeneration".

It also explains that what IBM is donating to Apache is "Reusable Dialog Components (RDCs)". In more detail:

Pre-built speech software components, or "building blocks" that handle basic functions such as date, time, currency, locations (major cities, states, zip codes), RDCs are often-used functions in speech-enabled infrastructure applications. These allow a caller to, for example, book a flight using an auto-agent over the phone. Multiple reusable dialog components can be aggregated to provide higher levels of user functionality.
Developed by IBM Research, RDCs are Java Server Page (JSP) tags that enable dynamic development of voice applications and multimodal user interfaces. JSPs that incorporate RDC tags automatically generate W3C VoiceXML 2.0 at runtime -- providing a standard basis for speech applications. By providing familiar and standards-based programming models, J2EE developers can add voice interaction to Web applications. And by making the RDC framework available to the community, speech components built using it will work together, regardless of the vendor that created them.

So the Apache stuff is also oriented towards establishing standards for Voice I/O in call center applications and the like. There's nothing on the Apache web site yet about this.

Posted by Mark Liberman at 12:09 PM

Final periods and quotation marks: harder than you thought

There’s a punctuation rule that American publishers follow rather strictly though British publishers do not: when an expression contained in quotation marks falls at the end of a sentence, a following comma or period (though not a colon, semicolon, exclamation point, or question mark) should be moved leftward to fall inside the quoted string. You might have thought it was child’s play to enforce that by algorithm. It isn't. We’ll consider just the issue of single quotation marks and periods. (Single quotation marks are less common in American printed sources than double quotation marks, but I'll deal with that issue below.) Since it looks really confusing to try and mention punctuation marks in print so you can talk about them, I'll refer to the right single quote character as <RSQUO> (after its HTML code ’), and I'll call the period or full stop <PERIOD>. The rule for correcting to the American practice could be (you might think) simply this:

Change any occurrence of <RSQUO><PERIOD> to <PERIOD><RSQUO>.

But a single sentence in the latest New Yorker caused me to realize that it isn’t that simple; it can never be simple; it is extremely hard, about as hard as the whole enterprise of accurately parsing arbitrary English syntactic structure.

The reason is simply that <RSQUO> is ambiguous in function: it serves both as our right single quotation mark (which must be matched with a left one that occurs earlier) and as the apostrophe (which is really a 27th letter of the alphabet that occurs in the spelling of certain words like won’t and children’s and has nothing to do with quotation). No font distinguishes these. What caused me to see that this matters a great deal was the underlined part of the following (the context being a discussion of how everywhere Al Gore goes he has to put up with people expressing sympathy for him and also grief of their own over the Florida election in 2000):

He has to face not only his own regrets; he is forever the mirror of others’. A lesser man would have done far worse than grow beard and put on a few pounds.

Here the <RSQUO> character is functioning as the apostrophe. It is part of the spelling of the regular genitive plural suffix, as in a phrase like several butchers’ aprons. Notice, the article is not saying that Al Gore is forever the mirror of others, i.e., other people; it is saying that he is forever the mirror of others’ regrets, i.e., other people’s regrets. But it would be perfectly possible to have a sentence like this (it doesn’t state a true claim, you understand, it’s just an example of a possible sentence; the bit inside the single quotes asserts, unlike the sentence quoted above, that he is the mirror of other people; and notice that I’m punctuating it wrongly according to the rule, to exhibit the contrast):

The New Yorker article said, ‘He has to face not only his own regrets; he is forever the mirror of others’.

That sentence would need to be changed under the American; it should be given like this:

The New Yorker article said, ‘He has to face not only his own regrets; he is forever the mirror of others.’

In case you're thinking that this won’t come up very much because usually we use double quotation marks for quotations, let me remind you first that this differs between publishers (the Linguistic Society of America style sheet requires single quotes), and second, more importantly, single quotation marks are used for quotations within quotations enclosed in double quotation marks. Consider this example:

Geoff Pullum writes on Language Log: “The New Yorker article said, ‘He has to face not only his own regrets; he is forever the mirror of others’. A lesser man would have done far worse than grow beard and put on a few pounds.’ Here the <RSQUO> character is functioning as the apostrophe.”

Here the first period must not be moved, but under the American rule the second one must! [Nerd note: Sophisticated computational linguists will immediately see that there is an argument here, based on quote patterns alone, to the effect that no finite state device can ever successfully recognize all the contexts in which the order of <RSQUO> and <PERIOD> must be change. I will not give the proof here, as the margin of this post is too small to contain it. End of nerd note.]

The bottom line: in order to tell whether you should change <RSQUO><PERIOD> to <PERIOD><RSQUO> you have to determine whether or not you’re inside a single-quoted sequence, and also determine whether the word before the period is a regular genitive plural. It’s non-trivial. There is no telling how long a passage in single quotes might be: the opening quote might be any number of sentences off to the left, and the closing quote might be any number of sentences off to the right, past any number of apostrophes. And the only way to tell whether you’re looking at a regular genitive plural is to grasp

the morphology (e.g.: does this noun take regular inflection?), and
the syntax (e.g.: is this noun in a structural position where genitive case is allowed?), and
the semantics (e.g.: is this sentence to be understood as making a reference to other people, or implicitly to other people’s regrets?),

all in full detail. Quite beyond the capacities of computational linguists at the moment.

Everything’s so much harder once it’s been given a simple explanation by a linguist, isn’t it? Sigh.

[Revised a little on September 14. Thanks to Glen Whitman for an interesting observation that contributed to this expanded version.]

Posted by Geoffrey K. Pullum at 12:56 AM

September 12, 2004

What did Cheney say about Kerry and terrorism?

Did the Vice President, in remarks made at an Iowa campaign appearance, really say that electing John Kerry will cause new terrorist attacks? John Edwards thinks so, and called the comments "unAmerican". The Associated Press thought that was what he was saying too and reported his remarks under the headline "Cheney Warns Against Vote for Kerry". But conservative bloggers think otherwise; see Patterico's Pontifications, for example (Patterico provides all the necessary references to things like the White House transcript). Surely we can decide, with the help of linguistic analysis, given access to accurate transcripts, the simple matter of what was said, can't we?

Well, here's the quote, with the part that AP quoted picked out in boldface. Cheney is saying that the decisions that set up key international security systems after World War II were carried through and supported by both Democrat and Republican administrations, and he goes on:

We're now at that point where we're making that kind of decision for the next 30 or 40 years, and it's absolutely essential that eight weeks from today, on November 2nd, we make the right choice. Because if we make the wrong choice, then the danger is that we'll get hit again, that we'll be hit in a way that will be devastating from the standpoint of the United States, and that we'll fall back into the pre-9/11 mind set if you will, that in fact these terrorist attacks are just criminal acts, and that we're not really at war. I think that would be a terrible mistake for us. We have to understand it is a war. It's different than anything we've ever fought before. But they mean to do everything they can to destroy our way of life. They don't agree with our view of the world. They've got an extremist view in terms of their religion. They have no concept or tolerance for religious freedom. They don't believe women ought to have any rights. They've got a fundamentally different view of the world, and they will slaughter -- as they demonstrated on 9/11 -- anybody who stands in their way. So we've got to get it right. We've got to succeed here. We've got to prevail. And that's what is at stake in this election.

Naturally, you trust Language Log to provide a linguistic reading that will sort this out. Did he say what AP (quoting the boldface bit) says he said, or not?

I regret to say that it is extraordinarily hard to rule on this. The argument that he did would be as follows. Suppose an Olympic athlete is facing a decision about whether to go on steroids for a while to boost her performance, even though it's wrong, and her coach says: If you make the wrong choice, then the danger is that you'll get found out in a random drug test, and your medals will be taken away. It would be inconceivable that anyone could doubt that the causal inference, and the warning, was intended: make the wrong choice, they'll find you out and you'll lose your medals. It wouldn't make any difference what he went on to connect this to (depression, remorse, loss of endorsement contracts, poverty, suicide): the warning about the direct consequences is completely clear.

The argument that Cheney did not intend the direct causal inference goes as follows. Read to the end of the passage, and you'll see Cheney is connecting making the wrong choice to the possibility of a future in which the USA returns to treating attacks by terrorists merely in terms of the criminal law, not as justification for foreign military actions. Assume that he gets the bits and pieces slightly muddled up as regards the optimal order, and you can read him as saying something that would have been better phrased like this: If we make the wrong choice, then the danger is of a future in which, when we are hit again (as some day we surely will be, perhaps devastatingly), we will fall back into the pre-9/11 mind set where we take terrorist attacks to be just criminal acts, and not appreciate that we're really at war; and that would be a terrible mistake for us.

In other words, the Cheney defenders say that he was stumbling inexpertly through an argument that relates to the danger that Kerry will treat Islamic terrorism as an internal police issue rather than an issue of global war, and that it would be a terrible mistake for America to deal that way with whatever further terrorist strikes may come.

Perhaps the defense seems a little tortured. Certainly, it is harder to lend it credence in the midst of a furious election campaign. And even if we accept it, the drift does seem to be that there is at least an indirect causal chain: electing Kerry leads to the criminal-law-not-war position on terrorism, and that is perceived as weakness, and weakness draws new terrorist attacks down upon us.

You decide. Language Log is not a political blog and does not advise you on which candidate to vote for (though it may analyze linguistically interesting things that candidates say). The linguistic lesson I draw from looking at this issue is that even given a transcript, it can be astonishingly difficult to decide whether someone asserted a given proposition P or not. One needs to be sensitive to the linguistic possibilities; imaginative concerning what might have been intended; willing to countenance ambiguity and uncertainty; prepared to compare alternative interpretations; and even then, sometimes you have to go back to he source and ask some questions.

My preferred way of working on this would be (if I had the access) to go back to Cheney and ask him, "Did you intend to claim that electing Kerry and Edwards would in a direct way lead to new terrorist attacks that would not have occurred if Bush and you had been elected, or not? Yes, or no?" There's no way to ensure an honest answer, of course; but asking about the speaker's intention is sometimes all you can do, because the raw record of linguistic performance is simply not clear enough. People do sometimes structure their spoken paragraphs poorly and get the cart before the horse.

The cycnic might say, Good luck with getting any politician, let alone one who's in hot water, to respond to a yes/no question with a one-word answer between now and November 2. But in fact we can now learn the results of my thought experiment, because Cheney has given an interview in which he clarified his meaning. He says he did not intend the meaning that AP attributed to him. So that settles that. Now it's back to whether you trust his report of his intention, or you think he's dishonestly backtracking because of the furore. Language Log cannot help you with that.

Posted by Geoffrey K. Pullum at 09:41 PM

The psychodynamics of grammatical correction

Being an optimistic sort of person, I've been thinking optimistic thoughts about new search and communications technologies creating a new era for discussions of grammar and usage. After all, any sensible person with internet access can now learn what the facts of usage are like. And anyone who pontificates ignorantly on the subject in public is likely to be ridiculed, in public and at length, by some of the people whose weblogs are listed on the Language Log front page blogroll. So logically, there ought to be a shakeout in the language maven business, weeding out those too careless or too bigoted or too dense to produce commentaries consistent with the basic facts.

But then, being a realistic sort of person, I had a more pessimistic thought. Some people pay others to make up rules and impose punishments for imaginary infractions, apparently because they derive pleasure from humiliation. The rules involved need not be rationally justified -- perhaps it's even more enjoyable if they aren't. Nor do careful description and accurate separation of fact from fancy seem to be in the job description of those who provide such services.

If this is really what's going on, then the grammar slammers of the world will continue to find customers eager to experience the sensations promised by the offer to "vanquish your language anguish", whether or not the slammers bother to get their facts straight. So I looked around a bit on the internet to see if I could find any evidence of grammar correction as a category of SM role playing, on a par with Klismaphilia, Retifism and the rest. Aside from things like jokey references to Lynne Truss as "the dominatrix of grammar", the main thing I found was an extensive network of attempts at porn-parlor Google bombing, full of text like this:

You owe it to your self to give Positioning Dominance Through Grammar and Swapping Cash your full un divided attention, chances such as Filipina Dominatrix Portland and Bisex Couple are very important. Life is too short not to give Justa Swinging Peppy or Aussie Christian Singles the chance that they deserve, learn how breakthrough ideas such as Illinois Free Adult Classifieds and certainly Cartilage Piercings have made a real difference.

(For more about pages of this kind, see this post from last winter).

My search also turned up this abstract from the September 2004 issue of Computers and Composition Online:

Jacqueline Rhodes, California State University , San Bernadino, Homo origo: The queertext manifesto.

Abstract: In a 56-point performance of what she calls “queertext,” Rhodes explicates the tensions between “The Word” and “queertext.” The Word, she writes, enacts its dominance through grammar and “extends its discipline” through a host of ills including “English-onlyism,” racism, heterosexism, and capitalism; queertext, on the other hand, resists textual dominance through its emphasis on “the material, erotic realities of our bodies.” Rhodes finds a unique space for queertexts online, claiming that the “hyperlink is an erotic textual moment, when idea and action collide.”

I believe that this is serious, though it's sometimes hard to tell.

My conclusion, in any case, is that grammar correction is not at all popular as a form of SM role play. So maybe there's hope for rational and honest grammatical discourse after all.

And I thought of closing with another "erotic textual moment", but decided to stick with the rigorous discipline of The Word.

Posted by Mark Liberman at 03:40 PM

That's not how that works

A few days ago, Geoff Pullum pointed out that journalists sometimes assert things about English usage that are easy to check, and turn out to be spectacularly false. In particular, he took to task a Australian who unwisely stated that "It's difficult to find a piece of writing in the mainstream press which mentions the word 'bisexual' without finding that it is immediately followed by the word 'chic'." As Geoff demonstrated, you can check these things. And these days, people do. So perhaps we can hope that the era of WCFCYA will eventually overcome the tendency of self-appointed language experts to make stuff up, at least if they want to keep charging money for their expertise.

The folks at English Plus offer the Grammar Slammer ("Deluxe with Spelling and Grammar Checkers", for $49). On their website, they also offer for free an alphabetized list of Common Mistakes and Tricky Choices, which is a mixture of useful guidance (e.g. on the difference between accept and except) and more dubious advice.

One of the pages on this site is entitled "Using That, Which, and Who as Relative Pronouns", and explains that

In modern speech, which refers only to things. Who (or its forms whom and whose) refers only to people. That normally refers to things but it may refer to a class or type of person.

Examples: That is a book which I need for the class.

These are the books that I need for the class.

He is the man who will be teaching the class.

They are the type of people who would lie to their mothers.

They are the type of people that would lie to their mothers.
(That is OK here because it is a class or type.)

There are two problems here. The first mistake is so trivial that it's shocking to find an apparently rational person asserting it, in a serious context, to be read by others who know the English language and have minimally intact powers of observation and memory. I'm referring to the claim that English relative clauses referring to humans can't be introduced by that, unless the reference is to a "class or type of person". Does anyone really think that "He's the man that will be teaching the class" is ungrammatical? Does anyone really believe that avoiding similar uses of that is a norm that writers and speakers of English aspire to?

If so, a few minutes reading the morning papers should set them straight. Searching the recent stories indexed by Google News, I found these examples in a couple of minutes:

There were three people that I thought about that morning.
And it will be coop care — so it will be controlled by the people that belong to it.
The other incident involved a three-year-old boy that was involved in the elementary school program
Police claim Johnson took a six-month-old girl that he believes he fathered and disappeared in July.
The man that Legan replaced, former state Sen. David Klarich, R-Ballwin, served for six months.
The woman that accused California governor Arnold Schwarzenegger of groping her has dropped her lawsuit against the politician.
Michelle Thompson was among agency workers that lined the ballpark walkway to greet volunteers and donors.
The occasion was a protest at the home store of Fabian Vera, the manager that has served as a full-time union buster at the location where Daniel works.
Omarr Conner is a player that we recruited, and we know what he did in high school.
The defense returns 9 players that started multiple games last year.
While the rain aided in some areas, those area farmers that were unfinished with harvest have suffered mightily from the precipitation.
Police said the farmer that shot a man in Malaekahana Valley had been stolen from before.

These relative clauses refer to specific individuals or sets of individuals, not the sort of hypothetical class or type ("the type of people that would lie to their mothers") in the English Plus example. And I can't see any reason to question the use of that in any of them.

If recent journalism isn't the source of this would-be principle of usage, what about literature? Well, just among titles, we have Mark Twain's The Man that Corrupted Hadleyburg, Edgar Allen Poe's The Man that was Used Up, and Edgar Rice Burroughs' The People that Time Forgot.

Here's a piece of advice: if someone proposes a grammatical principle that is violated by the titles of two or more classic novels or stories, you should think twice before paying them money for further advice on grammar and usage.

Turning our attention to poetry, it's equally easy to find examples of that introducing non-"class-or-type" relative clauses:

679      ---All moveables of wonder from all parts,
680      Are here, Albinos, painted Indians, Dwarfs,
681      The Horse of Knowledge, and the learned Pig,
682      The Stone-eater, the Man that swallows fire,
683      Giants, Ventriloquists, the Invisible Girl,
684      The Bust that speaks, and moves its goggling eyes,
685      The Wax-work, Clock-work, all the marvellous craft
686      Of modern Merlins, wild Beasts, Puppet-shows,
687      All out-o'-the-way, far-fetch'd, perverted things,
688      All freaks of Nature, all Promethean thoughts
689      Of Man; his dulness, madness, and their feats,
690      All jumbled up together to make up
691      This Parliament of Monsters.

(Wordsworth, The Prelude, Book Seventh.)

36 The merriment of the twin-babes that crawl over the grass in the sun, the mother never turning her vigilant eyes from them,

(Whitman, Leaves of Grass, Spontaneous Me).

15        Half-drunk or whole mad soldiery
16        Are murdering your tenants there;
17        Men that revere your father yet
18        Are shot at on the open plain;

(Yeats, Reprisals)

The second mistake in the Grammar Slammers' analysis of "Using That, Which, and Who as Relative Pronouns" is a much more subtle one: in this context, that should probably not be called a "relative pronoun" at all.

In the chapter 12 of the Cambridge Grammar of the English Language ("Relative constructions and unbounded dependencies"), the authors argue (pp. 1056-57) that

Traditional grammar analyses the that which introduces relative clauses as a relative pronoun, comparable to which and who, but we believe that there is a good case for identifying it with the subordinator that which introduces declarative content clauses.

They give four arguments, all simple and easy to understand -- for those who can keep track of the basic facts of standard English grammar, and don't feel the need to invent rules out of thin air.

Posted by Mark Liberman at 01:37 PM

Thoughts on the Pericu

A few comments on the report that DNA evidence suggests that the Pericú, an extinct tribe of Baja California, are more closely related to "the ancient populations of southern Asia, Australia, and the South Pacific Rim" than to other Native Americans and peoples of the North Pacific Rim.

It is true that this proposal will likely provoke criticism from some Native Americans. Some groups believe that they were created where they are now and consider it offensive to suggest otherwise, and some are concerned that any suggestion that they are themselves immigrants will undermine their claim to their territory. Of course, such groups aren't keen on the Beringian theory either, since it denies that they originated in situ. The idea that there was an earlier migration is worse, though, since it makes them not only immigrants but latecomers. However, there is considerable diversity both of tradition and opinion. Many tribes have no tradition of being created where they are now or even have traditions of migration. And many recognize that prior possession is a perfectly adequate basis for their claims to their territory and that it doesn't matter whether they have been there since the creation.

There is actually another interpretation that should be acceptable to Native Americans who consider themselves autochthonous. They could say that whereas they have always been here, the Pericú were merely the earliest immigrants into the Americas. After all, the DNA evidence itself does not establish that the Pericú were in the Americas before them.

The argument by Johanna Nichols to which Mark referred is based on a survey of typological features of languages, that is, features like "the verb follows its object" or "distinguishes between first person inclusive and exclusive". Such features are not traditionally considered probative of genetic affiliation because there are only a few possibilities, so the probability of two languages sharing them is rather high. Nichols argues that by choosing the features one uses carefully and looking at complexes of features rather than isolated features one can obtain evidence for a historical relationship between languages. She concedes that it isn't possible by this method to distinguish between common descent and diffusion, but suggests that that isn't a fatal flaw, because it is interesting to know who has been in touch with whom, even if we can't tell what the nature of the relationship was. According to Nichols, languages along the Pacific Rim share a number of features that suggest a historical relationship among the languages. This is an interesting idea, but it isn't clear whether or not it really works. Questions have been raised both regarding the validity of the method in general and regarding particular features.

The idea that the Pericú represent an earlier, more southerly migration by boat and/or along the coast to the Americas is quite plausible. For one thing, all of the very early humans found in the Americas seem more closely to resemble Austronesians and Ainu than later American Indians; adistinct migration would explain this. Secondly, it is now I believe conclusively established that the Clovis culture was not the first in the Americas, but it is Clovis that most plausibly reflects the Beringian migration So the pre-Clovis peoples presumably reflect another migration. Thirdly, if everybody came via Beringia, we would expect to find a progression of sites from North to South. We don't. Indeed, there are very early sites, e.g. Monte Verde, in South America. This argument isn't as conclusive as it might be because we don't have an awful lot of early sites, and we can't date them with great precision, so if there were a progression but the movement were rapid we might not be able to resolve it. But if it is right that we don't see the progression we ought to, we have another fact that would be explained by one or ore arrivals on the Pacific coast. Fourth, there is tons of evidence that itis possible to travel by fairly primitive boats between Asia and the Pacific Coast. In addition to planned voyages, there could have been many cases of people being swept to the Americas by storms.

The lack of old sites along the Pacific Coast is not a counterargument to this hypothesis because most of what would have been the coast at the time was submerged at the end of the last ice age. Some archaeologists think that there may be a lot of sites underwater. These sites are presumably much more difficult to discover than, say, Bronze Age sites - it's hard to observe a lithic scatter on the ocean floor.

Posted by Bill Poser at 01:52 AM

September 11, 2004

Text message novel

Last spring, I posted about the Japanese violinist and writer Senju Mariko, who is said to do her writing by sending herself text messages on her cell phone. Now, according to an article by Howard French in this morning's IHT, Qian Fuzhang has written a novel for others to read on their cell phones.

"Out of the Fortress" showed up on tens of thousands of mobile telephone screens on Friday. ... Weighing in at a mere 4,200 characters, "Out of the Fortress" is like a marriage of haiku and Hemingway, and will be published for its audience of cellphone readers at a bite-size, 70 characters at a time - including spaces and punctuation marks - in two daily installments.
Other "readers" may choose to place a call to the "publisher," hurray.com.cn, a short text-message distribution company, to listen to a recording of each day's story as it unfolds.

4,200 Chinese characters is roughly equivalent to 2,100 English words, which would make Out of the Fortress more like a short story than a novel. Is there a problem of terminology translation here? or is Qian doing some explicit genre-bending? or is this just a marketing thing, since the value of a "novel" will be perceived as greater than the value of a '"story"?

Posted by Mark Liberman at 10:58 AM

Ex-words, ex-parrots and nominal tense

Monty Python fans will have recognized the peroration of Geoff Pullum's recent post on "The Coming Death of Whom"

This word is nearly dead. It is close to being no more. It has all but ceased to be. If it wasn't Magic-Markered onto a defaced flag from time to time it would be pushing up the daisies. This is almost an ex-word.

as homage to Monty Python's Dead Parrot Sketch.

A customer enters a pet shop to "register a complaint", because he's just bought a parrot that turns out to be dead. The shop owner insists that the bird is "just resting". After they poke the cage and argue about whether the bird moved, the customer "[t]akes parrot out of the cage and thumps its head on the counter. Throws it up in the air and watches it plummet to the floor."

C: Now that's what I call a dead parrot.
O: No, no.....No, 'e's stunned!
C: STUNNED?!?
O: Yeah!  You stunned him, just as he was wakin' up!  Norwegian Blues
   stun easily, major.
C: Um...now look...now look, mate, I've definitely 'ad enough of this.
   That parrot is definitely deceased, and when I purchased it not 'alf an hour
   ago, you assured me that its total lack of movement was due to it bein'
   tired and shagged out following a prolonged squawk.
O: Well, he's...he's, ah...probably pining for the fjords.
C: PININ' for the FJORDS?!?!?!?  What kind of talk is that?, look, why
   did he fall flat on his back the moment I got 'im home?
O: The Norwegian Blue prefers kippin' on it's back!  Remarkable bird, innit,
   squire?  Lovely plumage!
C: Look, I took the liberty of examining that parrot when I got it home,
   and I discovered the only reason that it had been sitting on its perch in
   the first place was that it had been NAILED there.
 
(pause)
 
O: Well, o'course it was nailed there!  If I hadn't nailed that bird down,
   it would have nuzzled up to those bars, bent 'em apart with its beak, and
   VOOM!  Feeweeweewee!
C: "VOOM"?!?  Mate, this bird wouldn't "voom" if you put four million volts
   through it!  'E's bleedin' demised!
O: No no!  'E's pining!
C: 'E's not pinin'!  'E's passed on!  This parrot is no more!  He has ceased
   to be!  'E's expired and gone to meet 'is maker!  'E's a stiff!  Bereft
   of life, 'e rests in peace!  If you hadn't nailed 'im to the perch 'e'd be
   pushing up the daisies!  'Is metabolic processes are now 'istory!  'E's off
   the twig!  'E's kicked the bucket, 'e's shuffled off 'is mortal coil, run
   down the curtain and joined the bleedin' choir invisibile!!
   THIS IS AN EX-PARROT!

English has plenty of adjectives for modifying the temporal properties of nouns -- former, recent, late, forthcoming, and so on. We also have a few derivational processes with similar functions, like bride-to-be and ex-mayor. But for nouns, Indo-European languages lack anything analogous to their elaborate system for indicating tense, aspect and mood on verbs, some fragments of which English retains.

As the semantics of former and ex- indicate, there's no logical reason for this, and recently there's been increased interest in the many languages of the world where tense-aspect-and-mood (TAM) can or must be marked on nouns. An excellent review can be found in "Nominal Tense in Cross-linguistic Perspective", by Rachel Nordlinger and Louisa Sadler, forthcoming in Language.

Posted by Mark Liberman at 10:35 AM

September 10, 2004

The coming death of whom: photo evidence

Rather clear evidence of the approaching death of the accusative form of the human-gender relative and interrogative pronoun who may be found in the following photograph (it was apparently snapped by some reactionary student at Cornell University during a Columbus Day demonstration and sent in to the right-wing paranoid organization AcademicBias.com, which offers prizes on its website for photographic or filmed evidence that commies are taking over American campuses):

Yes, seeing is believing. That's two occurrences of whom in subject function, right there on a single defaced American flag.

There is an error in the plural of thief, too, but that one is in the direction of regularizing the irregular (regular *thiefs for the irregular thieves). Using whom for who isn't regularization. It's a desperately insecure clutching after a form that people no longer know where to use or how to control. Whom is like some strange object — a Krummhorn, a unicycle, a wax cylinder recorder — found in grandpa's attic: people don't want to throw it out, but neither do they know what to do with it. So they keep it around, sticking an m on the end of who every now and then when it seems like an important occasion. Columbus Day, for example, or when trying to impress a grammarian or a maitre d'hotel (whom will be our waiter tonight?).

Kiss whom goodbye. It is rarely heard in conversation now, and just about never in clause-initial position. This word is nearly dead. It is close to being no more. It has all but ceased to be. If it wasn't Magic-Markered onto a defaced flag from time to time it would be pushing up the daisies. This is almost an ex-word.

Posted by Geoffrey K. Pullum at 01:16 AM

September 09, 2004

Report on Yukon Native Languages

Somewhat to my surprise, as these things rarely make the news, CBC news is reporting on the release [pdf document] of the Profile of Yukon First Nations Languages, a new survey of Yukon native languages by the Yukon Aboriginal Language Services and the Yukon Native Language Centre. The good news is that the survey was done by visiting households and individually assessing the language ability of each member. That's the only way to obtain reliable information. All too often speaker numbers are based on self-reporting or on some individual's subjective estimate, neither of which is very accurate. The bad news is that the native languages of the Yukon continue to decline, with two, Han and Tagish, nearly extinct. In particular, the report indicates that the current generation of parents have very weak language skills and generally do not use their native language at home.

The Yukon devotes a fair amount of effort to language maintenance activities, through school programs and the Yukon Native Language Centre, which trains language teachers and produces teaching and reference materials. As in British Columbia, such native-language-as-second-language programmes have, however, had little impact on language loss.

Posted by Bill Poser at 10:12 PM

"Wake" as predicted track

"Keys evacuated in Ivan's wake", says the USA Today headline, "Posted 9/92004 12:35 PM" and "Updated 9/9/2003 1:50 PM". The thing is, at 2:00 PM, Ivan's center was at latitude 14.8 north, longitude 72.0 west, about 580 KM southeast of Kingston, Jamaica. The NOAA's current 5-day forecast has Ivan hitting Key West on Monday morning, around breakfast time, almost four days from now.

If I were in the Florida Keys, I'd certainly be following the evacuation instructions. A "category 5" hurricane is no joke. But surely we're talking about Ivan's path , not Ivan's wake.

I guess that the headline writer doesn't really know what a wake is. Or at least, (s)he didn't think about the trail that a boat leaves in the water when using the word, taking it instead to mean something like "track". This is a bit surprising, given the popularity of kayaking, windsurfing and jet skis, as well as more conventional water craft.

It's more obvious why such bleaching should have happened to all the words dealing with horse gear, as I discussed back in January. How many of us have recent experience dealing with unbridled equines? So it won't surprise me, as Ivan bears down on the U.S., to find a journalist or two writing about its "unbrided fury".

[Update: as of the 6:48 p.m. update of this story, the headline is the more coherent "Keys evacuated as Ivan approaches". ]

Posted by Mark Liberman at 04:50 PM

More on Lakoff on framing

Yesterday, AlterNet posted selections from chapter 1 of George Lakoff's forthcoming book Don't Think of an Elephant, as well as the book's Introduction (by Don Hazen).

The posted selections feature George's story about the metaphorical content of the phrase tax relief, as well as his riff on the strict father vs. nurturant parent dichotomy.

Reading the parenting discussion, I was reminded of the now-conventional point that Satan gets all the best lines in Paradise Lost. As William Blake put it

The reason Milton wrote in fetters when he wrote of Angels & God, and at liberty when of Devils & Hell, is because he was a true Poet and of the Devil's party without knowing it.

I don't say this because I'm a fan of James Dobson (whom Lakoff discusses at length), or for that matter of Satan. It's just that George's account of the "strict father model" seems crisp and sharp, while his description of the "nurturant parent model" seems fuzzy and distant in comparison. Read what he wrote, and see what you think.

It starts with the words he chooses for the names of the models, which contrast in word frequency (strict with 6,430,000 web hits on Google, vs. nurturant at 8,130 whG) as well as syllable count. In his discussion as excerpted on AlterNet, there's also a contrast in scope: in 340 words, the strict father section lays out a view of the world, human nature, and the roles of father and child; in 720 words, the nurturant parent section mentions a contrasting idea about human nature, but otherwise focuses on the attitudes and actions of the parents. There's no clear characterization of relation of the family to the world outside it, and no discussion at all of the child's role in the family.

(Previous Language Log posts on Lakoff's analyses of political ideas and political language are here, here, here, and here).

[AlterNet tip by email from Abnu at WordLab]

Posted by Mark Liberman at 10:40 AM

Type like a pirate day

It's only nine days until Talk like a Pirate Day. Linguists know that talking is primary, but blogging is mainly a textual form, so some of you may want to Type Like a Pirate instead, using your trusty old Corsair ergonomic keyboard:

The link I used when I blogged this last fall is dead. There are lots of copies out there on the net, including this one from 9/19/2003, but I don't know who's the original creative force responsible for this picture. If someone will tell me (myl at cis.upenn.edu) I'll give a proper attribution.

Pop-culture kitsch aside, real pirates were (and still are) a pretty reprehensible group. One particular set of pirates played an important role in the early history of the United States: the Barbary Pirates, who operated with (city-)state support out of Tripoli, Tunis, Morocco and Algiers. After an independent United States lost the protection of the British government -- which paid subsidies or tribute to the pirates as protection money -- U.S. shipping was at risk, and Congress allocated $80,000 as tribute in 1784. However, in 1785, two American ships were captured by the Algerians, who asked for $60,000 in ransom for their crews. An on-going sequence of threats, tribute and ransoms eventually led to nearly 15 years of intermittent war.

An interesting discussion of this history, written by Gerald Gawalt, the manuscript specialist for early American history in the Manuscript Division, Library of Congress, can be found here. Some quotes are below:

In his autobiography Jefferson wrote that in 1785 and 1786 he unsuccessfully "endeavored to form an association of the powers subject to habitual depredation from them. I accordingly prepared, and proposed to their ministers at Paris, for consultation with their governments, articles of a special confederation."... "Portugal, Naples, the two Sicilies, Venice, Malta, Denmark and Sweden were favorably disposed to such an association," Jefferson remembered, but there were "apprehensions" that England and France would follow their own paths, "and so it fell through."

Paying the ransom would only lead to further demands, Jefferson argued in letters to future presidents John Adams, then America's minister to Great Britain, and James Monroe, then a member of Congress. As Jefferson wrote to Adams in a July 11, 1786, letter, "I acknolege [sic] I very early thought it would be best to effect a peace thro' the medium of war." ... "From what I learn from the temper of my countrymen and their tenaciousness of their money," Jefferson added in a December 26, 1786, letter to the president of Yale College, Ezra Stiles, "it will be more easy to raise ships and men to fight these pirates into reason, than money to bribe them."

Jefferson's plan for an international coalition foundered on the shoals of indifference and a belief that it was cheaper to pay the tribute than fight a war. The United States's relations with the Barbary states continued to revolve around negotiations for ransom of American ships and sailors and the payment of annual tributes or gifts. Even though Secretary of State Jefferson declared to Thomas Barclay, American consul to Morocco, in a May 13, 1791, letter of instructions for a new treaty with Morocco that it is "lastly our determination to prefer war in all cases to tribute under any form, and to any people whatever," the United States continued to negotiate for cash settlements. In 1795 alone the United States was forced to pay nearly a million dollars in cash, naval stores, and a frigate to ransom 115 sailors from the dey of Algiers. Annual gifts were settled by treaty on Algiers, Morocco, Tunis, and Tripoli.

When Jefferson became president in 1801 he refused to accede to Tripoli's demands for an immediate payment of $225,000 and an annual payment of $25,000. The pasha of Tripoli then declared war on the United States. Although as secretary of state and vice president he had opposed developing an American navy capable of anything more than coastal defense, President Jefferson dispatched a squadron of naval vessels to the Mediterranean. As he declared in his first annual message to Congress: "To this state of general peace with which we have been blessed, one only exception exists. Tripoli, the least considerable of the Barbary States, had come forward with demands unfounded either in right or in compact, and had permitted itself to denounce war, on our failure to comply before a given day. The style of the demand admitted but one answer. I sent a small squadron of frigates into the Mediterranean. . . ."

The American show of force quickly awed Tunis and Algiers into breaking their alliance with Tripoli. The humiliating loss of the frigate Philadelphia and the capture of her captain and crew in Tripoli in 1803, criticism from his political opponents, and even opposition within his own cabinet did not deter Jefferson from his chosen course during four years of war. ... Jefferson was able to report in his sixth annual message to Congress in December 1806 that in addition to the successful completion of the Lewis and Clark expedition, "The states on the coast of Barbary seem generally disposed at present to respect our peace and friendship."

In fact, it was not until the second war with Algiers, in 1815, that naval victories by Commodores William Bainbridge and Stephen Decatur led to treaties ending all tribute payments by the United States. European nations continued annual payments until the 1830s.

[Update: more here.]

Posted by Mark Liberman at 08:49 AM

September 08, 2004

Headlines from heads

At the British Association for the Advancement of Science's Festival of Science 2004, geoarcheologist Silvia Gonzalez presented evidence that the Pericues, an extinct Baja California tribe, are genetically "closer to the ancient populations of southern Asia, Australia, and the South Pacific Rim" than to the northern Asian populations that other Native Americans have been thought to have come from. Gonzalez also suggested that the two oldest known Americans (Kennewick Man and Peñon Woman) might have been from a similar background.

Here's a press release from the Natural Environment Research Council, which sponsored the research, and a Discovery Channel report, which quotes Gonzalez as follows:

"... it is difficult to trace their point of origin as people 10,000 or 20,000 years ago did not look like their modern counterparts in many parts of the world, including Africa, Europe, and China.

"It is likely that southeast Asia 20,000 years ago was inhabited by people who more closely resembled present-day Polynesians or Australian aborigines so this could indeed be a source for the first Americans. They could have taken a coastal route to get there around the North Pacific Rim — it seems unlikely that they came directly across the Pacific."

Reuters pitches this as a political "hornets' nest", asserting that "[t]he claim will be extremely unwelcome to today's native Americans who came overland from Siberia and say they were there first". This seems like a pretty sweeping statement, predicting the reactions of a large and diverse group, some of whom might also turn out to be descended from groups like those that Gonzalez is studying. By what right does Reuters speak for all these people?

Even the Discovery article, which (as Claire Bowern says) is quite good, expresses Gonzalez' findings in a linguistically odd way:

DNA analysis of skulls found in Baja California that belonged to an extinct tribe called the Pericues reveal that the Pericues likely were not related to Native Americans and that they probably predated Native Americans in settling the Americas, according to an announcement Monday.

In order to make sense of this, you have to agree that the Pericues, though they are native Americans, and perhaps even descendents of the first hominids to settle in North America, are not in fact Native Americans. But everyone used to think that they were Native Americans, before these results came along. So what if it turns out that lots of other native Americans are not Native Americans either?

With respect to the larger NERC program this is a part of, Reuters quotes Clive Gamble as saying "We want to make headlines from heads. DNA will give us a completely new map of the world and how we peopled it." I think there's more to say about this. Among other things, there's question of whether "we" should be described in terms of our DNA or our culture -- or both. There's also a pretty scandalous history of over-interpretation of phylogenies based on DNA and other biological markers (as well as some excellent work of the same general kind). And (as Reuters reminds us) there are important political resonances to these origin stories. So all in all, I for one am not very enthusiastic about the idea of "science by headline" in this area.

I'd like to hear about how Gonzales' results connect to Johanna Nichols' ideas about peri-Pacific linguistic typology, from someone who knows more than I do about such things -- like Bill Poser?

[via Claire at Anggarrgoon ]

Posted by Mark Liberman at 12:52 PM

Trapped

At a campaign stop in Poplar Bluff MO on 9/6, President Bush was reported to have pushed his stand on tort reform by complaining that "too many OB/GYN's aren't able to practice their love with women all across the country."

As this sound clip makes clear, that's pretty close to what he actually said:

"Too many good docs are getting out of business. Too many O B G Y Ns aren't able to practice their-- their love with women all across this country."

What W meant to talk about is a real issue -- I know several women who have had to find a new doctor because their old one has left medicine, or at least has left his or her former practice. I understand that it's getting hard to find an OB-Gyn in this area who'll take new patients. The docs who are quitting cite the rising costs of malpractice insurance as one of the key factors. There is considerable controversy about causes and cures, but a 2003 GAO report did find that "losses on medical malpractice claims ... appear to be the primary driver of rate increases in the long run".

In any case, the president got himself into a phrase that he couldn't find a good way out of. There really are doctors who are no longer able to practice their ... what? They're practicing medicine, but you can't say that they're practicing their medicine, and you wouldn't want to say that they're practicing their medicine on women. Practicing their business? Their craft? Neither one is quite right. These doctors often protest that they love their work; perhaps another version of the stump speech talking about how malpractice insurance costs are preventing doctors from doing what they love to do; anyhow, out slipped that word love.

Of course, that sort of thing really happens sometimes too, at least in the euphemistic sense.

The repetition of their and the pauses before and after love make it clear that W understood that he'd gotten himself into a linguistic trap. He didn't just blindly stick in the wrong word, he just couldn't think of the right one fast enough.

It would have worked, I think, to say that "too many OB-Gyn's aren't able to practice their profession with women all across the country", though this is a bit awkward, and it might have been better to leave the women out of it. It would have been better still to start with the women, and say "too many women all across the country are finding that their OB-Gyn's can't practice medicine any more", or something like that. But once launched into the sentence "Too many OB-Gyn's are aren't able to practice their...", I doubt that I could have found my way out, in real time, any more fluently than George W. Bush did. I like to think that I wouldn't have said something so embarrassing. But I also like to think that I wouldn't have missed Mookie Wilson's grounder. And I'm probably wrong in both cases.

The MS-NBC announcer, interestingly, committed several disfluencies in introducing the Bush clip, including mispronouncing (and partially correcting) the name of the town where the speech took place.

[Update: Daniel Davies blogged this yesterday at Crooked Timber, and in the comments, someone named Robbo wrote:

The guy's an embarrassment to us all. Even his most ardent admirers, if they're honest with themselves, feel some level of embarrassment whenever Bush unleashes a statement like this on our ears. And it happens a lot. In the end, I think his inability to speak contemporanesouly gives us our best shot at getting rid of him.

The trouble with this topic is that I can't tell whether Robbo has introduced a misspelled malapropism for "extemporaneously" on purpose, to be ironical, or in the natural course of events, creating a different sort of irony. I'm leaning towards the second hypothesis. ]

Posted by Mark Liberman at 06:26 AM

Egghorn

I can't tell if this is a joke, or a meta-eggcorn. It's not a simple slip of the fingers, as it occurs twice:

Speaking of the Internet (good segue subject for a blog, huh?), I discovered another egghorn - keyholed for keel-hauled. Unf'ly I can't find it now (yes, "unf'ly" is used in tribute to someone - who read it out as "unf-ly" and then "un-fly". At no point did they seem to twig that it was in any way connected to "unfortunately").

Following the egghorn trail, and I'd just like to plug Pom du Cap - an English guy living somewhere [with authentic vegetation] near Cape Town.

Other examples of egghorn on the internet are mostly references to the mountain, although there is also a charming ghost story. The cited weblog entry (from Anyway at Anyhoo) includes an interesting recipe for "panhagglety" (with 14 alternative spellings).

Posted by Mark Liberman at 06:24 AM

Speaking Puerto Rican

Speaking of closed captioning stupidities (which I discussed here and Mark Liberman explained more fully here), Lance Nathan of MIT writes to say that he backs the idea of very uninformed and poorly trained humans being involved. He reminds me of a story of a moment from the Oscars a few years ago. A reporter caught Benicio del Toro outside the theatre, and asked him whether his family would be watching. Yes, he said, they were at home in Puerto Rico. She asked whether he would like to say anything to them, and he nodded and said something in Spanish. The closed captioning read: "[speaks Puerto Rican]".

Says Lance: "mistakes like that do leave me with images of 16 year old captioning slaves chained to desks."

Posted by Geoffrey K. Pullum at 01:11 AM

Neanderthal Historical Linguistics

According to a recent article in the New Scientist:

One of a Neanderthal baby's first words was probably "papa", concludes one of the most comprehensive attempts to date to make out what the first human language was like.

The article is a report on a paper delivered by Pierre Bancel and Alain Matthey de l'Etang of the Association for the Study of Linguistics and Prehistoric Anthropology in Paris at a conference on the Origins of Language and Psychosis held at Oxford in July. According to the reports in the New Scientist and The Telegraph Bancel and de l'Etang surveyed 1000 languages for which they were able to obtain detailed information on kinship terms and found that 700 of them contained the word "papa" with the meaning "father" or "male relative on the father's side".

"There is only one explanation for the consistent meaning of the word 'papa': a common ancestry," Bancel says.

To be precise, we can break their conclusions down into four claims. One is that all human languages are descended from a common ancestor, which I'll call Proto-World. The second is that in Proto-World there was a word meaning "father" whose sound was something like [papa]. The third is that Proto-World was the first language spoken by human beings, which can call Proto-Human. The fourth is that Proto-Human was also spoken by Neanderthals. All four claims are dubious.

The observation that words like mama and papa are widespread in human languages is not new. In fact, it was made in the 1950s by the anthropologist George P. Murdoch. In 1959, in response to an appeal for an explanation by Murdoch, Roman Jakobson (my academic "grandfather") published a paper entitled "Why 'mama' and `papa'?", in which he offered the explanation that the mama and papa words come about through the wishful thinking of parents. Before babies start to speak, they go through a period of what linguists call babbling in which they experiment with their vocal tracts and make lots of meaningless noises. Parents don't realize this, though, and eager to hear their child speak, attempt to interpret their baby's vocalizations as words. Naturally, they are keen on the idea that the baby is addressing them, so they assign the meanings "mother" and "father" to the baby's first "words". It happens that certain consonants, such as [p],[t],[b],[d],[m],and [n] are among the sounds that babies produce frequently in the early stages of babbling, as are vowels like [a], so the early "words" perceived by parents are things like [papa], [mama], and [dada]. They aren't actually words, but the parents perceive them as such and assign them the meanings "father" and "mother".

I won't go into this in further detail because the late Larry Trask wrote a very clear and readable essay on this topic entitled "Where do mama/papa words come from?" which can be downloaded here [pdf document]. He explains Jakobson's proposal in more detail and shows how it is far superior to alternatives.

Jakobson's proposal is generally accepted by linguists. That doesn't guarantee that it is correct, but as far as one can tell from the news reports, Bancel and de l'Etang have not attempted to refute it or to address the additional arguments made by Trask. Indeed, the news reports cite my colleague Don Ringe, who told them of Jakobson's explanation, but mention no rebuttal by Bancel and de l'Etang.

Jakobson's proposal isn't just a plausible alternative to common descent as an explanation for the frequency of words like papa with meanings like "father"; it is much more plausible. Larry Trask's piece explains why in some detail. To mention just one reason, if the Proto-Human word for "father" were indeed [papa], it is very unlikely that it would remain so similar in sound in so many languages. Jakobson's proposal, on the other hand, explains why the "father" words are so similar in sound: they aren't inherited but are constantly recreated. In sum, there is no good reason to take mama and papa to be evidence that all human languages are related.

Turning to the second point, even if the mama and papa words established that all of the languages known are genetically related, it wouldn't establish that these words were present in that form in Proto-World. We'd have to reconstruct the forms of these words to make any reasonable claim about what they sounded like, and the authors haven't even attempted a reconstruction. They aren't in a position to because they haven't established sound correspondances among the languages they are working with, and without them there is no basis for reconstruction.

The third claim, that the ancestor of all currently known human languages, Proto-World, was the language first spoken by human beings, Proto-Human, assumes that no top-level branches are unknown to us. We can only reconstruct to the lowest common ancestor of the languages for which we have data. Suppose, for example, that the only Indo-European languages known to us were Germanic languages. The best we could do would be to reconstruct Proto-Germanic. To reconstruct anything above the level of Germanic, we have to have data from languages outside of Germanic. If any languages branched off from Proto-Human before the lowest ancestor of the languages we know and became extinct, as is quite possible, Proto-World would be a language separated, possibly by thousands of years, from Proto-Human.

The last claim, that Proto-Human was spoken by Neanderthals, is just plain weird. There is no basis for inferring anything about Neanderthal language from historical inferences about human language. Neanderthals were a different species. We aren't sure whether they had language, much less whether it was genetically related to the language or languages of modern humans.

Ironically, the earliest human beings to have a language probably did have words like "mama" and "papa", but what tells us this is not historical linguistics but Jakobson's argument. If Neanderthals had language, and if their articulatory apparatus and cognitive systems were sufficiently similar to ours, they very likely also had words like "mama" and "papa" for "mother" and "father", for the same reason.

Posted by Bill Poser at 12:37 AM

September 07, 2004

Being a linguist, doing linguistics

A colleague was approached some time ago by someone considering advanced degree programs in linguistics, someone with a masters degree in a related field but no actual knowledge of linguistics. My colleague's question: what books to recommend that could provide some inspiration, some sense of what the goal is like? In my reformulation; what books could give a potential linguist some sense of what it's like to be a linguist, to do linguistics?

I found this a surprisingly difficult question. Not-bad introductions to linguistics aren't hard to come by, and there are some pretty good surveys of what has (or, actually, had) been done in the field: some of the chapters in Shopen's set Language Typology and Syntactic Description and in Newmeyer's Cambridge Survey of Linguistics, for example. But such works present the product of doing linguistics, not the activity.

For a feel for what it's like to do syntax, maybe Green & Morgan's Practical Guide to Syntactic Analysis.

For a sense of what it's like to do fieldwork and to discover something about the structure of a language, the two Shopen volumes Languages and Their Speakers and Languages and Their Status.

And for thought-provoking reasonably brief essays, the two books that I most often give to non-linguist friends who are interested in language: Bauer & Trudgill's Language Myths and, especially, Pullum's Great Eskimo Vocabulary Hoax.

But I'm a professional linguist and don't entirely appreciate what it's like to come at the field from the outside. (My first assigned texts were Gleason's and Hockett's, from another era, and the first surveys I read were Sapir's and Bloomfield's, wonderful books that I return to, but even older than my first texts.) I'm hoping that some of the readers of Language Log can point to things they've read -- not necessarily books, of course -- that they've found illuminating.

E-mail me at the address below, and after a week or two has gone by I'll summarize the results here.

zwicky at-sign csli period stanford period edu

Posted by Arnold Zwicky at 08:20 PM

Dr. Doolittle's Delusion

Steve Anderson has a new book out, Dr. Doolittle's Delusion: Animal Communication, Linguistics, and the Uniqueness of Human Language. It's reviewed by Donald McNeil Jr. in today's New York Times. I'm not able to get a permanent link for the review -- at least yet -- so read it while you can. Although the review says that the book is "to be published in November", amazon's web page cites a publication date of September 30. It also says that the volume "usually ships in 8 to 10 days". It's not logically consistent to assert this, on September 7, when you also claim that the same work will be published on September 30. Of course, that might depend on the meaning of "published", not to speak of "on". I don't imagine that Koko is worried about the inconsistency.

Posted by Mark Liberman at 04:20 PM

The globalization of daring and originality

I agree with David Beaver. We should "not allow mere facts to stand in the way of good journalism", which makes such a crucial contribution to what H.L. Mencken called

the daily panorama of human existence, of private and communal folly--the unending procession of governmental extortions and chicaneries, of commercial brigandages, and throat-slittings, of theological buffooneries, of aesthetic ribaldries, of legal swindles and harlotries, of miscellaneous rogueries, villainies, imbecilities, grotesqueries, and extravagances

all of which, as he went on to say,

is so inordinately gross and preposterous, so perfectly brought up to the highest conceivable amperage, so steadily enriched with an almost fabulous daring and originality, that only the man who was born with a petrified diaphragm can fail to laugh himself to sleep every night, and to awake every morning with all the eager, unflagging expectation of a Sunday-school superintendent touring the Paris peep-shows.

Mencken's goal in the quoted essay was to explain why he chose not to join so many of his fellow Americans in emigrating:

Their anguish fills the Liberal weeklies and every ship that puts out from New York carries a groaning cargo of them, bound for Paris, London, Munich, Rome and way points-anywhere to escape the great curses and atrocities that make life intolerable for them at home.

But in this era of globalization, there's no longer any need to travel in order to verify that the rest of the world has caught up to America in the production of "rogueries, villanies, imbecilities, grotesqueries and extravagances" -- if indeed there was ever an "imbecility gap" outside the perceptions of provincial intellectuals. Perhaps the foreigners have even surpassed us. I believe that if you search our weblog's archives for Reuters, the Guardian and the BBC, you'll find even more "fabulous daring and originality" than if you search for the Associated Press, the New York Times or the Washington Post.

And this is not because we treat domestic outlets differently from foreign ones. Here at Language Log, we try to share with others our own innocent delight at this parade of wonders, whatever the source. The world, and our weblog, would be a duller place without it.

Different people applaud in different ways, of course. When Geoff Pullum wrote that

The claim quoted is thus not just false but, staggeringly, overwhelmingly false. It is perhaps the falsest claim ever discussed on Language Log (though of course this is debatable; the BBC's science reporting constantly struggles to stay ahead in wild falsehoods). It is a very good example of the sort of claim we think people should stop making about language use. Difficult to find a case of "bisexual" that does not have "chic" after it is what he said. Utterly untrue. These things can be checked, often in under half a second of Google time.

he was celebrating this marvelous example of Australian daring and originality in his own special way. And when David Beaver wrote that "I couldn't find myself disagreeing more strongly with Geoff", he was agreeing in his own characteristic fashion. I, of course, agree with both of them.

Posted by Mark Liberman at 03:38 PM

No abstract concepts for them

Yes, the Pirahã, most recently reported on in Language Log here and here. The manglings and exaggerations of the story have now reached the endpoint: the claim that the Pirahã have no abstract concepts at all.

And the claim appears on sci.lang, a newsgroup intended for the discussion of language from a scientific point of view. Shame, shame. The exchange:

From: Christopher Koppler

Re: languages in Russia

Date: Thu Sep 02 22:42:23 PDT 2004

On Thu, 02 Sep 2004 19:12:51 -0400, Keith GOERINGER wrote:

In article <7df91bca.0409012331.2965ca84@posting.google.com>, tyusha@freemail.ru (Xenia) wrote:

So what is the basis for your confidence? Any dipolmat will tell you notorious stories how hard it is to write a treaty in Kazakh, because the nomadic language lacks any abstract concepts, let alone terms of international law.

No language springs from the lips of its speakers fully formed -- it evolves over time. And to say that a language "lacks any abstract concepts" is an absolute -- and it is one that is absolutely false, to boot.

To mingle with another recent thread here, the only language known that really seems to lack any abstract concepts (number, color, anything outside of personal experience) is the language of the Amazonian Pirahã.

Number and color are outside of personal experience? What about shape and size? Generation? Kin relationship? Goodness? (Feel free to extend this list.) And, no, Peter Gordon and Dan Everett shouldn't feel obliged to respond.

zwicky at-sign csli period stanford period edu

Posted by Arnold Zwicky at 01:32 PM

True to media type

It's surprising that something becomes news, sometimes. But once it does, the treatments are pretty much predictable. At least, they often tell you as much as about the information sources as about the information provided.

Back on April 15, this article by Eric Weisstein appeared in MathWorld Headline News:

Russian mathematician Dr. Grigori (Grisha) Perelman of the Steklov Institute of Mathematics (part of the Russian Academy of Sciences in St. Petersburg) gave a series of public lectures at the Massachusetts Institute of Technology last week. These lectures, entitled "Ricci Flow and Geometrization of Three-Manifolds," were presented as part of the Simons Lecture Series at the MIT Department of Mathematics on April 7, 9, and 11. The lectures constituted Perelman's first public discussion of the important mathematical results contained in two preprints, one published in November of last year and the other only last month.

[...]

Stripped of their technical detail, Perelman's results appear to prove a very deep theorem in mathematics known as Thurston's geometrization conjecture. Thurston's conjecture has to do with geometric structures on mathematical objects known as manifolds, and is an extension of the famous Poincaré conjecture. Since Poincaré's conjecture is a special case of Thurston's conjecture, a proof of the latter immediately establishes the former.

Perelman's work had been reported in Science News back in June of 2003, Mark Kleiman blogged about it in December of 2003, the Boston Globe had a story on December 30, and Charles Kuffner blogged about it on January 2, 2004, among other things that you can find on the first page that Google returns for {Perelman Poincare}.

But yesterday, Keith Devlin talked about Perelman's proof at the British Association for the Advancement of Science's Festival of Science 2004 (main program here), and this (or the associated publicity) was taken by some journalists as an announcement of something newsworthy.

So Reuters ran a piece in its "Oddly Enough" category, exclaiming about how Perelman "has simply posted his results on the Internet and left his peers to work out for themselves whether he is right". Since the proof is not news, I guess some Reuters editor decided to treat it as a human interest story. At least, that's the charitable interpretation. Apparently the fact that Perelman traveled to MIT to give a series of lectures doesn't matter, perhaps because it happened last year, and in any case spoils the story line.

The Reuters story also indicated its reporters' and editors' deep appreciation of practical mathematics by adding an extra three orders of magnitude in the currency conversion process from dollars to pounds:

A reclusive Russian may have solved one of the world's toughest mathematics problems and stands to win $1 million (560 million pounds) -- but he doesn't appear to care.

Though perhaps they had the Lebanese pound in mind, in which case the error is only a factor of three.

At the Guardian, Tim Radford combined Pereman's work with Louis de Branges' alleged proof of the Riemann Hypothesis to predict doom and disaster for life as we know it, or at least for the internet. Some lesser outlets have picked up Radford's disaster-mongering, so in case you feel the urge to stockpile drinking water and check your ammunition supplies, here's what MathWorld said about de Branges a couple of months ago:

Riemann Hypothesis "Proof" Much Ado About Nothing
A June 8 Purdue University news release reports a proof of the Riemann Hypothesis by L. de Branges. However, both the 23-page preprint (from 2003) cited in the original release and a 124-page preprint (from 2004) cited in a back-dated modified release seem to lack an actual proof. Furthermore, a counterexample to de Branges's approach by Conrey and Li has been known since 1998. The media coverage therefore appears to be much ado about nothing.

The BBC manages to quote Keith Devlin as (apparently) stating that Poincare's conjecture for n=3 -- what Perelman seems to have proved -- is false:

"One of the odd things about this conjecture is that if you go even higher in dimensions - four, five, six manifolds, the Poincare Conjecture is true as it is for two manifolds (dimensions)," said Dr Devlin.

But it fails for three manifolds. The one case that is really of interest in physics is the one case in which it fails."

What he meant, of course, was not that the conjecture fails for the case of n=3, but that it had remained open for n=3, and has now apparently been shown to be true in that case as for other values of n. Here's a detailed account, from MathWorld, some version of which Devlin no doubt explained to the BBC's reporter:

In the form originally proposed by Henri Poincaré in 1904 (Poincaré 1953, pp. 486 and 498), Poincaré's conjecture stated that every closed simply connected three-manifold is homeomorphic to the three-sphere. Here, the three-sphere (in a topologist's sense) is simply a generalization of the familiar two-dimensional sphere (i.e., the sphere embedded in usual three-dimensional space and having a two-dimensional surface) to one dimension higher. More colloquially, Poincaré conjectured that the three-sphere is the only possible type of bounded three-dimensional space that contains no holes. This conjecture was subsequently generalized to the conjecture that every compact n-manifold is homotopy-equivalent to the n-sphere if and only if it is homeomorphic to the n-sphere. The generalized statement is now known as the Poincaré conjecture, and it reduces to the original conjecture for n = 3.

The n = 1 case of the generalized conjecture is trivial, the n = 2 case is classical (and was known even to 19th century mathematicians), n = 3 has remained open up until now, n = 4 was proved by Freedman in 1982 (for which he was awarded the 1986 Fields Medal), n = 5 was proved by Zeeman in 1961, n = 6 was demonstrated by Stallings in 1962, and n >= 7 was established by Smale in 1961 (although Smale subsequently extended his proof to include all n >= 5).

The BBC also has Devlin saying that "manifolds" are "dimensions", but at least they left telepathy out this time.

People on Slashdot took the opportunity to argue about attitudes towards money and fame. There was also some discussion about the question of whether putting papers up on the internet constitutes "publication" (as required by the terms of the Clay prize) or not.

The recent story was also picked up by many other outlets, though not yet by many in the blogosphere -- perhaps because it's not really news. There are just four references in Technorati so far, all simply pick-ups of a wire service story.

Devlin's talk seems to have been part of a session entitled "Million Dollar Maths", introduced in the Festival's program this way:

Million dollar maths
The Clay Mathematics Institute's seven prize problems are providing exciting challenges for mathematicians in the new Millennium. This event looks at the problems: the Riemann Hypothesis,Navier Stokes equations, Poincare conjecture, Yang-Mills theory, P vs NP and the Birch and Swinnerton-Dyer conjecture which hold eternal fascination for mathematicians, and also for those who follow the story of their solution.

The speakers were Marcus du Sautoy on the Riemann Hypothesis (he doesn't mention de Branges' claimed proof in his abstract, but perhaps discussed it in his talk), Simon Singh on the Clay $1M challenge, and Keith Devlin on the Poincaré Conjecture.

[Update: If you're interested in the (very interesting) background of the Louis de Branges story, there's a July 2004 London Review of Books article by Karl Sabbagh, and a paper by Louis de Branges himself entitled " Apology for the Proof of the Riemann Hypothesis", dated 8/10/2004.]

Posted by Mark Liberman at 08:12 AM

Bisexual chic is back already

I'm afraid that I couldn't find myself disagreeing more strongly with Geoff, who criticizes the claim that the word bisexual is usually followed by the word chic in the mainstream press. Of course, the facts are on his side, as usual. There is, as it happens, a single word that follows bisexual over half the time in the news, and it ain't chic. Any idea what it is?

And.

Yes, and. If you do the search "bisexual and|&" in Google news (gee, I didn't know you could use disjunctions in the middle of Google string searches. Well, you can, and it's pretty damn useful. E.g. I've recently been looking at there-insertion using queries like "there s|is|are a|1..1000 * linguist|linguists", which matches e.g. "There are 8 indiginous linguists in Columbia". Gee, I didn't....) you'll find that simple conjunctions account for 498 of 990 hits on bisexual. Like so many frequency counts, this represents another dull triumph for function over content. So yeah, it's true that Geoff is factually correct. Totally. 100%. Again. But what a boring world this would be if we listened to Geoff all the time just because he's right. Pullumizing the article in question, the crucial sentence would read:

It's difficult to find a piece of writing in the mainstream press which mentions the word bisexual without finding that it is immediately followed by the word and or possibly transgender(ed), or, man/men, gay, people/person, woman/women, who, community/communities, but, character(s), or any of a few hundred more words or an item of punctuation.

Yawn. I stand with Bi-Victoria newsletter writers and BBC science commentators here: let us not allow mere facts to stand in the way of good journalism. Animals can talk, eskimos' brains are so warped by the gazillions of snow words in their heads that they have to take an extra long nap after lunch, and bisexual chic is in.

By the way, there is a charming footnote at the end of the newsletter piece Geoff linked to, which begins:

Footnote

* A reader of this article pointed out that monogamy and racism are not morally equivalent.

Nooooo? Really? Geoff, you nit-picker, will you please leave the poor journalist alone?

Posted by David Beaver at 04:38 AM

September 06, 2004

Bisexual chic: the facts

Many journalists seem to have the impression that the co-occurrence of certain word pairs is a good indicator of the way the culture is going, and so it may sometimes be. But one of the things the people posting on Language Log have tried to stress is that it is important to see that such claims are empirical: they could be wrong, they could be right, and you've got to do some work to determine which by looking at the way things actually are. What we are trying to fight against is this sort of thing:

It's difficult to find a piece of writing in the mainstream press which mentions the word 'bisexual' without finding that it is immediately followed by the word 'chic'.

This appeared in an article in a newsletter for bisexuals in Victoria, Australia, and was quoted by its author here.

Now, the claim is in an empirical one of a particularly clear sort. So, using Google News, we check the figures: for "bisexual", 984 hits in currently indexed news articles in English. For "bisexual chic": zero. Turning to the entire web (not just news sources), for "bisexual" we get 2,390,000 hits; and for "bisexual chic", a mere 870.

Could the author have meant to refer only to Australia? Google can check that too. In Australia we have: for "bisexual", 36,600 hits; and for "bisexual chic", 3. Only the latter figure is an overcount by 50%, because one of the hits is for the above quote itself. There are, therefore, exactly 2 hits for "bisexual chic". For a number like that, I am prepared to do an exhaustive reading of the text for all hits. One appears to be on the personal website of a creative writing student at Macquarie University. The other is a discussion of (mostly American) TV programs at Queerplanet, a gay website in Australia. Hence the number of hits at Australian mainstream press sites is zero.

The claim quoted is thus not just false but, staggeringly, overwhelmingly false. It is perhaps the falsest claim ever discussed on Language Log (though of course this is debatable; the BBC's science reporting constantly struggles to stay ahead in wild falsehoods). It is a very good example of the sort of claim we think people should stop making about language use. Difficult to find a case of "bisexual" that does not have "chic" after it is what he said. And that is utterly, outrageously untrue. These things can be checked, often in under half a second of Google time. Check them! I have spoken.

[Update 9/8/2004: Semantic Restructuring commented on this post, as an instance of "the growing reliance on google for statistical fact checks". I've added this link, at his request, because Trackback was not enabled at the time. --- MYL]

Posted by Geoffrey K. Pullum at 04:59 PM

Why I don't use A9 much

Back in April, John Battelle blogged about it (here and here), Cory Kleinschmidt reviewed it on Traffick, Pamela Parker wrote about it at ClickZ News, and so on. I'm talking about Amazon's A9 search spin-off, and the "search inside the book" facility it offers. A9 offers two kinds of search -- web search, which is just Google's results repackaged, and "search inside the book", which is the main value added as far as I'm concerned.

I was pretty excited about this when it first came out, and I still have some hopes for the enterprise. But as things have turned out, I haven't really been able to use it for much. The number of cases where it tells me something that I hadn't already learned from Google is small, and the number of cases where it tells me nothing of value at all is large.

There seem to be roughly three reasons for this:

First, there are no quoted strings.

On Google, you can search for "to be or not to be" and get 169,000 pages that actually include the quoted string. If you search A9 for a quoted string, you always get no results (in the books category) -- quoted strings just don't work.

For unquoted word sequences, A9's results ranking algorithm seems to give precedence to results in which the words are near one another in the same order. As a result, some such searches work. For example, after Edward Everett wrote to Abraham Lincoln that "I should have been glad if I could flatter myself that I came to near to the central idea of the occasion in two hours as you did in two minutes", Lincoln wrote back that "In our respective parts yesterday, you could not have been excused to make a short address, nor I a long one."

Of course, searching Google for "you could not have been excused to make a short address" returns 8 pages about Lincoln's letter.

Searching A9 for the same words (without the quotes) returns 26,735 pages, of which the top two are relevant:
p. 152 of The Civil War: Stange and Fascinating Facts, by Burke Davis; and
p. 67 of Talking Politics: The Substance of Style from Abe to W, by Michael Silverstein (where I took the quote from originally).
But after that, the results go bad in a hurry. The next few returns are to completely irrelevant pages of Adam Haslett's You Are Not a Stranger Here; Renee Rosenblum-Lowden's You Have to Go to School--You're the Teacher!; and Beverly Engel's Loving Him Without Losing You: How to Stop Disappearing and Start Being Yourself. As far as I can see, it doesn't get any better after that.

Searching for "to be or not to be", there's no cream to skim. The top four results (of 234,822) are:

1. Hugh Hewitt's If It's Not Close, They Can't Cheat: Crushing the Democrats in Every Election and Why Your Life Depends on It;
2. Behrendt & Tuccillo's He's Just Not That Into You : The No-Excuses Truth to Understanding Guys;
3. Woodall et al.'s What Not to Wear;
4. that classic of Shakespearean drama, NOT "Just Friends": Rebuilding Trust and Recovering Your Sanity After Infidelity, by Shirley Glass.

It didn't get any better on subsequent pages, at least not before my patience wore out.

Where do they get these links from? Believe me, they don't reflect a generalization of amazon's experience of my recent own personal book-buying history. Instead, the list seems to be some effectively random amalgam of bag-of-words hits and amazon sales rank.

Second, only a limited subset of books are indexed.

If you search A9 for marthambles, you'll find the example on p. 244 of Dorothy Dunnett's The Ringed Castle that I blogged about (also here and here, if you're interested), and a reference to p. 186 of Dean King's Patrick O'Brian: A Life, but you won't be able to answer that burning question "where in O'Brian's novels does the word marthambles occur?", because his novels aren't indexed.

As of 10/2003, a story reports that amazon had indexed "over 33 million book pages from over 120,000 titles". Presumably more have been added since, though I can't find any more recent counts. However, there must be many more out there to index -- there appear to be about 120,000 (distinct) books published each year in the U.S.

Third, sales rank is rarely a good substitute for page rank.

Search Google for "animal communication" or "first rule of fiction", and you'll find some useful links on the first couple of pages. Skim a dozen or so of the links, and you'll get a pretty good sense of what's going on.

Now search A9 for the same word sequences. In the case of "animal communication", you'll find that three of the top ten results are about telepathic communication with animals, and four others are about how to communicate with your horse or your cat. Three are expensive scientific tomes in which you can't actually read anything except by purchase -- they've been found because animal and communication are in the title. Oddly, one of the links is to Norbert Wiener's classic Cybernetics, which I've read several times without really noticing the subtitle "Control and Communication in the Animal and the Machine". Everyone should read this book, of course, but if you ordered it with the idea that it would tell you anything about animal communication, you'd be sadly disappointed.

In the case of "first rule of fiction", you'll get four references to Terry Goodkind's Wizard's First Rule; a link to Ann Blakely's Never Wear Panties on a First Date and Other Tips; Daniel Magida's The Rules of Seduction; Heather Lewis' novel House Rules; and Ann Rule's novel Possession (!). Enjoy...

Posted by Mark Liberman at 02:55 PM

Postcards from eggcornea

Here in Eggcornea we're riding a wave of mail set in motion by earlier LL postings, here and there. The mail has brought us some old standards, the darlings of the usage dictionaries; some hidden eggcorns; a cross-language eggcorn almost as good as pre-Madonna; and a crop of nominees for the Internet Eggcorn Galleria. And we've been moved to muse about the line between eggcorns and creative spellings, plain ol' malaprops, and syntactic blends; to examine the (very close) connection between eggcorns and puns; and to defend eggcorns against the criticism that they erase history.

1. Old standards. The LL postings sparked discussion on the newsgroup soc.motss (8/27-28), including queries about born/borne, tack/tact, and eminent/imminent (type 2 diabetes mellitus is "imminently controllable"). These are old standards, most of which are covered (to some extent) in MWDEU, Garner, Paul Brians's Common Errors in English, and similar compendia.

It would be worth some trouble to assemble the "confusions" in these compendia. This would take some judgment, since not all of the linked items are eggcorns. Some are just very common (inadvertent) misspellings or Fay/Cutler malapropisms, with no reanalytic motive at all. Some are words of very similar, overlapping, meaning (partly/partially) that the compendia are trying to subtly discriminate (often in inventive ways).

2. Hidden eggcorns. Three candidates have come to my attention.

2.1. the die is cast. Keith Ivey wrote on 8/28 to say:

When I first heard the phrase "the die is cast", I thought it meant that a mold for stamping out coins (for example) had already been produced from molten metal and thus set and could not be changed. I later learned that it referred to throwing a gaming cube. Apparently I'm not alone in having had this misapprehension.

Ivey unearthed a page with a couple of paragraphs "correcting"the gaming-cube interpretation:

Perhaps you have heard the phrase 'the die is cast' or 'the die has been cast'. This has nothing to do with gambling or dice; instead, it refers to a mold (die) which has been cast (made).

Once the mold is made, everything which comes from it, will have the shape of the mold. 'The die is cast' thus states that a pattern has been laid down, and thus subsequent events will conform to the pattern. This phrase lends itself to assumptions about the future being predictable, once patterns are seen in the present.

See also the pages here and here. Lots of people have reanalyzed the die related to dice as die 'mold', and some of them are entirely sure they're right.

2.2. passion play, the passion of the Christ. Also on 8/28 -- a big day in Eggcornea -- Nikita Ayzikovsky wrote to remind me that many people have probably reanalyzed passion in these expressions, from 'suffering' to 'intense feeling'.

2.3. beg an answer/solution. More recently, I came across the following in a review of the tv series "Hawaii" by Alessandra Stanley, in the New York Times of 9/1/04, p. B6, and reported on it in ADS-L:

There are no female detectives at headquarters, just a sultry young policewoman who aspires to be one. She is the lust object for two young investigators, Their silent, narrow-eyed stare contests are so smoldering that they almost beg Sergio Leone theme music.

The story starts with the technical idiom beg the question. The alt.english.usage faq page describes the beginning of the development:

Many people unaware of the technical meaning of "to beg the question" in logic use it in one of two looser senses. The first of these, "to evade the question, to duck the issue", is attested since 1860 (WDEU). The second, "to invite the obvious question, (with an inanimate subject) to raise the question", is now the most commonly heard use of the phrase, although we have found no mention of it prior to The Oxford Guide to English Usage, 1st edition (1983), and it is not yet in most dictionaries.

What's going on here is a reanalysis of the technical verb beg as a closer and closer approximation to the ordinary verb beg (for) 'ask for'. A quick Google search shows that beg the issue has developed roughly the same range of meanings as beg the question. Beg with some other objects, like the possibility, seems to have only the 'invite' sense: "Finally, of course, a weak dollar begs the possibility of higher interest rates. Mr. Greenspan's refusal..." (209.157.64.200/focus/f-news/1056180/posts).

The end development is beg + object as straightforwardly involving the ordinary verb beg, and we get things like beg an answer and beg a solution: "Trinitarian theology begs an answer to the question: 'What on earth happened to the Holy Spirit?'. Who is the Holy Spirit?" (www.biblicalunitarian.com/html/modules.php?name=News&file=article&sid=84); "Unfortunately, there is one issue that still begs a solution, and provides a challenge to creating a mutually acceptable situation." (shipwreck.net/gsarticle04.html). And then we're in a position to beg 'beg for' just about anything, even Sergio Leone music. Well, some people are.

3. Cross-language eggcorns. We saw the marvelous pre-Madonna in my last posting. Now David Fenton, in soc.motss (on, yes, 8/28), recalls the pre-fix menu he encountered at a D.C. restaurant a few years ago. Well, the cost is fixed ahead of time, right? Slightly Frenchier is pre-fixe: "Dinner : $20 Weekly Pre-Fixe. For this week of August 30th-September 5th, 2004. ... With Pre-fixe Menu Only... Solano Grill & Bar, Inc." (www.solanogrillandbar.com/menus/prefixe.htm).

4. New candidates. Here are some fresh eggcorn candidates. Previous caveats and warnings still apply.

4.1. expatriate > ex-patriot. Margaret Marks suggested (on 8/28!) that Language Log must already have mentioned this one. Apparently not. Google brings up several sites, some offering medical insurance to expatriots or ex-patriots. There's even an expatriot.fazzle.com site.

4.2. nip in the bud > nip in the butt. Also on 8/28, Patrick Linehan, who works in a hospital Emergency Depatment, wrote to tell me about a patient who came in with a sore throat of only one day's duration. "She said she came in so soon because she wanted to 'nip it in the butt'."

4.3. god awful > god offal. Also from Nikita Ayzikovsky on National Eggcorn Day (8/28, mark your calendars): god offal. The rare offal for the common awful is something of a surprise, and might belong in the "creative spelling" category (below). But there are a fair number of examples from Google. For instance, the complaining consumer: "i just slapped on a god offal french accent, called the pringles company, told them that my cheesy pringles werent cheesy enough, and they are sending me..." (www.livejournal.com/users/bitchonheels/62927.html). And the vexed climber: "I've been at this belay station for over an hour. It's the most god offal uncomfortable belay I've ever been at. A cross between hanging and sitting..." (www.tumtum.com/climbing/stories/94-09-07-Dierdre).

4.4. behind the throne > behind the thrown. Earlier, I reported on unthone > unthrown, which I took to be mere creative spelling. But now John McChesney-Young (8/31) writes to offer a letter his father wrote to the editor of the on-line (subscription) publication by stratfor.com: "In your Basic Global Intelligence Brief for 31 August, the mention of 'Jiang's plan to remain behind the thrown.' makes me wonder if this is somehow related to being behind the curve, if a curve ball has been throne!"

McChesney-Young goes to report that he's "surprised how common it turns out to be: Google web finds 'behind the thrown' in about 675 pages (in Usenet about 276 postings) and 'power behind the thrown' at 81 unique web hits (Usenet 39 unique postings). The former phrase is sometimes used legitimately, e.g., 'Keep the carried ball BEHIND the thrown ball'."

4.5. be garbled > be gargled. On 9/1, "hondacivic@whoever.com" posted to soc.motss to complain: "What's with the gargled sounds on cfrb. For the last few weeks now, ads, little tunes, often sound gargled." David Fenton cried, "EGGCORN ALERT!!!!" On 9/3, Jed Davis followed up: "I actually read through the OP, and it seems to be taking issue with certain processed sound artifacts that, apparently, remind one of the sound of a person gargling. Thus, gargled sound." Meanwhile, a bunch of other Google hits for "gargled sound" (only a few of them about phonetics) and one for "sound was gargled" suggest that the expression has some currency, at least for sounds. (No hits on "message was gargled".)

Well, this is a tricky one. It depends on whether users of "gargled" distinguish it from "garbled", or whether some use it to cover the territory that the rest of us use "garbled" for.

4.6. pinecone > pinecomb. And now, on 9/5, the helpful John McChesney-Young passes on a posting to STUMPERS-L from Diane Rainaud:

While reading a book just now my daughter came across the expression "..the pinecomb doesn't fall far from the tree." She was surprised as she had always thought the "thing" that grows on a pine tree was a pinecone. Our dictionary does not include the word "pinecomb", and Googling turns up some odd references such as this caption under a picture in a medical article:

"Cystogram showing the typical appearance of a spastic neurogenic bladder. The findings are sometimes referred to as the "pinecomb" or "Christmas tree" appearance."

Can anyone shed any light on the word pinecomb?

Looks like an eggcorn to McChesney-Young and me.

5. Creative spelling. Some time back (on 7/6/04), Mark Liberman posted here about eggcorns that were "just non-standard spellings". Mark's first example was whittle > widdle. These, I think, don't deserve to be called eggcorns at all; there's no shift in the identification of parts of expressions.

Back on National Eggcorn Day (8/28), Anthony Jukes pointed me to a truly wonderful cross-language respelling, voilá > walaa, many many examples of which can be found by googling on "and walaa". For instance: "You can pad the pipe to your liking (I use very thin but dense foam & duct tape, a friend uses foam then covers it with hemp rope) and walaa!" (www.martialartsplanet.com/ forums/search/topic/14816-1.html).

Ok, not an eggcorn. But delightful.

6. Plain ol' malaprops. Recall that eggcorns are reanalytic (classical) malapropisms. There are, of course, plenty of plain 'ol malaprops around.

6.1. epidemic > diagrammatic. Back on NED (8/28), Mark Mandel posted to ADS-L about a Nigerian scam spam letter he'd received, which began:

I am the above named person from Ghana. I am married to Dr Alfred Williams who worked with Ghanaian embassy in South Africa for nine years before he died in the year 2002.We were married for eleven years without a child. He died after an epigrammatic illness that lasted for only four days. Before his death we were both Christians.

Over the next day or two, the ADS-Lers wrestled with what epigrammatic was intended to convey. The consensus was epidemic, which is phonologically rather distant, but then classical malapropisms sometimes range pretty far phonologically from their targets; a minority report from Doug Wilson came down for brief, which would be a semantic error rather than a classical malapropism. The discussion was taken to a new level by Africanist Herb Stahlke, on NED+1:

I suspect this may have been more than just the normal malaprop. In the Nigerian English of the sort who seem to send the emails there is a strong tendency towards what they sometimes call "fine talk", using the biggest, most learned sounding words they can find whether they make sense or not. It's the sound and overall impression they are going for. But you have to hear it from within Nigerian culture to appreciate it. There's a 1998 novel by Karen King-Aribisala titled Kicking Tongues, Canterbury Tales transplanted to Nigeria, that has some truly artful examples of this.

6.2. remnants > ruminants. Still on NED and on ADS-L, Dan Goodman noted that "A recent story at http://www.literotica.com began with the narrator shaking off the last ruminants of sleep." These are presumably the sheep that the narrator counted to get to sleep in the first place and now must be expelled.

7. Syntactic blends vs. lexical intrusions. Eggcorns could be viewed as "lexical intrusions", in which one element (morpheme, lexeme, word) is substituted for another in an attempt to have a larger expression "make more sense". On occasion, this sort of substitution can look rather like a syntactic blend (a topic I want to say more about soon). A few examples:

7.1. hunker down > bunker down. Last autumn, ADS-L spent some time on the expression bunker down. It started with a example offered by Seán Fitzpatrick on 10/10/03:

From "Jonestown for Democrats: Liberals follow Gray into the big nowhere", by Marc Cooper in the LA Weekly http://tinyurl.com/qgfm (emphasis added):

As the insurgency swelled, the best that liberal activists could do was plug their ears, cover their eyes and rather mindlessly repeat that this all was some sinister plot linked to Florida, Texas, Bush, the Carlyle Group, Enron, and Skull and Bones. By BUNKERING DOWN with the discredited and justly scorned Gray Davis, they wound up defending an indefensible status quo against a surging wave of popular disgust.

"Hunker down" mixed up with some such phrase as "go into the bunker with".

Immediately, Gerald Cohen, who has collected enormous numbers of putative syntactic blends (Cohen, Gerald Leonard. 1987. Syntactic blends in English parole. Frankfurt: Peter Lang.) and is inclined to see them everywhere, firmly rejected this offering: " 'Bunker down' is not a blend. It's merely 'hunker down' with the intrusion of 'bunker' (based both on phonetic similarity and the idea of hunkering down in a bunker." And Clai Rice (10/13/03) offered up a collection of Google hits for bunker down, suggesting that this is (sometimes) not an inadvertent slip, but an eggcorn.

7.2. poke fun at > pick fun at. On NED+1, ADS-Ler Wilson Gray reported on pick fun at, which Larry Horn suggested might be a blend of poke fun at and pick on. But it could just be an "improvement" of poke by pick, that is, an eggcorn.

7.3. slugfest > slangfest. Right after this, on NED+2, Peter McGraw logged the following: "In a column about the Swift Boat attack ads and Kerry's response to them, David Gergen says: 'A quarter-century ago, they [the Swift Boat Veterans' allegations] would have faded away without much discussion. But in an age of slangfests on radio and cable news, it was inevitable that conservative hosts would blow up the story' (The Oregonian, 8/30/04)." McGraw, David Barnhart, and Grant Barrett googled up further examples. It would be possible to see slangfest as a blend of the amply attested slugfest and the somewhat more uncommon slanging match. Or, of course, slang could just be an improvement on slug, indicating that language was the medium of attack, and not fists.

8. A vexed note. I try to keep good files on the various kinds of "mistakes", inadvertent and not, phonological or semantic or morphological or syntactic, etc. etc. I have separate files for spelling errors, for eggcorns, for other classical malapropisms, for blends of various sorts, and so on. Most of them now have files appended that say, rather desperately, "see also..." Lord knows how I'd count any of this.

9. The pun connection. Just before the dawn of NED (on 8/27), Emily Bender mailed to remind me about the eggcorn-pun connection. Another way to look at eggcorns is as unintentional puns. Both puns and eggcorns turn on a doubleness of meaning for identical (or very similar) form. Bender's e-mail was actually about a kind of written pun in Japanese (which I'm not competent to write about, though I invite Bill Poser to say something about it), but her general point is an important one, I think.

In both cases, there are imperfect matches (eggcorn home > hone, pun "With fronds like these, who needs anemones?"), perfect phonological matches distinguished in spelling (most eggcorns and puns), and completely perfect matches (hidden eggcorns, puns like those below, from Geoff Tibballs (ed.), The mammoth book of humor (NY: Carroll & Graf, 2000)), in which both pronunciation and spelling are identical.

[5180] What did the farmer say to the goat who wouldn't reproduce? -- You must be kidding.

[5182] [with reference to Quasimodo] ... "I'm not sure of his name," said the woman, "but his face rings a bell."

[5183] What did one eye say to the other? -- Between you and me there's something that smells.

[5185] What do you call a witch who verfies her incantations? -- A spell checker.

[5187] What do you call Santa's helpers? -- Subordinate Clauses.

When the eggcorn topic hit soc.motss recently, it set off a wave of punning (not that this group needs encouragement to pun). The rain/reign/rein example immediately produced references to Prince's "Purple Reign" (and his "Purple Rein") and quotations like "The reign in Spain is mainly on the plain" and "Who'll Stop the Reign?" See, it's contagious: walaa!

10. The loss of history. I'm still mulling over the visceral objection to eggcorns that many people have, in particular the objection that things like free reign and hone in on, not to mention pre-fix and the venerable chaise lounge, are appalling because, in the ignorance of the history of these expressions, we erase that history by reshaping them.

Well, from the point of view of ordinary language users, history doesn't count for much. We mostly have no way of ascertaining that history, and when we know it, it's a kind of charming footnote: the important thing is how the pronunciations, meanings, and uses of expressions are linked. If you happen to know that some expressions arise from the technical vocabulary, metaphors, and metonymies of, say, card playing, sailing, baseball, fashion, horseback riding, the law courts, or music, that might deepen your appreciation of these expressions, and you might exploit those associations in your speaking or writing (if you know your audience), but this is lagniappe. Almost all of these figures are lost entirely or function subliminally.

One of the great lessons for me as a participant in ADS-L over the years has been the discovery of just how little even the experts know about the history of idiomatic and formulaic expressions, and how tremendously difficult these investigations are. We can speculate, and produce suggestive citations, but just an enormous amount of history is hazy, and some of it is probably unknowable. Even worse, things that "lots of people know" are just false; go back and look at the die is cast above. Mythetymologies abound.

It's not reasonable to ask ordinary people to be historians. Hell, it's hard enough for the specialists. I can't see why we should be insisting that ordinary people should be philologists. Let them find their own poetry; they're pretty good at it.

zwicky at-sign csli period stanford period edu

Posted by Arnold Zwicky at 02:15 AM

September 05, 2004

Blame Miles Bartholomew, Ward Stone Ireland and IBM

With respect to the Spanish closed-captioning for the recent Republican convention, Geoff Pullum contemplated the translation of Senator Olympia Snowe as "Senador Nieve de Olympia", and asked "Is it incredibly stupid 16-year-old human translation slaves that they have chained to desks at the captioning service office? Or machine translation software so dumb that even the armed services wouldn't pay for research into how to improve it any more so they had to go sell to the private sector?"

Well, it certainly seems as if some particularly stubborn transfer-based MT system might have been in the loop ("OK, we have an English phrase of the form MODIFIER NOUN, so that means we get a Spanish version in the form TranslationOf(NOUN) de TranslationOf(MODIFIER)"). But I suspect that the rest of the problem was probably not the fault of a human operator, but rather the consequence of a CAT ("Computer Aided Transcription") system.

If so, the (indirectly) guilty individuals were Miles Bartholomew, the "father of the stenograph", who patented the first American shorthand machine in 1879; Ward Stone Ireland, whose "high-speed keyboard [is] still in use today"; and a series of inventors funded from 1950 through the 1980s by the U.S. Defense Department and IBM, who created the technology for CAT.

In this system, the (human) transcriber uses a special keyboard, with a layout like this:

Some of the benefits are explained in this page, such as the fact that the word straight can be typed with one "stroke" (in which multiple keys are depressed), and the word centralization with three "strokes". Some details of how it works are explained here. A key point is that the primary coding scheme is based on pronunciation rather than on spelling. You can get an idea how this works from the stenotype output below, which represents the phrase "You should be able to read these short words":

Here's the basic stenotype "alphabet" -- remember that multiple keys are typically depressed simultaneously, coding a syllable or more at a time:

Note that not all sounds are represented directly on the keyboard, so that (for example) "gleam" is written by simultaneously chording the eleven keys TKPWHRAOEPL (which is what prints out on the tape), interpreted as TKPW = g, HR = l, AOE = long e, and PL = final m. Of course, (the toothpaste brand) "Gleem" would be chorded in just the same way, since the system is based on pronunciation.

The tape below shows a realistic example of such transcription, combining pronunciation-based sequences with other sort of keypresses, which according to the page where I got this, may be "unique to each reporter. In addition to the spoken word the reporter writes steno outlines to identify speakers; punctuate; insert parenthetical phrases, 'notes to self,' cues for computer translation. Some reporters invent new steno outlines 'on the fly' as needed".

In the old days, the machine just allowed the transcriptionist to keep up with a speaker in real time, but a human (typically the same transcriptionist) needed to go back later and transcribe the notes to normal text form. These days, the transduction to normal text is normally done by means of a computer program, which uses the same sort of "language model" that a speech recognition system does, in order to make appropriate guesses about how to make the translation. If there's time and/or money, a human editor may check the output, but if you want things cheaply and/or quickly, this may not happen.

I don't know exactly what combination of human and machine transcription and translation technologies was involved in producing the Spanish subtitles at the Republican convention, but if the transduction from "Senator Olympia Snowe" to "Senador Nieve de Olympia" involved a CAT step, then the loss of Senator Snowe's mute e was a small sample of the changes that in principle might have taken place, as this page explains.

Posted by Mark Liberman at 10:59 PM

Los Senadores y El Presidente

Christopher Buckley says in the Sunday New York Times (page 9 of Week In Review if you have the hard copy; available online if you're signed up) that his TV got stuck with the Spanish closed-captioning on while he watched the Republican convention, and it showed Zell Miller with the words "Senador Molinero del Zell" at the bottom of the screen. Olympia Snowe was "Senador Nieve de Olympia" (does her final silent "e" count for nothing? "Snowe" is not "snow"), and almost unbelievably he saw several times a reference to Presidente Arbusto. Is it incredibly stupid 16-year-old human translation slaves that they have chained to desks at the captioning service office? Or machine translation software so dumb that even the armed services wouldn't pay for research into how to improve it any more so they had to go sell to the private sector?

Posted by Geoffrey K. Pullum at 08:58 PM

The perils of prescriptivism

In Jasper Fforde's third Thursday Next novel, The Well of Lost Plots, each chapter begins with a paragraph-length quotation from some (imaginary) other work dealing with the (imaginary) world of the Great Library, where the action takes place. Chapter 22 starts with a definition of "echolocator" from the Guide to the Great Library by Cat Formerly Known as Cheshire:

An artisan who will enter a book close to publication and locate and destroy echoed words in the work. As a general rule, identical words (with exceptions such as names, small words and modified repetitions) cannot be repeated within fifteen words as it interrupts the smooth transfer of images into the reader's mind. (See ImaginoTransferenceDevice user's Manual, page 782.)

The related issue of repeated "small words", or at least a particular instance of it, is taken up at a meeting of Jurisfiction operatives in chapter 23, where the seventh agenda item is "the had had and that that problem". Lady Cavendish reports that "[a]t the last count, David Copperfield alone had had had had sixty-three times, all but ten unapproved". She also flags a problem in Pilgrim's Progress "due to its had had/that that ratio".

Specifically, she explains that the problem is "[t]hat that had that that ten times but had had had had only thrice".

"Hmm," said the Bellman. "I thought had had had had TGC's approval for use in Dickens? What's the problem?"

"Take the first had had and that that in the book by way of example," explained Lady Cavendish. "You would have thought that that first had had had had good accasion to be seen as had, had you not? Had had had approval but had had had not; equally it is true to say that that that had had approval but that that other that that had not."

"So the problem with that other that that was that ... ?"

"That that other-other that that had had approval."

"Okay," said the Bellman, whose head was in danger of falling apart like a chocolate orange, "let me get this straight: David Copperfield, unlike Pilgrim's Progress, had had had, had had had had. Had had had had TGC's approval?"

That's a lot of set-up for the eleven-had sequence, which is a form of an old joke presented more simply here, as a puzzle

Ann while Bob had had had had had had had had had had had a better effect on the teacher.

which can be understood with a little punctuation and goodwill as

Ann, while Bob had had "had", had had "had had". "Had had" had had a better effect on the teacher.

In the Well of Lost Plots passage, though, I could use some help with that "other-other that that" business. On second thought, never mind...

Echoing the topic of amazon sales rank, which has come up here recently, I note that the fourth Thursday Next novel, just published, has a current sales rank of 367, which apparently translates to sales of about 17 copies per day (though since books in this range are re-ranked every hour, this number may not be very stable). The Well of Lost Plots, published last year, is already down to a rank of 2,557, or around 4.5 copies per day.

Anyhow, it seems that this sort of thing has quite a few fans, though apparently they are thin on the ground at The Economist's letters-to-the-editor department.

Posted by Mark Liberman at 12:10 PM

September 04, 2004

Not all Bushisms originate with Bush

Some people have no ear for when a phrase sounds ludicrous. Even crashing incompatibilities, bathos, or undesired jingly phonetic similarities seem not to impinge on their consciousness. To observe that President Bush is one of these linguistically insensitive souls is not exactly a news item. And there could scarcely have been a better illustration, or so I thought, than his remark early today at an "ask President Bush" session at Brecksville-Broadview Heights High School in Ohio, to the effect that in the nightmare mass murders of Beslan, North Ossetia (southern Russia) we have seen "the horror of terror". The horror of terror. Surely no one but Bush could have slopped together such a ridiculous-sounding phrase — two virtually synonymous and phonetically similar abstract nouns fighting each other like two possums in a sack, I thought as I heard it. But to my surprise, this inept phrase gets over 190 prior Google hits. It is true that our president is not a gifted phrasemaker. But one must never forget how many other inexpert and unstylish users of English (which, after all, has between one and two billion people using it) have preceded him. Not all Bushisms originate with Bush.

Posted by Geoffrey K. Pullum at 02:15 PM

Words and ideas

Mark's post from earlier today about Lakoff, frames and messages links to his post from earlier this summer about Lakoff, ideas and words. This reminded me that I've been meaning to blog about the Republican National Convention.

OK, not really about the RNC. What I've been meaning to blog about is something from NPR's coverage of the RNC; specifically, a short (20-30 sec.) ad leading up to the actual live coverage.

(At this point we're relying entirely on my fallible memory; I haven't had time to see if I can find a clip of the ad online. Sorry. Please correct me if I get any of this wrong.)

The ad featured clips from presidential candidate nomination acceptance speeches (how's that for a compound) at different National Conventions. (Both parties were represented; I think I remember hearing JFK anyway.) There are a few voice-over words at the beginning, I think, and then it ends something like this:

Hear these same words spoken again during our live coverage of the Republican National Convention.

These same words? Ideas, concepts, frames, messages ... these are generally expressed with words, so why not let PITS conflate the distinction? I'd like to add a modest argument to Mark's call to honor the distinction. Consider the following pairs of phrases.

a man of many words
a man of many ideas
a man of few words
a man of few ideas

Which of these men would you vote for?

[ Comments? ]

Posted by Eric Bakovic at 12:22 PM

Frames and messages

George Lakoff's ideas about "the framing wars" have started to find their way into public discourse, though I've complained that the media (at least Bill Moyers) presents the issue as being about the choice of words rather than about the choice of ideas. You can find a clearer statement of George's perspective in this 9/1/03 article from the American Prospect, and in these two interviews from the UC Berkeley News. He thinks that the Republicans have "out-framed" the Democrats over the past couple of decades, and that seems to be the truth of the matter.

However, Lakoff is not talking about ideas as rationally-constructed opinions, convictions or principles, but rather about metaphors, images, and evoked scenarios with emotion-laden roles like victim, hero, villain, crime, strict father and so on. Although he has partisan goals, this mode of analysis is politically neutral, at least with respect to current American political parties. It emphasizes attention to emotion instead of substance, but you can apply that emphasis to promoting any goals you want.

I'm somewhat skeptical about the degree of separation between substance and presentation that this approach assumes. It's traditional in the advertising industry, and for that matter in politics. But it seems to me that political movements -- and for that matter, advertising campaigns -- succeed best when style and substance are integrated. That's certainly what Michael Silverstein suggests in his analysis of how Abraham Lincoln's rhetoric came to form part of America's "civil religion".

Like Lakoff, Silverstein is certainly partisan. His pamphlet Talking Politics contains a certain amount of bile directed at George W. Bush, as the author of Semantic Compositions discovered by reading a few pages via Amazon's "Read Inside the Book" feature. However, I think this misses the point. Like Lakoff, Silverstein is promoting a mode of analysis that applies to any human communication, political or otherwise. What he has to say about image, style and message can be applied to positive or negative evaluations of any politician, or for that matter to your relatives, your friends or yourself.

In order to give a sense of what we're talking about, I've typed in a highly elliptical version of Silverstein's introduction to what he means by "message" -- this represents about a fifth of the stretch of text from which the quotes are derived, with most of the elaborations, exemplications and asides left out:

Those not attuned to politicoglossia may at first think that someone's "message" is the topic, or theme, or cetnral proposition he or she is trying to communicate. ... You could paraphrase someone's "point" as a kind of assertion that such-and-such is the case about something-or-other.

You would be wrong. ... "Message," we can discern from the study of political communication, is really much more complicated than that. If successful, a person comes to inhabit "message" in the act of communicating.

... In order to understand "message," ... we have to think about the several different kinds of meaningfulness always present -- though not always recognized -- when language is used. ...

In our own intellectual tradition of understanding how people use language, the most salient -- the official -- "what" of communication lies in how words and expressions describe, or in technical terms, denote. ...

So officially we describe things and states-of-affairs so that others can also identify those things and states-of-affairs. ...

But additionally, there are principles based in developing information-structure itself, distinct from the grammar of sentences, that determine what expressions we can and do use at various points in communication ... while communication proceeds, sender and receiver can rely more and more on what has already been communicated about a topic, information about it that cumulates between them. ...

In this way, discourse is always being evaluated as description for how it achieves a kind of cumulative coherence as information. ... people can use language to construct collectively reached and collectively consequential knowledge, opinion, and belief about all manner of things. ...

But, having mentioned both grammar and information structure, is there anything else to communicative use of words and expressions? ... [I]n every discourse a large number of extra-verbal contextual factors leave their determinate traces in the forms we use -- what are termed in the trade indexical (pointing) traces. These traces inform us about, they point to, the who-what-where-when-why of discourse by subtle loadings of the "how", the actual forms, of discourse. ("Democratic" or "Democrat"? ...) ... Indexical values of language forms locate and identify the parties to the communication ... the way a good pantomime gives the impression of taking place in a comprehensible surround.

These indexical factors in language seem to crosscut the information structure always emerging via grammar and denotational coherence as speakers add to the words and expressions in a text. masters of political "message," just like other users of languages, have intuitively known all along about the indexical power of the words they use, and especially about the cumulative indexical poetry of poperly arranged words. Such masters have a knack for indexical design that has shaped each era's political communication -- at least as much as the descriptive content of it ... -- thus creating a true rhetorician's art form. ...

... In communicating we ... rely on social arrangements already in place, and the expectations we can then have about what form talk should take between two socially locatable individuals. But as well, each time we deploy specific forms of langauge we create social arrangements as consequences of using these forms; we bring new social arrangements into being.

... [T]he act of communication itself, that is, the emergence of certain indexically potent message forms, can always transform the intuitive classifications we apply to one another, new ones suddenly pointed to as now operative and consequential ...

What indicative signs and signals, for example, were you relying on in your aunt's talk when you concluded, the other day, that she was "stressed"? ... Again, how did I come to know that the prospective student in my office last week was gay? He did not announce this to me as a self-description, explicit or implicit. He just talked about -- described, in the sense I discussed earlier -- why he was interested in a particular educationational degree program. These kinds of inferential processes go on constantly in interaction, as we all know, on the basis of indexical signals that work like gestures in pantomime.

In essense we continuously point to our own -- and, relationally, then, to our interlocutor's -- transient and more enduring identities. Interactions as events develop these relational identities as consequences of communicative behavior. The clarity of identities comes in phases, punctuated by shifts over interactional time: what-you-and-I-are in a moment of interaction strives to become what-you-and-I-will-be. ...

Over multiple indexical channels, then, there comes into being a kind of poetry of identities-in-motion as the flow of communicative forms projects around the participants complex patterns -- let's say "images" -- not onto Plato's case wall, but onto the potentially inhabitable and then actually inhabited context. So there is image. There is style. There is "message". Image is not necessarily visual; it is an abstract portrait of identity ... Style -- the way image is communicated -- has degree and depth of organization... "Message", then, strategically deploys style to create image in a consequential way. ...

So being "on message" contributes to that consistent, cumulative, and consequential image that a public person has among his or her addressed audience. A really powerful "message" ascribes to me -- as opposed to describes -- my reality.

Leaving politics for a while, Silverstein points out that "these demonstrations of and inferences about identities" are central to human communication, but "have been largely out of the aware consciousness of communicators". And he observes that

[A]ll of the institutionalized technologies of languages have cumulatively reinforced this intuitive difficulty of explicit recognition by concentrating on its descriptive functions. ... I mean everything from the writing and printing conventions to the personnel and paraphernalia of enforcing standard languages: dictionaries, thesauruses, grammars, manuals of style, and the people who create them and insist that they are authoritative." ... The biases, built into our institutional forms across the board, keep telling us to discount what is actually indispensible to normal and effective human communication.

Although I'm just as interested in "the poetry of identities in motion" as Silverstein is, I do think that there are some good reasons for the bias in favor of the "descriptive functions" of language. The key factor is the psychological phenomenon of word constancy. We can all pretty much agree on what words someone said, and to a lesser degree on how the words go together, and what states of affairs are consistent with them. As we get further away from that level of description, things tend to get fuzzier and fuzzier. And I'd contend that this is the cause, not the consequence, of the way that writing systems work.

However, Silverstein is absolutely correct to observe that we often

.. hyperemphasize the use of language for descriptive purposes, sometimes foolishly and vainly attempting to disregard the inevitable, simultaneous use of language for inhabiting identities. And, of course, just such ways of fashioning inhabitable identities in communication give our messages whatever life-like appeal they may have.

Returning to the analysis of political discourse, it's worth observing that there is another way to think about these things, due originally to Aristotle: effective rhetoric is an amalgam of ethos (the character of the speaker), pathos (the emotions of the audience) and logos (the content of the argument).

Lakoff's "frame wars" are all about pathos: choosing metaphors that line the terms of the debate up with the emotions of the audience in a favorable way.

"Message", for Silverstein, "strategically deploys style to create image", and thus is basically about the projection of ethos.

Silverstein worries that there is an on-going "adjustment of the ratio of operative meanings ..., the denotational and the context-indicating", such that "[t]he key expressions are no longer experienced ... as signals of concepts with which we communicate denotational truth-and-falsity", but are merely "[conjuring] up a kind-of-'who' at a certain cultural 'where'." He associates this with "a corporate-standard language register".

It seems unfair to me to associate this kind of pointillistic display of identity-indexicals with modern corporations. They often use it, sure enough, but so do protest groups, NGOs and scientific societies, or musicians, computer programmers and university professors -- at least some of the time. At other times, the same organizations and individuals may achieve the integration of ethos, pathos and logos that Aristotle (and Lakoff and Silverstein) would recommend.

Posted by Mark Liberman at 10:59 AM

September 03, 2004

Pamphleteering, old and new

Would-be pamphleteers like Marshall Sahlins could learn something, not only from weblogs, but also from science fiction publishers and spammers.

In an earlier discussion of Michael Silverstein's Talking Politics, I estimated that its amazon sales rank of 634,034 means that it's selling about one copy a month. Here are some estimates for other recent Prickly Paradigm pamphlets, in reverse chronological order. In each case, I've given the current amazon sales rank, and the corresponding rate of sales estimated from the graph given on Morris Rosenthal's page here:

#15 Lindsay Waters. Enemies of Promise: Publishing, Perishing, and the Eclipse of Scholarship. Amazon sales rank 8,384 (5 copies per day).
#14 David Graeber. Fragments of an Anarchist Anthropology. Amazon sales rank 63,057 (= 0.4 copies per day)
#13 James Elkins. What Happened to Art Criticism? Amazon sales rank 82,485 (0.3 copies per day)
#12 Richard Price and Sally Price. The Root of Roots. Amazon sales rank 766,811 (= 0.02 copies per day)
#11 Magnus Fiskesjö. The Thanksgiving Turkey Pardon, the Death of Teddy's Bear, and the Sovereign Exception of Guantanamo. Amazon sales rank 1,246,227 (= 0.007 copies per day)
#10 James Clifford. On the Edges of Anthropology: Interviews. Amazon sales rank 650,291 (0.03 copies per day)

Overall, of course, we see the expected drop off in sales with time.

But even the high-end sales of recent Prickly Paradigm pamphlets -- 5 copies per day -- is not evidence of much success in reaching a large audience of intellectuals. In comparison, I believe that the relatively intellectual weblogs listed on ephilosopher all get between a thousand and ten thousand readers per day. You can see our recent statistics here. I'm sure that (for example) Crooked Timber has an order of magnitude more readers than we do.

The comparison is not a fair one. A weblog entry of 100-1,000 words is not the same as a pamphlet of about 30,000 words. And a significant number of weblog readers are regulars, whereas everyone who buys a paper pamphlet is a new set of eyeballs. All the same, if you're in the business of "short, edgy, critical, cantankerous" commentary, and you don't have a regular spot in a mass-market media outlet, on line content is much more widely read than paper pamphlets are.

And if you give free samples to a couple of thousand people a day, how could you not sell more than .02-.03 copies as a result? Even if you got only the kind of click-through that spammers count on -- on the order of 50 per million -- you'd still more than double your backlist sales... In fact, the experience of the Baen Free Library, discussed here with statistics by Eric Flint, suggests that you might be able to do a great deal better than that.

[Update 9/22/2004: a .pdf of Silverstein's pamphlet is now available here. ]

Posted by Mark Liberman at 10:35 AM

Prickly Paradigms under a bushel?

It's interesting to contrast Kerim Friedman's recent Anthropology News article with this this 2002 piece on Marshall Sahlins' efforts in connection with Prickly Paradigm Press. Professor Sahlins decided to become a publisher about four years ago. He had trouble finding a publishing outlet for his own pamphlet Apologies to Thucydides, which "strayed beyond his discipline, using baseball and the Elian Gonzalez affair as examples".

Sahlins is quoted as saying that

"Pamphlets are an important genre for academics who have something they want to get off their chests. It gives them freedom and encourages creativity. So many academics have a lot to say that they don’t want to write as a piece with scholarly apparatus, footnotes and a bibliography."

The article closes with Sahlins' hope the "the pamphlet genre will become popular with a general intellectual audience":

"There is a possibility that the short, edgy, critical, sometimes cantankerous pamphlet’s time has come."

Well, yes. Its time has come, and its place is on line, where hundreds of thousands of people every day read short, edgy, critical and sometimes cantakerous weblogs. They read longer documents as well, usually because a link is featured on weblogs and discussion forums.

I'm not against paper -- I own more books than I can fit into two houses and four offices, and I keep buying more of them. But if Professor Sahlins wants his pamphlets to reach a large audience -- and to sell enough copies to keep the press afloat financially -- it's time to start using the web.

He could start with Michael Silverstein's Talking Politics, which will have particular relevance over the next couple of months, and is now selling all of one copy a month or so via amazon.com. I suppose that Talking Politics was produced from some sort of digital document format, and in that case it could be on line in a few minutes. In the unlikely event that it was typeset by hand, it could be still be scanned into Pdf or DjVu format in an hour or so.

[Update 9/22/2004: a .pdf of Silverstein's pamphlet is now available here. ]

Posted by Mark Liberman at 08:29 AM

"Stop Yelling at the TV and Get Online!"

Kerim Friedman at Keywords has published an article in Anthropology News entitled "Stop Yelling at the TV and Get Online!" As Kerim explains in a weblog entry, "[t]his is the first of a series of articles [he is] writing and co-editing about the role that online publishing can play in anthropology".

Ironically, you can't read his article on the Anthropology News website unless you've paid your dues to the American Anthropological Association. However, Kerim has gotten permission (!) to post the same article on his wiki, where you can read it even if you're not an AAA member.

Posted by Mark Liberman at 07:37 AM

Bouma

Kevin Larson has posted a nice historical survey of "the last 20 years of work in cognitive psychology" on "word recognition" -- the recognition of printed English words, that is -- from the particular perspective of a "reading psychologist" speaking to the Association Typographique Internationale.

He got himself into this situation by taking a job at Microsoft, and winding up on the ClearType team, where he discovered that "the team believed that we recognized words by looking at the outline that goes around a whole word, while I believed that we recognize individual letters". Apparently many people involved in typography believe that the outline around words is important, and call this outline a bouma, named after H. Bouma, the author of a 1973 paper "Visual Interference in the Parafoveal Recognition of Initial and Final Letters of Words" ( Vision Research, 13, 762-782).

Larson explains that "[i]n my young career as a reading psychologist I had never encountered a model of reading that used word shape as perceptual units, and knew of no psychologists who were working on such a model. But it turns out that the model had a very long history that I was unfamiliar with." His paper is an attempt "to review the history of why psychologists moved from a word shape model of word recognition to a letter recognition model, and to help others to come to the same conclusion".

It looks to me like he's right, both about the widespread Bouma-culture among typographers, and its lack of scientific support relative to alternative theories. Specifically, he argues that all of the results that have been taken to support word-shape models "make more sense with the parallel letter recognition model of reading than the word shape model".

It's common for discredited scientific theories to live on -- sometimes for hundreds of years -- in applied areas. Read Ray Girvan's weblog for an entertaining parade of examples. But unless I'm just being parochial, this kind of thing is more common in language-related fields than elsewhere.

[tip via email from Kerim Friedman]

Posted by Mark Liberman at 07:15 AM

September 02, 2004

Translation and free speech

A lawsuit was recently filed here in San Diego, by "[s]everal doctors and a group supporting English as the nation's official language" (the New York Times version of the AP story can be found here or here). The suit challenges "a Clinton-era executive order requiring federally funded hospitals, clinics and doctors to offer translation services for patients who speak limited English", and includes "claims that it is expensive and limits doctors' free-speech rights".

If someone can explain to me how having your federally subsidized medical advice and instructions translated limits your freedom of speech, I'd love to hear about it.

The suit was filed in San Diego because, I suppose, the lead plaintiff is a San Diego orthopaedic surgeon (whose bio can be found here). The "group supporting English as the nation's official language" is one I'd never heard of, ProEnglish ("English Language Advocates"), based in Arlington, VA. This same surgeon and English-only group "filed a similar lawsuit in Virginia in 2002 that was dismissed for, among several reasons, lack of standing and failure to prove the plaintiffs were harmed by the order."

I don't have much (else) to say about ProEnglish at the moment, but my suggestion to Dr. Colwell is simple: "stop taking federal money".

No really, I mean it. This is probably one of the worst times in recent history to be taking a stand for this particular side of an ethnically and culturally divisive issue like this, lowish-profile though it may seem to be. (Of course, ProEnglish probably thinks this is an ideal time to do this. I say ZEN-o-FEE-lee-(y)uh, they say ZEN-o-FO-bee-(y)uh.)

In case you think I'm overreacting, you should really visit the ProEnglish website. There's a very prominent link on their homepage saying DON'T SAY "ENGLISH ONLY!" READ WHY!. The first few paragraphs on this page sound halfway reasonable -- I had a flash of Geoff Pullum with his checkbook spreadeagled on the table before him -- but the last couple of paragraphs should give us pause (boldface in the original):

So, if all this is true what does the term "official English" mean? It means that a government has decided that in order for its actions, laws, and business to be considered authoritative, they must be communicated in the English language. It means that there can be no disagreement about which language is the controlling one for discerning the meaning that government intends. And it means that there must be a compelling public interest for government to use any language other than English.

It also has a symbolic meaning, which is very important. It sends a message to all those who want to participate as citizens in this great republic, that there are responsibilities as well as benefits for being here. And one of those responsibilities is learning to speak the language of our country--English. There is no reason why our expectations for non-English speaking immigrants today should be anything less than our expectations for the generations of immigrants that preceded them.

You have to wonder how this group defines things like "to be considered authoritative", "compelling public interest", and even "the English language" in the first paragraph. The second paragraph is the really disturbing one, though: the use of the phrases "a message to all those who ...", "the language of our country", and "our expectations for non-English speaking immigrants" strongly suggests that this group knows exactly not only who they are and who they represent, but also who their core audience is/will be. As a bilingual and bicultural child of proud immigrants to this country who both function perfectly in their adopted second language and culture, living in a state with a recently-elected (and very popular) English-only-supporting otherwise-moderate-seeming Republican governor with presidential aspirations, I'm frightened by this.

[ Comments? ]

Posted by Eric Bakovic at 01:55 PM

Here, take this stand I made

While writing up my next post, I stumbled for a few seconds on the phrase "taking a stand (for/against)", which I momentarily thought should be "making a stand (for/against)". Googling, I find that the "take" forms far outnumber the "make" forms, but both are relatively strong.

Here are the results for each of the forms. I split "__ against" and "__ for" forms up because the former seemed to be more informative (given that a stand is a physical object that one can make for something, as in "Making a stand for Wise Men 'dolls'").

	__ against	__ for	TOTALS
taking a stand	20,600	10,800	31,400
take a stand	56,800	30,500	87,300
takes a stand	8,900	2,840	11,740
took a stand	7,170	3,140	10,310

	__ against	__ for	TOTALS
making a stand	2,630	4,130	6,760
make a stand	8,020	7,520	15,540
makes a stand	895	839	1,734
made a stand	931	771	1,702

For the example-oriented, here are each of the very first hits (except where otherwise noted) that I got in each case, almost all of them newspaper headlines.

(link) It's not easy taking a stand against crime and violence.
(link) Take a Stand Against the Recording Industry Association of America
(link) A CALIFORNIA COURT TAKES A STAND AGAINST SEXUAL HARASSMENT, MAKING A STATE LAW CLAIM EASIER TO BRING THAN A FEDERAL CLAIM
(link) Finally took a stand against the music industry.
(link) Making a stand against cancer

[This headline is clearly meant to be a pun; the first sentence of the article is "Eastside lemonade sales will help sweeten the national research pot." Whether or not this influenced the choice between make and take in this case is anyone's guess.]

(link) Clyde to make a stand against promotion laws.
(link) Ateneo de Naga Makes a Stand Against the CJ's Impeachment.
(link) In his mind, he had NOTHING to lose at all, and in the end, he made a stand against those who represent the greatest threat to all of us.

[The second link in this case is so much better, though: (link) France hasn't made a stand against war any more than a child who refuses to eat his vegetables has made a stand against veganism.]

(link) Taking a stand for moderate Islam.
(link) Take a Stand for Your Brand
(link) Comedy takes a stand for venue
(link) Time Christians took a stand for the Bible.
(link) Making a stand for world peace.

[The first hit in this case was the "Making a stand for Wise Men 'dolls'" example cited earlier.]

(link) Make a Stand for God, (YAHWEH). And Jesus (YESHUA), His Son.
(link) Dragulescu makes a stand for Romania
(link) Find out more about the way in which the early Christians made a stand for what they believed.

[ Comments? ]

Posted by Eric Bakovic at 01:53 PM

Reportedly versus Formally

Last night, the NYT's headline on this AP wire story was "Prosecutors to Reportedly Drop Charges Against Bryant". This morning, it reads "Bryant Charge Dropped; Civil Suit Looms." It changed because because the passage of time changed the perspective, but perhaps it should have changed anyway, for grammatical reasons.

It's not the split infinitive that's at issue here. There's no logical or grammatical reason to forbid splitting infinitives, and sometimes it's even obligatory, as Arnold Zwicky and Geoff Nunberg pointed out here last spring.

Rather, I'm interested in the scope of the adverb. Jeff Erickson at Ernie's 3D Pancakes discussed this question in connection with his tenure letter, which said that "we invite you formally to indicate your acceptance". Jeff asked "Did they formally invite me to indicate my acceptance, or did they invite me to formally indicate my acceptance?", pointing out that these are different things.

It could happen, I guess, that some prosecutors might really form the intention of taking an action that could be described as reportedly dropping some charges -- and perhaps, by pragmatic implication, not really dropping them. But the headline "Prosecutors to Reportedly Drop Charges" is not describing such a Byzantine situation: it clearly and simply means that the prosecutors are planning to drop the charges, but have not yet formally announced it. In due course, the announcement was made, and the headline duly changed to "Bryant charge dropped".

Searching the web for headlines of the form [ NP to reportedly V ] ... or [NP reportedly to V ... ], we can find plenty of both types:

Goldman to reportedly lead Lazard IPO
Thailand's Prime Minister to Reportedly Buy Stake in Liverpool
Mo. 'Pool' Businesses to Reportedly Save $3 Million from 3.7 Percent Workers' Comp Rate Cut

Pennington reportedly to become highest paid Jet
NIH reportedly to release guidelines allowing research on human embryo cells
Paris Hilton sex video reportedly to be offered via new Internet porn site

It's clear that both forms are meant to be interpreted in the same way, expanding the telegraphic headline by supplying a form of to be in front of the infinitive, and taking reportedly to modify the whole thing. Thus the fuller forms are things like "reportedly, Thailand's Prime Minister [is] to buy [a] stake in Liverpool", or "reportedly, Pennington [is] to become [the] highest paid Jet".

Although both forms occur, they're not equally common:

	N reportedly to	N to reportedly
Bush	21	2
Kerry	0	0
U.S.	121	5
China	55	4
France	5	0
U.K.	24	1
Canada	0	1
Germany	3	0
India	32	1
Japan	49	1
Clinton	18	1
Kobe	11	0
TOTAL	339 (95.5%)	16 (4.5%)

Note that many of these examples are not in headlines, and in some of them the noun is not the subject of the infinitive, though an analogous question about scope generally arises:

Dr. Jianli had traveled to China to reportedly investigate large-scale labor unrest and to meet with labor activists.

There may be some effect of copy-editor prejudice against split infinitives here, but I think it's mostly the effect of the (genuine) scope problem, as we can see by comparing what happens to the same list of nouns with the adverb formally, which normally modifies the verb phrase rather than the whole sentence:

	N formally to	N to formally
Bush	11	142
Kerry	4	51
U.S.	149	2,440
China	7	87
France	7	31
U.K.	17	55
Canada	1	161
Germany	2	8
India	10	41
Japan	20	86
Clinton	3	21
Kobe	0	0
TOTAL	231 (6.9%)	3123 (93.9%)

This pretty much reverses the proportions found in the table for reportedly. Note that the table is polluted on the "formally to" side by examples like

The European Commission has decided to ask France formally to review its rules banning television advertising by the publishing and cinema.

where formally probably modifies ask rather than review. And in this table, an even smaller percentage of the examples are headlines of the [NP to V ... ] type, although that doesn't matter to the general point about scope.

The exact percentages would change a bit with more careful inspection, but the unchecked counts in these two tables give a strong indication that people are placing adverbs in a way that sensibly reflects their scope. When I was taken aback by the NYT headline "Prosecutors to Reportedly Drop Charges Against Bryant", the numbers were on my side.

I'm not sure what's going on with the minority cases of wide-scope "to reportedly" and narrow-scope "formally to". As Arnold Zwicky has discussed in a series of four posts here, there's a thin line between error and mere variation. In the case of "to reportedly", it's possible that we're not seeing either simple scope errors or the birth of a new scope dialect, but rather the erratic application of a genre-specific pattern of superfluous and/or misplaced modifiers, which Arnold discussed under the heading of Journalists' Alleged Hedges. And in the (very rare) cases of narrow-scope "formerly to" (such as the AP headline "Kerry formally to announce candidacy"), I suspect the influence of the pseudo-rule against splitting infinitives.

Posted by Mark Liberman at 08:51 AM

September 01, 2004

Tear down this wall

At least three or four times since Reagan's death, most recently tonight during radio reporting of the part of the Republican convention in which a commemorative video was shown, I have heard reporters and commentators giving misty-eyed reminiscences about hearing President Reagan say, "Mr Gorbachev, tear this wall down." He said no such thing, not ever. People sometimes say linguists fuss over trivia, but I can't believe anyone could see this point as trivial.

It's so strange that people should misremember, for two reasons. First, the correct original version still phonetically rings in my ears, unforgettably, and I would have thought that would be true for anyone who heard it (which would include even a 30-year-old junior reporter: the speech was in 1987); and second, the way he put it — synonymous, but with with very slightly different syntax — is so much more compelling. The rhythm is better; the parallel with the preceding sentence ("Mr. Gorbachev, open this gate!") is better; and (let me get technical for just a second) the crucial direct object is positioned as last constituent in the verb phrase, not followed by the anticlimax of a particle that belongs with the verb, so the nuclear stress coincides perfectly with the final monosyllable which delivers the pragmatic punch, the key piece of new information conveyed by the final noun. Syntax, prosody and pragmatics in perfect harmony. What President Reagan said, very deliberately — and they say it was audible over on the other side in East Berlin — was: "Mr Gorbachev, tear down this wall!"

Posted by Geoffrey K. Pullum at 11:45 PM

Some Open Access advice for Michael Silverstein

Michael Silverstein is a talented intellectual who is a valued member of the departments of Anthropology, Linguistics and Psychology at the University of Chicago. He won a MacArthur "genius grant" in 1982, the second year in which the awards were given. He's a brilliant speaker, both in public and in private, and his former students are many and prominent.

Like any academic, he sometimes writes and talks in complex sentences that are hard to understand unless you're familiar with the disciplinary idioms involved. This description of his research interests, for example, includes phrases like "demonstrating the systematicity of 'indexical' meanings of various of the formal, distributional facts about language structure" and "developing an adequate account of what kinds of broadly 'textual' objects are developed in contextual realtime during the course of verbally-mediated interaction, whether with interlocutor(s) or with a text-artifact of some kind".

However, his most recent book Talking politics: the substance of style from Abe to "W" (March 1, 2003. Prickly Paradigm Press, Chicago) is accessible and fun to read, whether or not you've studied linguistic anthropology. There's (little or) no academic jargon here, and there are insights that you're likely to appreciate whether or not you share his political opinions.

The book starts this way:

No doubt about it. Abraham Lincoln gets the prize among United States presidents for the sheer concentrated political power of his rhetoric. When he set his -- actual, own -- mind to preparing his text, he could come up with gems such as his Second Inaugural and, of course, his 272-word "Dedicatory Remarks" at Gettysburg. Even his extemporaneous public and private talk, transcribed, shows great verbal ability. Now Mr. Lincoln had no Yale or Harvard degree as a credential of his education. But he understood the aesthetic -- the style, if you will -- for summoning to his talk the deeply Christian yet rationalist aspirations of America's then four-score-and-seven-year-old polity. Striving to realize this complex style, he polished it and elaborated its contours. He embodied the style. So much so, that Lincoln's great later text, like the late, great man himself, now belong to the ages. They form part of the liturgy of what Robert Bellah has termed America's "civil religion".

and ends like this:

Language used in the expository mode, used to create argument and therefore, at its most successful, to become the instrument of reason and rationality, is clearly not one of Mr. Bush's attributes. This is not Lincoln. This is not Kennedy. Neither Roosevelt. Whatever else we think of him, not Mr. Clinton. These were Presidents for whom language was both a renvoi, a hearkening back, to the experiences of literary imagination made concrete in words, and to systematic use of language for critical thought such as we do in science, in religion for narrative and theological investigation, etc. Whatever the field, Mr. Bush's is a phrasebook notion of political "message"-language, straight out of anxious corporate standard, in which saying the right terms, with luck in a poetically perfect arrangement, is all the message there is.

It's emphatically not the problem of "soundbites", as print journalists and their partisans in academia keep saying. This is just killing the media messenger. Short excerpts from longer texts can powerfully outline and encapsulate a message while not necessarily being only "message:" "Absolute power corrupts absolutely"; "E = mc²"; "Tune in, turn on, drop out."

[...]

In our politics, identity is "message" embodied. So listen to the language. Where, as Julia Ward Howe would have it, Lincoln, verbally embodied, would have Americans die for Freedom, Bush would have us die for Management. I'm not certain we're all, as they say in those parts, "on message."

In between, there are 130 pages mixing detailed analysis of particular speeches with general observations on history, culture, politics and language. I don't agree with all the analyses, whether linguistic or political, but the $8 that I sent amazon last year for my copy was money well spent.

Although Talking Politics has been reviewed here and there, it seems unlikely that very many people have read it. I don't know how many copies have been sold, but this book's current amazon sales rank is 634,034. This is truly abysmal. In comparison, Betsy Dyer's (excellent, but somewhat off-beat) Field Guide to Bacteria has a current amazon sales rank of 40,853, and Jacob Weisberg's little exercise in throught-free character assassination, Bushisms, has a current amazon sales rank of 7,978. Another recent book that (I happen to know) has sold around 3,000 copies overall has a current amazon sales rank of 4,788.

Based on the apparently well-informed discussion on this site, a sales rank of 600,000 or so means that about one copy is sold per month. Whatever the exact sales figures at amazon and across the whole marketplace, I think we can conclude that no one is now making any significant money out of Talking Politics. Not the author, and not the publisher. More important, only a handful of people are reading it.

So why not put the whole thing up on the web for free access, as a .pdf or in whatever other form comes easily? If the text was up there, I might try to get you interested in going through Silverstein's detailed analysis of the Gettysburg Address, or take up the question of corporate message-speak and its historical precursors. People on various sides of the current election campaign would take a look at the book and praise it or damn it, but anyhow quote it and think about it. Thousands of people -- maybe tens or even hundreds of thousands of people -- would read at least parts of it, and some of those people would be journalists or politicians or political scientists or other kinds of folk who don't normally buy stuff from Prickly Paradigm Press or read what linguistic anthropologists write.

And quite a few of those people would probably find $8 to buy a paper copy, as the Baen publishing company has found out with their Free Library. I'd be willing to bet the price of a good dinner for four that sales would increase rather than decrease -- maybe even enough for the royalties to cover the cost of dinner!

[Update 9/22/2004: a .pdf of Silverstein's pamphlet is now available here. ]

Posted by Mark Liberman at 11:41 AM

"No despicable part of their contemplation"

It seems wrong to me to assert, as Chuck Anesi does, that "[t]he Declaration of Independence... is nothing but a trite paraphrase of the leading ideas in John Locke's 1693 Concerning the True Original Extent and End of Civil Government". But it's true that Thomas Jefferson was an avid reader of John Locke's writing, and it's too bad that Jefferson didn't use his influence to promote Locke's ideas in education as he did in politics. If you're interested in the nature and use of language, and believe that this topic has too small a role in the current curriculum at all levels, you should join me in hoping that our societies some day consider Locke's advice seriously.

I specifically mean chapter XXI of book IV of An Essay Concerning Human Understanding, where Locke writes that "science may be divided into three sorts." By "science" he means "all that can fall within the compass of human understanding". And his three divisions are first "the nature of things, as they are in themselves, their relations, and their manner of operation", which he calls physica; second "that which man himself ought to do, as a rational and voluntary agent, for the attainment of any end, especially happiness", which he calls practica; and third "the ways and means whereby the knowledge of both the one and the other of these is attained and communicated", which he called semeiotike.

The first of these divisions corresponds more or less to what we now call the natural sciences. The second would include ethics; political science, economics and most other aspects of the social sciences; and parts of psychology. But it's the third division that I find most interesting, the division that Locke calls "the doctrine of signs; the most usual whereof being words".

The business of this area "is to consider the nature of signs [that] the mind makes use of for the understanding of things, or conveying its knowledge to others". This is important because "because the scene of ideas that makes one man's thoughts cannot be laid open to the immediate view of another, nor laid up anywhere but in the memory, a no very sure repository: therefore to communicate our thoughts to one another, as well as record them for our own use, signs of our ideas are also necessary: those which men have found most convenient, and therefore generally make use of, are articulate sounds".

Given this, Locke observes that "[t]he consideration... of ideas and words as the great instruments of knowledge, makes no despicable part of their contemplation who would take a view of human knowledge in the whole extent of it."

Locke uses an out-of-fashion syntactic structure in this phrase. He means "no despicable part of the contemplation of those who would take a view of human knowledge in the whole extent of it", but he gets there by connecting the relative clause "who would take..." to its head "their" across the noun "contemplation". And explaining how to construe a sentence spoils its effect, just as explaining the punch line of a joke does. As a result, it's harder than it should be to invoke the authority of Locke in support of the simple but important idea that "the consideration of ideas and words as the great instruments of knowledge" should form roughly a third of the curriculum at all levels.

If history had been different, I might be only be wondering why Locke chose to write "no despicable part", instead of using a more straightforward phrase such as "a large part" or "an important part". Litotes was part of his style, but I still wonder why he uses it in some cases and not in others, and what he means to imply by choosing it here. However, as things have turned out, the part of the curriculum currently allotted to the contemplation of "ideas and words as the great instruments of knowledge" is indeed despicable, both in quantity and in quality, and I'd be happy enough to have that attribute negated.

Linguistics would still be an interdisciplinary field, under Locke's disciplinary taxonomy. Phonetics and parts of psycholinguistics would be physica; much of sociolinguistics, educational linguistics, language planning and so on might be practica; and the rest of course would be semeiotike. Hierarchical ontologies are rarely a very good fit to reality, even when they're devised by a hero of the enlightenment. But universities, like other human endeavors, want hierarchies regardless of their philosophical validity, and I'd be happier with Locke's version than with the ones that we inhabit now. There would presumably be core subdisciplines within semeiotike into which most linguists would fit comfortably. Certainly I feel that I would.

[Update: Trevor at kaleboel is reminded of Habermas, but concludes that "in terms of clarity, Locke has his nose in front". ]

Here's the whole of Locke's chapter:

1. Science may be divided into three sorts. All that can fall within the compass of human understanding, being either, First, the nature of things, as they are in themselves, their relations, and their manner of operation: or, Secondly, that which man himself ought to do, as a rational and voluntary agent, for the attainment of any end, especially happiness: or, Thirdly, the ways and means whereby the knowledge of both the one and the other of these is attained and communicated; I think science may be divided properly into these three sorts:-

2. Physica. First, The knowledge of things, as they are in their own proper beings, their constitution, properties, and operations; whereby I mean not only matter and body, but spirits also, which have their proper natures, constitutions, and operations, as well as bodies. This, in a little more enlarged sense of the word, I call Phusike, or natural philosophy. The end of this is bare speculative truth: and whatsoever can afford the mind of man any such, falls under this branch, whether it be God himself, angels, spirits, bodies; or any of their affections, as number, and figure, &c.

3. Practica. Secondly, Praktike, The skill of right applying our own powers and actions, for the attainment of things good and useful. The most considerable under this head is ethics, which is the seeking out those rules and measures of human actions, which lead to happiness, and the means to practise them. The end of this is not bare speculation and the knowledge of truth; but right, and a conduct suitable to it.

4. Semeiotike. Thirdly, the third branch may be called Semeiotike, or the doctrine of signs; the most usual whereof being words, it is aptly enough termed also Logike, logic: the business whereof is to consider the nature of signs, the mind makes use of for the understanding of things, or conveying its knowledge to others. For, since the things the mind contemplates are none of them, besides itself, present to the understanding, it is necessary that something else, as a sign or representation of the thing it considers, should be present to it: and these are ideas. And because the scene of ideas that makes one man's thoughts cannot be laid open to the immediate view of another, nor laid up anywhere but in the memory, a no very sure repository: therefore to communicate our thoughts to one another, as well as record them for our own use, signs of our ideas are also necessary: those which men have found most convenient, and therefore generally make use of, are articulate sounds. The consideration, then, of ideas and words as the great instruments of knowledge, makes no despicable part of their contemplation who would take a view of human knowledge in the whole extent of it. And perhaps if they were distinctly weighed, and duly considered, they would afford us another sort of logic and critic, than what we have been hitherto acquainted with.

5. This is the first and most general division of the objects of our understanding. This seems to me the first and most general, as well as natural division of the objects of our understanding. For a man can employ his thoughts about nothing, but either, the contemplation of things themselves, for the discovery of truth; or about the things in his own power, which are his own actions, for the attainment of his own ends; or the signs the mind makes use of both in the one and the other, and the right ordering of them, for its clearer information. All which three, viz, things, as they are in themselves knowable; actions as they depend on us, in order to happiness; and the right use of signs in order to knowledge, being toto coelo different, they seemed to me to be the three great provinces of the intellectual world, wholly separate and distinct one from another.

Posted by Mark Liberman at 08:17 AM