Language Log: December 2003 Archives

December 31, 2003

Beware, corpus fetishists

Apropos of my recent transsexual pronominal reference story, let me just mention that the word transsexual gives us a truly frightening glimpse of the giant reservoir of error out there that Google keeps an index of. Google reported, the last time I checked, that the incorrect spelling transexual occurs on at least 1.87 million web pages, while the correct transsexual spelling occurs on only 1.37 million. Now, I take it that it is quite clear what is correct or incorrect in this domain: spelling is conventionalized and fixed in a way that grammar is not. This is not the time of the Paston letters, when spelling varied regionally and between families. So beware, corpus fetishists! The possibility of corpus research is a great asset to linguistics, and no one should try to work without corpus material; but there are major pitfalls for those who take the corpus to be the object of study. It is not the object of study. The language is the object of study. A corpus is just an assemblage of material through which we can study the language, and virtually any corpus is going to have errors in it. Possibly numbering in the millions, even outnumbering the correct forms. Deciding when some new expression type has become a part of the language and when we are simply dealing with a lot of people messing up is not an easy process.

Posted by Geoffrey K. Pullum at 09:48 PM

Such the surprise

Rosanne over at the X-bar asks

Has anyone encountered a such + definite NP construction in American (or, really, any) English? A pal of mine uses this frequently, with (1) and (2) being recent examples.

(1) She is such the smart girl (paraphrasable as "She is a very smart girl.")
(2) ...such the happy individual.

If you'd asked me cold for an unwired judgment, I would have said that "She's such the happy individual" isn't English. I certainly don't think I ever say or write things like that, and I don't have any memory of having heard or read them either. It sounds like a mistake for "She's quite the happy individual," which in turn sounds somewhat snooty and pretentious to me, FWIW.

However, given that I can ask the internet, I find that lots of people write things like I'm such the house wench, and Poor Homer... such the victim, and I am such the good son, and Wow. You're such the better man, and I am such the sheep lately, and OMG! Avril is such the coolest! and It's such the total love fest here and They are such the whores, man and so on. Live and learn.

[Update: Though I'm venturing out of my depth here, I wonder whether there is a connection to the much-ridiculed, much-imitated emphatic 'so'.]

[Update #2: while I was composing the previous sentence, Daniel Ezra Johnson wrote:

there's also the same construction with "so" in place of "such". here, there's the chance of confusion with the "so" meaning "too" (so a sentence like in "i am so the x" could have two readings) but google provides a lot of unambiguous examples too:

"i am so the consumer whore"
"i am so the tired one today"
"i am so the smitten kitten for you, raul"

i don't think future generations of linguists are going to have to make up sentences (at least for grammatical constructions). it is so easy to find incredible ones on google! the ones given above are just the first three, not even a selected group...
dan

p.s. "i am so the opposite. i LOVE pedicures!"

I agree about web-based exemplification, basically, with some caveats of the sort expressed here and here, and the additional observation that it's not possible to search the web for structures yet -- though in principle one could parse everything and search the results with appropriate tools.]

[Update 5/24/2004: more discussion is here.]

Posted by Mark Liberman at 05:55 AM

mi etiam unico magis unicus

Jim at Uncle Jazzbeau's Gallimaufry shows that writing about degrees of uniqueness goes back to Plautus.

OK, now who can find an example in Gilgamesh?

Posted by Mark Liberman at 05:45 AM

Sex, ethics, pronouns, and culture (warning: adult situations)

The Chronicle of Higher Education's issue of December 19, 2003, page A17, has a news item about J. Michael Bailey, chair of the Department of Psychology at Northwestern University, who is alleged to have had sexual relations with a client and then used private details about that client and five others as research material for a book without getting informed consent from the people whose cases he discussed. But when I had finished reading the story I found I still needed to do quite a bit of inference to figure out what had gone on. The reason involves pronouns and sex. And from here on in I had better warn you that we get into what the movie warnings call "adult situations".

First I'll tell you what I could understand, and then I'll tell you what made it nonetheless baffling. The person who was later to complain about Bailey's unethical behavior apparently needed a a letter from him in his professional capacity as part of a dossier relevant to the case for doing some elective surgery. In 1996 Bailey supplied this letter, which established a professional relationship between him and the complainant, a relationship falling under the scope of the American Psychological Association's rules about treatment of clients and research subjects. The surgery was later done, and in January 1997 the patient appeared as a guest speaker in one of Bailey's classes and talked about it. A bit over a year later, after some socializing at a nightclub on March 22, 1998, Bailey and the client had sexual relations at the client's apartment. And then five years afterward Bailey published a book which included a discussion of this client's case, having never revealed that he was planning this, and having never obtained consent for the information to be thus used.

Now I'll tell you what makes the story so hard to understand the way the Chronicle printed it. The surgery was sex reassignment. The letter was a recommendation that a sex change be performed on the patient. So now, suppose we ask ourselves a simple question, like whether the sex that took place was gay sex. Can you tell, from what I've said so far?

I carefully avoided using pronouns to refer to the patient. But if you do try to use them you find that our language just isn't properly equipped for a story like this. The only unproblematically singular third-person pronouns in English are strictly classified by gender: he for males, she for females, and it for (non-human) neuters. And it is tacitly assumed that the relevant characteristics are both knowable and immutable. Even if a woman's successful transition to being a man results in great self-awareness, you cannot describe it by saying *She eventually came to have a much better understanding of himself: it's simply not allowed by the grammar.

The way the Chronicle tells the story, the she pronoun lexeme is used for the patient throughout. The Chronicle says (and I will underline the gender indications): "The woman ... says that Mr Bailey, as a psychologist, gave her a letter she needed from a professional supporting her desire for sex-change surgery." And they write "She had the surgery in January 1997, was a guest speaker in one of Mr. Bailey's undergraduate classes the next month, and had sex with him at her apartment a year later..."

But you see the problem with this? If he gave the letter to a woman, and she desired sex-change surgery, then by the time she'd had it, at the guest appearance in the class, she'd have been a man, and it would be a bit odd to say he later had sex with her because she'd be a him. On the other hand, if Bailey had sex with a woman, then in 1996 she would have been a man, and it would be a bit odd to say he had given a letter to him and then had sex with him, because although he'd have given the letter to him he would have had the sex with her after he'd become a she. (I hope this isn't getting too confusing.)

The outline of the story that I think the Chronicle is relating could be made clearer if we told it like this. Bailey saw a male patient and gave him a letter recommending sex-change surgery. The patient took his letter to a surgeon, had his operation, and hence became a woman. Bailey asked her to come and speak about her operation in his class. Then a year later, he socialized at a nightclub with her, and went back to her home and had sex with her. After that he wrote a book and included details of her case in it, and she complained.

But that isn't how the Chronicle put it: they opted for she throughout, using the pronoun that would be appropriate for the complainant now as a way of referring to her at all stages of her life, so they had Bailey giving the letter to a woman, and they even refer to "her desire for sex-change surgery." That's more than a little confusing, you must admit.

This isn't the first time the issue has come up, of course. Master synthesizer keyboardist Walter Carlos, whose album Switched-on Bach was a huge best-seller, underwent a sex change later in his career and changed his name to Wendy Carlos. One wants to grant that Wendy should get some credit for Switched-on Bach, since after all, she is the person who (as Walter) played the music on it. Yet it seems very strange to say that she recorded it, because She recorded it entails that it was recorded by a female, and it was actually recorded by a male who at the time had not even an announced an intention to one day be a female.

People say that language reflects culture, but one of the reasons it doesn't is that language can't change fast enough to keep up with culture. Sex reassignment surgery rapidly became a fact of our culture once the relevant techniques were worked out in the 20th century, but the English language couldn't change fast enough to keep up with what that meant. It'll be the same when gay marriage becomes a fact of our culture. Are the two guys bridegrooms? Is either of them a bridegroom? Should the minister say, "I now pronounce you husband and husband"? Here, for once, we have the kind of situation where the language really isn't equipped to talk about the new things we have to say. We'll work something out, but it's going to be rocky in the early years.

Posted by Geoffrey K. Pullum at 01:38 AM

December 30, 2003

cq

As Geoff Pullum points out it isn't obvious how to pronunce Cquila in English. In the International Phonetic Alphabet, <c> represents a voiceless palatal stop, <q> a voiceless uvular stop. Neither occurs (as a distinct speech sound) in English. <c> is also used in some languages and by some linguists for /ts/ but English doesn't allow /ts/ at the beginning of words. (There are a few loanwords spelled with initial <ts>, such as tsar and tsunami, but they aren't pronounced /ts/ by ordinary English speakers.)

There is however at least one language that does have words beginning with the letters <cq>, namely Secwepmectsín, known in English as Shuswap. Shuswap is one of the Salishan languages. It is spoken by about 500 people in East central British Columbia. The big city in Shuswap territory is Kamloops, the site of the annual Kamloopa Powwow, which is a lot of fun. Some of the signs are in Shuswap, and the older Shuswap people speak it, but since this is a big league powwow, most of the MCs are imported and the announcements tend to be in Blackfoot or English.

In this case the <c> represents a voiceless velar fricative, the sound spelled x in the IPA. This is the sound at the end of German Bach. In Shuswap spelling <q> represents a voiceless uvular stop, as in the IPA. The sequence [xq] may take a little practice since you have to shift the back of your tongue from velar to uvular position while moving from the narrow constriction of the fricative to the complete closure of the stop, but it isn't all that hard, much less impossible. Here are some examples from the Shuswap dictionary:

cqénkem	to open up a fish, take out the guts
cqu7	bay
cqp'uxén	to have a broken leg
cqp'exw	crowded together
cqp'úgwi	younger brothers and sisters
cqp'ellp	tree, Subalpine Fir
cqellqíllsc	to keep someone awake
cqelmexwwílx	to recover from illness
cqp'élnem	to shoot
cqelqp'lélx	button

In case you're wondering, <x> in Shuswap spelling represents the voiceless uvular fricative. Consonants followed by an apostrophe are glottalized, <ll> represents the voiceless lateral fricative, and <7> represents glottal stop.

Posted by Bill Poser at 09:32 PM

Does "Locklear" really rhyme with "cochlear"?

Richard Lederer, "the author of many books about language and humor", is the proprietor of the site Verbivore. He writes a widely-reprinted column Looking at Language, which I stumbled over while researching a recent post on presidential pronunciations. The piece that I found, oddly titled "A Nuke-Yoo-Lur Proliferation Theatise," was in the online archive of the journal Boomer Times and Senior Life ("A Monthly Magazine Serving Senior Adults of South Florida since 1990"), Feb. 2003, Volume 14, Number 2.

Mostly the article seems to be about how scary it is that our current president pronounces nuclear the same way that Dwight D. Eisenhower did. I find it somewhat comforting, myself, but whatever. I also pass over in silence the column's title, except to ask: "theatise?"

The point that I want to focus on is Lederer's contribution to the analysis of the stigmatized pronunciation of nuclear. He cites a NYT article by Jesse Sheidlower, the American editor of the Oxford English Dictionary, as follows:

"Sheidlower ... observes that the -ular combination is a common pattern in English -- circular, muscular, particular, vascular and the like -- while -lear is heard only in rare words such as likelier and cochlear (and, I'll add, Heather Locklear)."

Huh? I don't have any special expertise in the pronunciation of Heather Locklear's last name, but I've always assumed that it has two syllables and rhymes with "mock year", not with "cochlear" (which has three syllables). Awaiting confirmation or correction, let me suggest that Mr. Lederer, despite being known as "Conan the Grammarian," may be having trouble with the difference between spelling and sound.

If remediation is indicated, I suggest the exercise of rendering (passages from) Gerard Nolst Trenité's poem The Chaos into the International Phonetic Alphabet, as I ask the students in Linguistics 001 to do each fall.

Posted by Mark Liberman at 05:15 PM

Mispronunciation -- or prejudice?

YourDictionary.com's press release on Top Words of 2003 has sections on "Top Ten Words", "Top Ten Names", "Top Ten Youthspeak Words", etc. -- and an odd little section entitled "5 Top Mispronunciations by President Bush in 2003". I'm surprised that this got past yourDictionary's eminent " Advisory Council of Experts". As far as I can see, only one of the five cited examples is actually a genuine mispronunciation -- "Anzar" for "Aznar" -- and this is a mistake that Bush apparently made in 2001! The rest are either standard American pronunciation variants, or instances of regional variation in the treatment of reduced vowels around liquids.

1. a-MERR-ca
a-MER-i-ca (America)

2. NEW-cue-ler
NEW-clee-er (nuclear)

3. JU-ler-ee
JU-wel-ree (jewelry)

4. An-zar
Spanish Prime Minister Jose Maria Aznar

5. Ne-VAH-duh
Ne-VAE-duh) (Nevada)

Taking the last one first: the stressed syllable in Nevada is given in both versions by Merriam-Webster and American Heritage (with the vowel of cot or the vowel of cat). Don't the folks at yourDictionary.com, who are in the online dictionary business, consult any dictionaries before putting this stuff out?

Two of the remaining three (nuclear and jewelry) are widespread variant pronunciations, not much more culpable than "Febyuary" for February, which Merriam-Webster gives as the first pronunciation. With respect to the prounciation of nuclear, both M-W and American Heritage cite Bush's pronunciation as one of their variants, with a usage note (the American Heritage calls it "generally considered incorrect"). However, as Geoff Nunberg writes, "'nucular' is a choice, not an inadvertent mistake.", and apparently more standard than not among people who deal routinely with nuclear weapons. With respect to jewelry, I'd like to look at a spectrogram of the offending pronunciation before categorizing it -- it might just have been the version that M-W represents as ['jül-rE]. A true schwa vowel between /l/ and /r/ is unlikely, but partial or complete vocalization of the /r/, if it happened, would be an instance of a widespread pattern.

While censuring the president for perhaps inserting an extra schwa in jewelry, yourDictionary slags him off for leaving one out in America. The question of what happens to fully reduced vowels, especially around liquids like /l/ and /r/, varies in complicated ways in different varieties of English. In the speech of the South Midlands, reduced vowels are often transformed by assimilation into lengthening of adjacent consonants or vowels, as can be heard in this example of speech from a woman from Tennessee (waveform and spectrogram below).

Note that the treatment of reduced vowels around /l/ or /r/ is at stake in three of the five cited cases. Even without getting into sociolinguistic studies of such things, we can be pretty confident that these are stigmatized changes in progress, or long-standing regional or class prejudices, just by reading how upset the language maven Dr. Richard Lederer, "the 2002 recipient of the Toastmasters International Golden Gavel Award," gets about the whole business.

So, let's sum it up. Depending on what Bush actually said for jewelry, one or two of the examples are normal variant pronunciations, two or three of the examples are widespread regionalisms (or other socially marked variants), and one is a genuine mistake in pronunciation -- which was made in 2001, well past the statue of limitations for mistakes of 2003!

Chill, yourDictionary guys -- lexicography shouldn't be prostituted to treat stigmatized class or regional speech as "mistaken". If you want to make a list of presidential regionalisms, fine -- but don't call them mispronunciations. And don't throw in variant pronunciations without checking them, just because they're different from what you say yourself.

[Update: two friends on yourDictionary's panel of experts, Stephen Anderson and Mark Aronoff, confirm that they didn't see this list before it went out, and agree in essence with my analysis of the cited "mispronunciations."

Mark observes that the "point is a very subtle one for non-linguists," which is certainly true, and that "Bush-haters will grasp at anything", which is also true, though I'm not sure whether it's a defense or a further criticism.

Steve notes that the yourDictionary site "does ... contain a considerable amount of useful information, generally correct as far as I've sampled it", a point that I agree with. He concludes that "I guess it's a bad thing for linguists to be associated with factually incorrect -- and overly normative --- observations about language, but otherwise I would consider your reaction a bit overwrought....." Well, OK, maybe so. But getting "a bit overwrought" is kind of a weblog tradition, after all. It's better than being underwrought, though of course here at Language Log we aspire to being exactly wrought enough.]

Posted by Mark Liberman at 03:17 PM

How to call Cquila's name

It almost seems like every new African American female of college age or younger that you meet these days has a completely unique invented name. The New York Times on Saturday, December 27, 2003, carried an article by Leslie Kaufman about changes in what is provided to the poor in food baskets. It was illustrated with a photo of a six-year-old girl named Cquila Singleton. Cquila is definitely unique, getting no Google hits at all as of today. (By tomorrow there may be one hit, but it will be this page.) And one can only guess at the intended pronunciation.

The invented names that black mothers bestow on their daughters are often rather beautiful phonetically, and generally fashioned to look and sound vaguely African. In the case of Cquila there is a distinct suggestion of the orthographies of Southern Bantu languages like Zulu. But in those orthographies Cquila would be the spelling of something completely unpronounceable. The letters c, q, and x are used for velaric ingressive stop consonants -- clicks, as they are more usually known. Roughly speaking, c stands for a dental click made by sucking the tongue tip away from the back of the upper front teeth); q represents a deeper-sounding postalveolar click performed with rounded lip position and tongue pulled away from the front of the roof of the mouth (people use it to imitate the sound of a champagne cork coming out of the bottle), and x stands for the lateral one (a clicking at the two sides of the tongue used conventionally to gee up horses). The latter occurs in the name of the language Xhosa; Peter Ladefoged has examples of the clicks in this language here, and lots of other fascinating material on the same site.

The click in Xhosa is apirated, which means it is immediately followed by an h sound. But you can hardly follow a click by a click. It never happens in the Southern African languages that have clicks, any way (though Julian Bradfield points out that the earlier version of this post was too strong: producing two clicks in quick succession is phonetically possible); cq couldn't ever be the beginning of a well-formed Zulu or Xhosa word.

It is possible for a dental click to be immediately followed by a uvular stop (like the last sound in the word Iraq when correctly pronounced). That happens in certain Bushman languages spoken in the Kalahari desert area, such as !Kung, as heard in the movie The Gods Must Be Crazy. But for those the usual spelling (invented by German missionaries) has the slash / for the dental click, the exclamation mark for the postalveolar one, and two slashes for the lateral one. In more recent proposals a k is prefixed to a voiceless click (and g for a voiced one and n for a nasal one). The (imaginary) word k/quila would be pronounceable in a Bushman language. But it's probably not the intended pronunciation of Cquila's name.

There are words ending in cq in at least some Romance languages, English gets a few of them in the form of foreign proper names like Domecq (the family name of Pedro Domecq, from Spain). But I don't think there are any languages in which words can begin with cq.

Except for post-1997 English, of course, if you count the name Cquila as an English word. Little Cquila (she is six) will have to tell everyone how she wants her name to be pronounced, because I can't even guess. People will probably make attempts sounding like keela, queela, ka-queela, sa-queela... She may end up being nicknamed Tequila. You don't know what you've started when you invent a name whose spelling doesn't indicate a pronunciation in any known human language.

Posted by Geoffrey K. Pullum at 03:04 PM

On the Road with Big Brother

Concern is expressed about the side effects of conversations with networked objects.

Posted by Mark Liberman at 01:18 PM

Neat stuff at NITLE

There's all kinds of interesting information at the NITLE Blog Census. The (top of the) surprising rank ordering of languages has not changed since it was discussed here back in October: English, Portuguese, Polish, Farsi, French. And onwards: Spanish, German, Italian, Chinese-big5, Catalan, Dutch, Icelandic, Indonesian. I'm still somewhat skeptical about what the TextCat language classification algorithm is doing: it's hard to believe that Russian is nowhere in the top 25, but Breton is... And that there are really almost 5 times more Icelandic bloggers than Japanese... However, the top of the list seems likely to be right.

See blogcount for more blogospheric information. Blogdex just now has yourDictionary.com's "Top Ten Words of 2003" as the sixth "most contagious information currently spreading in the weblog community".

Posted by Mark Liberman at 01:13 PM

More on the computational linguistics of smells

Fernando Pereira picks up the question of the computational linguistics of smells:

Surprising as it might seem to outsiders, this question is central to modern computational linguistics. One side will argue that without perceptual grounding, anything we glean from texts is a poor, fake proxy. The other feels that the grounding of much of the language we use, especially that pertaining to social and technical topics, is other language. ...

Current information-extraction techniques based on labeling a bunch of documents and learning pattern matchers from the examples take less advantage than we'd like from co-occurrence statistics. Some research ... suggests that one can do much better using lots of unlabeled data, but at present those techniques are a black art: sometimes they work, sometimes they don't, and it's not yet clear why. I think that part of the problem is that existing techniques focus on just one kind of entity and very superficial features, while the way we learn that CPEB may denote a kind of protein involves seeing the term used in relation to several other terms, themselves belonging to rich terminological networks of which we have some knowledge.

Read the whole thing, including Fernando's Proustian ruminations on mildew smells across time and space.

I conjecture that biomedical text may be the best initial testbed for the kind of research that Fernando describes (as he broadly hints in his note), since it's easy to get access not only to billion-word text corpora but also to a rich and varied universe of bioinformatic databases and ontology-attempts. The fact that the results may often be intrinsically worthwhile is another motivation to look in that domain first.

Posted by Mark Liberman at 12:30 PM

Combating lexicalist prejudice

People often write about language as if it were nothing but words, words, words. Language Log is therefore accepting nominations for X-of-the-year awards at other levels of linguistic analysis:

Allophone of the year. New ways of pronouncing (English) phonemes in context. This reaches the popular imagination occasionally through discussions of regional, class or other social-group pronunciations, like Valley Speak.

Affix of the year. New methods for making new (English) words out of old ones. An obvious recent example is -izzle.

Construction of the year. New ways of combining (English) words into phrases. The problem here is to find "new" syntactic usages that don't turn out to go back to the 18th century. The syntactic clock ticks rather slowly.

Word sense of the year. New meanings for old (English) words. For example, the OED's last quarterly update included new meanings for churn "... Change to a customer base; esp. a large and rapid loss (and replacement) of subscribers to a particular service. Also: turnover or reorganization of employees. Cf. CHURN RATE n." and fist (whose new definitions I won't quote in this PG-rated weblog).

Intonation of the year. New melodies for (English) utterances. This one reached the popular imagination in the early 1990s, via uptalk.

Rhetorical trope of the year. New structures or frameworks for arguments. Call the Rockridge Institute!

Logical form of the year. Is this one possible? Hilary Putnam once argued that logic is empirically testable and indeed must be revised based on the discoveries of 20th century physics -- but do the logical structures and interpretative principles of natural languages ever change?

There could be others -- Discourse marker of the year, Disfluency of the year, etc. -- but we'll leave it there for now.

The fact is that most linguistic innovations have a pretty long history by the time they are noticed. The main exception is a new phenomenon (or at least a new conceptualization of an old phenomenon) that is given a new name (like "TSE" or "prion") or assigned a new sense of an old word (like "embed" or "churn"). Even in most of those cases (like "spider hole"), the only real novelty is the new intensity of public interest in the topic.

Posted by Mark Liberman at 10:39 AM

Extreme bibliophilia

This should be a warning to all Erasmians -- though it seems to have been the magazines that did Mr. Moore in.

Posted by Mark Liberman at 08:44 AM

December 29, 2003

lexical change through consumer fraud

I'm not usually one to complain about how the language is going to the dogs, but sometimes using words correctly really matters. A few minutes ago I turned on the TV and watched the final question on Jeopardy. The answer was (I paraphrase) "A condiment eaten with sushi and also eaten at Passover". Since there is no condiment satisfying both conditions, you might think that the contestants all got it wrong. Two were way off: they responded "nori" and "ginger". The one who got it "right" responded "horseradish", which Alex Trebek explained is the same thing as wasabi. It isn't.

Horseradish and wasabi are different plants. They don't even belong to the same genus. Horseradish is Armoracia rusticana. wasabi is Wasabi japonica, also known as Eutrema japonica. Anyone who has tasted real wasabi knows that it doesn't taste the same as horseradish. Another subtle clue is that wasabi is green; horseradish isn't.

wasabi, by the way, is written like this: 山葵. This is a nice example of a Chinese character idiom. The first character means "mountain". Its native Japanese reading is yama; its Sino-Japanese reading is san. The second character means "hollyhock" and has the native reading aoi and the Sino-Japanese reading ki. No matter how you try, you can't get wasabi from these components. The fact that these two Chinese characters together are read wasabi is morphologically arbitrary.

Now, to return to the point, why is it that the Jeopardy folks think that wasabi is horseradish? I think that this is an instance of lexical change through consumer fraud. Real wasabi is indigenous only to Japan and Sakhalin and people have succeeded in growing it only in a few other places, such as Oregon. The real thing is expensive, and for it to be any good, it has to be freshly grated from the root. The result is that as sushi has become more popular, more and more of the "wasabi" served in the United States has been fake. It isn't wasabi: it's horseradish with green dye. Many people don't know the difference between wasabi and horseradish because in their experience there isn't any.

Posted by Bill Poser at 08:42 PM

You can't be too careful

"I seriously doubt that anybody who publicly uses the word 'contretemps' can ever be elected president," Nicholas Kristof said a few weeks ago in a Times column enumerating all the reasons why Howard Dean is unelectable. Actually, Dean did use the word to describe the Confederate flag episode, but only in a conversation with the Times's own editors. Probably it would have been more accurate to say that no one can be elected president who uses "contretemps" in public, even over lunch in a 100- Zip code.

One way or the other, it's a new category of political gaffe. Which begs the question, can someone who uses "gaffe" publicly be elected president?

Posted by Geoff Nunberg at 07:39 PM

Words of 2003

Columnists at a loss for other topics are beginning to write year-end roundups, rating and ranking the year's contributions in areas from gadgets to celebrity scandals. Those writers who round up the year's words, unlike those who list celebrity scandals, seem cite authorities: this piece in the Philadelphia Inquirer quotes Erin McKean, "senior editor for U.S. dictionaries at Oxford University Press in New York", and "linguist Wayne Glowka, ... who is chairman of the new words committee for the American Dialect Society", while this piece from Reuters quotes "Paul JJ Payack, president of YourDictionary.com" (dotless J's original).

The American Dialect Society will choose its "word (or phrase) of the year" for 2004 in a session starting at 5:30 p.m. on January 9th, 2004, at the Sheraton Boston. Oddly, there is no live TV feed. I haven't seen a list of nominations, but perhaps someone more closely involved with the ADS can supply one. I'll be at the LSA meeting in the same hotel at the same time. I've never been to one of the ADS "word of the year" sessions, but if the Sheraton Boston's "Wireless internet in the Lobby" reaches the meeting room, I'll report from the spot. As Geoff Pullum has pointed out, the LSA welcomes visitors with open arms (or at least benign indifference), and I imagine that the ADS does as well, so stop by if you're around.

[Update 12/30/2003: Many outlets picked up the yourDictionary press release, which came out on Christmas Day: ZDNet, CNN International, ABC Online Australia, Seattle Times, Daily Times (Pakistan), The New York Post, WNDU-TV, KLAS-TV, Web User UK, etc. I suspect that the ADS choices will come too late to get similar uptake, but we'll see.

The Calcutta Telegraph has a fascinating piece featuring words of Indian English, mostly humorously invented.

Here's an article from the SF Chronicle, which asks readers to "[h]elp us choose the Word of the Year".]

Somehow the lexicographical pundits that I've sampled have missed the new word (or phrase) FRT (for Fast Repetitive Tick), which is a sound made by "bubbles coming out of a herring's anus". As Dave Barry put it

Isn't modern technology amazing? A hundred years ago, if you had told people that some day there would be a giant network of incredibly sophisticated ''thinking machines'' that would allow virtually anybody, virtually anywhere on Earth, to hear a herring cut the cheese, they would have beaten you to death with sticks.

For subscribers, here is a note in Science about the same topic, which may possibly exhibit the first use of the word "farting" in that august publication (though I haven't checked). In any case, this is certainly another confirmation of the change in linguistic standards that John McWhorter recently described.

Here is a page on marine biologist Ben Wilson's site with a link to the .wav file that Dave Barry wrote about so movingly. The primary reference is: Wahlberg, M., H. Westerberg (2003). Sounds produced by herring (Clupea harengus) bubble release. Aquatic Living Resources 16: 271-275. The abstract is available, for those with the right access, but the full article does not seem to be on line, with or without subscription. Crucial information:

The characteristic sound made by herring during gas release is denoted as the pulsed chirp. This pulsed chirp is 32-133 ms long (N = 11) and consists of a series of 7-50 (N = 11) transient pulses with a continuous reduction of the frequency emphasis (centroid frequency of first pulse 4.1 kHz and of last pulse 3.0 kHz, N = 11). The source level of the chirp is 73 ± 8 dB re 1 μPa rms (root mean square) at 1 m (N = 19).

I observe in passing that the first scientist to document the comparable human sounds in similar detail is virtually guaranteed an Ig® Nobel Prize.

Posted by Mark Liberman at 03:57 PM

December 28, 2003

Blah

The linguist who invented Chicken seems to have taken a position at the Filipino men's magazine FHN. (from Ad Age, via Instapundit).

Posted by Mark Liberman at 11:47 AM

Bad words getting better?

John McWhorter writes in the WaPo on changes in standards for speech in broadcasts, in public and in private.

Posted by Mark Liberman at 11:06 AM

Advantage: Google

Six relevant mad cow acronyms: BSE ( bovine spongiform encephalopathy); CJD and vCJD ( Creutzfeldt-Jakob disease, variant Creutzfeldt-Jakob disease); CWD (chronic wasting disease); FSE (feline spongiform encelphalopathy); TSE (transmissible spongiform encephalopathy).

Four on-line dictionaries: The Oxford English Dictionary (OED); Microsoft Encarta; Merriam-Webster OnLine; American Heritage.

All four online dictionaries have BSE. OED and Encarta have CJD, but Merriam-Webster and American Heritage don't. Only OED has vCJD. None of the four dictionaries has CWD or FSE. Only Encarta has TSE.

Score: OED 3 of 6, Encarta 3 of 6, Merriam-Webster 1 of 6, American Heritage 1 of 6.

Google, of course, scores 6 of 6.

[Update 12/29/2003: Mark Worden poitns out that I overlooked one acronym, nvCJD (new variant Creutzfeldt-Jakob disease), which is another name for vCJD --

"New variant CJD (nvCJD) or variant CJD (vCJD): name given to a newly identified human TSE which is significantly different from other forms of CJD. The number of definite and probable cases is 153 people (143 in the U.K., six in France, one in Ireland, one in Italy, one in the United States, and one in Canada. Scientists have concluded that the patients in the United States and Canada contracted nvCJD in the U.K.)*22 (Both nvCJD and vCJD refer to the same entity. nvCJD is used throughout this information resource primarily and is preferred by many experts; however, vCJD is also commonly used.)"

Of the four dictionaries checked, only Encarta and OED have nvCJD. The OED is the only dictionary that has both vCJD and nvCJD, and it simply expands the acronyms, without indicating that the two terms are different ways of referring to the same thing.

Updated scores: OED 4 of 7, Encarta 4 of 7, Merriam-Webster 1 of 7, American Heritage 1 of 7, Google (i.e. the internet) 7 of 7.

It's a little surprising that on-line lexicography is not more up to date on these terms, since they seem to have been used in the specialist literature since about 1997 (CWD since the late 1960s), and they refer to various aspects of a matter of major public health and public policy concern. ]

Posted by Mark Liberman at 10:24 AM

Names of smells

As a grown-up version of Bertie Botts' Every Flavor Beans, Demeter offers perfumes in fragrances like dirt, crust of bread, sawdust and laundromat, as well as tobacco, condensed milk, Earl Grey tea and cranberry, and more traditional things like patchouli, honeysuckle and sandalwood. I don't quite get it. Do people buy these as a joke gift, for the incongruity of a fancy bottle of perfume labelled mildew -- that actually smells like mildew? Or do they buy them because they really want to go around emitting wafts of turpentine or lobster? Or is it because they get a proustian rush from privately uncorking their bottle of sticky toffee pudding or stable?

Anyhow, Demeter's list of currently available fragrances suggests a problem in computational linguistics: devise an automatic algorithm that analyzes a very large text corpus to derive a comparable list of "names of things with evocative smells". (In fact one should be able to do better, since Demeter's list is not really very long, systematically omits highly offensive smells like cat piss and rotten eggs, and includes some odorless oddities like holy water ... ) This problem in itself is not important, but it's an instance of an interesting class. It would be nice, for instance, to be able to process biomedical text so as to derive a list of names of structural proteins, or diseases of domestic pets, or insects implicated as disease vectors, or whatever.

[link via join-the-dots]

Posted by Mark Liberman at 09:35 AM

December 27, 2003

Mad cow words

Here's some interesting biomedical stuff on prions and mad cow disease. I've added a bit of lexicography for the obligatory language link.

Researchers at Columbia and MIT have found a protein in sea slug neurons (cytoplasmic polyadenylation element binding protein, or CPEB) that appears to use prion-like alternative forms as part of a mechanism for encoding long-term memories (NYT article). This could be a big deal for two reasons -- it might help explain how memories are formed (or more generally, how synapse-specific long-term facilitation works), and it might help explain where prions come from, and why they seem to form spontaneously in pretty much all animals. If true, either of these would be important enough to elevate CPEB (or some other nickname for these proteins) into the general vocabulary. It's likely -- given nature's thriftiness with basic mechanisms -- that similar tricks are used for lots of cellular switching functions, and thus may also be involved in other disease processes, making the discovery even more important. Some more links are here and if you subscribe to Cell here, here and here.

A lot of attention has been paid in the media (e.g. here) to the fact that the animal recently diagnosed with BSE was a "downer", i.e. was too sick to walk on its own when it arrived at the slaughterhouse. I agree with the note from Dr. Weinstein in this posting at ProMED-mail (scroll down to item [2]): "it makes me more than a little nervous to find out that obviously sick animals are still sent for slaughter to enter the human food chain."

This additional information provided by the ProMED-mail editors is just as distressing:

"Cattle are humanely stunned with a captive bolt stunner that
penetrates or piths the brain rendering the animal unable to feel
pain. However, the animal is not dead. Depending upon the speed of
the slaughter plant the animal remains alive, but unable to
comprehend or feel pain, for an average of 2 to 7 minutes before the
throat is cut, exsanguinating the animal.

During that 2 to 7 minutes the neurological tissue that captive bolt
compressed into the brain and into the blood stream can circulate
throughout the body, as long as the heart beats. The prion is smaller
than a red blood cell. Therefore, it would appear that the prion
agent can be in muscle tissue. (The Lancet, Sep 14, 1996, Letter to
the Editor)."

I had (falsely) assumed that cuts of meat away from the bone are likely to be safe, based on the earlier regulations in the U.K., which claimed on "the latest scientific advice" that properly boned beef can be eaten "with complete confidence." It sounds like "captive bolt stunners" are a really bad idea from the prionic point of view (I just made up the word prionic, by the way, but according to google, at least 921 others have engaged in anticipatory plagiarism).

This discussion makes it seem that Kosher or Halal beef would be safer in this respect. Of course, testing all slaughtered animals for prions would be even safer. Or becoming a vegetarian.

Neither the OED nor Encarta nor Merriam-Webster nor American Heritage has an entry for "captive bolt stunner," and all think that "downers" are (only) sedative drugs or depressing things, not cows who can't walk. I bet that downer soon makes it into jokes on late-night TV, if it hasn't already. I'm not sure about captive bolt stunner -- it depends on how the public discussion develops. I didn't bother looking for cytoplasmic polyadenylation element binding protein -- not even the Enzyme Commission has that one yet. Contrary to what I wrote here earlier, mad cow disease itself makes it into the online verions of the OED, Encarta, American Heritage and Merriam-Webster -- if properly looked up.

Update: this article from the Financial Times contains a very interesting -- and reassuring -- quantitative comparison with the British BSE/CJD episode of a few years ago:

In contrast to the single infected cow in Washington state, the UK has had 180,000 confirmed BSE cases.
As many as 750,000 infected animals may have entered the British food chain before the disease was recognised and proper precautions taken.
Even now, 11 years after the BSE epidemic reached its peak, several new cases a week are reported in British herds.
The incidence of variant CJD, the fatal human disease linked to eating BSE-contaminated meat, peaked in 2000. The cumulative death toll from vCJD stands at 138.
Although statisticians say it is too early to be sure how many people will die, most expect the eventual total to be about 200 to 300 - assuming that there is no secondary epidemic spread by infected blood supplies.
On that basis, even a few hundred cows with BSE in the US would not be likely to cause any human disease.

Posted by Mark Liberman at 07:17 PM

Cullen Murphy draws the line

In the Atlantic, Cullen Murphy writes that "... surely there are a handful [of standards] on which we might all agree to hold the line—this far and no further, unto the end of days. To start this long-overdue public conversation, I'll propose ten." His #3 is

III. Notoriety does not denote "famousness," enormity does not denote "bigness," and religiosity does not denote "religiousness."

I agree with Murphy about the meaning of these words, personally, but the basis of his strictures in history and present usage is more tenuous than one might like for a standard that we are supposed to uphold "unto the end of days". Or to put it more bluntly, sez who?

Murphy's column is at best half serious, and much of his new decalogue could be charitably interpreted as playful recycling of mildly un-PC rectitudes -- for instance his #2 and #5 are:

II. "Women and children first" (except maybe Ann Coulter).

V. "Honey, you look great!" (still the only correct answer).

So maybe it's unfair to take him to task for bad judgment in picking linguistic examples. Still, I'm disappointed. I'd expect him to be able to find some obnoxious new usages to (playfully pretend to) hold the line against. Instead, he picks three cases where he's objecting to the retention of an earlier (often original) meaning of a word.

There's no general rule that the development of a more specific meaning must drive out an earlier, more general one. Sometimes it happens and sometimes -- probably more often -- it doesn't. The OED considers that the more general sense is obsolete in only one out of three of Murphy's examples.

Was Murphy too lazy to check, too insensitive to see the difference? This is hard for me to believe about someone who has written the comic strip Prince Valiant since the mid 1970's. Or is this aspect of his piece just a subtle tongue-in-cheek subversion of his own Language Maven schtick?

For those without easy access to the OED, here's a summary.

Notoriety. The OED's first sense for notorious is "Of facts: Well known; commonly or generally known; forming a matter of common knowledge." The cited examples make it clear that the well-known facts need not be negative ones:

1555 EDEN Dec. W. Ind. (Arb.) 198 His courage was such and his factes so notorious. 1586 SIDNEY Ps. XX. iii, Lett him [God] notorious make, That in good part he did thy offrings take. 1621 BP. R. MONTAGU Diatribæ 567 Why were not other Examples brought into practice, as notorious as that of Abraham paying Tithes? 1686 W. CLAGETT 17 Serm. (1699) App. 15 These testimonies were too notorious and publick to be gainsaid. 1705 STANHOPE Paraphr. II. 407 That Every one is bound..to..keep within his own Property..is too notorious to need a Proof.

Negative connotations don't come in until sense 4: "Used attributively with designations of persons which imply evil or wickedness: Well known, noted (as being of this kind)." No indication is given that the older, neutral sense (from Latin notus "known", notorium "knowledge", etc.) is obsolete.

For notoriety, the OED's first sense is "the state or condition of being notorious; the fact of being famous or well known, esp. for some reprehensible action, quality, etc." Examples without a negative connotation include

1575 N. HARPSFIELD Treat. Divorce Henry VIII (1878) 37 The notoritie of the manifest and open justice of our cause. 1749 H. FIELDING Tom Jones III. VIII. i. 146 The Credit of the former [historians] is by common Notoriety supported for a long Time.

Enormity. The OED agrees that the use of enormity to mean "bigness" -- its sense 3, which it glosses as "[e]xcess in magnitude; hugeness, vastness" -- is obsolete, and its citations for that sense are all from the late 18th or early 19th century:

1792 Munchhausen's Trav. xxii. 93 A worm of proportionable enormity had bored a hole in the shell. 1802 HOWARD in Phil. Trans. XCII. 204 Notwithstanding the enormity of its bulk. 1830 Fraser's Mag. I. 752 Of the properties of the Peak of Teneriffe accounts are extant which describe its enormity.

But if enormity could mean "enormousness" in 1830, who's to say that we have to hold the line "until the end of time" against the return of that sense?

Religiosity. The OED's first meaning for religiosity is "1. Religiousness, religious feeling or sentiment", with citations from 1382 to 1887:

1382 WYCLIF Ecclus. i. 17 The drede of the Lord [is] religiosite of kunnyng. Ibid. 18 Religiosite shal kepen, and iustefien the herte. 1483 CAXTON Gold. Leg. 245/1 There is treble generacion spirituel of god, that is to saye, of natyuyte, religyosite, and of body mortalite. 1609 BIBLE (Douay) Ecclus. i. 17, 18. 1813 Edin. Rev. XXII. 222 Their disposition to religious feeling, which they call religiosity, is..a love of divine things for the love of their moral qualities. 1846 J. MARTINEAU Misc. (1852) 188 Our author argues from the religiosity of man to the reality of God. 1887 Z. A. RAGOZIN Chaldea iii. 149 Man has all that animals have, and two things which they have not -- speech and religiosity.

The OED gives no indication that this meaning should be withdrawn in favor of the more specific sense "1.b. Affected or excessive religiousness", for which its earliest citation is 1799. That's because the original, broader sense never died out -- it's easy to find a continuous pattern of uses of religosity in this sense, from 1382 to the present day.

Posted by Mark Liberman at 02:41 PM

December 26, 2003

Have some word salad with that word soup

I recently visited a heritage village and found myself inside a reconstructed 1870s house browsing old books. I chanced upon an old grammar of English which contained a discussion of a sentence: "She said that that 'that' that that boy used was wrong." Later, a google search turned up thousands more sentences containing long sequences of identical words. Other repetitive sentences, like "police_NOUN police_VERB police_NOUN and their longer variants "(those) fish (that other) fish (like to) fish, (themselves) fish (other) fish", are sometimes used to test natural language processing (NLP) systems. I'll refer to this genre as word soup, to distinguish it from another interesting category called word salad.

Word salad is the technical term for the result of randomly tossing words into a sentence, e.g. `The a are of I'. As Steve Abney and others have pointed out, it is often possible to come up with a plausible interpretation for such sentences. In this particular case, one has to know that an "are" is 100 square metres (one hundredth of a hectare), and that "a" and "I" are names. So we can interpret `the a are of I' just like: `the "a" section of paddock "I"'. In general, this kind of trick is easy to do, since every word can be used to name itself and is therefore a noun, and because just about any noun can be `verbified' (i.e. we can verb most nouns).

Another category we could call word minestrone, for obvious reasons: `That that that is is that that is not is not that that that is not is not that that is is is not that so.'

Posted by Steven Bird at 03:40 PM

December 25, 2003

Words and other lexical entries

On the question of the number of new English words per year, Language Hat writes:

Liberman rightly (in my opinion) discounts the trademarks, but I think he's too quick to dismiss the scientific terms. As rebarbative as "GDP-L-fucose synthase" may be, I don't see any principled way to distinguish it from the long line of terms that have preceded it, from atmosphere through phlogiston and quark. The OED has from the beginning tried to include scientific terminology, and although it's probably impossible by now to keep up with the details of every specialty, if they're used in the normal course of events by the specialists concerned, they're bona fide English words and deserve to be counted. Whether it's possible to do an accurate count, of course, is another matter altogether.

There's some truth in this, but for the sake of clarity, let me argue the other side for a while.

First, I don't entirely discount the trademarks, any more than dictionary-makers do. The OED's most recent update includes Bluetooth, Nomex, Norplant, Noryl and Swiss Army knife, among other trademarked words, and they were quite right to include these. Margaret Marks lists a small sample from the International Trademark Associaton's list, and many of her examples are plausible candidates for inclusion, if they're not already there (as Grand Marnier and Grape-Nuts are).

It's just that most of the 100,000 new trademarks registered in the U.S. every year (and I assume in other places as well) are simply names (of businesses, products, etc.) that someone happens to have registered according to a certain legal procedure. This legal registration doesn't privilege them lexicographically over the tens or hundreds of millions of new names created in the Anglosphere every year that aren't trademarked (like Perl, which also made the OED's most recent update, and has not been registered as a trademark). All names are lexical entries, in the sense that they are morphophonological patterns with a conventional (if sometimes very local) meaning, which is not predictable from the meaning of their parts (if any). My brother's childhood imaginary friend was named "Clocktho" (rhymes with "block know"); our current cat is named "Tickle"; I'm co-director of an outfit whose acronym is "IRCS" (often pronounced "irks"); I often eat at the "Class of 1920 commons" (often abbreviated as "1920 commons" or just "1920"). These are all part of my mental lexicon, and I share each of them with some other people as well; but none of them are in any general dictionaries of the English language, nor should they be. The OED's most recent update includes Nipmuc, referring to "several Algonquian-speaking North American Indian peoples formerly inhabiting parts of central Massachusetts and adjacent Connecticut and Rhode Island," who gave their name to several landmarks of my childhood such as the Nipmuc Trail. The difference between Nipmuc (which was long overdue to be included) and Clocktho (which never will be) is not narrowly linguistic but rather historical, sociological, and quantitative.

Second, there is a difference worth noting between scientific terms like quark and those like trimethylamine-N-oxide reductase. The latter is a kind of a phrase, composed according to a certain grammar or at least pattern, which lends itself to the construction of a very large number of additional strings that are not necessarily part of the scientific lexicon. In principle we could have dimethylamine or monobutylamine at the start, etc. The choice among instantiations of these linguistic patterns is then a matter of what chemical configurations are possible and which of them biology uses. Scientists need standard databases for what is known about these facts of chemistry and biology, and also for the associated linguistic choices, such as the acronyms, abbreviations and other nicknames for the chosen entities. The Enzyme Commission provides such a standard. But only a few of the names that it catalogues -- whether the full phrasal names or the nicknames -- belong in a dictionary.

This is not specific to scientific vocabulary -- in fact, it's a lot like the problem of street addresses. Ware College House, where I live, is now officially at 3650 Spruce Street. Three years ago, it was officially at 3700 Spruce Street; and then for a couple of years, it was officially 3615 Hamilton Walk. By "officially" I mean that the address was registered in those changing ways with the U.S. Postal Service ( though the buildings have been in the same place since 1902). There are many similar strings -- e.g. "3615 Spruce Street" or "3650 Hamilton Walk" that are not valid addresses at all. These facts -- that 3615 Spruce Street isn't a valid address in Philadelphia, but 3650 Spruce Street is, and furthermore that as of 2003 it is the address of Ware College House -- are not facts about the English language, exactly. They're facts about (the U.S. Postal Service's official view of) the way we've decided to use the English language to talk about Philadelphia. You can look such facts up in an appropriate reference, but (except perhaps for a few like 221B Baker St.) the appropriate reference is not a dictionary.

Streets and buildings exist independently of how we choose to address them, but the question of which streets in which cities have which numbering schemes, and which institutions and buildings are officially designated with which street addresses, is to a large extent a question about linguistic convention (I understand that street numbers in some Japanese cities are assigned in the order of building construction!). However, the kind of linguistic convention involved is not one that we usually regard as being part of the responsibility of dictionary makers. The same thing can be said about the question of how to form complex chemical names, how to abbreviate these names or otherwise form shorter and more convenient versions, etc. It's a good thing that we have efforts like the Enzyme Commission to keep track of specific areas of scientific terminology, just as it's a good thing that the U.S. Postal Service keeps track of U.S. street addresses. Both are lexicographical enterprises, in some sense; but ...

My only real conclusion here is that the terms "new", "English" and "word" are too vague in ordinary use for the question "how many new English words are there each year" to have a well-defined answer. And in fact we've only scratched the surface of the kinds of vagueness that would have to be remedied in order to give a meaningful answer :-)...

Posted by Mark Liberman at 03:18 PM

December 24, 2003

'Twas the night before Christmas

Little did I realize, when I wrote a scholarly reflection on the semiotics of clothing and the transitivity of identity, that I was about to turn Language Log into a player in the Holiday Porn Industry.

We may be writing pieces about Turkic vowel harmony, the failures of foreign language instruction, and the nature of the passive voice -- but today, most of our new readers are finding us by searching for "santa sex" (because my little essay is #5 on yahoo for that), "sex with santa" (where it is #1 on yahoo), "sex santa" (where it is #5 on google), and so on.

So welcome, Santa fetishists all! I'm afraid this site is not what you're looking for. But before you vanish back up the chimney, here's a piece of free lexicographical advice, featuring a new word that has not made it into the OED, Encarta or Merriam-Webster. Have you considered searching for "Santa Claus" slash fiction ? Of course as soon as Google does another indexing pass, what you'll get is this page you're reading now :-)...

Posted by Mark Liberman at 01:18 PM

Songs of rationalism and empiricism

This weblog plays no epistemological favorites: having quoted Blake's critique of rationalism (from Plate 8 of Visions of the Daughters of Albion), fairness compels us to quote his critique of empiricism, from Plate 6 of the same work:

With what sense is it that the chicken shuns the ravenous hawk?
With what sense does the tame pigeon measure out the expanse?
With what sense does the bee form cells? have not the mouse & frog
Eyes and ears and sense of touch? yet are their habitations.
And their pursuits, as different as their forms and as their joys:
Ask the wild ass why he refuses burdens: and the meek camel
Why he loves man: is it because of eye ear mouth or skin
Or breathing nostrils? No. for these the wolf and tyger have.
Ask the blind worm the secrets of the grave, and why her spires
Love to curl round the bones of death; and ask the rav'nous snake
Where she gets poison: & the wing'd eagle why he loves the sun
And then tell me the thoughts of man, that have been hid of old.

Posted by Mark Liberman at 12:26 PM

Dragging the chain of life in weary lust

I seem to be one of about six entities in cyberspace who have missed the recent story of how philosopher of language Peter Ludlow was bounced from "Sims Online" for exposing its seamier side, and in particular for uncovering the practice of child cyber-prostitution.

In case you're one of the other five, here is Peter Ludlow's blog the Alphaville Herald, the slashdot thread, the Salon.com article, and an interview with Peter Ludlow at gamespot.

As modest amends for being so completely out of the loop, let me quote William Blake's introduction of the character Urizen, from whom I suppose Ludlow must have taken his nom de sim:

Lo, a shadow of horror is risen
In Eternity! Unknown, unprolific!
Self-closd, all-repelling: what Demon
Hath form'd this abominable void
This soul-shudd'ring vacuum?---Some said
"It is Urizen", But unknown, abstracted
Brooding secret, the dark power hid.

Simulated worlds certainly offer a different perspective on Blake's critique of rationalism:

What are his nets & gins & traps. & how does he surround him
With cold floods of abstraction, and with forests of solitude,
To build him castles and high spires. where kings & priests may dwell.
Till she who burns with youth. and knows no fixed lot; is bound
In spells of law to one she loaths: and must she drag the chain
Of life, in weary lust! must chilling murderous thoughts. obscure
The clear heaven of her eternal spring?

Posted by Mark Liberman at 10:53 AM

December 23, 2003

Counting new words: is there a lexicography gap?

Geoff Pullum is absolutely right to observe that Don Watson's notion of 20,000 new English words a year is probably an example of the well-known fact that 57% of all quoted statistics are made up on the spot, while another 34% are an inflated quotation of someone else's extemporaneous fabrication. People do this because it sounds better than saying "quite a few, I don't know how many".

Geoff is also right to observe that an accurate count of how many new English words come up every year is almost impossible to define in any useful way, since the meaning of the terms "word" and "English" in such statements is so vague. Nevertheless, it's easy to come up with some specific numbers that are not completely devoid of interest.

The OED's four most recent quarterly updates (through Dec. 11 2003) added 487 new "out of sequence" entries (leaving aside the much larger number of new-edition words in designated alphabetical ranges, such as the most recent batch Nipkow disc-nuculoid, since these have presumably been in preparation for a longer time). Even so, the great majority of the past year's 487 out-of-sequence additions were words that have been around for a while, but had previously been missed. These are not just stuffy old formal-language words, though -- the list includes backassward, digerati, fuckwit, gang-bang, infoholic, perl, Queer Nation, studmuffin, Thinsulate and Wonderbra. If there are really 20K new words a year, the OED's lexicographers are almost two orders of magnitude short of keeping up -- they'd be falling behind by more than 1.9 million words per century, the poor saps. But perhaps we should give them credit for all the new-edition entries -- adding 545 of the Nipkow disc-nuculoid batch in the last quarter alone, plus about a hundred new sub-entries in the same range. Along with relative newcomers like Nomex and nitrox, this would include definitely older words such as non-abelian and nonadditive; but it's all arguably part of the same lexical ledger, so let's give full credit for all the additions. If we do that, then I guess that the OED is adding about 2500-3000 new items per year -- and only falling behind Watson's estimate by some 17,000 per year, or 1.7M words per century :-).

The 2001 edition of Microsoft's Encarta Dictionary advertises "over 5000 new words", presumably relative to the 1999 edition. This would be 2500 new words per year; but there is no reason to think that these are all novel words, as opposed to older words that the editors decided on reflection to include. If they were indeed all new, and if there really were 20,000 new words a year to keep track of, Encarta would be falling behind roughly at the same rate as the OED :-).

I'm sure that there are lexicographers out there who can give a more exact account of the number of apparently novel English coinages or borrowings they observe per year, independent of the number that they decide to include in their published dictionaries. I'll be somewhat surprised if those estimates are higher than 5,000 words a year, if they are that high; and I'll be very surprised if there really is a "lexicography gap", in the sense that the profession is falling behind by millions of bona fide words per century.

On the other hand... At the other end of several scales, the USPTO's TESS "contains more than 3 million pending, registered and dead federal trademarks" (as of 10 November 2003), whereas when it was started on Feb. 14, 2000, "TESS [allowed] the public to search ... the 2.6 million plus pending, registered, abandoned, cancelled or expired trademark records found in PTO’s X-Search system." This is about 400,000 added in 3.75 years = >100,000 added per year.

A lot of these are things like FUSION WAKEBOARD TOWERS AND ACCESSORIES or ROCK WAX SLAM'N HAIR WAX THAT ROCKS -- but wakeboard tower really is a word, and so is hair wax. The three-letter acronym IED has been trademarked 13 times, and none have anything to do with Improvised Explosive Devices :-). So there might well be more than 20,000 new company and product names invented every year, not to speak of semi-compositional complex nominals like "wakeboard tower," but I suspect that this is not what Don Watson was talking about.

In various areas of science and technology, there are many new terms of art added every year, and in some of these areas, some more or less official group keeps track. The Ezyme Commission's Enzyme Nomenclature Supplement 9 for 2003 includes around 200 new items, each of which may involve several new "words" (if we take the registered terms to be "words") -- thus EC 1.1.1.271 is

Common name: GDP-L-fucose synthase.
Other name(s): GDP-4-keto-6-deoxy-D-mannose-3,5-epimerase-4-reductase
Systematic name: GDP-L-fucose:NADP+ 4-oxidoreductase (3,5-epimerizing).
The cross-listed NiceZyme entry gives another "alternative name" GDP-fucose synthetase.

If each of these variants is a different word, and if this entry is typical, then there might have been 800 or more new enzyme names registered officially in 2003. From my recent experience in biomedical information extraction, I can say that many "names" of enzymes (and genes and structural proteins and ...) are used without being officially registered. These are names, not words in the general sense, though the shorter variant variant names of a few of them might come into general use from time to time (like caspase-9 or topoisomerase 1).

If we looked across all the different sub-areas of science and technology, there will probably be many more than 20,000 (durable and generally-recognized) new names coined every year -- new genes, new species, new stars, new algorithms, whatever -- but I don't think that's what Don Watson had in mind either.

I also admit that there's lots of stuff going on under the lexicographical radar of all these monitors. Neither the OED nor Encarta nor the USPTO nor the Enzyme Commission has glemphy, craptacular, or Falluja. My personal guess is that craptacular (with 16,500 Google hits) will make it into the dictionaries before long, and that the other two won't (because glemphy won't ever be generally used, while Falluja will fall back into the category of foreign-language place names that are not really part of the general English vocabulary, even though they once might have been (like Qui Nhon and Echternach); but I'm skeptical that the list of also-rans as plausible as these is anywhere near as big as 20,000 a year.

Without spinning out the obscurities any further, it's clear that there are meanings of "new", "word" and "English" under which you could argue that there are 20,000 new English words per year, or even more -- but these meanings are pretty loose and even unreasonable ones. A more plausible guess, closer to the core interpretation of the terms by working lexicographers, seems to be in the range of the two or three thousand items that the OED and the creators of Encarta seem to be adding (though I look forward to hearing other numbers from people in a better position to know).

Like Geoff Pullum, I haven't read what Don Watson has to say about the globo-downfallization of language, because Watson's book is not available here. Maybe some reader down under can take a look? If Watson shows any evidence of having thought at all about what it means to say that "there are X new English words every year", rather than just blurting out some implausibly large estimate because he didn't want to say "a whole bunch", I'll buy a round of drinks at the LSA for anyone who cites his evidence or his arguments.

[By the way, we can't answer the question just by looking at the growth over time of the list of lexical tokens in some very large electronic corpus, because after a while, most of the new tokens are typos or mis-spellings. In addition, this method doesn't find new words that happen to be written with internal white space. One can imagine a variety of ways to deal with both of these problems, and people have tried some of them, but that's another story, or at least another post :-)].

[While we're on the subject, I need to very gently correct Geoff's statement that "the 5 exabyte mistake about word tokens uttered in human history [is] much repeated but known to be completely false." It's not completely false, it's just off by a factor of 8 thousand or so :-).]

[A few other relevant sites:
www.wordspy.com (adds one new word a day)
The most commonly misspelled words on the web -- 2.86M cited for transexual, Google now says 4.47M...
The Dictionary Forum ]

Posted by Mark Liberman at 05:31 PM

Twenty thousand new words a year

I don't know whether the book by Don Watson that Mark Liberman recently mentioned contains anything at all to justify its hysterical claims that the English language "is being mangled by the globalising forces of obfuscation." But if it does, it is puzzling that nothing that could begin to justify such claims is quoted or mentioned in the article about it in Melbourne's newspaper The Age. We get a couple of noun phrases with hyphenated compound prenominal attributive modifiers like outcome-related, real-world, and whole-of-organisation, and that's just about it. The rest is all frothing and flaming about the noble English language being done to death, desecrated, doomed. The article makes it look like a more ridiculous and extreme demise-of-the-language polemic than any I've ever seen.

Only one thing caused a flicker of interest for me: an actual figure is given for the likely number of new words added to English in the course of a year. The figure cited is 20,000. I'm wondering what the source was.

Don Watson's book (which I have not seen) may not tell us. The article about him says he has no wish to keep the language static: "The genius of English is the way it updates itself every day, with 20,000 new words a year, Watson read somewhere." He read it somewhere? Thanks a lot, Don; that narrows it down a bit.

Watson, of course, like just about every non-linguist who ever writes about language, presupposes that a language is just a big bag of words. Barbara Scholz and I have attacked that idea (in Nature 413, 27 September 2001, p.367), but it's not that we think anyone will listen or anything will change. Everybody thinks that the key thing about a language is which words it has -- and above all, how many. Now, Scholz and I think that the answer is that it's inherently and profoundly indeterminate, for a very deep reason: we think natural languages do not have closed lexicons at all. (This is an idea due to Paul Postal; there is a discussion of it in Chapter 14 of Arc Pair Grammar by Paul Postal and David Johnson, Princeton University Press, 1980.) Natural languages are much better thought of as systems of conditions on the structure of expressions (words, phrases, sentences). Some of the well-established conditions apply to word-sized units (it really is well established that dog denotes Canis domesticus, and that the is the only acceptable form for the definite article), but the constraints do not entail a roof on the number of words or prescribe which ones are genuinely in the language.

This makes neologisms (brand-new coinages of words) important: while closed-lexicon models of language would suggest that sentences containing new words are not part of the language and cannot possibly be understood, so the introduction of new words should be a rare and tricky business, Scholz and I (like Postal) are saying that there is absolutely nothing linguistically wrong with sentences containing novel words, and that sort of suggests they would occur often, perhaps every day, all the time. And that seems right: if you really take note of everything linguistic that happens to you today, the chances that you will not come across a word you hadn't ever seen or heard before are very low, and you may even encounter a word that no one had ever used before (though that's harder to check).

But even Scholz and I did not think the evidence of lexical openness would be as bountiful as 20,000 words a year. That's really a lot. It's 55 a day. That means two or three new words becoming established every hour, day and night. It could be true. We'd sort of like to know whether it is, and if so, what definition of word is being used. (To make the question interesting, you need to make sure you don't count words in a silly way. For example, since we talk about RS232 ports and Intel 80486 chips and the year 2004 and the Boeing 767 and so on, you could count all digit strings as words, which immediately tells you there must be a countable infinity of them. But that can't be what we mean if we're talking about adding 20,000 new words each year.)

Just about all we know right now is that Don Watson read it somewhere. Give us a source, Don. I mean something checkable. The closest I've got is that I've seen the 20K words claim attributed to the New York Times in a Powerpoint presentation from the University of Kentucky's journalism school that I found on the web, but I'm looking for something more specific than just the name of a newspaper. Because of course the claim could be just another urban legend, like the 5 exabyte mistake about word tokens uttered in human history, much repeated but known to be completely false.

Posted by Geoffrey K. Pullum at 02:20 PM

Amazoodling

I'd seen it, but I didn't know what to call it: amazoodling. For instance, the (now 833) comments on The Best of David Hasselhoff.

Posted by Mark Liberman at 08:29 AM

December 22, 2003

Anti-Effle

John McWhorter recently posted here about the kind of expressions that are all too often missing from foreign language courses. Linda Seebach sent email about a different kind of anti-effle:

I saw your post about effle on language log and would like to comment on a dual phenomenon; things that someone really did say spontaneously in a real conversation that are the exact opposite of the formulaic conversations on the Test of English as a Foreign Language. The name dates from the sabbatical year we spent in Shanghai, where one of my classes was TOEFL.

I'm not sure that TOEFL itself deserves the rap -- the language of these practice questions seem close enough to English as she is spoke these days, at least in the somewhat formal register that seems appropriate for such a test. I bow to Linda's first-hand knowledge of TOEFL-preparation classes in Shanghai, but I'll substitute anti-effle as a general term.

Here are her examples:

What do seals have to do with beer?
(My son, on a Yangtze River boat, overhearing his parents wondering aloud why the local brew we took on at the last stop had a seal logo -- we were almost 2,000 miles from the ocean.)

Did I tell you about the resurrection of my poinsettia?
(Me to my son -- I'd thought for weeks it was dead but when I went to throw it out it had new leaves)

It's so funny to see a cat umbrella in the garden.
(one of our editors to his daughter, whose umbrella had a cat on it)

Thumbs are pretty smart.
(Our editor, explaining why his Blackberry was easy to use)

We're all standing around looking at Pilar's mouse.
(Pilar is our editorial assistant, and she had just spotted a mouse,
mammalian not computer, under the printer. This drew a crowd.)

How do you weaponize the mosquito?
(Me to Alan Leshman of AAAS, who visited here last week for the AAAS convention; he'd just said worry about publishing information about the genetics of the malaria parasite was overwrought)

They're taking the guinea pig to the beauty parlor because she's depressed about the other one.
(My editor; one of his daughter's guinea pigs had died and his wife was trying to cheer her up)

I am practicing triage on my cheese.
(me to my son; I had a package of cheese cubes that was getting moldy, and
the question was whether the water I used to wash the cheese would have
been better used for something else)

These findings predict a level of gymnastic ability that is rare among proteins.
(I saw that in Nature, but I seem not to have saved the date.)

A surprising amount of real language has similarly high (but not too high!) entropy. This is the secret behind googlism: if we ask it about "Geoff", we find (with some editorial selection) that Geoff is "a life member of the australian black and white artists club", "one of Canada’s most energetic and inspirational fitness leaders", "the director of IT at Walford Anglican School for Girls", "the guy who introduced me to the 5 and 6 string bass in 1983", "grateful for the help that the nightclub has provided as a sponsor and its general support of his strongman pursuits", "an expert on demand chain internet solutions", "indifferent between outcome B and a lottery in which outcomes A and C are possible", and "my hero and I got one of his decks", among many other things.

I won't say that you can't make that stuff up, but it's not easy for art to imitate the texture of real-life text.

Geoff Pullum is absolutely correct that "it's not insincerity in example sentences that is the great enemy of language teaching, but boring the crap out of learners." However, effle is not always boring, and therefore I think we have to admit that effle may sometimes be effective pedagogy, if only by virtue of the mnemonic value of strong emotion. Many of the examples on this page, said to be from Langenscheidts Konversationsbuch English-Deutsch. would provoke notable hilarity in any high-school language class (via Desbladet):

Hier sind Schlüpfer in verschiedenen Farben. Sind sie haltbar?
"Here are panties in different shades. Are they durable?"

This is certainly effle, but even the slowest student in German class that day would emerge with the vocabulary items Schlüpfer and haltbar etched indelibly into his hippocampus.

In the same vein, let me cite some memorable effle from an Armenian phrase book contributed by Margaret Marks, who started the whole thing:

Let me eat the fish and throw the mouse, let me throw the mouse and eat the fish.

I'll never eat khorovatz again without remembering that example --I wish I knew the Armenian.

[Update: Linda Seebach writes on this topic at the Rocky Mountain News.]

Posted by Mark Liberman at 10:15 AM

A tale of two Dons

Down in Australia, The Age reviews Don Watson's Death Sentence, The Decay of Public Language, which "charts how 'managerial language' has infiltrated the English of politics, business, bureaucracy, education and the arts. The book is about the rise of core strategies and key performance indicators, and the death of clarity and irony and funny old things called verbs. It is about a new language that Watson calls sludge and clag and gruel."

Well, O.K. then (or should I say, "good on him"?). But I have to wonder whether Don Rumsfeld is in the index, and if so, what he's cited for. Rumsfeld's language makes an impression on people. For people who don't agree with his politics and don't sympathize with high-ranking bureaucrats, the reaction is often negative (as in the case of the egregiously undeserved "Foot in Mouth" award) or sarcastic (as in the case of Slate's .piece on "The Poetry of D.H. Rumsfeld"). Rumsfeld's transgression seems to be that he tries to discuss complicated things, in a difficult political context, in language that is plain, simple and as clear as circumstances permit. Some of the people who have trouble with his politics have double trouble with his language. They react as if he came to a news conference wearing a dress.

Watson was a speech-writer for Paul Keating, who rates a website just for his recorded insults. These seem to be cases where he chose to talk in public the way people normally talk in private:

"Now listen mate," [to John Browne, Minister of Sport, who was proposing a 110 per cent tax deduction for contributions to a Sports Foundation] "you're not getting 110 per cent. You can forget it. This is a fucking Boulevard Hotel special, this is. The trouble is we are dealing with a sports junkie here [gesturing towards Bob Hawke]. I go out for a piss and they pull this one on me. Well that's the last time I leave you two alone. From now on, I'm sticking to you two like shit to a blanket."

You can get plenty of this stuff from public figures if you want it -- Molly Ivins has been filling columns for years with colorful quotes from Texas politicians. Rumsfeld's public language is different. It lacks the profanity and the colorful colloquialisms and even most of the informal discourse markers. He just explains complicated things carefully in plain words. Apparently that weirds people out.

[via A.L.D]

[Update 12/23/2003: the word clag, used in the review quoted above, was new to me. The OED says that it's "north. dial." for "[t]he process or product of clagging; a sticky mass adhering to feet or clothes, entangled in hair, or the like; a clot of wool consolidated with dirt about the hinder parts of a sheep, etc. ". Unfortunately Watson's book is apparently not (yet?) available in the U.S., nor does amazon.co.uk find it, though it's published by Knopf and available in Australia. ]

Posted by Mark Liberman at 06:57 AM

December 21, 2003

Another Humboldt heard from

A reader from Japan suggested that Humboldt University in California, which is named after Humboldt County where it's located, is another university named for the linguist Wilhelm von Humboldt (even if indirectly).

Close, but not quite -- Humboldt county is named for the explorer Alexander von Humboldt, Wilhelm's kid brother.

So this suggests another academic trivia question -- are there any other cases where each of two siblings has a university named after him or her (even through indirect geographical attribution)?

Posted by Mark Liberman at 07:50 PM

Harmonic rhythm in Turkic

In his post on Dating Indo-European, Bill Poser observed that "lexical replacement is a small part of language change". This naturally raises the question of whether it is possible to use other aspects of language change as "clocks" for assigning dates to (unobserved but reconstructed) stages of language history.

The short answer is "no, at least not yet; and maybe never..." However, there is certainly some interesting recent work on the dynamics of other kinds of language change besides lexical replacement. These include changes in the overall sound system (phonology), in the patterns of word formation (morphology) and in the ways of putting words together into phrases (syntax). None of this work (as far as I know) has the explicit goal of defining a clock that could be applied to dating in linguistic reconstruction. Instead, the goals are simply to understand large-scale long-term historical change better -- a topic that has interested researchers for hundreds of years -- and also to see how well different theories about language structure and use stand up in a historical context.

"Language change has at various times been seen as linear -- that is, languages are progressing or decaying monotonically -- or cyclical -- that is, languages pass through a life cycle of birth, maturity, death and rebirth. However, modeling language change in a formal way has led to a recognition that it is a complex dynamical system (Lass 1997): the interaction of individual speakers leads to emergent, global population characteristics of a language that are neither linear nor cyclical."

That is a quote from a recent paper (Emergent Behavior in Phonological Pattern Change. Artificial Life VIII. MIT Press, 2003) by Mark Dras, David Harrison and Berk Kapicioglu, who have surveyed the history of (loss of) vowel harmony in various Turkic languages over the course of the past millennium or so, and have also explored ways to model the development and loss of vowel harmony as emergent properties of interactions in an artificial speech community. An earlier paper by the same authors gives more details: Agent-based modeling of vowel harmony, Proceedings of NELS 32 (2002).

Reading those papers, and other recent works on both empirical and theoretical aspects of the dynamics of long-term language change, I come to two provisionally negative conclusions about the prospects for finding reliable "clocks" in such changes. First, there are not nearly enough data points for us to be able to say much about the distribution of rates of change for various kinds of linguistic phenomena. Second, language change may turn out to be like climate change (and many other non-linear dynamic systems), in that trends can operate on a remarkably wide range of time scales.

The first problem will be solved as more research is done. The same new research will help determine whether the second problem is a show-stopper or not.

Let me stress that Harrison, Dras and Kapicioglu are not looking for a clock, and I am not criticizing them for not finding one! Rather, I'm asking whether work of this kind might lead to temporal estimates for processes such as harmony loss, in cases where the existence of the process can be inferred (e.g. from a set of related contemporary languages with partial harmony, and a confidently reconstructed ancestor with fuller harmony).

Posted by Mark Liberman at 12:09 PM

December 20, 2003

"a select company of curious men"

The Economist on the internet/coffeehouse analogy:

The coffee-houses that sprang up across Europe, starting around 1650, functioned as information exchanges for writers, politicians, businessmen and scientists. Like today's websites, weblogs and discussion boards, coffee-houses were lively and often unreliable sources of information that typically specialised in a particular topic or political viewpoint.

The analogy is an idée reçue by now, but the article is full of interesting history.

For a gloomier perspective on aspects of 18th-century coffee-house society, see Samuel Johnson's Rambler 177 .

[From Vivaculus:] ... I hasted to London, and entreated one of my academical acquaintances to introduce me into some of the little societies of literature which are formed in taverns and coffee- houses. He was pleased with an opportunity of shewing me to his friends, and soon obtained me admission among a select company of curious men, who met once a week to exhilarate their studies, and compare their acquisitions. ... Every one of these virtuosos looked on all his associates as wretches of depraved taste and narrow notions. Their conversation was, therefore, fretful and waspish, their behaviour brutal, their merriment bluntly sarcastick, and their seriousness gloomy and suspicious. ...

[Response:] It is natural to feel grief or indignation when any thing necessary or useful is wantonly wasted, or negligently destroyed; and therefore my correspondent cannot be blamed for looking with uneasiness on the waste of life. Leisure and curiosity might soon make great advances in useful knowledge, were they not diverted by minute emulation and laborious trifles. It may, however, somewhat mollify his anger to reflect, that perhaps none of the assembly which he describes, was capable of any nobler employment, and that he who does his best, however little, is always to be distinguished from him who does nothing. Whatever busies the mind without corrupting it, has at least this use, that it rescues the day from idleness, and he that is never idle will not often be vicious.

Read the whole thing.

Posted by Mark Liberman at 11:36 AM

The mother of all universities named after a linguist

On the topic of universities named after linguists, Elihu M. Gerson writes that "Humboldt University in Berlin is named after Wilhelm Humboldt".

He's right:

"Founded in 1810 according to the concept of Wilhelm von Humboldt, the Humboldt-Universität zu Berlin was the 'Mother of all modern universities'".

As the wikipedia entry for Wilhelm von Humboldt explains, he wore many hats, including Prussian minister of education. Given the connotations of the word "Prussian", some may be surprised to learn the strength of von Humboldt's libertarian beliefs. Here are a few translated quotes from his 1791 work "The Limits of State Action":

"The very variety arising from the union of numbers of individuals is the highest good which social life can confer, and this variety is undoubtedly lost in proportion to the degree of State interference."

"If it were possible to make an accurate calculation of the evils which police regulations occasion, and of those which they prevent, the number of the former would, in all cases, exceed that of the latter."

'[W]hatever labour "does not spring from a man's free choice, or is only the result of instruction and guidance, does not enter into his very nature; he does not perform it with truly human energies, but merely with mechanical exactness"; when the labourer works under external control, "we may admire what he does, but we despise what he is."' [From an amazon.com review]

It's a bit surprising that neither the original German nor any English translation of this work appears to be available in digital form.

With respect to von Humboldt's ideas about language, here is a selection of translated passages from the von Humbolt chapter of Lehmann's Reader in Nineteenth Century Historical Indo-European Linguistics. Let me add another one, from von Humboldt's posthumous essay On Language, which I've used as the basis for an exam question:

"The articulated sound, the foundation and essence of all speech, is extorted by man from his physical organs through an impulse of his soul; and the animal would be able to do likewise, if it were animated by the same urge."

I must confess that my most vivid personal association with von Humboldt is a feeling of dread. Shortly before I finished graduate school, a certain punctilious German language instructor at my PhD institution failed a native speaker of German on her German language exam, by requiring her to translate a passage from Uber die Verschiedenheit des menschlichen Sprachbaues. Legend had it that he was a Humboldt fanatic, who cared deeply and personally about every nuance of von Humboldt's famously dense prose. I spent a discouraging couple of days trying to persuade myself that I might do better, after which I asked to take my second language exam in Latin rather than German. Morris Halle agreed, which is one of the smaller reasons for which which I'm indebted to him.

[Note: the wikipedia entry cited above says that "Humboldt is credited with being the first linguist to identify human language as a rule-governed system, rather than just a collection of words and phrases paired with meanings." This is (almost) a self-proving sentence, since it perfomatively assigns such credit to Humboldt at least by implication; but the credit is surely not due. Panini antedated Humboldt by more than two millennia, and many others had the same idea in the intervening time].

Posted by Mark Liberman at 07:42 AM

December 19, 2003

Wars of Words

Geoff Pullum's wish that all battles could be fought with nothing more cruel than mocking phrases and cutting epithets is regrettably not likely to be fulfilled, but there are some interesting precedents. In the Byzantine Empire, the ultimate power was vested in the Emperor, but there were nonetheless political parties, which exerted their power through popular opinion, bribery and, sometimes, violence. These parties were known by colors, the major ones being the Blues and the Greens. Since there were no elections, there was no campaigning of the sort with which we are familiar. Instead, the parties put forward their positions in the stands of the Hippodrome. Representatives of the parties would debate each other across the stadium. This debate was conducted entirely in spontaneously composed verse!

Posted by Bill Poser at 11:30 PM

Universities named after linguists

Mark Liberman recently remarked, having only just learned that Barcelona's Universitat Pompeu Fabra was named after a great Catalan linguist: "I can't think of another major university named after a linguist, but perhaps someone will inform or remind me."

But of course. There are quite a few. It's surprising that Mark, who seems to know so much about so many subjects, was not aware of them.

[Note added much later: Access records reveal that at least some people have come to this page looking for real information. This is terrible, because the whole of the rest of this page was written just to give Mark a giggle. Absolutely nothing below that pertains to university naming is true., despite the mentions of three real people (Partee, Ball, and Emonds). Sorry! Me and my deadpan humor... --GKP]

To begin with the most celebrated instance, Harvard University was named after Sir Walter Montmorency Belgrave Harvard, who in 1689-1691 traveled by donkey through much of what is now western Massachusetts and parts of upper New York State, recording food terms in the languages of the local Indians. (He died after failing to take note of a critical phonemic distinction: q'opuhi 'asparagus' vs. q^hopuhi 'species of deadly poisonous asparagus-shaped fungus'. So let's practice distinguishing ejective from aspirated stops, okay class?)

In the west, the University of California, Santa Barbara, was named after the great semanticist and genuinely wonderful person Barbara Hall Partee, a UCLA faculty member at the time UCSB was founded (later a distinguished member of the faculty at the University of Massachusetts at Amherst). Barbara was not officially canonized at that time, but the status was foreseen, and indeed, she is reported to be definitely on the current pope's next list, and quite rightly so.

There are many other examples. I am surprised that Mark didn't think of them straight away. Ball State University in Muncie was named for the computational linguist Cathy Ball, in honor of her service as program director for linguistics at the National Science Foundation; Emory University in Atlanta was named after the fine syntactician Joseph E. Emonds (this one shows how careful one should be about arranging such matters through handwritten notes -- but at least it didn't come out as Edmonds University, with that added "d" that is the common error with Joe's name); and Rutgers University was named after Henry "Reefer" Rutger, a little-known generative semanticist who published virtually nothing, but who once, in the early 1970s, remarked sarcastically at a Christmas party in front of two very young and impressionable future linguists that "What with all these deep structure constraints and surface structure constraints and stuff, pretty soon people are going to have to assign some kind of rank ordering to them all just so we'll know which overrides which!" The rest is history.

Posted by Geoffrey K. Pullum at 08:43 PM

Let the witty put-downs begin

In news from the Middle East, it is rare that we ever see events of violence and death replaced, even for just a few minutes, by outbreaks of spontaneous wit and harmless linguistic repartee. But news services reported yesterday that in Tikrit about 700 people had rallied to protest the arrest of Saddam Hussein. They chanted, "Saddam is in our hearts, Saddam is in our blood!", and back came an answering chant, rising from the soldiers and police that they were chanting at: "Saddam is in our jail, Saddam is in our jail!"

If only all battles could be fought with nothing more cruel than mocking phrases and cutting epithets. According to Alex Ross in this week's New Yorker, Oscar Wilde once made a prescient prediction about how wars of the future would be fought: "A chemist on each side will approach the frontier with a bottle." If only it could be a comedian from each side approaching the frontier armed only with stingingly witty epigrams and snappy verbal put-downs. Weapons of mass detraction.

Posted by Geoffrey K. Pullum at 08:06 PM

Effleville: Why is "I have a cold" an "idiom"?

The posts on Effle bring me back to my days of suffering through the boredom of foreign language classes back in my school days. From an early age I was always frustrated as my fascination with other languages was doused by drills in sentences one would never use. Who cares whether "my uncle is a lawyer but my aunt has a spoon"?

It is too little acknowledged in language pedagogy that really, knowing the words for KNIFE and DRESS is about as useful as knowing the words for LIVER or AMBIGUOUS -- one can pick up words for silverware as time goes by, but what about EVEN as in "I even had a purple one" or SMELLS LIKE?

It always ticked me off that after God knows how many years of French classes I had no idea how to say THAT TASTES LIKE CHICKEN, GET YOUR FEET OFF OF THERE, or STICK OUT YOUR TONGUE. Often we are told that these things are "idioms," but they actually simply require learning usages of certain nouns or verbs that do not line up with how they are used in English. No one considers it a distraction to teach students similar cases like JE M'APPELLE for "My name is" or ME GUSTA for "I like" in Spanish. But for my money, drills should be constructed that put students through similarly everyday necessities like expressing PICK THAT UP, PUT THAT DOWN, GO ALL THE WAY TO THE END, COME DOWN FROM THERE, STICK IT IN THERE and HE LEFT RIGHT IN THE MIDDLE OF THE MEETING (ask someone who claims to "know" a language how to say these things and you easily separate the men from the boys!).

The occasional teaching book that takes a chance and gives learners real sentences heartens me that language pedagogy can improve here. Lewis Glinert's marvelous MODERN HEBREW kicks right off with sentences like HEY, BENNY, IS THAT YOU? IT'S ME AGAIN and COME OVER TO THIS LINE, IT'S MOVING. Wonderful! The little-known ASSIMIL book series is also good with this -- spend the half-hour a day they recommend and you actually come out able to lope along with shambling effectiveness in actual conversations, because they try their best to give students words and constructions they will actually need (actually beware the Arabic books, which are a sad exception -- back to the aunts and spoons).

I have always thought that I might make a late-life career out of fashioning some language teaching books that took this approach. For my money, language teaching should focus on 1) vocabulary 2) grammar and just as much, 3) making sure students come out knowing how to render basic concepts like MIGHT AS WELL, SUPPOSE WE... and YOU'LL GET OVER IT. Years ago I made out a long list of hundreds of sentences full of things like this by taking them down while watching TV over a year's time, and find that once one has mastered these 500 or so sentences in a language, one is in a place far beyond what any class or textbook bothers to provide.

I might add that the example par excellence of "Effle-plus" is the famous The New Guide of the Conversation in Portuguese and English by one Pedro Carolino, written in 1869 but most often encountered in an 1883 edition with a marvelous introduction by Mark Twain, generally reprinted in abridged form as English As She Is Spoke. Carolino apparently just rendered the French in a French-English guide word-for-word into "English," which he clearly neither spoke nor even read. (First year language classes should teach students how to render the JUST that I used in the last sentence!)

Break this book out at parties and watch guests laugh till tears roll down their cheeks, as Carolino regales us with one "English" sentence after another, often in dandy "dialogues." One of my favorites is the immortal "The Fishing," in which the protagonist exclaims "That pond it seems me many multiplied of fishes. Let us amuse rather to the fishing." Elsewhere we get "idiotisms" (IDIOTISME is "idiom" in French) such as "The stone as roll not heap up not foam" (I'll leave you to guess what this is supposed be) and Carolino's so authentically English version of "to cast pearls before swine," "to make paps for the cats." I was reminded of this last one searching for a web hit to provide on the delightfully tantalizing Carolino book:

"Silence! There is a superb perch!"

"You mistake you, it is a frog! Dip again it in the water."

Posted by John McWhorter at 07:15 PM

The dread hand of Effle... and boredom

It's a useful concept, Effle -- meaningless English from English as a Foreign Language (EFL) textbooks, with their awkward and stupid-sounding English example sentences made up without regard for whether anyone would ever want to say anything like that. And Norma Mendoza-Denton's story (a native non-standard English speaker from a Spanish-speaking home forced to sit in class all day listening to This is a pencil and A man has a dog, arranging to get himself thrown out and sent to the principal's office just for relief) is heartbreaking. But I'm not at all sure I can agree with everything on the Effle page.

On a minor point, applied linguist Pit Corder is cited there as the source of The farmer kills the duckling, but in fact I think Edward Sapir, the great anthropological linguist, was responsible for that dimwitted piece of example construction (uncharacteristic for Sapir).

But more centrally, surely we cannot accept the maxim, "As soon as we encourage or force learners to say what they do not mean, we break links that will be all the harder to mend afterwards." It's fine to have learners say things they don't mean; that's what playwriting and acting in skits is all about, and it can be tons of fun. Heaven forfend that we should require students to learn entirely by speaking the truth about their feelings. The objection to making students say things like Her food is eaten by her (said to have been actually found in an exercise on the passive) is not that they don't mean it; it's that no one talks that way, and no one would want to say that anyway.

My guess would be that it's not insincerity in example sentences that is the great enemy of language teaching, but boring the crap out of learners. Once a learner loses the feeling that being able to say things in this language is going to be a really neat, and that the learning process can be a lot of fun, the language teacher has lost the battle. She might as well just pack up and go home to wherever her food is eaten by her.

Posted by Geoffrey K. Pullum at 01:00 PM

Franks, French, Freedom

Trevor has an enlightening piece here on the early history of Arabic words derived from Frank. Read the whole thing -- and the rest of his weblog too, it's a treat.

In passing, he makes an important point:

"ATILF (Analyse et Traitement Informatique de la Langue Française) has accumulated and created what look like some wonderful online French language resources. Unfortunately most of these are only available to other institutions, not citizens; this excludes me, since the only institution I am likely to enter in the next few decades will probably not have internet access. I just don't get it: does the Republic want to promote French to the world or not? Go on, let me in, even if it's just for Christmas!"

Worldwide, there are many enthusiastic and accomplished people like Trevor. The history of scholarship, linguistic and otherwise, is full of examples of important contributions by individuals without academic affiliation. The decipherment of Linear B by Michael Ventris is probably the best known recent case, but there are many others. I've written here earlier about the linguistic investigations of Thomas Jefferson, and there is a lot to say about the contributions of Hermann Grassmann, among many others who were amateurs in the best sense of the word.

It's now a commonplace observation that the internet allows people, in the academy or out of it, to find others with similar interests and to communicate within social networks that work in the way that such networks always have, on a human scale. This is part of the dynamics of free software, for example. But Trevor's problem with ATILF and similar resources exemplifies a real (and I think deepening) "digital divide". People without the right institutional connections are excluded from lots of cyberspace, and especially from key parts of the distributed digital library that scholars increasingly use in place of the physical research libraries of major universities. The excluded are not only individuals, but also researchers at institutions in many poorer countries, and sometimes even researchers at wealthy institutions that can't justify the subscription fees for a particular resource.

This is not a new problem, though it has new aspects and there are technical possibilities for new solutions. It's not only a problem of uncooperative, narrow-minded or greedy owners and administrators of online resources, though there are plenty of those. It's really a complicated set of related problems, not a single problem that is likely to have a single solution. However, it's worth trying to make things better in this area. This is partly for the benefit of people like Trevor, who get to satisfy their curiosity more fully, but mostly it's for the benefit of all the rest of us, who get to read the results of their research.

[Note: in fairness to the ATILF, I should note that some of its key digital resources, such as the TLFi, seem to be freely available to all online comers. Here, for example, is the entry for Franc. Trevor's general point stands, though.]

Posted by Mark Liberman at 09:02 AM

Computers singing in Barcelona

I've recently been reading some papers by Xavier Serra, Jordi Bonada and other researchers at the IUA (Institut Universitari de l'Audiovisual) at the Universitat Pompeu Fabra. According to its web site, this university was founded in 1990, and named after Pompeu Fabra, a linguist who "laid down the standards of the modern Catalan language". I can't think of another major university named after a linguist, but perhaps someone will inform or remind me.

The papers that I've been reading are about techniques for analysis, modification and resynthesis of sounds in general and speech in particular (note that to see the last link, you may have to press cancel on an annoying little pop-up that wants you to go to their home page and then navigate through four layers to the abstract in question). These are the techniques involved in Yamaha's Vocaloid system for sythesis of the singing voice, due to be released in January. I teach a course in digital signal processing for outsiders (here is last spring's web site), and I'm planning to put together a module on these new techniques for singing synthesis. They're fun to play with as well as interesting and useful, and they're an excellent illustration of many basic DSP concepts and techniques.

I'm withholding judgment on the Yamaha system until I can try it out interactively. Their demos are impressive, but for evaluating speech synthesis, pre-prepared demos are not very meaningful. They're more or less like screen shots of an interactive program -- they tell you something, but not much. The underlying signal processing techniques are very interesting, though, and should in principle be able to support what Yamaha promises. It's just that there are a lot of other steps where problems could arise.

If you're interested in learning more about the fundamentals, there's a (slightly sketchy) tutorial for the (free software) CLAM system from IUA, in which you should look especially at sections 8 and 9 on SMS analysis and synthesis.

Posted by Mark Liberman at 07:32 AM

December 18, 2003

Glemphy will not be Word of the Year

In an earlier post about a made-up word, Geoff Pullum asked "Funny, isn't it, how you can just look at a word and know immediately that it is not going to catch on?"

In this recent interview at SciFi.com, Joss Whedon comments on Fox's decision to cancel his series Firefly:

"They were so wrong that we may have to create a new word that means wronger. .... The new word for wrong that we're going to come up with—they were glemphy. They were just completely glemphy.

I sympathize, I really do. I can think of several situations in my own recent experience where "they were just completely glemphy" is the perfect description, and I don't even work in the television industry, which by reputation is significantly glemphier than academia. I'll try to do my bit for this infant coinage over the coming months. But if glemphy winds up in any semi-reputable dictionaries, or even into the American Dialect Society's Words of the Year awards, the drinks are on me.

These messages from 1999 suggest that in fact "glemphy" was not a spur-of-the moment coinage, but rather a word from Whedon's past -- if there were a lexicographic plot here, it would be thickening :-). But glemphy didn't catch on in 1999, and it probably won't in 2003. Despite the I Fear the Glemphy Man stickers circulated four years ago by the Stand up for Buffy Committee, the web was subsequently glemphyless until Whedon's recent interview, and alas, glemphyless it is likely to remain.

It's not because there's anything wrong with glemphy. Compared to advective, the coinage that Geoff was dissing, glemphy is a triumph of the word-maker's art. It offers a G-rated substitute for FUBAR and similar useful but slightly salty terms; it's got good mouth feel and decent phonetic symbolism; it comes out of not one but two pop subcultures. It's just that creating a successful new word is hard.

Posted by Mark Liberman at 10:02 PM

Fieldwork Effle

In reading Mark's post about Effle, I immediately thought of a Salvadoran teenage boy I interviewed in my '94-'97 fieldwork among Latina/o youth in a high school in Northern California. I was investigating what some scholars (i.e. Guadalupe Valdes) call "linguistic resistance", that is to say identity-based resistance to language learning (in his case, he was a native speaker of African-American English but had been placed in English as a Second Language (ESL) classes, clearly a case of his Spanish-speaking home background trumping his oral language skills). When I interviewed him and asked why he had failed to be promoted out of beginning ESL after three years of being in the class, he told me that he couldn't stand sitting around reading and listening to nothing but Effle-language, and would rather get kicked out of class. Ok, he didn't use the word "Effle" but he gave me something close to the definition on the Effle page.

Can you imagine being stuck for 8 hours a day in a classroom where people only spoke in Effle?

Posted by Norma Mendoza-Denton at 11:27 AM

Vintage Effle

Margaret Marks at Transblawg points us to the Effle Page, which introduces a useful word for the pseudo-language of many phrase books (and some linguistics examples), and claims that Ionesco's Bald Soprano was written (in French) as an imitation of effle sentences in the books from which he learned English.

My favorite source of effle used to be a thin Vietnamese-English phrase booklet that I bought at a Pleiku road-side stand in 1970. It was written by someone who was not a native speaker of English, printed very cheaply, and was apparently intended for the bar girl market, since the English side ran to things like

I am grateful for you to buy another bottle of champagne.
You have mistaken me, sir, I am a girl of good born.

Indeed, with some stage directions and a bit of good will, the whole thing could easily have been passed off as a one-acter from some second-rank absurdist playwright. My copy wandered off at some point, so someone else will have to arrange the premiere.

Some similarly evocative dramatic fragments can be found in the brief (about 100 lines) English/Harari "dialogues and sentences" in appendix II (Grammatical Outline and Vocabulary of the Harari Language) of Richard Burton's First Footsteps in East Africa, which I recently re-read. I'm not sure this counts as effle, sentence by sentence, but the overall impression created by the sequence is similar. Here's an illustrative sample:

Come in and sit down.
What is thy name?
Come here (to woman).
Dost thou drink coffee?
I want milk.
Where goest thou?
I go to Harar.
Send away the people.
I love you.
What is thine age?
Don't laugh.
Raise your legs.
Don't go there.
This man is good.
He is a great rascal.
I don't want you (woman).
Leave my house.

Depending on the staging, this phrase list/dialogue might accompany several different stories, all more or less piggish. Whatever events one might imagine, they seem likely to be Burton's fantasies rather than facts, since he spent his ten days in Harar "so closely watched that it was found impossible to put pen to paper", and compiled his Harari grammatical sketch, after fleeing the city, during a few days spent in the Galla country to the east of Harar while equipping a caravan for the journey to Berbera on the coast.

"The literati who assisted in my studies were a banished citizen of Harar; Sa'id Wal, an old Badawi; and Ali Sha'ir, "the Poet", a Girhi Somal celebrated for his wit, his poetry and his eloquence.... Our hours were spent in unremitting toil: we began at sunrise, the hut was crowded with Badawi critics, and it was late at night before the manuscript was laid by. On the evening of the third day, my three literati started upon their feet, and shook my hand, declaring that I knew as much as they themselves did."

[Update: some excellent effle is now available at desbladet.

On reflection, I'm not satisfied with the cited definition (from the effle page)::

"Effle is grammatical English which could never be uttered because it has little meaning and could never be put into a sensible context."

The examples are mostly meaningful enough, it seems to me. But they have a sort of artificial feeling, like not-quite-real computer-generated movie scenes. As in the case of such scenes, it can sometimes be difficult to put your finger on exactly what's wrong -- though of course sometimes it's pretty obvious. Anyhow, it's interesting that this sense of unnaturalness can arise in a purely textual environment, since in the case of CG scenes, it's likely that the problems are mostly due to the lack of real physics and physiology in the causal chain leading to the signals.

The analogous issues in the case of synthetic speech are especially interesting. I've been learning about recent innovations in that area, and may have something to say on the subject in coming weeks.]

Posted by Mark Liberman at 08:01 AM

December 17, 2003

Wedding Vowels R Us

Public Service Announcement: If you've come here because you're interested in solemn promises of faithful attachment in marriage, and you've searched for "wedding vowels", you really should make this search for "wedding vows" instead. A vow is "a solemn engagement, undertaking, or resolve, to achieve something or to act in a certain way." A vowel is "a speech sound produced by the passage of air through the vocal tract with relatively little obstruction, or the corresponding letter of the alphabet", usually contrasted with consonant. Your vows will need to contain both vowels and consonants. I wish you all the best in your ceremony and in your life together!

As Geoff Pullum explained, I occasionally look over our server logs to see who is reading Language Log and how they find us. Geoff described this as "like having little radio transmitters attached to all of you so that we can track your positions, and keep track of where you go and where you came in from and what pages you like and so on." I think of it more like walking through the bar to see who's there, what they're eating and drinking, and whether they seem to be having a good time.

Anyhow, this afternoon around 4:30, someone from the Alamo Community College District found this Language Log post by asking Google to search for renewing wedding vowels, a search that returns 203 results of which we are number 5.

I can't tell, of course, whether this reader was interested in learning about linguistic errors or in planning a ceremony of interpersonal rededication. The fact that they read just that one page may be a clue.

Keeping the marriage ledger balanced, a different reader found this post by searching for " adultery laws in new hampshire divorce", which returns 5,470 pages, of which we are number 9. This visitor also did not hang around long.

By comparison, ten people today have found this post by asking Google for " meaning of sic" (or similar patterns such as "sic meaning"). This returns 380,000 pages, of which we are number 2. I suspect that what these readers found was very close to what they were looking for; and perhaps for that reason, most of them stuck around and read more stuff. One of them read more than 30 other posts...

Something like 200 people a day find us via search engines -- I hope that most of them find something to their taste, even if it's not what they expected.

By the way, my own guess is that that our guests from treas.gov who scoped out Geoff's posts on his Las Vegas "fieldwork" were just looking for something to stick up next to the water cooler at work. But then, I haven't seen them around the place recently -- though of course they're always welcome to stop in for some quick intellectual refreshment :-).

[Update 1/26/2004: A reader has pointed out that the word avowal is no doubt part of the pattern that results in the vow/vowel confusion. (myl)]

Posted by Mark Liberman at 05:50 PM

Women and discourse markers

I just recently came across a study by M. R. Mehl and J. W. Pennebaker ("The sounds of social life: a psychometric analysis of students' daily social environments and natural conversations", manuscript, 2002; findings summarized in J. W. Pennebaker, M. R. Mehl and K. G. Niederhoffer [2003], "Psychological aspects of natural language use: our words, our selves", Annual Review of Psychology, 54:547-77), where the authors rigged up 52 University of Texas undergraduates with electronically activated recorders and found that (a) males use 4 times more swearwords than women, (b) women use more discourse markers, (c) women use more first-person pronouns, and (d) women use more modals.

I wanted to ask LanguageLoggers what they think of this. Any explanations?

Posted by Norma Mendoza-Denton at 05:36 PM

Passive voice and bias in Reuter headlines about Israelis and Palestinians

The organization Honest Reporting recently released a study of bias in Reuters news agency headlines about events in Israel and Palestine. The part of the study on "Verb selection" claims that the choice between active and passive voice is being used to make Israeli violence more overt and apparent and Palestinian violence less so. The report says:

Violent acts by Palestinians are described with "active voice" verbs in 33% of the headlines.

Violent acts by Israelis are described with "active voice" verbs in 100% of the headlines.

Unfortunately, whatever the validity of the data on which the claims are based, the accuracy of their linguistic analysis is wrong two-thirds of the time in the examples that they give.

Here are their three examples:

Example 1:

"Israeli Troops Shoot Dead Palestinian in W.Bank" (July 3)
Israel named as perpetrator; Palestinian named as victim; described in active voice .

vs.

"New West Bank Shooting Mars Truce" (July 1)
Palestinian not named as perpetrator; Israeli not named as victim; shooting described in passive voice.

Example 2:

"Israel Kills Three Militants; Gaza Deal Seen Close" (June 27)
Israel named as perpetrator; Palestinians ("Militants") named as victims; described in active voice.

vs.

"Bus Blows Up in Central Jerusalem" (June 11)
Palestinian not named as perpetrator; Israelis not named as victims; described in p assive voice.

Example 3:

"Israeli Tank Kills 3 Militants in Gaza - Witnesses" (June 22)
Israel named as perpetrator; Palestinians ("Militants") named as victims; described in active voice.

vs.

"Israeli Girl Killed, Fueling Cycle of Violence" (June 18)
Palestinian not named as perpetrator; killing described in passive voice.

The evidence of bias may seem clear enough (I won't be evaluating that here), but this is Language Log, and -- forgive me for being a pedant, but it is part of my job description -- I have to point out that only one out of the three examples here actually illustrates the passive voice.

Example 1. "New West Bank Shooting Mars Truce" is entirely active: the main verb is mars. The subject noun phrase is new West Bank shooting. The word shooting here is a nominalization -- a noun derived from a verb root (notice, you can talk about two shootings: it actually takes the plural marker -s like any other noun). Nominalization is one way to avoid reference to the agent of an action (here, who did the shooting), but it's not the same as using the passive voice.

Example 2. "Bus Blows Up" is indeed a strange way to describe an incident in which a human being straps explosives to himself, gets on a crowded bus in a city street, and kills 13 people by detonating his payload, clearly intending to murder as many Jews as possible at one go. However, there is no passive construction here. The predicate is active and intransitive. ("Bus is Blown Up" or "Bus Blown Up" would have been a passive.) What's weird is that a reference to the bus is used as the subject of this intransitive predicate. Reuters describes the event as if the bus had just exploded all on its own. But not with a passive.

Example 3. The third example is the only one with a passive verb: "Israeli Girl Killed" has the past participle killed used as the verb of a passive verb phrase: in a fuller (non-headline) form the sentence would be "Israeli girl is killed". There is no by-phrase following, so there is no reference to who did the killing; this is the point that Honest Reporting complains about in Reuters headline phrasing. The thing about the passive construction that makes it convenient for suppressing reference to perpetrators is that preposition phrases with by are almost always optional: you can leave them out without the result being ungrammatical. If you use a tensed active verb it's not so easy to suppress the identity of the actor, because subjects are obligatory in tensed clauses: Palestinian gunman kills Israeli girl would be grammatical, but *Kills Israeli girl would be ungrammatical. (A few newspapers do use subjectless tensed headlines -- I've seen it in the Chicago area -- but most do not.)

Honest Reporting is claiming that Reuters uses active and passive verb phrases differentially in its headlines, often suppressing facts of Palestinian agency in violent acts, but literally never suppressing the fact of agency when Israelis or the Israeli state are involved. If their analysis of the data is accurate, this deserves explanation. There ought to be no gross nationality difference in the frequency with which constructions making reference to the agents in acts of violence are used -- certainly not a difference as staggeringly large as 33% versus 100% according to whether Palestinian or Israeli violence is involved. But this sort of propaganda analysis would be best done by people who have a clear grasp of basic traditional grammar, so that when they refer to the use of passive voice they know what they are talking about and can give examples that do indeed show passive clauses.

Credibility is everything in studies of this kind. Honest Reporting cannot possibly claim to be non-partisan: they are avowedly devoted to the cause of righting what they see as a shocking anti-Israeli bias in the western media. So we can only trust that they are living up to the first word of their name if they are scrupulously accurate when they do their deliberately pro-Israel advocacy and analysis. When we find that they can only identify a passive verb 33% of the time, in an analysis that is explicitly about how many times the passive voice is used, it shakes our confidence in the accuracy of other aspects of their analysis too (perhaps quite wrongly).

Footnote added later: There could be other factors accounting for the numerical discrepancies, of course. Anthony Hope has pointed out to me that when the Israeli state does something the identity of the agent is known immediately, but Palestinian-initiated acts of violence are often hard to attribute to a specific person or group in the first few hours. Chris Potts points out a linguistic issue: the word "Palestinian(s)" is longer than the word "Israel(i)" by a factor that would be nontrivial in headline composition, where every millimeter of column width counts. Both these factors could be in play. I'm not suggesting otherwise when I observe that Honest Reporting's data needs explanation. By the way, they discuss many different kinds of bias on the part of Reuters, not just choice between actives and passives.

Posted by Geoffrey K. Pullum at 12:30 AM

High jinx

Apparently the legendary Philadelphia middle school jinx masters are just the start of the story. Jinx lore, it seems, is a sort of lexicographic Drosophila melanogaster, with many existing variants, and new mutations forming and recombining before our eyes.

Greg Urban sent me a pointer to this Texas folklore page, describing a cooperative (?) jinx-avoiding ritual consisting of saying "jinx, you owe me a coke." Other sources cite the formula "pinch, poke, owe me a coke", which is more euphonious and also sounds somewhat less cooperative, and the response "wearing blue, you owe me two."

Laurie and Winifred Bauer, of the School of Linguistics and Applied Language Studies at the University of Wellington, have a fascinating and extensive site documenting a two-year project on NZ Children's Playground Vocabulary. They provide this discussion of jinxes, in which they observe that

The main finding is that practices vary considerably from school to school, and that the same words used in setting up the jinx will not necessarily involve similar penalties or clearance procedures. It is clear that there is a good deal of invention in making jinxes harder to clear, and harsher to incur, and that there are basically no fixed understandings of how jinxes will work. This is an area of potential difficulty for children who move schools: there were 57 different forms of jinx reported, and a variety of penalties and clearance procedures. ...

The commonest wordings of the jinx were personal jinx (118 reports) and jinx (61). They are often differentiated in terms of who can clear the jinx: if you say personal jinx, then only the jinxer can clear the jinx, while if you say jinx, then anybody can clear the jinx. However, it is clear that in some schools, personal jinx functions like jinx as described above, and private/master jinx or personal jinx padlock functions like personal jinx above. ... Sometimes a longer formula is required, e.g. jinx, jinx, personal jinx, reported 15 times. ... Other long formulae include personal private personal jinx, jinx personal personal jinx. Double jinx was reported 8 times, and during school visits, this was said to mean that the jinxer and one other person could clear the jinx. There were a host of one-report-only variations: banana personal jinx (which incurs the penalty of being hit 100 times), commander jinx (where the jinxer can command the jinxee to do anything they fancy), infinity jinx, golden jinx, smelly jinx, caller jinx, unbeatable jinx, ...

Here is a (non-serious!) story from a Bath (UK) student newspaper, describing "jinx gangs", and a "jinx king" who "openly boasted about extorting hundreds a week from other children at his comprehensive school. By singling out victims who were easy targets, such as those singing, or telling well known jokes, he was able to jinx up to forty pupils per day." The most affecting story is this one:

After last week, the name of Little Hussock will go down in history as a place of tragedy. During a school service at the local church, a pupil of the local primary school shouted ‘jinx’ immediately after the Lord’s prayer had been said. 342 children, staff and parents were struck dumb in one cruel blow.

Normally, this would have been just an inconvenience. However the boy, hoping to escape punishment, ran from the church and out onto a busy main road. Driven by the thrill of jinx, he failed to look where he was going and was promptly flattened by a lorry.

With the jinx-er unable to say the names of those jinxed, they may remain unable to speak for the rest of their lives. Scientists are currently looking into cloning technology in an attempt to recreate the boy, although experts doubt that this will meet the stringent criteria of jinx removal. Others have simply said that those jinxed should ignore the speaking restrictions, but what do they know?

I wonder what the jinx culture of non-English-speaking countries is like, and whether there is any international effort to establish best jinx practices and harmonize jinx standards :-)...

Posted by Mark Liberman at 12:00 AM

December 16, 2003

Don't read this, it will creep you out

I am going to tell you two things that will creep you out. I mean they will creep you out. What is more, if you are unwise enough to read them both at one sitting you will realize that they are conceptually connected, and that will creep you out even more. So it would be best not to read any further. It really would. Don't read on. Just stop here and save yourself a sleepless night.

* * * * * * * * * * * * * *

All right, you did read on, fool. So here they are. I will just relate them in order, the first one (because it is the first) will be first, and then after that, second, in the number two spot, will come the second one.

1. In the wild country south of Carmel on the Big Sur coast, the Santa Cruz campus of the University of California has a piece of land that was donated to it for use as a nature reserve. Various biologists have done scientific studies there. One studied the habits of the mountain lions (cougars) that live on the land. The scientists attached little radio transmitters to some of the animals so they could track their positions, and then they kept a log of where they went and what they did for a few months. What the scientists found was that there was a single thing that was clearly the absolute favorite thing for mountain lions to do with their time between April and October. There was just one place they all liked to hang out when the weather was good. Down where this piece of wild land meets the coast is a beach where families come to swim, bringing their little children. Overlooking the beach is a high and inaccessible ledge all covered in vegetation. The mountain lions turned out to be spending eighty percent of their free time just lying in hiding on that ledge, watching the children below. Not doing anything about it, like moving down to the bearch to try and sink their fangs into one of those perfect little butts and hall a tasty snack back in the forest with its little bare legs kicking and struggling or anything like that, but just watching from a place of concealment, thinking they were unobserved.

[Note added much later (1/15/04): It is somehow inexorably relevant here that (as noted on Agoraphilia) Michael Jackson's new rented Beverly Hills mansion, now that he has moved out of the Neverland Ranch forever, is described thus: "The hillside property, which also includes indoor and outdoor swimming pools, a huge tennis court, home theater, and ballroom, overlooks a kiddie park in Coldwater Canyon . . . Oh, no!]

2. In the wild desert country east of Southern California lies the interesting speech community of Las Vegas. A few weeks ago I posted some notes on Language Log about some data that I gathered on a linguistic field trip to that community: a rare singular they example with proper name antecedent, and a nice clear case of an endocentric noun-noun compound with regularly inflected plural non-head. I commented in another post about how unfair it would be if anyone were to cast aspersions on the genuineness of my tax deduction claims for bona fide business travel devoted to field trips of this kind to places such as Nevada. Now, it happens that Mark Liberman does various kinds of scientific research on Language Log and its readership, and has installed software for tracking the people who view pages on Language Log. Not that we're snooping on you, of course: Mark just has an interest in what audience we are reaching with our ruminations, where they are in Internet land, and how and by what links they tend to find out about us. It's like having little radio transmitters attached to all of you so that we can track your positions, and keep track of where you go and where you came in from and what pages you like and so on.

Well, Mark told me something about what he found. Just after my Las Vegas posts he spotted in the log files unmistakable evidence that at least two people (we don't know their names, of course) had been reading my stuff from a machine in the domain treas.gov, which is registered to the United Department of the Treasury. Not doing anything about it, like calling the Fresno IRS center here in California to see about sinking their fangs into my butt and hauling me off kicking and struggling for an audit or anything like that, but just watching from a place of concealment, thinking they were unobserved.

I mean... like... hello? Does the parallel here fucking creep you the fuck out or what.

Let me just add a word before I conclude about what a fine job our government servants do in ensuring uniform compliance with all applicable laws and regulations, and what an honor it is to know that they may be reading my modest linguistic contributions (albeit apparently on government time), and what a pleasure it is to be paying into the treasury of this great country every penny of the Federal taxes that I owe. If only income taxes hadn't been so dramatically reduced recently, it would have been a delight to pay even more.

Posted by Geoffrey K. Pullum at 01:47 PM

Pickle jinxed blogging?

I'm wondering whether jinxing, even pickle jinxing, applies to blogging as opposed to talking. Hope not, because Language Log would be a sadder and a quieter space without Mark. But look, if it does apply, surely the jinx can be dissolved by emailing the victim's name to him? Mark! Mark! Mark!

Perhaps you can even stockpile tokens of your name and keep them in files to open later. Or have your name said to you once per second by a shell script for permanent immunity:

      #!/bin/csh
      while ( 1 == 1 )
        echo 'Mark!'
        sleep 1
      end

I don't think jinxing is going to be the same in the Information Age.

[Mark: "Sorry, Geoff, it's only naming by the jinxer that dissolves the jinx! The role of the jinxee's friends is to trick the jinxer into saying the name. Opinion is divided about whether the jinx extends to writing -- the crucial precedents were apparently established in kindergarten, when hardly anyone could write anyhow."]

Posted by Geoffrey K. Pullum at 01:25 PM

Pickle jinx

When I was a kid, there was a playground rule (mostly obeyed by girls) that if two people said the same thing at the same time, both speakers were supposed to

Link pinkies, touch blue,
And don't speak 'til you're spoken to.

My seven-year-old son and his friends have more asymmetrical and complex rules for this situation, involving a silence jinx imposed on the participant who is slower to react with a prescribed incantation.

There are two ways to impose the jinx, according to what I was told as we were walking to school this morning. One of the two people involved can say "jinx personal jinx," which imposes a silence jinx on the other, slower participant. In that case, the jinxee can't speak for the rest of the day until spoken to, except that (s)he can say "ebbs" and dissolve the jinx (though apparently some kids don't know that). However, you can also say "jinx pickle jinx". If you do that, the jinx is much stronger. The jinx lasts for a month, and the jinxee can't escape until ~~someone~~ the jinxer says his name (I'm not sure whether girls have the same rules here, so I'll stick with masculine pronouns). The jinxee can also say "ebbs" to dissolve the jinx, but this only works once a month with a pickle jinx. If the jinxee speaks before the jinx is dissolved, in principle everyone else is entitled to punch him, though I gather that social censure is usually enough, especially if teachers are watching. There is quite a bit more to it, apparently, including a legend of some middle schoolers who are said to know other jinx-dissolving methods.

This seems to be a local and perhaps recent invention, since "pickle jinx" is one of those few pairs of words that are not found in Google's index. However, here is a discussion of some (British) jinx rules that are similar in spirit though less complex.

I should also mention that I myself am now in the state of having used "ebbs" to dissolve a "pickle jinx", which means that if I get jinxed again within a month, I may be uncharacteristically silent for a while.

[Update: Here is some further jinx lore.]

Posted by Mark Liberman at 11:09 AM

Chicken

Chicken. (via Trevor)

Also Shoe.

Posted by Mark Liberman at 10:02 AM

Yankees and Bostonians

Mark Liberman's post on the use of Frank as a term for European, like that of Yankee for Americans in general, calls to mind another situation in which Yankees have come to be representative of a larger group. Sailors out of Boston were prominent in the maritime trade in the Pacific Northwest, as a result of which Boston came to mean American in Chinook Jargon, the trade language used along the coast, which later spread up the Fraser River and saw use in the interior of British Columbia as far north, at least, as Takla Landing. Variants of Boston came to mean "American" in Carrier, and in some dialects came to mean, and still do mean, "white person" in general.

Posted by Bill Poser at 12:55 AM

Were the French the Yankees of Medieval Europe?

The OED's primary definition for Frank is "[a] person belonging to the Germanic nation, or coalition of nations, that conquered Gaul in the 6th century, and from whom the country received the name of France." The first citation is from Beowulf, "In Francna fæðm" ("in the grasp of the Franks"). The second sense for Frank is "[a] name given by the nations bordering on the Levant to an individual of Western nationality." For example, Burton observed that the inhabitants of Harar barred Europeans from their city because they "read Decline and Fall in the first footsteps of the Frank". I suppose that Europeans came to be called Franks at time of the crusades, since the crusaders were more French than not, in the same way that Americans came to be called Yankees at at time when New Englanders seemed to the rest of the world to be the prototypical Americans.

The OED cross-references the extended sense of Frank to "Feringhee", which is defined as "[f]ormerly, the ordinary Indian term for a European; in 19th c. applied esp. to the Indian-born Portuguese, and contemptuously to other Europeans."

Presumably this is the lexicographic inspiration for the Star Trek species the Ferengi, though the person who imaged the Ferengi language here was certainly not patterning it on French, ancient or modern.

Tim Buckwalter wrote to me that in his experience, this frozen synecdoche ("Frank" for "European") is found in Egyptian but not in Levantine Arabic:

"I found that Levantines and Egyptians made use of "French/Frank" differently. Although both used "fransaawi" (or MSA "faransi") for "French", only the Egyptians used "farangi" to denote "European foreigner". (Levantines would have pronounced it "faranji" if they had used it, but I don't know what meaning they would have assigned to it. ...). But the Levantines had an interesting use of "Frank" in the term "franko-arab", which they used for designating bilingual Arabic-European language talk typical of university educated people ... "

[Update: Trevor writes:

I don't know much about Yankees, but the term ifrang (caron on the g) is already used pre-crusades in C8th Andalusian and Maghrebi Arabic to refer to heathens from the north (ie tripartite division: Andalusians, Jews, Franks). Now that we all belong to ancient nations desirous of independence, ifrang tends to get translated as 'Catalan' or 'Basque' or whatever, depending on the translator's paymaster and/or party membership. According to Miquel's Géographie humaine du monde musulman jusqu'au milieu du 11e siècle (1975) the word was originally used by the Arabs to distinguish western Christians from the Byzantines (which may explain why it doesn't show up in pre-modern Levantine Arabic). Kfr certainly made it over the land route to parts of India in the first Muslim rush, but I suspect that Feringhee arrived in the subcontinent on dhows from the Gulf. Various other words were used by mediaeval Arabs in this part of the world to describe unbelievers, including kafir (see Byron's Gavour), rum (macron on the u), 'ilg and nasrani (I hate diacritics).

So my guess about the crusades as the source of "Frank = European" is apparently wrong; and it seems that the OED is also wrong in assigning the origin of the (various Arabic transliterations of the) term to the Levant.]

[Update 2: Lameen Souag writes:

Interesting post on Ferengis and Franks... I could add that: "franko-arab" is a straightforward loan of the French coinage Franco-Arabe; and http://www.emich.edu/~linguist/issues/4/4-492.html gives an impressively long list of languages using farang, including most of Southeast Asia and Ethiopia. I had always assumed the word was initially spread by the medieval Arab geographers; compare the semi-mythical country Waq Waq (either Madagascar or Japan in the medieval geographies), which in parts of Algeria is still a popular site for fairy tales. Note that, while it may not be used in the Levant per se now, it was copiously used there in the time of the Crusades - see Amin Maalouf's The Crusades through Arab Eyes, where he quotes relevant Arab chroniclers of the period - so your hypothesis may ultimately be correct... In Algeria we don't have the term faranj, but do still call the French and any other Westerners "Rumi" - Roman, or really Byzantine - or "gawri" < Turk gavur < Arabic kafir - which from Turkish gave English the word "Giaour".

]

Posted by Mark Liberman at 12:12 AM

December 15, 2003

Like we used to could

Close listeners to President Bush's press conference today will have heard some delicious Texas phonetics, for example when he said of letting non-coalition partners get rebuilding contracts in Iraq that it "was sump'n I wadn't gonna do", with [d] for [z] in wasn't. There was also some classic Texas syntax, e.g. in "...like we used to could." In Standard English you can never use a modal (can, could, must, ought, shall, should, will, would...) in its plain form (like in an infinitival clause), but in lots of southern dialects you can (you find they might should, we used to could, and various other combinations). Writers who comment on Bush's speech tend to take his most startling departures from Standard English ("Is our children learning?" and so on) as evidence of brain damage. Those much-quoted subject-verb agreement errors certainly are remarkable failures of sentence planning; they're ungrammatical in every dialect. But things like sump'n I wadn't gonna do and like we used to could aren't a sign of inarticulacy or ignorance or anything of the sort. They're regional and informal, but they don't provide any basis for inferences about anyone's intelligence or competence in other domains. Most Americans seem to think regional speech varieties automatically and unfailingly signal low status and low IQ, as if there was some definite connection there. I'm not tempted in that direction, because the department chair who hired me when I joined the University of California, Santa Cruz, is one of the smartest syntacticians I ever met in my life, Professor Jorge Hankamer, who proudly speaks a pure southern Texan English that you could cut into cubes with a Bowie knife and make chili with. So I tend to associate Texan dialect (rightly in some cases, and doubtless wrongly in others) with high prestige and high intelligence. You have your linguistic prejudices, I have mine.

Posted by Geoffrey K. Pullum at 11:51 PM

French and Postmodernism

Mark Liberman's reference to French naturally made me think of Postmodernism, the French revenge for Jerry Lewis. One of the odder varieties of natural language technology is random text generation, and one of the more interesting examples of this is the Postmodernism Generator, which I urge you to check out. A description can be found in A. C. Bulhak's paper "On the simulation of postmodernism and mental debility using recursive transition networks", 96/264, Dept Computer Science Technical Reports, Dept Computer Science, Monash Univ, Melbourne Australia, 1-12, 12pp. Technical report CS 96/264, which can be downloaded, in gzipped Postscript, here . The title is perhaps a bit redundant, but it's interesting.

Posted by Bill Poser at 07:45 PM

News from the big word book

The December issue of the OED Newsletter is out. If you read it, you'll learn what bastard and bat-man have in common, and what the OED's Artist in Residence is thinking.

Posted by Mark Liberman at 01:05 PM

He used all his French

According to the 12/15 New York Times story on Saddam Hussein's conversation with four members of the Iraqi Governing Council, Mowaffak al-Rubaie asked "When they arrested you why didn't you shoot one bullet? You are a coward. "

The response, according to Mr. al-Rubaie, was that "... he started to use very colorful language ... Basically he used all his French."

Mr. al-Rubaie went on to say

"I was the last to leave the room and I said, `May God curse you. Tell me, when are you going to be accountable to God and the day of judgment? What are you going to tell him about Halabja and the mass graves, the Iran-Iraq war, thousands and thousands executed? What are you going to tell God?' He was exercising his French language."

According to Arabic speakers that I've asked, "French" is not an expression for cursing in Arabic. Presumably al-Rubaie's remarks are a generalization of the English idiomatic meaning of "French" described by the OED as follows:

B.1.b. Used euphemistically for ‘bad language’, esp. in the phr. excuse (or pardon) my French!

1895 [see DURNED]. 1909 J. R. WARE Passing Eng. 171/1 Loosing French, violent language in English. 1922 JOYCE Ulysses 446 Bad French I got for my pains. 1936 M. HARRISON All Trees were Green II. 104 A bloody sight better (pardon the French!) than most. 1940 S.P.E. Tract IV. 181 Excuse my French! (forgive me my strong language). 1955 M. MCCARTHY Charmed Life (1956) ii. 52 ‘Damn fool,’ he said, vehemently, ‘pardon my French.’ 1961 J. O'DONOVAN Middle Tree ii. 12 A kick in the arse, smartly administered... Excuse my French! 1966 A. LA BERN Goodbye Piccadilly xxv. 220 Well I'll be buggered. Excuse my French.

The citation list suggests that this usage is not much more than a century old. The possible birth of a new idiomatic meaning of the word French was documented in this space a couple of months ago.

Posted by Mark Liberman at 11:21 AM

December 14, 2003

Life at the speed of blog

On December 11, Anders at Phluzein noted, a wee bit testily, that " Language Log has at last contributed to the body of criticism aimed at the journal Nature for its publication of an article (previously posted about here) ..." I guess that we do need to apologize to our readers for the fact that Bill Poser's review was not posted until December 10, nearly 13 full days after the Nature article appeared on November 27. Our standing offer to refund the fees of any dissatisfied subscriber applies here, of course.

Seriously, it did seem like a long time, though by the conventional standards of formal intellectual discourse, Bill might have been expected to publish a review in Language some time around the middle of next year (and maybe he will, anyhow). Meanwhile, Geoff Pullum has improved on our record for timely response by posting a scoop about the linguistic aspects of Saddam Hussein's capture, a mere 6 hours or so after Bremer's news conference in Baghdad announcing the event.

Posted by Mark Liberman at 02:15 PM

We got 'im!

On the central coast of California where I live, we catch up on international news when we wake up before dawn by turning on the bedside radio and switching back and forth between KAZU in Pacific Grove, which relays National Public Radio from Washington DC, and KSPB at Stevenson School in Pebble Beach, which runs a feed from the BBC World Service. It was an interesting experience this morning to hear a genuine difference in dialect and style in play. During the small hours of this morning, Saddam Hussein was captured in his hiding hole (we have so far heard it referred to a spider hole, a squirrel hole, and a rat hole, by the way: the media have yet to standardize on a specific vermin metaphor). Ambassador Bremer decided on resolutely informal style for the opening of his speech announcing the event: his first paragraph was, in full, "We got 'im!" He gave it that American English flapping and voicing of [t] between vowels that makes "got 'im" sound like what could be written phonetically as ['gadm]. But within minutes the BBC were reporting what he had said as "We got him", in their educated southern British dialect and rather formal style, with the "t" of got sounding like [t], and the [h] of him clearly audible. There is a real question about whether this (largely involuntary) style and dialect switch reported the content of Bremmer's utterance correctly. Beside the phonetic point, there is a syntactic and semantic one: "We got 'im" in American English can be present tense -- the equivalent in British of "We have got him." But in British English, "We got him" can only be preterite tense, the equivalent of "We did get him."

Posted by Geoffrey K. Pullum at 01:38 PM

Billy Bass has learned to read TIMIT!

Look here at section III, "Creating Your Own Sound Bites," and get the source here. The project uses TIMIT transcription format to "synchronize lip motion with sound"!

Somebody should manufacture these things for sale. It would make a good intro phonetics project -- if you had two of them, you could orchestrate a dialog -- but not everyone may feel comfortable hacking actual physical objects with soldering irons and dremel tools and things.

Posted by Mark Liberman at 12:35 PM

Strange scrambling Alphabets

Here's a follow-up to Bill Poser's post on Unicode. Knowing the appropriate Unicode character codes (in whatever UTF-? presentation) is only the first step towards being able to enter, display, edit and print documents in many of the world's languages. To give a fuller picture of the issues, I'll add a quick set of commented links on the problem of "rendering", that is, creating a correct visual representation for the character code sequences representing words, phrases, paragraphs and so on.

To see why rendering is a problem, look at this page on Examples of Complex Rendering and this one on Challenges in publishing with non-Roman scripts. I'll leave issues of entering and editing multi-lingual text for another time.

As for where the digital world is now on this problem, it's really complicated. Things are a lot better than they were just a couple of years ago, when Unicode was nearly useless because there were hardly any applications that could actually do anything with Unicode text, even for simple cases of "complex" rendering like single diacritics on Roman text, much less Arabic or Hindi. Today, nearly any reasonably up-to-date browser should be able to handle Arabic Unicode mixed with English, as would be required for Appendix II of Burton's First Footsteps. However, there are still lots of holes, inconsistencies and incompatibilities. You can find a relatively recent overview of the problem and (some of) the range of partial solutions here, though that page does not mention Pango or Qt (about which more below), or the progess recently made in the Java world.

The most extensive practical progress on this problem has been made by Microsoft. Internet Explorer and Microsoft Word, on recent Windows XP platforms, appear to offer the easiest context for dealing with the widest range of scripts. Here is the Introduction to a series of pages on Windows Glyph Processing that illustrate Microsoft's approach to rendering.

There are some open-source projects for rendering complex scripts that are very well designed and have great promise for the future. Here are the home pages for SIL's Graphite and Gnome's Pango projects. Unfortunately, the results have not been very thoroughly integrated into other programs and toolkits yet, so (for example) if you want to craft new interactive programs that involve rendering Arabic or Indic scripts using open-source software, it appears that your best bet at present is the (semi-open) Qt toolkit. This toolkit takes a less general approach -- it lacks a programmable rendering engine of the type represented by Graphite and Pango -- and you have to pay to develop with it for Windows platforms, but it offers a set of useful widgets that do the job well for some specific scripts right now.

There's a lot more to say -- some of it hopeful, some of it depressing -- but these links should be enough to get you started in the right direction if you're interested. One well-earned piece of advice: if you have a project that depends on scripts with complex rendering, don't believe what anyone says about their software's capacities until you see it with your own eyes, doing the kind of thing you want to do, in your own operating environment. This is not mainly because people are dishonest, though of course they sometimes are. Rather, it's because people (including me!) are often incompletely informed -- not to say ignorant -- and the situation is very complicated.

[The title of this piece is a quote from Beaumont's Psyche. More of its context can be found here.]

Posted by Mark Liberman at 09:20 AM

Unicode

In a recent post, Mark Liberman referred to Unicode as a way of reproducing Arabic characters. Unicode is probably unfamiliar to a lot of people, so I thought I'd talk about it a little bit.

Like everything else, written characters are represented on computers by patterns of bits, ones and zeroes. These patterns of bits have a more conventional interpretation as numbers, so we generally talk about the representation of text as if characters were represented by numbers. What bit pattern (or number) represents what character is perfectly arbitrary. In the old days, when computer terminals had character encodings built in, it was a matter of what electrical signals would cause a certain character to appear on the screen. Nowadays it is a matter of what bit pattern causes a certain picture to be drawn. A mapping from bit patterns or numbers to characters is a character encoding. For example, in the American Standard Code for Information Interchange (ASCII) code the letter a is represented by the number 97, or really, the bit pattern 01100001.

ASCII is by far the most commonly used character encoding because it suffices for normal English text and English has long been the dominant (natural) language used on computers. As other languages came into use on computers, other sets of characters, with different encodings, came into existence. Indeed, there is usually more than one encoding for a particular writing system. All in all, there are hundreds of different character encodings.

This proliferation of character encodings causes a lot of problems. If you receive a document from someone else, your software may not be able to display it, print it, or edit it. You may not even be able to tell what language or writing system it is in. And if you need to use multiple writing systems in the same document, matters become much worse. Life would be much simpler if there was a single, universal encoding that covered all of the characters in all of the writing systems in use.

Unicode is a character encoding standard developed by the Unicode Consortium to fulfill this need. The current version of the Unicode standard contains almost all of the writing systems currently in use, plus a few extinct systems, such as Linear B. More writing systems will be added in the future. A list of the current character ranges can be found here. In some cases Unicode lumps together historically related writing systems (for example, what it calls the Canadian Aboriginal Syllabics is not a single writing system), so to find out if your favorite writing system is included, and where the characters are, you may have to spend some time exploring the standard.

Here is a screenshot of the Yudit Unicode editor displaying a sampling of writing sytems. This text is all encoded in Unicode.

Unicode originally intended to use two bytes, that is, 16 bits, to represent each character. That would be sufficient for 65,536 characters. Although this may seem like a lot, it isn't really quite enough, so full Unicode makes use of 32 bits, that is, four eight-bit bytes. That's enough for 4,294,967,296 characters, which should hold us for a while. In fact, the Unicode Consortium has agreed that for the foreseeable future only the first 21 bits of the available 32 will actually be used. Text encoded in this version of Unicode is said to be in UTF-32.

One problem with UTF-32 is that every character requires four bytes, that is, four times as much space as the ASCII characters and other single-byte encodings. In order to save space, a compressed form known as UTF-8 is usually used to store and exchange text. UTF-8 uses from one to four bytes to represent a character. It is cleverly arranged so that ASCII characters take up only one byte. Since the first 128 Unicode characters are the ASCII characters, in the same order, a UTF-8 file containing nothing but ASCII characters is identical to an ASCII file. Other characters take up more space, depending on how large the UTF-32 code is. Here are the encodings of some of the characters shown above. The 0x indicates that these are hexadecimal (base 16) values.

UTF-32	UTF-8	Name
0x00041	0x41	Latin capital letter a
0x00570	0xD5 0xB0	Armenian small letter ho
0x00BA4	0xE0 0xAE 0xA4	Tamil letter ta
0x04E09	0xE4 0xB8 0x89	Chinese digit 3
0x10024	0xF0 0x90 0x80 0xA4	Linear B qe

One source of resistance to using UTF-8 in some countries is that it seems to privilege English and other languages that can be written using only the ASCII characters. English only takes one byte per character in UTF-8, while most of the languages of India, for instance, require three bytes per character. By the standards of today's computer processors, storage devices and transmission systems, text files are so small that it really doesn't matter, so I don't think that this is a practical concern. It's more a matter of pride and politics.

If we don't need the extinct writing systems and other fancy stuff outside of the Basic Multilingual Plane, we could all be equal and use UTF-16. English and some other languages would take twice as much space to represent, but other languages would take the same space that they do in UTF-8 or even take up less space. At least from the point of view of those of us who aren't English imperialists, this might not be a bad idea, if not for the fact that UTF-8 has another big advantage over UTF-16: UTF-8 is independent of endianness.

What is endianness? Well, whenever a number is represented by more than one byte, the question arises as to the order in which the bytes are arranged. If the most significant bits come first, that is, are stored at the lowest memory address or at the first location in the file, the representation is said to be big-endian. If the least significant bits come first, the representation is said to be little-endian.

Consider the following sequence of four bytes. The first row shows the bit pattern. The second row shows the interpretation of each byte separately as an unsigned integer.

bit pattern	00001101	00000110	10000000	00000011
decimal value	13	6	128	3

Here is how this four byte sequence is interpreted as an unsigned integer under the two ordering conventions:

Little-Endian	(13 * 256 * 256 256) + (6 256 256) + (128 256) + 3	218,529,795
Big-Endian	(3 * 256 * 256 256) + (128 256 256) + (6 256) + 13	58,721,805

Most computers these days are little-endian since the Intel and AMD processors that most PCs use are little-endian. Digital Equipment machines from the VAX through the current Alpha series are also little-endian. On the other hand, most RISC-based processors, such as the SUN SPARC and the PowerPC, as well as the IBM 370 and Motorola 68000 series, are big-endian.

By now you've probably forgotten the point of all this. Well, UTF-16 is subject to endianness variation. If I write something in UTF-16 on a little-endian machine and you try to read it on a big-endian machine, it won't work. For example, suppose that I encode the Armenian character հ ho on a little-endian machine. The first byte will have the bit pattern 01110000, conventionally interpreted as 112. The second byte will have the bit pattern 00000101, conventionally interpreted as 5. That's because the UTF-32 code, 0x570 = 1392, is equal to (5 * 256) + 112. Remember, on a little-endian machine, the first byte is the least significant one. On a big-endian machine, this sequence of two bytes will be interpreted as (112 * 256) + 5 = 373 = 0x175, since the first byte, 112, is the most significant on a big-endian machine. Well, 0x175 isn't the same character as 0x570. It's ŵ (w with a circumflex). So, if you use UTF-16 you have to worry about byte order. UTF-8, on the other hand, is invariant under changes in endianness. That is a big enough advantage that most people will probably continue to prefer UTF-8.

The terms big-endian and little-endian were introduced by Danny Cohen in 1980 in Internet Engineering Note 137, a classic memorandum entitled "On Holy Wars and a Plea for Peace", subsequently published in print form in IEEE Computer 14(10).48-57 (1981). He borrowed them from Jonathan Swift, who in Gulliver's Travels (1726) used them to describe the opposing positions of two factions in the nation of Lilliput. The Big-Endians, who broke their boiled eggs at the big end, rebelled against the king, who demanded that his subjects break their eggs at the little end. This is a satire on the conflict between the Roman Catholic church and the Church of England and the associated conflict between France and England. Here is the relevant passage:

It began upon the following occasion.

It is allowed on all hands, that the primitive way of breaking eggs before we eat them, was upon the larger end: but his present Majesty's grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the Emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs.

The people so highly resented this law, that our Histories tell us there have been six rebellions raised on that account, wherein one Emperor lost his life, and another his crown. These civil commotions were constantly formented by the monarchs of Blefuscu, and when they were quelled, the exiles always fled for refuge to that Empire.

It is computed, that eleven thousand persons have, at several times, suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volums have been published upon this controversy: but the books of the Big-Endians have been long forbidden, and the whole party rendered incapable by law of holding employments.

During the course of these troubles, the emperors of Blefuscu did frequently expostulate by their ambassadors, accusing us of making a schism in religion, by offending against a fundamental doctrine of our great prophet Lustrog, in the fifty-fourth chapter of the Brundecral (which is their Alcoran). This, however, is thought to be a mere strain upon the text: for their words are these; That all true believers shall break their eggs at the convenient end: and which is the convenient end, seems, in my humble opinion, to be left to every man's conscience, or at least in the power of the chief magistrate to determine.

Some aspects of Unicode have come in for criticism, and there are some alternative proposals, but at least for now it is by far the most widely adopted universal encoding.

Your web browser can probably handle Unicode, provided that you have the necessary fonts installed. And you may have to explicitly tell your browser that the text is in Unicode - sometimes it can't tell. You can get a free Truetype font called Code2000 that includes just about everything here. Some text editors and word-processors can handle Unicode, but a lot of software, including the software that runs this blog, still can't, not directly and conveniently. That's why I used an image above.

Here's a little more Unicode, inserted the hard way. If your browser doesn't display it properly, you can click on it for an image.

سالاام

Posted by Bill Poser at 12:59 AM

Dangling etiquette

Rich and creamy, your guests will never guess that this pie is light.

"Does this fall under the no dangling modifier prescription?", asks Rosanne, in a post on The X-bar. Yes, Rosanne, it does. Like participles, adjectives and also some idomatic preposition phrases, when used as adjuncts, need an understood subject (or, it might be better to say, a target of predication) to be filled in if they are to be understood. The prescriptive tradition says that the subject filled in must be the one obtained from the subject of the matrix clause. Here that would be your guests, which makes a nonsense reading, so the sentence cited would be treated as an error.

But the prescriptivists have a problem. Sentences of this kind, which call for you to fill in the understood subject from somewhere else (here, the subject of is light in the subordinate clause), are so common that when I and several friends have spent some time picking new ones up from print and radio sources, we get them at a rate of as many as one per day. That's in edited sources, where grammatical errors have almost entirely been screened out. This just cannot be syntactic error. It's too frequent.

I definitely think that sentences that make you twist this way and that, hunting for the intended subject, are ill-written and discourteous. But it simply isn't reasonable to say that they are syntactic errors. We follow our syntactic rules so much better than we follow this principle of courtesy. The syntax of English says (for example) that the subject should precede the predicate in a normal declarative: The cat wants to go out rather than *Wants to go out the cat. Ever seen anyone get that wrong? I thought not. People mostly know their syntax. Dangling modifier cases fall down on simple courtesy. It's manners, not grammar, that's what I think.

Posted by Geoffrey K. Pullum at 12:39 AM

December 13, 2003

Language relationships: families, grafts, prisons

When I recently read a reviewer's assertion that linguists borrowed from Charles Darwin the notion of a "family tree" as a way to describe and explain similarities among languages, it surprised me. The true direction of historical influence is mostly the other way, though of course linguists have been inspired in turn by Darwin's ideas. So I've been attuned to the ways that people think in pre-Darwin discussions of linguistic relationships.

In the post linked above, I quoted from Thomas Jefferson, who in 1781 clearly understood that affinities among contemporary languages should be seen as the residue of descent with modification. The metaphor of family trees for the relation among languages is a commonplace during the centuries before Darwin. However, not everyone who used that metaphor was as clear a thinker as Jefferson.

In 1855, Sir Richard Burton used both "family" and "tree" (or maybe "vine") metaphors in one short sentence when he wrote

"The Harari appears, like the Galla, the Dankali, and the Somali, its sisters, to be a Semitic graft inserted into an indigenous stock."

From the context, it appears that what Burton means by the "grafting" metaphor is a milder version of what creolists would mean by talking about a semitic lexifier on an "indigenous" substrate. However, it seems that Burton's evidence for this idea here is just a combination of recent borrowings from Arabic into these languages, along with (what I think are) some inherited cognates that he attributes to earlier borrowings into an originally-unrelated African language. That is, he sees the four cited languages as "sisters", but does not see their relationship to Arabic in terms of the slightly more distant "family connection" of Afro-asiatic, as more recent scholars would, but instead sees only contact effects at different time depths.

Burton is also wrong (by modern standards) in calling Harari, Galla, Dankali and Somali "sisters":

Harari is ethnologue code HAR, classification Afro-Asiatic, Semitic, South, Ethiopian, South, Transversal, Harari-East Gurage
"Galla" is Oromo, ethnologue code GAX, classification Afro-Asiatic, Cushitic, East, Oromo.
"Dankali" is Afar, ethnologue code AFR, classification Afro-Asiatic, Cushitic, East, Saho-Afar.
Somali is ethnologue code SOM, classification Afro-Asiatic, Cushitic, East, Somali.

Thus Oromo, Afar and Somali really are sisters, but Harari is more of a (not very close) cousin, if the ethnologue classification is accurate.

In a footnote, Burton quotes a passage from the 1850 Swahili grammar of the Anglican missionary Johann Ludwig Krapf, which presents a rather different metaphor for relations among languages, that of linguistic divergence as escape from captivity:

"In the Abyssinian language, especially in the Ethiopic (or Ghiz), and in the Tigre and Gurague, its dialects, we find the Semitic element is still predominant; the Amharic manifests already a strong inclination of breaking through this barrier. The Somali and Galla languages have still more thrown off the Semitic fetter..."

This metaphor may have been suggested to Krapf by the pervasiveness of the slave trade in East Africa during his travels there, and the key role of Arab slave traders. Burton does not seem to notice any contradiction between Krapf's escape from servitude and his own grafting: both are just metaphors for some kind of mixed situation, and he does not seem to be thinking about either one of them very precisely.

The quoted passages come from Appendix II of Burton's his two-volume work First Footsteps in East Africa, or, An Exploration of Harar. A handsome hyperlinked e-text version, which according to its footers was "rendered into HTML ... by Steve Thomas for the University of Adelaide Library", has recently appeared on the web. Unfortunately this version omits appendix II, the Grammatical Outline and Vocabulary of the Harari Language, because "because of the large number of Arabic characters it contains, which makes it impossible to reproduce accurately." Though it is unreasonable to complain about the quality of a free good, I do want to point out to the University of Adelaide Library that there are several alternative methods for accurate html reproduction of Arabic characters, including Unicode.

The story of Burton's trip is an interesting one in itself. When he entered the walled city of Harar on January 3, 1855, it was considered a "forbidden city", closed to Europeans on pain of death. Burton attributed this exclusion not only to a superstition that would "read Decline and Fall in the first footsteps of the Frank," but also to the fact that "at Harar slavery still holds its head-quarters, and the old Dragon knows well what to expect at the hand of St. George."

The ancient metropolis of a once mighty race, the only permanent settlement in Eastern Africa, the reported seat of Moslem learning, a walled city of stone houses, possessing its independent chief, its peculiar population, its unknown language, and its own coinage, the emporium of the coffee trade, the head-quarters of slavery, the birth-place of the Kat plant, and the great manufactory of cotton-cloths, amply, it appeared, deserved the trouble of exploration.

Harar hosted another famous European later in the century. Between 1880 and 1884, and again from 1888 to 1891, the poet Arthur Rimbaud lived there, as a storekeeper and trader. He was apparently not a fan, writing shortly after he arrived for the first time in Harar that "je ne compte pas rester longtemps ici, je n'ai pas trouvé ce que je présumais et je vis d'une façon fort ennuyeuse et sans profits." ("I do not intend to stay here very long. I have not found what I was expecting and I am living in a very boring and unprofitable fashion".)

Now you can buy coffee from the Harar region on the internet. You can find several web sites devoted to the city of Harar, including one from UNESCO and one oriented to prospective tourists. There are web sites for Harari communities in Dallas, the Bay Area, Toronto, Atlanta and elsewhere, and a RealAudio feed for a Harari radio program from Melbourne, Australia. We live in wondrous times.

Posted by Mark Liberman at 05:44 PM

Missing links

The only (somewhat) language-related items in this year's New York Times Year in Ideas were ways to Offload Your Memories in voice and/or text form, the business about Proving You're Human by reading obfuscated text, the idea of Quiet Parties (at which talking is forbidden), and the fad for pseudo-scientifically reading the Body Language of celebrity photos. This is out of 67 or so. And I'm stretching to cite those four. Well, maybe I should add the Pod Car, which promises some interaction with its driver via speech. Is this the best that language science, the language industries, and language culture (high and low) can do?

Posted by Mark Liberman at 04:51 PM

No, appearances of paradox were deceptive

Brian Weatherson's idea about why one should blog one's thoughts instead of just ruminating on them privately is basically that to make them public is inherently stimulative of intellectual progress. Sure enough, as soon as I put down my idea for Brian to reflect on, a brief emailed question from David Beaver and a very short puzzled remark by Brian began to make me see that my example:

(1) Appearances are not deceptive; it only seems as if they are.

is not paradoxical, it's just contradictory. Its problem is not that its truth implies its falsity and conversely, as I confusedly thought; its problem is just that it is false because its second half contradicts its first half. I think that might be all that makes it so mind-bending. While appearing to say something about appearance and reality, it actually says one thing and then takes it back by saying the opposite. That's all the analysis it really needs.

Now why hadn't I seen that before? Because I didn't put down what I was thinking (or trying to think) in writing and expose it to public view. The effort of doing so led rapidly to my being able to see that I was confused. I think that is what Brian was saying in the interview on normblog about why he blogs his thoughts. And he's quite right.

Posted by Geoffrey K. Pullum at 03:19 PM

This is not Middle Earth

I was a big fan of J.R.R. Tolkien when I was a kid. I've enjoyed reading the LOTR books out loud to my seven-year-old, and listening to him read them to me -- especially his Elvish and Orcish accents, which he rightly believes to be much better than my own.

Tolkien's invented languages are the framework on which he built his world. However, there is something about the way he designed the languages of Middle Earth that is both very natural and very wrong.

Tolkien was a philologist specializing in the history of the English language, and Professor of Anglo-Saxon at Oxford University. By the time he was twelve years old, in 1904, he was making up languages for fun. Later in life, he began to invent a world for those languages to fit into, and adventure stories about things that happened in that world.

Three books were crucial in developing Tolkien's linguistic imagination: Joseph Wright's Primer of the Gothic Language; C.N.E Eliot's Finnish Grammar; and John Morris-Jones' Welsh Grammar.

From his Gothic primer, Tolkien learned about reconstructing Indo-European linguistic history through the comparative method, as it had been developed by 19th-century scholars. His encounter with Finnish opened his mind to the exotic structures of a non-Indo-European language, which he described as "like discovering a complete wine-cellar filled with bottles of an amazing wine of a kind and flavour never tasted before." As a boy, he had seen in Welsh place-names "a flash of strange spelling and a hint of a language old and yet alive", and when he won an English prize as an undergraduate at Oxford, he spent his prize money on a Welsh grammar.

Throughout his life, he imagined new languages and along with them, new systems of writing, new linguistic histories, new literatures, and a new world. All of these were inspired by his study of real-world languages, histories and literatures, where he had the credentials of a serious scholar.

In Tolkien's fantasy world, different languages generally belong to different races -- elves, dwarves, men, hobbits, orcs and others -- who are very different from one another in every other way as well. They look different, they live in different habitats, they do different kinds of work, they are interested in different things, they have different life-spans (elves in particular are immortal), they have different preferred weapons, they live in different kinds of houses and wear different sorts of clothes, they have different sex lives (dwarvish females have beards, are less than 1/3 as numerous as males, and never appear in public), and so on.

Thus Tolkien's races are radically different from one another in biology, language and culture; and across these races, biology, language and culture are well correlated. This was the predominant 19th-century view.

However, even 19th-century scholars in fact knew that biology and language correlate badly if at all. For example, the great American linguist William Dwight Whitney wrote in 1864 that

One of the first considerations which will be apt to strike the notice of any one who reviews our classification of human races according to the relationship of their languages, is its non-agreement with the current divisions based on physical characteristics.

Furthermore, scientific examination of human physical, linguistic and cultural variation generally does not produce well-defined and well-separated bundles of characteristics corresponding to the categories of "race" and "language", but rather complex geographical and social patterns of graded variation in statistical frequencies of traits.

So human reality does not divide us cleanly into dwarves, elves, men, hobbits and orcs, not biologically and not even linguistically. Furthermore, if we insist on the common folk categories of race and language, or do our best to make up new ones with better scientific grounding, we find that biological, linguistic and cultural traits often do not line up.

Nevertheless, people find it very hard to avoid thinking like Tolkien did. It's easy and natural to imagine that intelligent beings divide into well-defined biological subgroups, and that members of these groups tend to have different personalities, different strengths and weaknesses, different cuisines, different ways of talking, and so on. Tolkien's Middle-Earth is not the only imaginary world based on this idea: think about Star Trek, and Star Wars, or even the world of Pokemon.

Francisco Gil-White (of Penn's Psychology Department) has argued, based on his cross-cultural research, that

humans process ethnic groups (and a few other related social categories) as if they were “species” because their surface similarities to species make them inputs to the “living kinds” mental module that initially evolved to process species level categories.

This is not Tolkien's Middle Earth. But most people still think it is, at least sometimes. 19th-century anthropology resurfaced many times as 20th-century fantasy, and the 21st century is continuing the tradition. If Prof. Gil-White is right, this is because 19th-century anthropology simply recapitulates the folk-ontology of ethnicity.

[Note: This post is made from recycled materials, specifically on-line lecture notes from a course on Biology, Language and Culture that I (as Language) taught two years ago with Alan Mann (as Biology) and Greg Urban (in the role of Culture).]

[Update: Here is some some serial and reciprocal troll-baiting about the many alternative political interpretations of this aspect of Tolkien's work.]

Posted by Mark Liberman at 07:54 AM

Eating people is wrong

Margaret Marks has a fascinating series of posts on Transblawg (here, here and here) commenting on various legal and linguistic aspects of the Armin Meiwes trial.

Posted by Mark Liberman at 07:14 AM

December 12, 2003

The apparent deceptiveness of the world

This one is for Brian Weatherson, who has some interesting remarks about why a philosopher like him should blog, in a recent interview on normblog (where, incidentally, he rates Language Log in his top three favorite blogs; right back at ya, Brian!). He explains that blogging improves his work by making him do some writing and have some on-line discussion every day and thus stimulating the process of coming up with philosophical ideas. So here is a further idea for him to think about -- an idea that feels to me like it might have a certain amount of philosophical interest, though I have so far made nothing of it, so it is time to turn it over to a working philosopher who will appreciate it and give it a proper home.

Let me explain.

Consider the following statement:

Appearances are not deceptive; it only seems as if they are.

Clearly, if this is true, then it has to be false, and if false, it must be true. Yet it is not a standard liar-paradox sentence like as in classic liar sentences like This statement is false, or Everything I tell you is a lie, including this. It does not mention truth or falsity, or refer to itself. It is a metaphysical claim, as far as I can see. It speaks not about language or truth but about the nature of reality. It says (contrary to the old proverb) that reality does not present itself in a way that deceives our senses, and any perception we may have to the contrary is incorrect.

Compare (1) with the famous and much quoted claim of G. K. Chesterton's (from his book Orthodoxy) in (2):

"The real trouble with this world of ours is that it is nearly reasonable, but not quite. Life is not an illogicality; yet it is a trap for logicians. It looks just a little more mathematical and regular than it is; its exactitude is obvious, but its inexactitude is hidden; its wildness lies in wait."

Chesteron's is a coherent metaphysical claim (a very beautiful one). It might well be true. But (1) an incoherent metaphysical claim, and slams us straight into the brick wall of paradox.

Of course, there may be a way to reduce (1) to a simple liar paradox instance buried deep down inside it, if you analyze it enough, in the right way; but that can hardly be said to be clear. Can such an analysis be convincingly given? That's where the philosophy comes in. I know when I'm out of my depth. Over to you, Brian.

Posted by Geoffrey K. Pullum at 04:00 PM

Good glottochronology

Again and again, the world is presented with another example of the same old drama: a fascinating hypothesis about deep history is bravely advanced and defended by a few methodologically adventurous souls, who come under relentless attack from a posse of hidebound linguistic nay-sayers firing footnotes from their hiding places in the library stacks.

At least that's how some people seem to see the story. The facts, as usual, are more complex and more interesting, even when it's biologists as hypothesizers and linguists as critics. See the recent Language Log posts about Forster and Toth here, and Gray and Atkinson here, here, and here.

Well, this is how science is supposed to work: hypotheses have to be criticized and defended, not just appreciated, for genuine knowledge to advance. But the story is misleading: in fact linguists are far more likely to be intrepid hypothesizers than critical snipers, whether the subject is deep history or social dynamics or epistemology. And I might also need to point out that linguists as a class are unusually open to the interdisciplinary trade in ideas, both as importers and as exporters.

In particular, it's not as well known as it should be that some ethnohistorians routinely use a variety of linguistic methods, including lexicostatistics and glottochronology, and that professional linguists by and large approve and even collaborate.

Edda Fields' dissertation on Rice Farmers in the Rio Nunez Region works out the history and chronology for the migration of various ethnic groups and the development of wet rice cultivation in the mangrove swamps of coastal Guinea, during a period from roughly 2000 BCE to 1880 CE. Edda uses all available sources of evidence, but in this case, much of the evidence turns out to be linguistic. As she writes, "[i]n coastal Guinea, ... [t]here is no viable method for dating oral narratives .. [a]nd ... there is no other historical data that pre-dates the 15th century Portuguese accounts." She also faced current "absence of direct carbon, pollen, climate change and archeological evidence for the early history of the Rio Nunez region." As one method to find historical patterns, she examines the vocabulary associated with the material culture of rice farming, looking to see which languages borrowed terms from which other languages, as evidence about the sources of various innovations. In order to assign dates, classical glottochronology (based to a large extent on word lists she gathered herself in the field) was her main tool. Are the resulting estimated dates exact? Certainly not. Are the estimates worth having? Absolutely.

I think this is terrific work, and would argue that its application of lexicostatistics and glottochronology is entirely appropriate, given the usual caveats about interpretation of the results (which Edda is careful to express).

I should also mention that this work has a vital connection to American history, because of the key role of West African slaves in adapting their wet rice farming methods to the plantations of coastal Carolina in the late 17th century. These were the first commercially successful plantations on the North American mainland, and played a significant role in the early economic development of the British colonies here.

Posted by Mark Liberman at 08:40 AM

December 11, 2003

Ticks and tocks of glottoclocks

As a footnote to Bill Poser's lucid discussion of the recent Nature article by Gray and Atkinson, Language-tree divergence times support the Anatolian theory of Indo-European origin, I thought I'd put up a few details from earlier work on rates of vocabulary retention.

Note, by the way, that I don't mean any of this as a refutation of Gray and Atkinson's work. Their (very brief) paper and their bibliography make it clear that they are quite familiar with the relevant literature and have in some way taken account of these issues in their modeling. Their basic innovation is the application of a "model-based bayesian framework," which they assert "allowed us to ... estimate divergence times without the assumption of a strict glottoclock." I can understand in principle how this might work, but the problem, as Bill explained, is that from their paper it's not possible to determine in any detail just what they did, and so it's hard to evaluate their conclusions. Like Bill, I look forward to learning more about this work. From their references we can learn a bit more about the class of bayesian models they used (whose equations they do not provide in the published paper), but it's clear that a fuller evaluation of their time estimation methods will have to wait on a fuller publication of their research and/or similar explorations by others.

I should note also that Nature posts on its web site "supplementary info" about papers that it publishes, and in this case it would have been very appropriate to supply a fuller account of the work, since the web site presumably does not suffer from the severe space limitations of the paper journal. Unfortunately in this case the supplementary information is just a table of the "[a]ge constraints used to calibrate the divergence time calculations, based on known historical information". This table makes it clear that the authors worked hard to be careful in establishing dates to provide a sort of scaffolding for their bayesian framework, and provides another indication that this is a serious and interesting work, very different in its flavor from the work of Forster and Toth recently published in PNAS. However, it still doesn't tell us what model they actually fit!

Swadesh (1952) estimated 14% change (86% retention) per millenium in his defined vocabulary. Lees (1953) estimated 20% change (80% retention) per millenium:

Thirteen sets of data, presented in partial justification of these assumptions, serve as a basis for calculating a universal constant to express the average rate of retention k of the basic-root morphemes:
k = 0.8048 ± 0.0176 per millennium,
with a confidence limit of 90%.

Here's some of Lees' data:

Language	Years	Words	Cognate	Rate (per KY)
English	1000	209	160	.766
Latin/Spanish	1800	200	131	.790
Latin/French	1850	200	125	.776
German	1100	214	180	.854
Middle Egyptian/Coptic	2200	200	106	.760
Greek	2070	213	147	.836
Chinese	1000	210	167	.795
Swedish	1050	207	176	.853

Here are some estimated per-millenium retention rates for "more retentive" languages, taken from Bergsland & Vogt (1962):

Language	100-word list	200-word list
Icelandic (rural)	.990	.976
Icelandic (urban)	.980	.962
Georgian	.965	.899
Armenian	.978	.940

Here are some less retentive ones, cited in Guy (1994):

Language	Source	Time period	Retention	Rate per 1KY
East Greenlandic	Bergsland & Vogt	600 years	.722	.34
Muyuw (Woodlark Island)	David Lithgow	"one generation"	~.80	~.06

I'm not sure how well documented the last data point (suggesting 20% vocabulary replacement in one generation) is, but it certainly seems that a fairly wide range of rates can be observed empirically.

However, note that because the assumed model is stochastic, and the numbers (of words) are small, some variation is to be expected given a constant underlying rate, and in principle any actual replacement rate could be observed. The half life of Oxygen 15 is 2 minutes, but if we start with only two atoms, it's by no means certain that after two minutes, exactly one atom will have decayed and one will be left. That is the most likely single outcome, but there is an equal chance that both or neither will have decayed. Of course, as the number of atoms increases, the likelihood of substantial deviations from 50% total decay after 2 minutes goes down.

What kind of variation of word loss rates over a millenium do we expect, given a "strict glottoclock"? Well, there are 50 20-year "generations" in a millenium, so if a word has probability R of being retained for a millenium, it should be retained with probability R^(1/50) per generation. I ran a trivial little simulation, in which each of 200 words has a fixed chance of being lost in each of 50 generations, and got an empirical distribution of outcomes like this:

A distribution like this is quite consistent with the values in Lees' sample (as he well understood), but both the more retentive and less retentive examples from e.g. Bergsland and Vogt are rather unlikely to come from such a source. That's the essential basis for the conventional skepticism about the validity of the dates emerging from classical glottochronology: if two languages A & B share 80% of the Swadesh 200 list, then if the underlying rate is .8 retention per millenium, A and B probably separated about 1000 years ago; but if the underlying rate is the .34 retention per millenium documented for East Greenlandic, then A and B most likely separated about 210 years ago; and if the underlying is the .976 retention per millenium documented for rural Icelandic, then the most likely time for the separation of A and B is about 9200 years ago. These look like pretty big uncertainties; and the number of cases for which we have good calibration of historical "glottoclock rates" is not very large; and there are almost certainly significant effects of speech community size, extent of contact, type of social organization and so on, which are not very well varied in our sample of calibration cases.

[Note that large differences in the "glottoclock" rate along different branches of a language family may also cause relative cognation proportions to be incongruent with the true historical descent structure. As Bill Poser pointed out to me, in Bergsland and Vogt's work this led to the lexicostatistical conclusion that the biggest split among Eskimo languages is between East and West Greenlandic, which is not plausible on other grounds.]

Some very smart people have worked on ways to get around these problems or at least to quantify them carefully. One of these whom I know well is Joe Kruskal, who made significant contributions in many areas of applied mathematics, and who was always very scrupulous in his attention to the facts of the various applications areas he worked in, and very careful in the claims that he made for the results he achieved. My own evaluation has been that the past attempts to rescue glottochronology have produced results that are interesting but remain subject to great uncertainties of interpretation, and I don't think that Joe would disagree with that. I'll be very interested to see how far the new ideas of Gray and Atkinson -- or perhaps it's better to say their application of ideas that have been developed in the biological modeling literature -- go towards reducing those uncertainties.

[Note: there is a useful discussion and bibliography on "classical" lexicostatistics and glottochronology on Paul Black's web site here.]

Posted by Mark Liberman at 09:53 PM

December 10, 2003

Irresponsible Punditry

The paper "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin" discussed in a previous posting was the subject of an article by Boston Globe staff writer Gareth Cook in the Thanksgiving Day issue (p. A16). The title, "A new word on birth of Western languages", is a little odd since the Indo-European languages include not only most of the languages of Europe but most of the languages of such non-Western countries as Iran (Persian), Armenia (Armenian), Afghanistan (Dari, Pashto), Pakistan (Panjabi, Urdu), India (Sanskrit, Hindi, Gujarati, Bengali, Assamese, Marathi, Oriya), Nepal (Nepali), Bangladesh (Bengali), and Sri Lanka (Sinhalese), as well as Kurdish, spoken in Turkey, Iran, and Iraq, and Tocharian A and B, once spoken in Chinese Turkestan, but the article itself is pretty good. It does, however, contain one irritating bit:

Gray was trained as a biologist, not a linguist, which some scientists said could explain the generally cautious reception yestereday's paper was greeted with among linguists. "Partly, I think they are irritated", said Luigi Luca Cavalli-Sforza, who is a leading expert on historic population migrations and a professor emeritus at Stanford Medical School. "It is a very good paper."

Cavalli-Sforza is indeed a distinguished geneticist, whom I first encountered via his book Cultural Transmission and Evolution: A Quantitative Approach which I read with pleasure many years ago and still own. But as far as I can tell Cavalli-Sforza has no reason whatever to think that the cautious reaction of linguists to the paper was based on anything other than legitimate scientific issues. There are some, discussed in my previous posting. I know of no evidence that anyone's reaction was based on irritation. He's just blowing smoke.

For the record, here are the comments that I sent Gareth Cook when he was writing the piece in question. It seems to me that they make a few technical points, are in many ways positive about the paper, and withhold final judgment until I can find out more about what exactly the authors did. Its fair to characterize them as cautious, but I don't see any irritation. You can judge for yourself.

The paper by Gray and Atkinson is a serious paper. It shows familiarity with the literature and attempts to address the known problems with glottochronology and methods of dating based on lexical turnover. And they used a reasonably reliable source of data and information about cognation. They have also taken a number of precautions to ensure that their results are not the result of chance and to see that their assumptions are not influencing the results. So it compares quite favorably with the junk that we sometimes see in which people apply a technique from another field to a problem that they don't really understand, often with poor data sources.

The main question that this paper leaves me with is whether their technique adequately addresses the fact that the rate of lexical replacement is known not to be constant. They acknowledge the issue and say that "the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model variation across the tree." The reference they give is to what appears to be the manual for a piece of software. I'm not familiar with this, so on short notice I simply can't tell whether it adequately addresses the problem.

The other problem, pointed out by Don Ringe, is that it isn't clear what exactly they have done with their cognate sets. Dyen et al. contains Swadesh 200 word lists for 95 languages. They excluded 11 languages that Dyen et al. did not code, which leaves 84. Then they added Hittite, Tocharian A, and Tocharian B. So they should have 200 cognate sets across 87 languages. If they were using methods of the sort I am most familiar with, each cell in the matrix would have a value indicating either "for this lexical item this language retains a reflex of the reconstructed Proto-Indo-European etymon" or not. But that can't be what they have done since they talk about 2,449 cognate sets. So they've apparently split each gloss into multiple cognate sets, and they don't explain how.

I have an idea of what they might have done, but its just a guess. Perhaps they have used each subset of cognate words as a "cognate set". For example, the PIE word for "bear" is believed to be the ancestor of Latin ursus, Greek arktos, Sanskrit rkshas, Welsh arth (as in the name Arthur) etc. However, this doesn't show up in Germanic and Balto-Slavic. Germanic languages have words like English "bear", German baer, Old Norse bjorn - evidently they referred to bears as "the brown ones". In Slavic you get words like Russian medved, literally "honey eater". Presumably, this reflects taboo-ing of the original word for bear. Anyhow, in a case like this they might have treated cognates of ursus as one cognate set, cognates of bear as another cognate set, and cognates of medved as a third cognate set. There's nothing wrong with that, as far as it goes. But "has a cognate of ursus", "has a cognate of bear", and "has a cognate of medved" are not independent - e.g., if a language has a cognate of ursus as its word for "bear", it doesn't have a cognate of medved. So if your technique assumes the independence of the characters, you can't do this.

It's quite possible that whatever they've done is not problematic - I can't tell because they don't give sufficient detail.

A minor comment is that it is a little odd to use the Romance languages when they are known to descend from Latin, which of course is well attested. Using the daughters rather than the ancestor can only add noise. Presumably they didn't use Latin because Dyen et al. don't give Latin data.

We expect scientists to provide objective commentary based on a knowledge of the subject, not insinuations about the alleged motives of those who disagree with them. We can leave that to the Postmodernists in the literature departments. So you might think that Cavalli-Sforza's remark was merely an addendum to a discussion of the scientific issues and suppose that the newspaper is at fault for reporting only the fluff. That probably isn't what happened though: this wouldn't be the first time that Cavalli-Sforza has substituted unfounded, ad hominem remarks for intelligent commentary.

Cavalli-Sforza is a staunch defender of the late Joseph Greenberg, whose 1987 book Language in the Americas is generally considered by historical linguists to be worthless, partly because its methodology is invalid, and partly because Greenberg's handling of the data is so appallingly bad. Cavalli-Sforza hasn't made any attempt to defend Greenberg's data, and his attempts to defend Greenberg's methodology contain nothing of substance. Let's take an example. In his book Genes, Peoples, and Language he says (pp. 137-138):

...some anti-Greenberg linguists believe it is impossible to posit a quantitative relationship between any two languages. By disallowing reliable measurements, and by limiting the relationship betweeen two languages only to "related or not related", the American linguists opposing Greenberg have ruled out the possibility of hierarchical classification, an essential prerequisite to taxonomy.

Now, this is perfect nonsense. I think it is fair to say that all of the linguists who have criticzed Greenberg's work believe in degrees of relationship, that is, that some languages are more closely related to each other than to other languages. I have never heard ANY linguist express the view described by Cavalli-Sforza. Virtually every book and paper on historical linguistics assumes a hierarchical classification. To claim that historical linguists are critical of Greenberg because they don't believe in degrees of relationship is like claiming that biologists are critical of Lysenko because they don't believe in evolution.

It is also striking that such an amazing claim is supported by no evidence. Cavalli-Sforza doesn't even name any of the linguists who allegedly hold this amazing view, much less supply quotations from their work or references to it. That's because there isn't any supporting evidence.

Just to be sure, I asked Cavalli-Sforza if he could offer any support for his claim:

From wjposer Sat Feb  1 13:26:19 2003
To: cavalli@stanford.edu
Subject: degrees of relationship
Content-Length: 698
Status: RO

Dear Professor Cavalli-Sforza:

In your book Genes, Peoples, and Language at pp. 137-138 you say:

      ...some anti-Greenberg linguists believe it is impossible to
      posit a quantitative relationship between any two languages.
      By disallowing reliable measurements, and by limiting the
      relationship betweeen two languages only to "related or
      not related", the American linguists opposing Greenberg have
      ruled out the possibility of hierarchical classification, an
      essential prerequisite to taxonomy.

I wonder if you could supply the names of the linguists who
take this position and references to publications in which
they have done so. Thank you.

Bill Poser

Here is his reply:

From cavalli@stanford.edu  Sun Feb  2 03:15:05 2003
Return-Path: 
Received: from smtp-roam.Stanford.EDU (smtp-roam.Stanford.EDU [171.64.14.91])
	by unagi.cis.upenn.edu (8.10.1/8.10.1) with ESMTP id h128F4D23142
	for ; Sun, 2 Feb 2003 03:15:05 -0500 (EST)
Received: from smtp-roam.Stanford.EDU (localhost [127.0.0.1])
	by smtp-roam.Stanford.EDU (8.12.6/8.12.6) with ESMTP id h128F3gG029435
	for ; Sun, 2 Feb 2003 00:15:04 -0800 (PST)
Received: from cavalli.stanford.edu (DNab42a421.Stanford.EDU [171.66.164.33])
	(authenticated bits=0)
	by smtp-roam.Stanford.EDU (8.12.6/8.12.6) with ESMTP id h128EpM5029427
	(version=TLSv1/SSLv3 cipher=DES-CBC3-SHA bits=168 verify=NOT);
	Sun, 2 Feb 2003 00:14:58 -0800 (PST)
Message-Id: <5.1.1.5.2.20030202001114.01aa9378@localhost>
X-Sender: cavalli@localhost
X-Mailer: QUALCOMM Windows Eudora Version 5.1.1
Date: Sun, 02 Feb 2003 00:14:47 -0800
To: William J Poser 

From: "L. Luca Cavalli-Sforza" 
Subject: Re: degrees of relationship
Cc: anca.ruhlen@forsythe.stanford.edu
In-Reply-To: <200302011826.h11IQKH06074@unagi.cis.upenn.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Status: RO
Content-Length: 1246

Dear Dr. Poser,
My understanding of the errors by American linguists who criticized 
Greenberg is mostly derived from Greenberg's 1987 book on Lnguage in 
Americas, Stanford University Press. You may get a better response from 
Dr.Merritt Ruhlen, to whom I am cc-ing this letter.
Sincerely
                         Luca Cavalli-Sforza

He provides no support for the claims in his book, no references, no names. In fact, he admits that he doesn't have any firsthand knowledge of what he is talking about and has taken his views from Joseph Greenberg, the very person the critics are criticizing. Caveat lector.

Posted by Bill Poser at 10:33 PM

Dating Indo-European

The journal Nature (vol. 426, 27 November) contains a paper entitled "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin" by Russell D. Gray and Quentin D. Atkinson that has attracted a good deal of interest. The paper dates the initial divergence of the Indo-European language family to 8700 years ago, with Hittite as the first language to split off. This they take to support the theory that Indo-European originated in Anatolia and that Indo-European languages arrived in Europe with the spread of agriculture. They take this to argue against the alternative "Kurgan hypothesis", according to which the "Kurgan Culture" of the steppes was Indo-European speaking, though they say that it is consistent with the view that the Kurgan people represented a branch of Indo-European.

If it is really possible to obtain accurate dates for linguistic divergence from linguistic data, that would be very nice. It would provide a useful new tool for the study of prehistory. However, the reactions of historical linguists to this paper have generally been skeptical. I'll explain why.

Languages change in a number of ways: words are replaced by entirely different words, a word shifts in meaning, one grammatical construction is replaced by another. Much language change is systematic: a certain sound, in a certain context, changes into another sound in every word in which it occurs in that context. This is known as sound change, and the rules that describe the changes are known as sound laws. For example, Latin /k/ became French /sh/ (spelled <ch>) before the vowel /a/. Thus, Latin castellum became French chateau, Latin campus became French champs, Latin captivus became French chetif and so forth. To take another example, Japanese used to allow /y/ before /e/, as in yen, the unit of money, yedo, the old name for Tokyo, and yezo, the old name for Hokkaido, which shows up in the scientific neo-Latin adjective yezoensis "of or pertaining to Hokkaido", as in Porphyra yezoensis, the scientific name for susabinori, one of several species of the seaweed you eat wrapped around sushi. (Incidentally, susabinori and its relatives have a fascinating life history, which you can learn about here.) However, /y/ disappeared before /e/, so these words are pronounced /en/, /edo/, and /ezo/ in modern Japanese. Sound change plays an important role in working out the family trees of related languages. Languages that have undergone the same sound changes are likely to have been a single language at the point at which they underwent it. Interactions among sound changes can tell us the order in which they occurred.

Although sound change is the main way in which words change over time, it is also possible for a word to be replaced by an entirely different word. For example, the Proto-Indo-European word for "dog" was something like *kuon. (The star indicates that this is a hypothetical form.) We reconstruct this form from attested (actually recorded) forms like Greek kuon, Sanskrit shvan, and German hund by asking what proto-form would yield the attested forms after undergoing the sound changes observed in the various languages, and also taking into account changes in word-formation. The direct descendant of this word in English is hound. But at some point the common Germanic word for "dog" took on a more specialized meaning and was replaced, as the general term, by dog, a word whose origin we do not know.

Although we can do a reasonably good job of reconstructing the way in which languages are related to each other, the standard techniques only tell us the order in which the splits occurred; they don't give us dates. The main approach to assigning dates to linguistic divergence events is known as glottochronology or lexicostatistics, proposed in the early 1950s. Glottochronology was based on the idea that words are replaced by entirely different words at a constant rate, just as radioactive molecules decay at a constant rate. To apply the technique, you take a list of basic vocabulary known as the Swadesh list after Morris Swadesh, the linguist who proposed glottochronology, and you translate it into the languages you are working with. You then figure out which words are cognate. For example, if you were to compare English and German, you would record that English "foot" and German "fuss" are cognate while English "dog" and German "hund" are not. When you're done, you count up the number of cognates and compute the fraction of words that are cognate. You then plug this into an equation that allegedly gives you the number of years of separation between the two languages.

The equation is basically the inverse of the equation for radioactive decay, with a time constant based on the observed rate of lexical replacement in a number of languages whose history we know fairly well, primarily the Romance languages.

There are a number of variants of glottochronology, using vocabulary lists of different lengths, different rates of lexical change, and so forth, and a variety of difficulties in applying the technique, but the central problem is that the lexical replacement rate is not constant. The rates observed in languages with a known history vary considerably. For example, studies show that English preserved only 68% of its basic vocabulary over a 1,000 year period, while Icelandic preserved 97%. Time depths calculated using the "standard" rate proved to be far off the mark in a number of test cases. As a result, glottochronology is considered to have been discredited by most historical linguists. (Further discussion of glottochronology, including problems not mentioned here, can be found in Lyle Campbell's textbook Historical Linguistics: An Introduction at pp. 177-186.)

Gray and Atkinson used an existing database of words compiled by linguist Isidore Dyen (an advocate of glottochronology) and colleagues and used techniques and software developed for work in genetics to construct a family tree and assign dates to it. Their approach is similar to glottochronology in that it makes use exclusively of information about lexical replacement. It differs from glottochronology in the methods used to construct the tree and compute the dates.

This paper avoids many of the problems that frequently arise in work of this type. It shows familiarity with the literature and awareness of some of the problems with glottochronology and related methods. It also uses a reasonably reliable source of data and information about cognation.

Nonetheless, we can't accept these results at face value. One reason is that we're generally skeptical about any sort of purely lexical method such as this because we know that lexical replacement is much more subject to cultural influence, external and internal, than other aspects of language change. Its a little hard to believe that something as peripheral and unsystematic as lexical replacement provides sufficient information not only to reconstruct a realistic family tree but to date the splits. Keep in mind that the DNA sequence that serves as the input for tree construction and dating in genetics contains all of the information about biological change, whereas lexical replacement is a small part of language change.

More specifically, there is the question of whether their technique really deals adequately with the fact that the lexical replacement rate is not constant. We have to keep in mind that we're not talking about just a little bit of variation. As the examples of English and Icelandic show, the range of variation of lexical replacement rates is pretty large. (Unfortunately, the number of languages for which the rate has been determined is not large, so we don't have a good knowledge of the statistical distribution.) The paper does address this. They say that:

the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model variation across the tree.

but they give only a brief description of the approach, and the only reference is to the manual for the software that they used. The manual can be downloaded from the r8s website, but it isn't all that helpful. It looks like understanding this approach will require reading papers referred to in the r8s manual as well, very likely, as experimenting with the program. It is possible that using r8s adequately deals with the problem of lexical replacement rate variation, but at this point, we can't tell, and it is far from clear whether it really does.

Their treatment of the data also raises a red flag. Their data source contains Swadesh 200 word lists for 95 languages, but cognation information is omitted for 11 languages, so they reasonably enough left them out. Then they added data for three languages not contained in the Dyen et al. database: Hittite, the best attested of the Anatolian languages, and Tocharian A and B, the two Indo-European languages attested from Chinese Turkestan in the 6th through 8th centuries C.E. So they should have 200 sets of words across 87 languages. Each cell in the matrix would have a value indicating either "for this lexical item this language retains a reflex of the reconstructed Proto-Indo-European etymon" or not. But that can't be what they have done since they say that they used 2,449 cognate sets. They have somehow split each set of 87 words into an average of about 12 subsets.

They don't say how they did this, but we can guess. What they may have done is to take each subset of cognate words as a "cognate set". For example, the PIE word for "bear" (not on the Swadesh list, just a convenient example) is believed to be the ancestor of Latin ursus, Greek arktos, Sanskrit rkshas, Welsh arth (as in the name Arthur) etc. However, this doesn't show up in Germanic and Balto-Slavic. Germanic languages have words like English "bear", German baer, Old Norse bjorn - evidently they referred to bears as "the brown ones". In Slavic you get words like Russian medved, literally "honey eater". What they may have done is treat cognates of ursus as one cognate set, cognates of bear as another cognate set, cognates of medved as a third cognate set, and so forth.

This is a perfectly reasonable way of describing the data, but you can't use binary characters based on such cognate sets as the input for clustering algorithms because characters like "has a cognate of ursus as its word for 'bear'" are not independent. If, for example, a language has a cognate of ursus as its word for "bear", it doesn't have a cognate of medved.

As I said, this is just a guess as to what they did. It seems pretty likely, since it is the obvious, non-arbitrary way to split up sets of semantically equivalent words, and it would probably produce the right number of cognate sets. But we don't know for sure, and until we do know exactly what they did, we can't decide how much credence to give their results.

The way in which this work was published also raises some issues. Why was this work published as a letter? Clearly, exactly what the authors did, and how what they did addresses the known problems, requires a lengthier discussion than they were able to provide. Letters to Nature are appropriate for announcing important new results using well-understood techniques that don't require lengthy discussion. And why was this published in Nature, a journal with no expertise in historical linguistics? If you're proposing a new technique, you would normally publish a full length paper in a journal with expertise in the area. That way, problems are likely to be caught at the review stage, and unclear points will be clarified prior to publication. A full length publication is much more likely to provide the reader with sufficient information to evaluate the paper, and publication in a journal in the appropriate specialty makes it more likely that people with relevant expertise will read the paper. A letter to Nature makes a nice splash, but it isn't the best way to put a new idea into play and let people kick it around.

We also have to ask how a paper that clearly doesn't contain sufficient information to allow it to be evaluated got published in Nature. It seems to be part of a pattern in which journals that don'ṫ routinely deal with linguistics fail to obtain referees with appropriate expertise. Mark Liberman discussed another instance of this in a previous posting about a review in the American Scientist Online of a paper, "Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European" that appeared in the Proceedings of the National Academy of Science. PNAS is equally culpable for publishing a weak paper in the first place; they too failed to obtain appropriate advice.

[Although I alone am responsible for its final form, this note reflects discussion with Morris Halle, Jay Jasanoff, Don Ringe, Sally Thomason, and Tandy Warnow.]

Posted by Bill Poser at 05:20 PM

December 09, 2003

Sic sic sic

The Latin word "(sic)" (meaning "thus") is a device in formal written English to indicate that the foregoing part of a quotation really is accurate, it's not a typo, that's really what the original said. Scholars use it with relish to quote passages from their enemies that contain revealing errors.

But it may not always be so simple to figure out whose error is being pointed out. A review by Edmund S. Morgan in the latest New York Review of Books (December 18, 2003, p.26) quotes Gore Vidal saying this about President Lincoln:

With his centralizing of all power at Washington this "reborn" (sic) union was ready for a world empire that has done us as little good as it has done the world we have made so many messes in.

But as I looked at that "(sic)", I realized I didn't know how to interpret it in this case.

Did it mean that Morgan was telling me that Vidal really said that, he really put "reborn" in scare quotes? Or was it in the original by Vidal, a sign put there by Vidal to say that Lincoln really did use the word "reborn"? (And for you reading this, the above instance of "(sic)" could conceivably have a third possible meaning: that Pullum is telling you it really does have "reborn" in quotes at that point in what Morgan wrote in the New York Review.)

One can only guess at the meaning, because "(sic)" is not used recursively. English does not provide for something like "With his centralizing of all power at Washington this "reborn" (sic) (sic) union was ready . . " to mean that Morgan vouches for the fact that Vidal really did interpolate "(sic)" in order to signal that he (Vidal) vouched for the fact that Lincoln really did use the word "reborn". You could invent a special notation along those lines and explain to the reader what you're doing (one "(sic)" for each quotation level, perhaps), but it isn't there in the structure of the language right now. I've never seen an iterated "(sic)", and would have trouble figuring out how to interpret such a sequence if I did see one.

The lesson is that while computer programming languages are designed for full explicitness about everything they can express, modern standard written English is not. There are limits to the extent to which you can avoid ambiguity, even given all of the context. But you knew that already.

Posted by Geoffrey K. Pullum at 05:47 PM

It's yankees all the way down

Well, for four levels at least.

Over the past month, I've contributed a few posts to the on-going discussion about hierarchical ontologies and the semantic web, while David Beaver recently explored the recursive identity of a beaver played by a beaver played by... A couple of days ago, a foreign student' asked me to explain what "yankee" means, and I responded with a traditional jokey definition that makes the word into a sort of semi-recursive identity ontology all its own:

For foreigners, a "yankee" is an American. For American southerners, a "yankee" is a northerner. For northerners, a "yankee" is somebody from New England. For New Englanders, a "yankee" is somebody from Vermont. For Vermonters, a "yankee" is somebody who eats apple pie for breakfast.

You can find versions of this definition on the internet in places as diverse as a recipe for Tuna Roll-ups and the Sylvia Plath forum. The fractal deconstruction of yankeehood generally ends with the pie-eating Vermonters, though I've heard variants that also mention a lack of indoor plumbing. In my experience, the definition is roughly true, though the sociolinguistic details are naturally more complex.

When I was a child in rural eastern Connecticut, it was understood that only some of the people in our village were called "yankees" (which of course had nothing at all to do with the hated baseball team of the same name). Later on, I learned that these people were the descendents of the English immigrants who had settled the area in the late 17th century, but when I was six or so, the characteristics that I associated with "yankees" included keeping a few farm animals on the side, trapping to earn a little extra money from furs, making hooked rugs from old socks, and shooting at garden pests rather than merely cursing at them. Although I participated in such activities with friends and neighbors, mine was certainly not a Yankee family in the local sense, and so it still takes me aback when I realize that some Texan or Virginian regards me as a Yankee.

I suppose that the hierarchy of Yankee significations must have arisen through successive layers of part-for-whole reference, combined with the distillation of prototypical characteristics in the mode of "real programmers" jokes. Both of these are common processes in the history of word meanings, but I can't think of any other word that has achieved so many well-defined contextual layers.

The OED suggests that the oppositions north/south and New Englander/other were implicit from the beginning, since the first two citations by date are:

1765 Oppression, a Poem by an American (with notes by a North Briton) 17 From meanness first this Portsmouth Yankey rose. Note, ‘Portsmouth Yankey’, It seems, our hero being a New-Englander by birth, has a right to the epithet of Yankey; a name of derision, I have been informed, given by the Southern people on the Continent, to those of New-England: what meaning there is in the word, I never could learn.
1775 J. TRUMBULL McFingal I. 1 When Yankies, skill'd in martial rule, First put the British troops to school. Editor's note, Yankies, a term formerly of derision, but now merely of distinction, given to the people of the four eastern States.

The etymology is "unascertained", according to the OED, which nevertheless goes on to say that

[t]he two earliest statements as to its origin were published in 1789: Thomas Anburey, a British officer who served under Burgoyne in the War of Independence, in his Travels II. 50 derives Yankee from Cherokee eankke slave, coward, which he says was applied to the inhabitants of New England by the Virginians for not assisiting them in a war with the Cherokees; William Gordon in Hist. Amer. War states that it was a favourite word with farmer Jonathan Hastings of Cambridge, Mass., c 1713, who used it in the sense of ‘excellent’. Appearing next in order of date (1822) is the statement which has been most widely accepted, viz. that the word has been evolved from North American Indian corruptions of the word English through Yengees to Yankees ...
Perhaps the most plausible conjecture is that it comes from Du. Janke, dim. of Jan John, applied as a derisive nickname by either Dutch or English in the New England states (J. N. A. Thierry, 1838, in Life of Ticknor, 1876, II. vii. 124).

In my personal childhood experience, Yankees mostly had sausage, eggs and toast for breakfast. However, the idea of apple pie for Yankee breakfast is historically well founded, as this page on "The American Apple Heritage" explains:

In the primitive colonial American farmhouse, apples were a primary staple of the family diet. Apples would be served as part of a main course, at breakfast, lunch or dinner. During winter months, many households relied heavily on apples for sustenance. ... Apples could be stored longer than other fruits, some for more than six months. Fruit was stored in a Dutch cellar where it never froze under ground. The cellar was constructed at the foot of a rising ground, about 18 feet long and six feet wide. It was walled up about seven feet from the ground and had a strong sod covered roof. The door always faced the south. They buried the apples in fine white sand or covered them with straw on the cellar floor.

It must be in one of those Yankee cellars that Emily Dickinson's Apple stayed snug:

Like brooms of steel
The Snow and Wind
Had swept the Winter Street,
The House was hooked,
The Sun sent out
Faint Deputies of heat---
Where rode the Bird
The Silence tied
His ample, plodding Steed,
The Apple in the cellar snug
Was all the one that played.

Posted by Mark Liberman at 06:48 AM

December 08, 2003

Naked languages

Hey fellow bloggers and assorted fans: a question, taking advantage of this wonderful tool called the internet. A question: can you identify for me languages that have neither 1) inflections nor 2) tones used to distinguish lexical items or encode grammar?

I first started asking people this question in 1996, and since then, I have found that there are four kinds of language like this: 1) Polynesian 2) some languages of Southeast Asia 3) a few Mande languages in West Africa and 4) some creoles.

Obviously this homology is very rare. But how rare? Are there languages like this in South America, by chance? Does anyone know of any other Niger-Congo languages up on that northerly coast that are like this, contrasting with the better-known Twis and Mendes and Igbos?

If you ask me, a natural language that has become neither like Greek nor like Thai after all this time is one that has some skeletons in its closet. Most creolists find too little interest in the whole issue itself to be inclined to even search for such languages -- but I suspect other linguists might be differently inclined.

Have I missed any? Languages with just a whisper of inflection or a mere handful of tonally-distinguished minimal pairs are okay.

Posted by John McWhorter at 06:00 PM

Let It Snow Words

Besides those 88 English snow words, there's another interesting thing about English words for precipitation and its aftermath -- something that occurred to me a while back in the middle of a lecture on different ways in which languages carve up lexical space. Having a revelation in mid-class isn't always a terrific pedagogical move, but it's fun.

I was explaining words for "snow" in Montana Salish (a Salishan language spoken in Montana...as the name suggests). The language has different roots for snow depending on whether it's in the air (as in "it's snowing") or on the ground; that is, once it hits the ground, it's a different thing altogether. Weird, right? I mean, snow is snow, regardless of whether it's more or less vertical and moving or just lying there. So it seems exotic for a language to make a distinction like the Montana Salish snow distinction.

And then it struck me that English does in fact distinguish precipitation from the stuff after it hits the ground. But only unfrozen precipitation: rain is rain only when it's coming down. Once it's stopped falling, and is on the ground or your roof or your hat, it's no longer rain, even if it's splashing around: it's water, puddles, part of a lake -- but it's not rain. And this certainly isn't exotic; it's just normal (to English speakers). Conclusion: English is just like Montana Salish, only less consistent. Well, just like Montana Salish except for those four pharyngeal consonant phonemes and the glottalized resonants and the 8-consonant syllable onsets and the special terms for in-laws after the death of the connecting relative and...

Posted by Sally Thomason at 03:00 PM

Streamlined cognition?

For me, Bill Poser's examples of Carrier's many words about beavers are well-taken -- while I assume none of us regard these words as evidence of any peculiar acuity on the part of Carrier speakers, it can certainly be interesting to see where languages split hairs, to remind us of the diverse ways of being human.

A favorite example of mine here is Tzeltal, which though having only about 3000 roots in its vocabulary (with morphology recruiting them as various constituents, of course), has a neat array of words for EAT. The general word is TUN, but then there is LO' for eating bananas and soft things, K'UX for beans and crunchy things, WE' for tortillas and bread, TI' for meat and chilis, TZ'U' for sugarcane, and UCH' for corngruel and liquids. (This is from an article by Penelope Brown.) Just think -- we only break out words like CRUNCH and GNAW for narrative extravagance, while to a Tzeltal speaker this seems downright crude.

But in a follow-up to my musings on how many interpret such findings, I can't help imagining someone assuming that this meant that Tzeltals value eating more than other people. I remember a linguist who had worked on an Austronesian language of Oceania marvelling that the language she had studied had lots of words for motion, and that for her this seemed tied to the fact that they were always "going, going." But then what about the famously subtle distinctions that Slavic languages make in this same area? Are Russians especially well known for being nomadic, or for "going, going" more than we do? Have they ever been?

As it happens, I have been reading up lately on a current of Whorfian works from the past six years or so that suggest that the old guy was not COMPLETELY out of his mind. Thank you to Mark Liberman for his references, for example, to a paper by Lera Boroditsky et al. showing that speakers of languages with grammatical gender marking do appear at some level of consciousness to process objects according to sex. How fascinating to find that someone whose language marks TABLE as feminine is more likely to imagine a table depicted in a cartoon as speaking in a woman's voice, for instance. This squares with a late-night question I once asked a Francophone linguist, "Do you think of tables as girls?" He, as skeptical of the extremes of Whorfianism as I am, readily said "Actually, yes."

But one thing worries me about the idea that "language channels thought" in any SIGNIFICANT way. I suppose being a creolist helps nudge my questions in this direction. Namely, if we assume that assorted doodads in a grammar color and enrich speakers' worldviews to an extent worthy of regular citation in anthropology textbooks and presentation to undergrads, then we risk a certain implication for some peoples.

Take, for example, a language like the Native American Hokan variety Atsugewi. To render "The soot flowed into the creek," it would have in translation "it / by falling / dirt-move / into-liquid / FACTUAL / nominative/ soot / creek-to, according to Len Talmy's 1972 dissertation. (My God, do people actually speak these languages????) Atsugewi attends to the fact that dirt was doing the moving, despite the sentence nicely informing us of this later by referring to soot itself. It has a path satellite telling us that the event devolved into the drink, although we could easily get this from the reference to the creek itself, complete with a directional marker. The factual marker establishes truth conditions that an English speaker would consider of little import to signal overtly. And then nominativity.

Atsugewi is dripping with these overzealous path satellites, for example, and some might be tempted to see this as evidence of the speakers' sensitivity to the processes of nature. But then what about languages like Riau Indonesian, as chronicled by David Gil? Here is a natural language with neither inflection nor tone, no articles, barely any tense marking, and a third person pronoun that is neutral to both gender AND number. Overall, Riau Indonesian takes the telegraphic, context-dependent tendency of Southeast Asian languages to a stunning extreme.

For example, AYAM means CHICKEN and MAKAN means EAT. The sentence AYAM MAKAN can mean "The chicken is eating," "The chicken ate," "The chicken will eat," "The chicken is being eaten," "The chicken is making somebody eat," "Somebody is eating for the chicken," "The chicken that is eating," "Where the chicken is eating," "When the chicken is eating," "How the chicken is eating," etc.

No persnickety path satellites, no tables with dulcet voices, no funky uphill/downhill markers of spatial relations, no paradigms of numeral classifiers -- really, not much of anything compared to an Atsugewi, an English, or even many creoles.

Which leads to a question -- are we ready to propose that Riau Indonesian conditions its speakers to be less attendant to the nuances of the world around them overall than Atsugewi did? Some languages attend to much fewer baroque distinctions than others. How would Whorf have handled this?

For example, for all of the attention to Africans' contributions to plantation creole grammars, the fact remains that Saramaccan creole's grammar is an abbrevation of the West African language Fongbe's, not a reproduction of it. Saramaccan has its quirks, as well as the mot justes and idioms that any language has. But one could acquire its basic grammar with a month's instruction, while Serbo-Croatian's would require two years. Yet I itch at the notion of supposing that my best Saramaccan informant has a duller perception of the world than Slobodan Milosevic.

Comparisons like this help me assume that the bells and whistles of more elaborated grammars exert little meaningful influence on how people think. If my erstwhile Russian girlfriend really did think of the tables we ate at as men and the windows we looked out of as hermaphrodites, this gave no evidence of coloring her essence as a human being in any way worthy of sustained attention.

Nevertheless, it is still pretty darned cool that some people distinguish beavers and their ways, or crunching from chewing, as assiduously as we distinguish varieties of computer.

Posted by John McWhorter at 02:01 AM

December 07, 2003

Discourse: tangle or branch?

About a month ago, I posted a discussion of approaches to describing discourse structures. The occasion was the web availability of a paper by Florian Wolf and Ted Gibson, arguing that "trees do not seem adequate to represent discourse structures," and the forthcoming publication of a "treebank" (annotated corpus) embodying their own proposal for representing the network of relationships among phrases in a coherent text. Wolf and Gibson specifically argue against earlier work in Rhetorical Structure Theory (RST), including the RST Discourse Treebank by Lynn Carlson, Daniel Marcu and Mary Ellen Okurowski, which was published last year.

Now Daniel Marcu has written a thoughtful and thought-provoking response to the Wolf/Gibson paper. Anyone interested in these questions should read Marcu as well as Wolf and Gibson. And if you're interested in language, trust me, you should be interested in this stuff.

As I wrote last month, this whole situation is wonderful. Just a few years ago, although we had many interesting theories about structure and meaning above the sentence level, there was no model for discourse coherence that was defined in enough detail, and exemplified extensively enough, that someone like me could figure out how to apply it to new texts with reasonable confidence. Now we have two! And the authors of these different approaches are using their extensive descriptive work to try to address fundamental questions in an empirically responsible way, combining methods from linguistics, psychology and engineering as appropriate. They are also engaging one another's work seriously, respectfully and creatively. This is rational investigation of language as it should be done.

It's clear that neither of these approaches has all the answers, and it's quite likely that they haven't yet even found all the questions. However, this is the kind of investigation that has a chance to solve the problems in the end, while bringing further enlightenment along the way.

Posted by Mark Liberman at 10:03 PM

88 English words from snow

We've just had our first nice snow in the northeast U.S., and so we linguists are bracing ourselves for the inevitable avalanche of observations about the many Eskimo words for frozen precipitation. Geoff Pullum is especially likely to get a few journalistic inquiries. In order to point this old conversation in new directions, let me suggest that everyone is missing a chance here. True, the role of the concept X in a given culture can to some extent be seen in its stock of words for X in various contexts and subtypes, its words for X. But at least as many clues can be found in a culture's use of metaphor and metonymy based on X to name other things, its words from X.

This is part of what John McWhorter was doing in his seminal remarks on the many English words from stand, and what Bill Poser did in his response on Carrier beaver vocabulary, further echoed by my post on the MIT beaver-related lexicon, and (in a different vein) David Beaver's skit on beaver names and beaver identities. When a language resists borrowing vocabulary for borrowed concepts, as Sally Thomason writes about Montana Salish, metaphor and metonymy get worked especially hard; cars wind up being things with wrinkled feet, rather than just being called by the Montana Salish pronunciation of "car". Far from having any problems with new vocabulary, English has famously "on occasion ... pursued other languages down alleyways to beat them unconscious and [rifle] their pockets for new vocabulary". However, English still derives plenty of words by metaphor and metonymy, because no matter how many words you have, no matter how many more you can borrow from your neighbors, it seems that there are always new things to make words for.

As meteorologically relevant evidence, I offer an initial list of 88 English words from snow. I've limited myself to compound words or lexicalized phrases containing the morpheme snow, gleaned from the OED, Webster's 3rd and a few minutes of idle thought. All the words for actual (kinds of) snow have been removed, and I'm ignoring the extensive polysemy of snow and many of its derivatives.

Snow White, glory-of-the-snow, snow angel, snow apple, snow azalea, snow bear, snow berry, snow blindness, snow board, snow boots, snow bunny, snow bunting, snow burn, snow cam, snow castle, snow cat, snow cone, snow crab, snow crash, snow day, snow drop, snow fence, snow finch, snow flea, snow fly, snow fort, snow goggles, snow goose, snow grass, snow grouse, snow guard, snow gum, snow hare, snow job, snow knife, snow lemming, snow leopard, snow lichen, snow light, snow lily, snow machine, snow man, snow mobile, snow mold, snow mosquito, snow pants, snow partridge, snow pea, snow pear, snow pheasant, snow pigeon, snow plow, snow pudding, snow pusher, snow quail, snow report, snow roller, snow rose, snow scald, snow sculpture, snow shoe, snow sleigh, snow snake, snow static, snow suit, snow sweeper, snow thrower, snow tire, snow train, snow trillium, snow vine, snow vole, snow wreath, snow-in-summer, snowball fight, shoeshow hare, snowshoe siamese, snowy campion, snowy egret, snowy lemming, snowy orchid, snowy owl, snowy plover, snowy tree cricket, the Snow Queen, to be snowed in, to snow under.

Many of these words carry big loads of cultural baggage: snow angel, snow man, Snow Crash, Snow White, snow job, and so on.

The variable internal punctuation may fool you into feeling uncertain that these compounds and phrases are really "words": snow suit, snow-suit, snowsuit. But just consider the variable yet specific semantic contributions of the morpheme snow to the meaning of the compound words snow angel, snow man, snow shoe, snow cone, and snow job: "a winged figure imprinted in the snow by lying on the back and waving the arms up and down"; "a humanoid figure made by piling up large snowballs and decorating them"; "flat frames strung with cords and worn on the feet so as to be able to walk in deep snow"; "a confection of shaved ice and syrup, reminiscent of flavored snow"; "an elaborate deception, an overwhelming blizzard or snow drift of insincerity". You couldn't automatically calculate this stuff just from the core meanings of the morphemes snow, angel, man, shoe, cone and job. You have to live in a culture where people use these words to talk (or write) about these things, a culture with shared snow-experiences that are diverse and salient enough to create and maintain so many words from snow.

I suspect that there are at least as many English words from ice -- but not now...

Now, it's time to go make a snowman!

As Grete Tartler wrote (translated from the Romanian by Fleur Adcock)

How sensually we shall relish, under our camouflage of snow,
the simplification of vocabulary!

Posted by Mark Liberman at 07:11 AM

Economist follows Language Log

The Economist (December 6th, 2003, print edition, p.28) has now followed your alert Language Log correspondent in asking why the Plain English Campaign saw fit to award Donald Rumsfeld the Foot In Mouth award. "Sorry, but what is so muddled about that?", they ask. The Guardian too, it reports, has described the Rumsfeldian remark as far from foolish, indeed, "a complex, almost Kantian thought". They are basically right, of course (even if The Economist doesn't seem to know existentialism from epistemology), but a little late. You read it here first.

Posted by Geoffrey K. Pullum at 12:56 AM

December 06, 2003

Verb-modifying "far from" in 18thC pornography

I am pleased to rise to Mark Liberman's challenge: "I'm waiting for someone to point out to me that adverbial far from was used by Winston Churchill, Jane Austen, William Shakespeare and even the author of Beowulf :-)...". I am going to ignore the smilicon at the end here, and take it that he means he would welcome a good example of far from + finite verb in classic works of English literature.

I happen to have a not inconsiderable shelf of fine works of literature from the Victorian period and earlier in my bedroom, and it did not take me long to find this example in John Cleland's 1749 feminist novel of one woman's path to personal discovery, Fanny Hill: Memoirs of a Woman of Pleasure.

Our heroine, after two days alone in Charles' apartment, feeling unwell, and longing to see dear Charles again, decides to ask unsympathetic landlady Mrs Jones downstairs to go out and find him:

The third day my impatience was so strong, my alarms had been so severe, that I perfectly sicken'd with them; and being unable to support the shock longer, I sunk upon the bed and ringing for Mrs. Jones, who had far from comforted me under my anxieties, she came up.

The quotation (my regrets to those of you who were hoping it might include at least some dirty bits) can be seen in context either in my bedroom or at http://eserver.org/fiction/fanny-hill/03.html. It also contains a nice 18thC dangling participle not controlled by the matrix clause subject ("ringing for Mrs Jones") and an instance of strong verb variation (sunk for sank). (I have said this before and I will say it again: the grammarian who looks into a corpus usually learns at least one or two new things from the first relevant sentence, other than the thing that was initially being investigated.)

Anyway, since David Beaver's main point (despite all the distraction of his Google-based arithmetical estimation) was simply that far from is already well established as a pre-modifier of finite VPs, I would say he wins a point here.

My perusal of Beowulf, the immortal bard of Avon, the fiction of Jane Austen, and the non-fiction of Sir Winston continues, though for some reason works of this kind seem not to be so well represented on the bookshelf by the king-size four-poster.

[Note added a bit later] Oh, and by the way, I'm sure you're wondering whether The Cambridge Grammar of the English Language covers this construction? Not exactly, but on page 1132 it comes extremely close. In the first subsection of 4.5 in chapter 13 you will see that very similar modifiers derived from comparative adjective phrases are discussed: example block [32] cites This more than compensated for the delay, etc. A sentence like This far from compensated for the delay is a natural antonymous structure.

Posted by Geoffrey K. Pullum at 02:37 PM

Reichenbach on university and society

Most linguists know Hans Reichenbach through his work on logic, and especially on the semantic representation of tense and aspect. I used to have George Lakoff's copy of Reichenbach's 1947 work Elements of Symbolic Logic, with George's enthusiastic marginal notes. I borrowed it when I was a student, and at some point I lent it to a student in turn. I hope that he or she enjoyed reading it as much as I did. Reichenbach's simple little system of relations among Speech, Event and Reference times was a brilliant success -- I remember reading it and thinking: "Oh. Yes. Now tense and aspect makes sense." It's enough to make anybody believe in the feasibility of linguistic semantics, at least for a while.

Steven Gimbel's recent article "If I Had A Hammer: Why Logical Positivism Better Accounts for the Need for Gender and Cultural Studies" features Reichenbach prominently in another role, as the author of a 1918 essay “Die Socialisierung der Hochschule”. Gimbel "argue[s] that its central arguments are easily and naturally extended from opening the universities of post-World War I Germany to culture studies in the contemporary state of the academy." He also asks why "[i]f ... Nietzsche and Heidegger were the darlings of the most repressive and powerful, and the positivists were those people actively risking their lives and well-being for the emancipation of humans from tyranny...[w]hy did the logical positivist mode of valuation branch off from the fashionable left and become considered its opposite?”

Gimbel frames an argument for "gender and culture studies" based on Reichenbach's distinction between "community" and "society" and his 1918 call for "all academic rights be made independent of class, party, church, sect, race, sex, or citizenship." I'm not sure that I follow all the steps in Gimbel's logic. But I retain enough respect for Reichenbach's Logic to pay some attention to his politics. So read the whole thing!

[via A.L.D]

Posted by Mark Liberman at 02:27 PM

Dictionaries on parade

YiLing Chen-Josephson has an interesting consumer-oriented evaluation of collegiate dictionaries at Slate.

The methodology is simultaneously quantitative and arbitrary. The author allocates up to 12.5 points for "usage guidance" but 25 points for "enjoyment", and rates the dictionaries subjectively on these dimensions. The score (up to 25 points) allocated for word stock is defined more objectively, in terms of the number of words found from a certain list, but the list itself is idiosyncratically defined:

... words that I knew but wanted to understand better (like regret, jealous, and overdetermined); words with disputed usages (including aggravate, disinterested, fortuitous); words with potentially interesting etymologies (e.g., chauvinism, juggernaut, lagniappe); neologisms and slang (e.g., blogger, booty, yay); anything friends had looked up recently (e.g., Panglossian, condominium, alembic); as well as the words I didn't know in the last book I read, J.M. Coetzee's Elizabeth Costello.

An interesting set, but I wonder how the test would come out if someone else tried the same thing.

If I had a horse in this race, I'd hope for a poll of users to determine the importance of various dimensions of evaluation, ratings by a sample of users to award points on those dimensions, etc. Still, Chen-Josephson's approach produces a review that seems informative, even if I don't believe that the apparently quantitative rankings are likely to be stable over similar exercises by other evaluators, or even by the author in another life stage.

One surprising omission: there is no evaluation of on-line or other digital access (though the Merriam-Webster Collegiate's web site and CD-ROM are mentioned). Surely most students among Slate's readership now do most of their word lookup digitally? When I ask (a non-scientific but quite random sample of) Penn students about this, I find that nearly all of them have paper dictionaries -- often bought for them by parents or other relatives -- but few of them use them. When they want or need to know something about a word, they usually look it up on line. So maybe a more careful survey of actual dictionary users and dictionary usage would have been beside the point from the perspective of paper-dictionary market research.

Posted by Mark Liberman at 12:32 PM

Google-sampling: avoiding pseudo-text in cyberspace

Neat! David Beaver uses google-sampling corpus linguistics to argue that "far from" has already become an accepted pseudo-adverb, and that it occurs in Google's sample of the web at a rate of about 1 per 10 million words (roughly as often as Hammurabi or Frege, for example).

Now, I'd already learned (by asking) that younger Americans find nothing at all wrong with phrases like "he far from fulfilled his promise". I could come to like this innovation. We used to be able to say "they nearly succeeded" but not, alas, "*they farly succeeded". Now we can say "they far from succeeded": big deviations get equal adverbial time! Mere syntactic coherence is a small price to pay.

However, I want to warn you aspiring google-samplers to be careful. There are some mean texts out there, kiddies. In particular, you need to watch out for the textual wiles of gambling dens and porn parlors, who create big networks of interlinked web pages in order to boost their google score. Google tries to ignore obvious examples of this sort, so the bad guys hire renegade computational linguists to write programs that churn out pages full of searchable stuff looking enough like real text to fool Google. Stuff like "For example, a progressive jackpot indicates that a tablet a cosmopolitan hoofer. Another oed hestitates, because an ungraciously blindfold optimist a quodlibet of another progressive jackpotistry. When you see the modiste, it means that a restroom hides."

These linguistic grifters (and some other less criminal effects, such as Google's habit of indexing sequences across punctuation) have polluted David's samples to the point that his estimates are off by a factor of about 14. This doesn't invalidate google-sampling as a technique. But you have to watch out!

David used the following reasoning, in my reconstruction:

According to Google's index, appropriately filtered, sentences of the form "They far from FiniteVerb ..." are about 10 times commoner than sentences of the form "They ungraciously FiniteVerb ...".
the word ungraciously occurs about 10,000 times in Google, "most of which come from 'ungraciously + finite verb'".
Therefore (given a few other assumptions that need to be checked!), there are about 10*10,000 = 100,000 occurrences of 'far from + finite verb' in Google's index.

This is an excellent example of creative google-sampling analysis, in form. But the content has a problem -- the samples weren't carefully enough filtered.

Looking at the very same data more carefully, it appears that a better estimate of the count of 'far from + finite verb' in Google's index would be 7,250, not 100,000 (see below for details). If Google indexes a trillion words, roughly, then the frequency of this construction is roughly one in 140 million, not one in 10 million as David estimated.

Of course, if we were serious about this question, we'd want to try some other approaches. For example, we might try inspecting a sample of occurrences of "far from" directly, to see what fraction precede finite verbs. This is harder, as I learned when I was writing my original piece on adverbial far from. Google returns 6.95 million pages for this string, and it's clear that only a very small fraction of these are adverbial uses, as you can see if you look yourself. I checked the first 150 hits and found none. On David's estimate of 100K total "far from" pseudo-adverbs, roughly 1 in 70 should be adverbial, while on my estimate of 7,250, roughly 1 in 1,000 should be. In order to get an accurate enough estimate of the rate of occurrence of a phenomenon like that, we'd have to check a sample of ten thousand pages or more. I'm sure that's why David took the more indirect approach of comparing far from to another word in a particular context where the adverbial ore is enriched, and then trying to scale the results in proportion ... So, the truth is clearly out there, but perhaps we've got enough of it now. Or more than enough; though I'm waiting for someone to point out to me that adverbial far from was used by Winston Churchill, Jane Austen, William Shakespeare and even the author of Beowulf :-)...

At this point, most of you readers who are still with me will want to turn your attention to something interesting, like this. But for you aspiring google-samplers, here are the details...

SUMMARY:

Google finds 9,670 pages containing ungraciously, sure enough. But only 15% are human-generated uses of the form "ungraciously+finite verb".

If this sample is typical, then a better estimate of the google count for "ungraciously" + finite verb" is actually .15*9670 = 1450.

The next stage of David's analysis involves "they far from". He suggests that about 200 of 481 google hits for this sequence involve pseudo-adverbial modification of a finite verb. This sequence gets 479 google hits for me (Google gives slightly different results on different trials, for various reasons!).. I checked a sample of 40 (pages 1, 5, 13, and 18 of the google hits) and found that 25% (10/40) were genuine pseudo-adverbial examples (see below for analysis of the rest). Thus "they far from" produces .25*479 = 120 cases.

Finally, there is the count for "they ungraciously". Google gives me 24, all of which seem to be pre-finite-verb cases, as David indicated.

So David's 200*10000/23 = 86,956 should be 120*1450/24 = 7250, or about 7% of the 100K that he rounded up to.

Further details and examples are below...

THEY FAR FROM:

25% (10 out of a sample of 40) were genuine pseudo-adverbials like this:

I am sad to report that I am not a huge Dryspell fan. They far from suck or anything, they are just not my cup of tea.

The rest were punctuation-spanning:

...they, far from being stupid, are actually hundvísir "most wise"...

auxiliary-inverted:

They ... emphasized that not only were they far from areas where mercenaries operated, but ...

or copula-deleted:

yo bwoi u mite wanna fix up ur spellings blood, they far from desired. no offence, just relax and type when u is redy, innit man?

UNGRACIOUSLY:

I checked a sample of 20 -- pages #2 and #11 of the google hits, with the sample from page #2 reproduced in full below. Only 35% of them (7 out of a sample of 20) are even human uses of the word "ungraciously" at all! And only 15% (3 out of 20) occur before an active (1) or passive (2) verb in a finite clause.

Another 4 are in non-finite clauses or are post-verbal uses that are not relevant to "far from" and the like, such as "..., she said ungraciously". No one is yet starting to write things like "*..., she said far from", so we can ignore these. The other 13/20 instances of ungraciously in the sample are dictionary entries, word lists -- and especially, on-line gambling pseudo-text pages (like this one for Best Betting), generated by program to fool google and similar search engines.

2nd page of google hits for ungraciously:

conscience would permit, rather ungraciously perhaps, the indulgence of a number of carefully selected desires.

simply ignores smoker simply ignores returned
ungraciously speaking returned ungraciously speaking
returned ungraciously speaking returned ungraciously speaking
parent powers

Future citizenship manner. Chinese children
hoarse groan. Gurgle man who being
murdered Mercy ungraciously late July.
Secretary acknowledged made threats
Thai Post staff.

Summary:-
ungraciously - gracelessly, ungracefully, without graciousness, woodenly

If an exudation behind a durum a stringy derby, then the immoderation beyond the sovietism self-flagellates. When you see a stitchwort, it means that an ungraciously nescient fennel feels nagging remorse. Furthermore, a sympatric fulcrum daydreams, and a consoling wingman phylogenetically a vista.

Trelawny showing Campo Santo settled Life villa Goethe work flew grand spacious Life villa Thy mountains seas vineyards ungraciously rendered gift less ungraciously rendered gift less towers bent dun faint ethereal gloom precious implanting fatal trait representation scene passion

> Ive noted that in Soc. Motts one of our users has been rather
> ungraciously badgered by a number of individuals for an occurance that
> was beyond his control.

Furthermore, the superficies returns home, and the redeeming scientist another farmland. When a cocklebur is hypermetropic, an earlier play as the dealer ungraciously a taffy over the merging. Now and then, an osteoclasis about a pompano alternatively a tangible moniliales.

"Mabbe," observed Jimmie Dale, as ungraciously as before, "mabbe dere's some more t'ings youse don't know!"

Posted by Mark Liberman at 05:41 AM

December 05, 2003

How far from the madding gerund? (100kG)

In Far from the Madding Gerund, Mark Liberman noted Edward Skidelsky writing:

Unfortunately, Lessons of the Masters far from fulfils the promise of its subject.

Mark says this is syntactically odd, since the highlighted part has "no plausible syntactic analysis." He goes on to observe a clever way in which it may be right after all, or at least may become so. He notes that Skidelsky's sentence probably involves the first stages of a language change, specifically a reanalysis of far from as an adverb. I've got news for Mark: you're right. Except the reanalysis has already happened.

The question is whether far from is already an adverb for many speakers (a use in which its spatial connotation has been completely superceded by a function as a modal degree modifier, following a common path of grammaticalization.) Let's let Google weigh in, with searches "far from fulfills", "far from fulfils" and "they far from". Here are some of the hits:

Democracy far from fulfills the illusions that drive it, yet, in Winston Churchill's immortal turn of the phrase, it's the worst political system save the alternatives.

The role of this type far from fulfils the required role of such a division.

While they far from guarantee a successful and stress free implementation, they at least put the developer on the right path.

A name and a color scheme are essentials of an army, but they far from complete an army.

The Reds had the start they far from wanted, with Mick Godber having to leave the field for treatment after just 40 seconds after a Vauxhall defender had followed through. But he was back on within three minutes.

They far from fail him when he translates his feelings into images of nature.

So just how common is "far from" + finite verb? Well, "they far from" gives 481 hits, and a quick scan indicates that very roughly 50% of these are of the right sort, say about 200 Googles. We can compare this to another low frequency adverb: "they ungraciously" gives 23 Googles. Now "ungraciously" itself returns about 10 kiloGoogles, most of which come from "ungraciously" + finite verb. Ignoring multiple occurrences of a pattern in the same document, we can make a very rough and ready estimate of the web frequency of "far from" + finite verb: +/- 100kG (i.e. 200G * 10kG / 23G).

How many kG should convince me that some pattern is grammatical? The answer must be complicated, presumably depending on pattern length and abstractness. Then again, I'm surrounded by heretics who suggest grammaticality may be gradient. If grammaticality is a function of frequency, then the one question becomes two: what are the units of grammaticality, and what is the function? Well, heck, these are tricky questions, but let's just suppose the function is identity, and measure grammaticality directly in Google hits. Then we know just how grammatical "far from" + finite verb is. Yup, that's 100kG of grammaticality. I'll be darned if it ain't 30kG more grammatical than madding. An easier question is: why did Mark get all uppity about far from? Why hadn't he already reanalyzed far from as an adverb?

From what I know of Mark, I'm guessing Google's corpus gives a pretty representative sample of the language he encounters. Google scans about 3 billion documents, including a lot of non text and non-English. I'll call it a nice round billion. Let's say the documents have an average of 1000 words. (I've no idea if this is right.) Then the Google corpus is about a trillion words of English text, and contains about a trillion trigrams. 100,000 of these trigrams are "far from" + finite verb, so this pattern has a frequency of about 1 per 10 million. Now I'm guessing that 10 million is within an order of magnitude of the number of trigrams Mark has ever encountered. So it doesn't surprise me that he just encountered the "far from" + finite verb pattern, but it also wouldn't surprise me if it's his first time.

Congratulations Mark! And don't worry - it's much less painful the second time around.

Posted by David Beaver at 11:59 PM

Lexicographical collages

Andrew Radford (Transformational Grammar: A First Course, Cambridge University Press, 1988) sides with the view (it's the wrong view, actually) that the difference between adjectives and adverbs is so slight and so syntactically determined that you can collapse them into one part of speech, slightly different variant forms like satisfactory and satisfactorily. being no more significant than the difference between satisfy and satisfies. But that new part of speech embracing both of them needs a name. So Radford proposes the word advective.

Funny, isn't it, how you can just look at a word and know immediately that it is not going to catch on?

It certainly hasn't caught on. Not a single scholar to my knowledge has adopted the term, or even used it a single time, though there are plenty of candidates -- syntacticians like Mark Baker who still maintain the wrong analysis that collapses adjectives and adverbs together. (The word "advective" does get thirty thousand Google hits, but only because of various irrelevant uses of the same letter sequence in senses that have a genuine Latin etymology, in fields ranging from management consulting to earth sciences and biology.)

My friend Jennifer invented a special cocktail for Thanksgiving day this year. It had cranberry juice, vodka, some kind of orangey liqueur... Quite nice. She served it to everyone, it raised the merriment and affection level of our whole gathering. She invented a name for it, too: She called it a cranberry turktail. Geddit? Not the tail of a cock, but the tail of a turkey, because it was Thanksgiving and we were eating turkey rather than chicken, and there was cranberry sauce, you see, and... That name isn't going to catch on either, is it?

I don't think so, anyway. I confidently predict that ten years from today no syntactician will be talking about advectives and no bartender will be mixing turktails around the last Thursday of November. Mark what I say. Ten years from today you can check Google (or its hypergoogolic successor) on your cell phone (or wrist watch or subcutaneous patch or whatever we're using by then), and you'll see that I'm right. I don't know what makes words catch on, but I know you can't just stick bits of other words together, however ingeniously, and expect an awed speech community to take your lexicographical collage to its bosom.

Posted by Geoffrey K. Pullum at 02:09 PM

Far from the madding gerund

Edward Skidelsky's review of George Steiner's Lessons of the Masters contains this sentence:

Unfortunately, Lessons of the Masters far from fulfils the promise of its subject.

It's clear what this means, and the rest of Skidelsky's text makes a good case that it's true. But it's syntactically odd in an interesting way.

The origin must be the construction "to be far from fulfulling [something]", which is syntactically normal. "Far" is an adjective, and "from fulfulling" is a prepositional phrase. The whole thing is structurally just an adjective with a PP complement, like "full of promise", "equal to the challenge", "hot to the touch", "ready for use", and plenty of others. Like the other examples that I've given, "far from fulfulling" happens to be a cliché or at least a fixed expression, but of course the same construction can just as easily be used in novel ways: "full of cold tapioca", for example, which has not occurred within the ken of google.

What's odd about Skidelsky's sentence is that "far from" has no plausible syntactic analysis. It seems intended to function more or less like an adverb, as in "scarcely fulfils" or "never fulfils". I suppose that the writer got there by transforming "is far from fulfilling the promise" to "far from fulfils the promise", on the model of "is scarcely fulfilling the promise" transformed to "scarcely fulfils the promise". But you can't do that! at least not in general.

What's interesting is that he almost gets away with it. Skidelsky is obviously a good writer, and he missed it. I imagine that the New Statesman, where the review appeared, has editors and even copy editors, and they missed it. I myself read right past it, and got halfway through the next paragraph before an obscure sense of oddness brought me back.

This is a good example of two processes, one a general fact about language change and the other a specific fact about the recent history of the English language or more properly the culture of those who write formal English.

In general terms, this is just structural re-analysis, of the kind that frequently results from the forces created by clichés and fixed expressions of various sorts. When people start using "is far from VERB-ing" as a common way to say "definitely doesn't VERB", the rhetorical effect inevitably creates a sort of shadow analysis in parallel with the original syntax, and it's only a matter of time before the shadow takes over and licenses examples like "far from VERBs". This usually just creates a new lexical item, in this case an adverb "far from", like the vernacular pseudo-adverb sort of in "he sort of fulfils the promise", or the regionalism near to in "I near to died" (google finds 8 instances of "near to died"). In some cases, the result can be the leading edge of a new morphological or syntactic pattern, so perhaps at some point we'll see enough English adverbs of the form adjective+preposition or noun+preposition to trigger a general "rule" for such formations.

This kind of change is common and inevitable. It's one of several forces that tend to create complexity and irregularity in natural language form-meaning relationships, in opposition to other forces that tend to regularize those relationships. I conjecture that explicit instruction in grammatical analysis tends to damp (in formal writing) the effect of these "forces of disorder", limiting them to gradual leakage from patterns that have become well established in the vernacular (where formal instruction is irrelevant). Now that grammatical instruction has been abandoned for several generations, at least in the American educational system, we are likely to see a new era of change within the culture of formal writing. "X far from fulfils the promise of Y" is not a vernacular construction -- nobody talks like that. It's a written-language "mistake" -- or let's say "change" -- characteristic of someone who is very well read and who writes a lot, and who hasn't been trained to parse.

In case the reader is one of those whose education has not provided them with this essential skill, here's a quick lesson:

Q. Please explain how to diagram a sentence.
A. First spread the sentence out on a clean, flat surface, such as an ironing board. Then, using a sharp pencil or X-Acto knife, locate the "predicate," which indicates where the action has taken place and is usually located directly behind the gills. For example, in the sentence: "LaMont never would of bit a forest ranger," the action probably took place in a forest. Thus your diagram would be shaped like a little tree with branches sticking out of it to indicate the locations of the various particles of speech, such as your gerunds, proverbs, adjutants, etc. [Dave Barry, a.k.a Mr. Language Person]

Update: I've asked a few people for their judgments about "far from" and similar sequences as pseudo-adverbs. My provisional conclusion is that there is on-going lexicalization of some particular adjective+preposition sequences, especially those associated with degree modification of scalar predicates. It is also pretty clear that this lexicalization is not stigmatized or marked as vernacular by those who exhibit it. The judgments in the table below should not be taken too seriously, as they represent only my memory of the answers given by perhaps half a dozen informants, all of whom were American students or faculty.

Sentence	Younger speakers	Older speakers
This book far from fulfils its promise.	fine even on reflection, "nothing wrong with it"	bad, especially on reflection
This book close to fulfills its promise.	fine even on reflection, "nothing wrong with it"	bad, especially on reflection
This book distant from fulfills its promise.	obviously bad	obviously bad
This book near to fulfills its promise.	bad, maybe regional dialect	bad, maybe regional dialect
This book sort of fulfills its promise.	OK but informal only.	OK but informal only.
This book kind of fulfills its promise.	OK but informal only.	OK but informal only.

Posted by Mark Liberman at 08:33 AM

Re-naming and Necessity

A blog-entry formerly known as: The Beaver's Second Lesson

[Somewhere in Metropolis, a roving reporter bumps into a big furry guy with buck teeth, who poses for photos and offers the chance of an exclusive interview.]

Beaver: Hey, reporter, I'm famous, wanna take some photos? Or perhaps an exclusive interview?

Clark Kent: Well, I would like to ask you what your reaction was to news of the replacement of the name "Beaver College" with "Arcadia University?"

Beaver: I have an incredibly busy schedule, putting in cute but industrious appearances at fund raisers and sports events all over the continent. So the fact that Beaver College has been renamed just means I get a little more time on my paws.

Clark Kent: Nonsense!

Beaver [aghast]: What do you mean nonsense? I am so busy you wouldn't believe it. I'll have you know that I am the mascot at Babson College and MIT in Mass., Bemidji State University in Minnesota, Bluffton College in Ohio, Buena Vista University in Iowa, California Institute of Technology (a role in which I gained a lot of publicity, as part of the greatest college prank ever) and the nearby Los Angeles Trade-Technical College, Champlain College in Vermont, The City College of New York, Minot State University in North Dakota, Oregon State University, and the University of Maine at Farmington, as well as being the symbol of New York District, and of course, Canada (I even did the olympics there back in `76). And let's not forget my trans-Atlantic engagements - I'm also the mascot at the London School of Economics. I can tell you, I earn a lot of air miles. Plus some of those double beaver matchups, like the CalTech-MIT games that people like to talk about, are truly exhausting - all done with mirrors. The fact that Beaver College has changed its name...

Clark Kent: Bah! You idiot!

Beaver: Say, you look familiar. Didn't we meet at the CUNY game last week? I was the big brown furry guy jumping up and down right in front of the home crowd.

Clark Kent: Well, apart from my reporting job, I do have a little side position in the Graduate Center at CUNY.

Beaver: I know you. You're no mild mannered reporter. Let's see what's behind that shirt...

[Button popping rip, revealing a giant K on Kent's chest.]

Beaver: Oh so you are Kent after all. [Looks crestfallen.] Hey wait...

[Rips further, revealing the full insignia: SK]

Beaver: [Gasps] You couldn't be [dramatic chords] Super Kripke!

Super Kripke: Au contraire - I must be!

Beaver: Hey man, I'm a real fan of your completeness stuff. And, well, frankly, the fact that you're you changes everything. If you say that it's not the same me who's at all the football games every week, then who am I to argue. You are after all the master of identity. It must have all been one big dream of mine. [Pauses, in deep thought.] But, no, Super Kripke, this time you're wrong.... you forgot about the air miles: I can't possibly have imagined them!

Super Kripke: My dear furry friend, the fact that I'm me is surely no surprise, and far be it from me to argue with your well documented poly-iconicity. As you say, the Beaver is the mascot of all those schools, you are the Beaver, ergo you are at all the games: I see no contradiction there. No, I was simply questioning your contention that Beaver College has been renamed, which is nonsense.

Beaver: Huh? You said so yourself.

Super Kripke: I most certainly did not. There is no such college.

Beaver: Oh, I see. You mean Beaver College no longer exists.

Super Kripke: Beaver College has never existed!

Beaver: [Gasps.] You mean, the name change is retroactive?

Super Kripke: Of course not. The governors of the University are not so powerful as to be able to change history. However, it is no longer appropriate to refer to Arcadia University as "Beaver College". So while Beaver College has never existed, Arcadia University, as the sign at its entrance correctly and proudly proclaims, has been around for 150 years and counting. You see, if its made of wood and has big black letters on it, there's a good chance its a rigid designator.

[Super Kripke would gladly have talked till next day,
But he felt that the lesson must end,
And he wept with delight in attempting to say
He considered the Beaver his friend.

While the Beaver confessed, with affectionate looks
More eloquent even than tears]

Beaver: Once again, I have learned in ten minutes far more than all books would have taught me in seventy years. We have sorted out the puzzle of your identity, and the puzzle of Arcadia's. But before we part I have a puzzle for you. You may be you; indeed, I accept that you must be. But I am not me.

Super Kripke: Impossible - let us see who you really are!

[SK starts to rip open beaver suit, revealing... another beaver suit.]

Inner Beaver: You see, I'm just pretending to be me.

Super Kripke: Nonsense again! There must be someone in there doing the pretending.

Inner Beaver: You're very clever, young man, very clever. But it's Beavers all the way down.

Posted by David Beaver at 04:15 AM

December 04, 2003

American names

Mark Liberman's reference to David Beaver's story about the orgin of his family name reminds me of a story told my grandmother. Her family knew a family of Russian Jews originally named Chorny "black". After immigrating to the United States, or more accurately, New York, and somehow having kept their name intact at Ellis Island, they decided to Americanize it. Since their idea of a good American was a German Jew, they changed their name to Schwartz.

Posted by Bill Poser at 09:47 PM

Beaver vocabulary from another culture

As Bill Poser's fascinating post on Carrier beaver words reminded me, both Bill and I are alumni of an institution whose totem has been the beaver ever since 1914. In that year, "Lester D. Gardner 1898 presented the idea to MIT president Richard C. Maclaurin at the annual dinner of the Technology Club of New York." The official reconstruction of Gardner's argument runs like this:

"We first thought of the kangaroo, which, like Tech, goes forward by leaps and bounds. Then we considered the elephant. He is wise, patient, strong, hard working, and like all those who graduate from Tech, has a good tough hide. But neither of these were American animals. We turned to [William Temple] Hornaday's book on the animals of North America and instantly chose the beaver. The beaver not only typifies the Tech [student], but his habits are peculiarly our own. The beaver is noted for his engineering, mechanical skills, and industry. His habits are nocturnal. He does his best work in the dark."

Not surprisingly, MIT students and staff have developed an extensive beaver-related lexicon over the intervening 89 years, just as fascinating in its own way as the Carrier beaver vocabulary. On Bill's account, the Carrier beaver words mostly refer to actual beavers or to aspects of their life and death. By contrast, nearly all the MIT beaver terms appears to refer to symbols or rites of various cultic groups, which are known to be thick on the ground at that estimable institution.

A few examples are given below, as they would be pronounced in the archaic Building 20 dialect, using the pronlex phonemic transcription conventions. (I'll try to add a field for IPA transcriptions as soon as I have time to figure out how to get Unicode effectively past the cross-product of MovableType with the usual range of browsers, operating systems and font inventories...)

+br@s'r@t	ceremonial beaver-shaped ornament of initiates
'biv.R+kcl	ritual chant of female athletes (see below)
+tIm.Dx'biv.R	persona of ritual beaver costume
+biv.R'ban+Spil	traditional call and response ceremony
'biv.R+t@g	system for determining ritual "death"
+biv.R'ri+lez	ceremonial foot races
+stAft'biv.R	ritual gift to newborn children of faculty
+biv.R'kAp	traditional contest with rival claimants of beaver totem

If you don't come from the place where Mass Ave crosses the Charles river, you probably don't know any of this stuff. Heck, even if you do, you probably don't. And you may not have time to click on all the links above. So I'll close with the informative and inspiring ritual chant of the MIT Women's Track and Field and Cross Country squad:

The MIT Beaver Call

I'm a beaver. You're a beaver. We are beavers all.
And when we get together, we do the beaver call!

E to the u du dx,
E to the x, dx.
Cosine, secant, tangent, sine,
3 point 1 4 1 5 9.
Integral, radical, mu, dv
Slipstick, sliderule, MIT!

Go Tech!

[Update: Language Hat and some of his commentators add information about the academic beaver cultures of the West Coast. Their discussion ranges further afield, extending for instance to the pronunciation of "geoduck" (whose derivation from Chinook Jargon is not mentioned) and the sexual practices of slugs, a topic whose ethnography and lexicography no doubt offer many interesting features.

I should also point out that one Philadelphia-area academic institution is not proud of its castorian heritage. As of July 16, 2001, Beaver College became Arcadia University.

And taking my inspiration from David Beaver's Third Maxim of Blog ('Digress.'), I will point out that according to the OED, the word "beaver" itself derives from "OAryan *bhebhrú-s, reduplicated deriv. of bhru- brown", and has given rise to many combined forms in the general vocabulary, including

6. Comb., chiefly attrib., as beaver-fur, -intellect, -kind, -pond, -skin, -wool(= fur); beaver-like adj. Also beaver-board, a trade-mark (U.S.) for a kind of wood-fibre building board; beaver cloth (cf. sense 4); beaver-dam, a dam made by beavers; beaver-eater (see quot. 1771); beaver finish, a finish giving a resemblance to beaver fur; hence, a finish in which the fibres are all laid in one direction; so beaver-finished adj.; beaver lamb, lambskin cut and dyed to resemble beaver fur; also attrib.; beaver-poison U.S., the water-hemlock, Cicuta sp.; beaver-rat, the musquash or MUSKRAT; beaver-root N. Amer., a pond-lily, Nymphoea odorata; beaver-stones, the two small sacs in the groin of the beaver, from which the substance 'castor' is obtained; beaver-tail, the tail of a beaver; also transf.; beaver-tree, Magnolia virginiana, the sweet or white bay of the U.S.; beaver-wood, (a) the hackberry tree of the U.S., Celtis occidentalis; (b) the beaver-tree; the wood of this tree.

while David Beaver has written that his own name "is supposedly a case of very poor translation by English officials helping my ancestors anglicize their Polish family name, 'Kaczka', which means 'duck' ". ]

Posted by Mark Liberman at 12:00 PM

Mis-spelling "grammar"

If I were a curmudgeon, I would point out, in reply to Geoff Pullum's complaint about the mis-spelling of "grammar", that the spelling with an <e> has an honorable history, going back to Middle English "gramere". And if I were more of a curmudgeon, I would point out that the correct spelling of "grammar" is easily remembered if one knows that it goes back to Greek grammatike, which is derived from grammat-, the passive participle of graph- "to write". And of course in Greek, the letters <a> and <e> were pronounced quite differently, so there's no possibility of confusion. And of course anyone who knows the elements of the derivation of Greek from Proto-Indo-European would know that the /a/ of such participles was originally a syllabic /n/, which is even more memorably distinct from /e/. If only everyone still learned Greek and the rudiments of Indo-European, we wouldn't have such errors.

But I'm not, so I won't.

Posted by Bill Poser at 01:56 AM

Hammer, jammer, slammer, stammer, grammar

The local newspaper in my home town, out here on the edge of the Pacific tectonic plate, printed a little story just before Thanksgiving about how a book of which I was co-author, The Cambridge Grammar of the English Language, has won the Linguistic Society of America's Leonard Bloomfield Book Award. (The authors are just tickled pink about that, I should tell you. Bloomfield is one of the true greats of 20th-century American linguistics, and being even vaguely associated with his name made our toes curl with delight.) But to my dismay, the headline in the Santa Cruz Sentinel (11/26/03, uncorrected even in the paper's web archive) was this:

Grammer compendium garners recognition

This misspelling of grammar is incredibly common. And here it got past an author and all the subeditors and printers and appeared in a headline, in a university town, in a story about a grammarian. How could this happen? Nobody misspells hammer as hammar. What's going on?

Taking a lesson from Mark Liberman, I think we should consider quantitative factors. From the elementary 25,000-word dictionary found in /usr/dict/words on most Unix systems (which sometimes contains local modifications), it appears that the number of distinct words in English that end in -er is about five times the number that end in -ar. Interestingly, this ratio of about 5 to 1 stays the same for -mer versus -mar, and for -mmer versus -mmar, and for -ammer versus -ammar. And by the time we are looking at words ending in -mmar we are down to 1. Grammar is an unusual-looking word.

So my guess would be that the frequency of the letter string at the right hand end dominates everything else. Phonologists would doubtless think that people might compare the sound of the word grammar with the sound of the word grammatical to get a clue that it must be an a after the mm (notice, grammatical doesn't rhyme with alphabetical, it rhymes with sabbatical). But people apparently don't do that, despite the fact that it would help so much with spelling if they did. It appears that for human beings, systematic comparison of related word forms is hard, but being sensitive to the frequencies of sequences you have been exposed to is easy.

Posted by Geoffrey K. Pullum at 01:24 AM

cool words

John McWhorter's reminder that we shouldn't go overboard in our admiration for the fact that indigenous languages have highly elaborated and specialized vocabularies for things of interest to them is well taken, but there's no reason that we shouldn't find it fascinating that people know and care so much about things that most of us don't.

Carrier, the native language of a large portion of the central interior of British Columbia, is spoken in an area full of lakes and rivers, in which there are a great many beaver. Not surprisingly, Carrier has a much more extensive vocabulary for beaver than English does. My favorite is k'onih'azi "newly mated beaver couple". Here is some more beaver-related terminology from the Stuart/Trembleur Lake dialect:

tsa	beaver
tsatsul	beaver of mid-sized variety
tsayaz	beaver of small variety
tsati	beaver of large variety
tsachenisboo'	beaver kit
tsata'	adult male beaver
tsa'at	female beaver
tsadiya	mother beaver
tsacho	male beaver that is the boss of a whole area
tsaken	beaver lodge
'utsut	runway from lodge of beaver or muskrat to land
lht'azutnai	pair of beaver lodges built close together behind one dam
'udats'un	beaver harpoon
tsambilh	beaver snare
'ulhtusti	trail over beaver dam
tsata'ti	beaver channel under the ice

The reason that there is a word for "beaver channel under the ice" is that the warm air released by a beaver as he swims causes the ice over the channel to be a little thinner than the ice elsewhere; a hunter or trapper can detect this by tapping the ice and wait or set his traps near the end of the channel.

I bet that many of the people reading this don't know what a "beaver kit" is. That's the English term for the young. But if you don't come from northern North America, and aren't a backwoods person, this probably isn't part of your vocabulary.

Posted by Bill Poser at 01:17 AM

December 03, 2003

Punctuation: the rest of the story

If you are very good, I will tell you the joke that gives the inspiration for the title of the humorous punctuation book Eats, Shoots and Leaves, which Mark Liberman recently mentioned on Language Log. But you have to promise to be good. You also have to be over 13 or accompanied by an adult; the joke failed to receive a rating as suitable family entertainment because of strong language and extreme violence.

A giant panda goes into one of those expensive and pretentious restaurants serving French/Asian fusion cuisine and takes a table for one. The surprised waiter for that table explains unctuously that his name is Marcel, he will be your server tonight, and we 'ave a number of specials (he is French), etc., etc. The panda listens impassively to the list of $27 chili-pepper encrusted swordfish specials and so on, and then orders a delicately flavored dish of young bamboo tips and mixed greenery served with steamed jasmine rice. On finishing his meal, the panda gets up, reaches into his fur for a handgun, brings down the waiter with one shot, and calmly heads for the door.

The head waiter is near the door and exclaims in shock, "Oh, monsieur, what 'ave you done? You 'ave killed Marcel! Why 'ave you done zis, monsieur? You 'ad some problem? Ze service was not acceptable?"

The panda scowls at him and says, "I'm a fucking panda. Go look it up." He stalks out into the night.

The baffled staff huddle round the compact encyclopedic dictionary that they keep on the premises, and turning to Panda, giant, they read this:

Panda, giant. Large bear-like animal, Ailuropoda melanoleuca, with distinctive black and white markings, related to raccoon family. Rare; found only in bamboo forests of Tibet and western China. Eats shoots and leaves.

[(comment by Liberman:) Aha! I knew this joke -- though it's always a pleasure to hear an old, good joke well told -- but I failed to consider its implications. In retrospect, it's clear what "The Book of Bunny Suicides" has to do with punctuation: fur, strong language, pretentious French waiters, gunshots ... I'm still not clear what the connection between semicolons and "Crap Towns" is, though.]

Posted by Geoffrey K. Pullum at 11:47 AM

December 02, 2003

Temples of memory

[Update 12/15/2003: I've placed this update before the rest of this post rather than after it, as I usually do, because it calls the content of the post into question in a fundamental way. I have not edited the post itself.

Lameen Souag has pointed out to me that Dr. Ziedan has posted a series of notes asserting that the some of the quotes attributed to him by al-Usbu' were "fabricated", and stating clearly his view that the Protocols of the Elders of Zion is "a racist, silly, fabricated book." Lameen also pointed out that al-Usbu' seems to have withdrawn the article in question from its web site. To his credit, Ziedan links to many of the outraged comments from around the world, including this one. However, he also implies that these comments represent some sort of organized effort against him, speaking of "comments that surfaced unexpectedly on numerous web pages launched by many (most probably zealous Jews)", asking "What is the secret behind this deliberate provocation in the wording chosen by that al-Usbu‘ to cover the news?", and referring to "the striking synchronization of publishing the article in the two Israeli papers" and "the systematic attack launched by websites on the following day".

I was unable to determine from Dr. Ziedan's posts what he thinks the "secret behind this deliberate provocation in the wording chosen by that al-Usbu'" might be. Whatever hidden hand might be directing things at al-Usbu', I'd like to assure Dr. Ziedan personally that what I wrote here on Dec. 2 was not part of any "systematic" effort. No one suggested to me that I write it, or even told me about the facts of the case. I learned about the al-Usbu' article by reading MEMRI, as I do from time to time, as one of many news sources from a wide variety of perspectives. As a matter of fact, I was initially skeptical that a manuscripts expert could believe that the Protocols might be genuine, but after reading the article "WWW and the Informatics Plexus" (quoted in the MEMRI page), I concluded (apparently wrongly) that al-Usbu' was no more misleading about Dr. Ziedan's views in this case than journalists usually are in quoting those that they interview. Dr. Ziedan seems to have removed this article from his website -- though if I have simply been unable to find it, I apologize for not looking effectively enough.

In the third of his notes, Ziedan writes that "what was mentioned in the article is groundless and with no proof. I will return to the question of furnishing proofs in detail in my next argument about the whole Jewish issue." This is a reference to the fourth of his notes, which is listed as "coming soon". I look forward to reading it.]

The Internet Sacred Text Archive (recently cited by Language Hat for the Complete Corpus of Anglo-Saxon Poetry) has just added The Vampire Codex, "a document written by occultist and psychic vampire Michelle Belanger for use as the instructional text of House Kheperu", and The Songs of Bilitis, "a clever forgery by Pierre Louys" published in Paris in 1894, which "purports to be translations of poems by a woman named Bilitis, a contemporary and acquaintance of Sappho." The Internet Sacred Text Archive is even more open-minded than these selections might indicate, as their long list of available texts includes the Mahabharata, Tao the Great Luminant, and the Unicode Greek New Testament.

Meanwhile, in news from another galaxy, the Egyptian weekly Al-Usbu' reported on November 17 (via MEMRI) that the manuscript museum at the new Alexandria Library has "added The Protocols of the Elders of Zion to the display case of the holy books of the monotheistic religions, next to a Torah," because, according to the museum's director Dr. Yousef Ziedan, "this dangerous book ... has become one of the sacred [tenets] of the Jews, next to their first constitution, their religious law, [and] their way of life. ... It is only natural to place the book in the framework of an exhibit of Torah [scrolls]."

The Alexandria Library has been built with funds from the Italian government and UNESCO, among others, totalling something like $100 million.

On November 1, Umberto Eco gave a talk at this same Alexandria Library, entitled "Vegetal and Mineral Memory: the Future of Books." He is quoted as describing libraries as "temples of vegetal memory."

"They were, and still are, a sort of universal brain where we can retrieve what we have forgotten and what we still do not know," he told the assembled crowd at the library’s conference center. "We have invented libraries because we know that we do not have divine powers, but we try to do our best to imitate them."

Or something.

[Update 12/3/2003: Let me try to state my own viewpoint on this, in case it's not clear from reading between the lines above.

I find it strange that John B. Hare, the man behind the Internet Sacred Text Archive, thinks that Daughters of Bilitis is a sacred text at all, or that The Vampire Codex should be given essentially parallel billing with the King James Bible and the Rg Veda on his site. However, it's Hare's right to follow his own inspiration as he pleases, especially because his site is funded only by himself and the contributions of those who find it useful. And many of the texts he posts seem to represent a truly valuable service, while all of them are likely to be interesting at least to some people. His descriptions of the texts (those that I've read) also seem fair-minded and completely lacking in malice. So at worst, it seems to me, one could describe the site as heart-warmingly nutty, though I can see that some readers might be offended by the implied parallelism between sacred texts that they sincerely revere and things like Louys' forged erotica or Belanger's creepy Vampire manifesto.

By contrast, Dr. Yousef Ziedan, director of the manuscript museum at the Biblioteca Alexandrina, which is supported by something like $100 million from UNESCO and various governments, is promoting a malicious political agenda by placing the notorious anti-semitic forgery The Protocols of the Elders of Zion in a small display case next to a Torah scroll, and describing the text in at least one published interview as if it were a genuine document that "has become one of the sacred [tenets] of the Jews." Such statements have become routine in the (government-supported) Egyptian media, but it is shocking to see that the infection has penetrated to an internationally-oriented institution that pretends to want to "became the matrix for a new spirit of critical inquiry". I'd describe Dr. Ziedan's views as chillingly nutty, and the Egyptian social context that validates them as appalling.

The new Alexandria library has established ties not only with governments, politicians and NGOs from around the world, but also with major public intellectuals like Umberto Eco and Brewster Kahle. They're drawn by the vision, the publicity and I suppose also partly by the money, which can be used to promote good works such as Kahle's Internet Archive. I don't mean to suggest that such people are complicit in Ziedan's gesture, nor even that they've been aware of it. But it will be interesting to see how widely the see-no-evil approach of the European Union will be duplicated in this case.]

[Note: Peter Erwin has pointed out to me that Umberto Eco's book Foucault's Pendulum "contains, amidst numerous other displays of erudition, a brief discussion of the Protocols' true history -- not just its concoction by elements of the Tsarist secret police, but further back to a chain of plagiarism and borrowings originating in mid-19th Century political screeds against Napoleon III and the Jesuits. So Eco is perfectly aware -- indeed, far more than most people -- of how much of a hoax the Protocols actually are." Erwin also drew my attention to this essay by Eco, which closes with the words "... it is enough to visit certain racist websites, or to follow anti-Zionist propaganda in the Arab countries, to see that anti-semites have still found nothing better to do than to recycle, yet again, those ludicrous Protocols." And finally, Erwin pointed out a link to the full text of Eco's November 1 talk at the Biblioteca Alexandrina.]

[Update 12/9/2003: according to Agence France Presse (here reprinted in Al-Jazeera), library director Ismail Siraj al-Din has ordered the book removed from the exhibit, claiming that it had been displayed "as a curiosity" only. The AP Story (here from the SF Chronicle) also cites a letter "questioning the display" from Koichiro Matsuura, director general of UNESCO, and quotes Ziedan as saying that "My professional view is that it is a silly book ... Its only significance is that it is the first Arabic edition of the book that has influenced the Arab mentality to a great extent."

But Ziedan's quoted remarks in the Al-Usbu' piece translated on the Memri site offer a different picture of his views. Either he distinguishes between his professional view (that it is a "silly book") and his personal view (that it "has become one of the sacred [tenets] of the Jews, next to their first constitution, their religious law, [and] their way of life"), or he was mis-quoted by Al-Usbu', or he lacks the courage of his convictions, saying different things in different contexts. What's your guess?

Matsuura also sent a letter to the participants in a seminar on the 100th anniversary of the Protocols, coincidentally held in Venice last weekend. ]

Posted by Mark Liberman at 11:11 PM

No foot in mouth

I'm as much in favor of good plain writing as the next grammarian writing about Standard English, and no particular fan of Defense Secretary Donald Rumsfeld, but let's just take another look at this news item about the Plain English Campaign giving Rumsfeld its foot in mouth award for "the most baffling statement by a public figure". Here's the paragraph that got him cited:

Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know.

Now read that carefully. Because my question is, What the hell is supposed to be wrong with it?

The quotation is impeccable, syntactically, semantically, logically, and rhetorically. There is nothing baffling about its language at all.

Now, admittedly, I don't know whether it's true, but that's a very different matter. What it says is completely straightforward: he pays special attention to negative reports because he's conscious of the possibility of areas of ignorance that are not currently recognized as such. His reminder in passing that there are also (i) areas of knowledge that we are aware of possessing and (ii) areas of ignorance that we are aware of seems to allude to a familiar old Persian apothegm:

He who knows not, and knows not that he knows not, is a fool; shun him.
He who knows not, and knows that he knows not, can be taught; teach him.
He who knows, and knows not that he knows, is asleep; wake him.
He who knows, and knows that he knows, is a prophet; follow him.

It also echoes (perhaps unwittingly) the title of Sylvain Bromberger's collection of philosophy papers, On What We Know We Don't Know. Bromberger saw it as an interesting epistemological fact that we can be aware of our lack of knowledge in some domain. Rumsfeld is drawing attention to the importance of our unacknowledged areas of ignorance -- what we don't know we don't know. So what? Is this the best that can be done to identify baffling utterances by public figures, in a world where Judith Butler remains at large, and Michael Jackson's lawyer can say in his client's defense "if these charges were true I assure you Michael would be the first to be outraged"? I don't get it. Hate Rummie if you want for political reasons, but don't try to get grammar or logic on your side. There is nothing unintelligible about his quoted remark, linguistically or logically.

Posted by Geoffrey K. Pullum at 09:22 PM

Not a pedant, but a stickler

Eats, Shoots and Leaves: the Zero Tolerance Approach to Punctuation, by Lynne Truss. Sounds like fun. The title and author might have been made up by a writer at the Onion, but this review (via A.L.D.) makes it all sound so real. Unfortunately neither amazon.com nor bn.com have heard of the title or the ISBN (1861976127), but a trip to amazon.co.uk confirms that the book exists, and can be had by those who live in the UK or are willing to pay the freight.

Best lines in the review:

... in spite of the reference in the title to zero tolerance, Lynne Truss remains utterly good-natured throughout. She says she is not a pedant, but a stickler - which is a description that many of us would be happy to adopt. She does say that people who put an apostrophe in the wrong place, when they ought to know better, deserve to be struck by lightning, hacked up on the spot and buried in an unmarked grave, but it's probably mostly in fun.

You may not get the same set of "closely related" books from Amazon.co.uk that I do, but if my list is typical, their system is either slyly inspired or totally FUBAR. Amazon suggests some expected stuff like Between You and I and The Adventure of English 500 AD -- 2000, but then they throw in The Book of Bunny Suicides ("yep, it's definitely bunnies committing suicide", says Clive Jones from Cambridgeshire) and Crap Towns: the 50 Worst Places to Live in the U.K. ("wish I'd had the idea as these people are going to make a fortune over Christmas", says "a reader from London"). Maybe the whole thing is just a bunch of virtual domains inside http://www.theonion.com, after all ...

Or maybe the list is not the fault of some misdesigned or undertrained textual relatedness algorithm, but rather reflects amazon's marketing priorities?

Posted by Mark Liberman at 07:59 PM

More on "Samarra"

I have some new information about "Samarra."

First, the Iraqi blogger Omar, whom I emailed to ask about the stress pattern, kindly replied to say

the word is : Sa----ma---rra.
so the stress is on the last syllable (iraqi way)
and in formal arabic, it's: Sa--ma--rra'

Thus Iraqi colloquial deletes the final glottal stop, but the stress remains final, just as Mohamed Maamouri suggested (he is Tunisian, but he knows a lot about Arabic linguistics and has visited Samarra). I believe that Omar is talking about the kind of Arabic spoken in Baghdad (what Ethnologue calls Mesopotamian Spoken Arabic, code ACM).

Second, Mohamed Maamouri wrote again:

I was not pleased yesterday that even though I thought of a possible etymological link to /samar/ the name /samarraa/ did not have a good morphological grounding. When I woke up this morning, I remembered one of my high-school lessons which gives the explanation of the exceptional word-formation ('morphosyntactic amalgam' ?!) of /samarraa/ which comes from /sarra man ra?a:/ '/delights/cheers (he/she) who sees [it]''. This etymology is attested in the literature of the period.

Finally, Tim Buckwalter added:

You are absolutely right, and I just found it in the ALECSO dictionary, although they give it in the passive: sur:a man ra:?a (i.e., the town of "happy-is-he-who-sees-it"? )

and Tim noted as a postscript that

This Google is very productive: "surra man ra'a"

One of the more interesting historical links is this one, which says that

The ancient toponyms for Samarra are: Greek: 'Souma' (Ptolemy V c. 19, Zosimus III, 30), Latin: 'Sumere', a fort mentioned during the retreat of the army of Julian the Apostate in AD 364 (Ammianus Marcellinus XXV, 6, 8), and Syriac 'Sumra' (Hoffmann, Auszüge, 188; Michael the Syrian, III, 88), described as a village.

[. . .]

The caliph's city was formally called Surra Man Ra'a ("he who sees it is delighted"). According to Yaqut (Mu`jam s.v. Samarra), this original name was later shortened in popular usage to the present Samarra. It seems more probable, however, that Samarra is the Arabic version of the pre-Islamic toponym, and that Surra Man Ra'a, a verbal form of name unusual in Arabic which recalls earlier Akkadian and Sumerian practices, is a word-play invented at the Caliph's court.

Posted by Mark Liberman at 03:42 PM

Clear Thinking Campaign gives "fogged spectacles" award to John Lister

The Plain English Campaign is in the news today for giving its "foot in mouth" award to Donald Rumsfeld. However, the campaign's spokesman, John Lister, needs to clear up the thinking behind his own rhetoric.

According to this press release:

"You won't need a degree in linguistics to hire a room at the University of Warwick" so says John Lister from the Plain English Speaking Society.

Mr. Lister is praising Warwick for re-writing its Terms and Conditions document to eliminate "legal jargon" and "gobbledygook." The new document is certainly clearer and better than the old one, but what does this have to do with degrees in linguistics? We linguists don't offer our students any instruction in understanding badly-written documents, nor do we expect them to develop such skills on their own.

I'm sure that Lister didn't really think this through. He's just using a thoughtless stereotypical turn of phrase, a "phrase for lazy writers in kit form," as Geoff Pullum put it in an earlier Language Log post. People often say "you (don't) need a degree in X to do Y", where the connection between X and Y is loosely associative at best. Ask Google about "need a degree in" and you'll find people writing that "you don’t need a degree in cultural studies to notice that Western society doesn’t have too many worthwhile heroes anymore," and "you don’t need a degree in mechanical engineering to drive a car with an automatic transmission," and "you practically need a degree in Botany to grow anything in this area," and "with some cell phones you practically need a degree in rocket science to operate the darn things," and on and on.

I have no idea what Lister's own academic training is -- it's not relevant at all -- but let me say that he shouldn't need a degree in philosophy in order to think about the content of his pronouncements as well as their form.

[Update: the reader should also refer to Geoff Pullum's argument that the "Foot in Mouth" award to Rumsfeld is based on a quotation that is "impeccable syntactically, semantically, logically, and rhetorically", and thus must have been selected politically.]

Posted by Mark Liberman at 12:20 PM

Talkin' metric to decent folk

Bill Poser's posting on Pope Gregory's grammatical guilt (or shame?) reminds me of Gamble Rogers' character Narcissa Nonesuch, the community organizer who riled the Florida panhandle town of Snipes Ford by "talkin' metric to decent folk". I believe that this story also may involve her, or at least a close relative.

Here is a short selection from one of Rogers' CDs, in which he discusses the lexical distinction between dawgs and dogs. I've used this bit in the past as the basis for a course assignment on dialect, formality, class and gender. You could think of it in those terms, or you could just enjoy it as a rant about those obnoxious little yappy dogs. It's your call :-).

Posted by Mark Liberman at 08:11 AM

December 01, 2003

religion and grammar

With Geoff Pullum regaling us with naughty tales of his frolics in the fleshpits of Nevada and illicit noun-noun compounds, it seems that a dose of old-time religion is in order. Bertrand Russell, in "Has religion made useful contributions to civilisation?", notes:

Pope Gregory the Great wrote to a certain bishop a letter beginning: "A report has reached us which we cannot mention without a blush, that thou expoundest grammar to certain friends." The bishop was compelled by pontifical authority to desist from this wicked labor, and Latinity did not recover until the Renaissance.

Curiously, according to the biography in The Catholic Encyclopaedia, Pope Gregory excelled in grammar as a young man, before he became a monk.

Posted by Bill Poser at 11:57 PM

The rigors of fieldwork trips

I am shocked, shocked, at what Mark Liberman clearly insinuates in his recent Language Log entry. He notes that hundreds of compounds with plural non-head elements, counterevidence to the (originally Kiparskian) generalization studied by Peter Gordon (`Level ordering in lexical development', Cognition 21 [1986], 73-93), could have been be found on the web site of my own institution, not only without leaving my desk but without leaving the ucsc.edu Internet domain. Clearly he is implying that therefore the rigorous field trip I undertook to the barren desert state of Nevada, during which I tracked down and recorded the crucial compound counterexample activities center, must be suspect as far as tax deductibility is concerned. I am deeply disturbed lest any reader, whether working for the Internal Revenue Service or not, might be inclined to follow him in doubting the genuineness of my business travel schedule.

It is essential for the empirical health of linguistics that the investigator should not just sit at home in (say) his comfortable Philadelphia apartment with his high-speed Internet connection and his radio-equipped laptop using Google as a substitute for fieldwork, but should be prepared to travel to far-away places, sometimes with harsh and arid climates and strange customs, to observe language use in its natural context and setting. It is true that dozens of examples like activities center are already recorded in the 2002 paper by Scholz and myself, and many more are on UCSC web sites, including my own; but one can never have too much data, and dedicated linguistic scientists must be prepared to get out of their comfort zone and undertake travel to distant parts to gather more of it. Serious empirical inquiry demands no less.

Since my ticket stub entitled me to a free cocktail at The Orleans' casino bar, I had an opportunity for further mingling with the local area speech community, in Spring Valley (The Orleans, too, falls outside the Las Vegas city limits) after the Frankie Valli concert. Thus in a single evening I had juxtaposed opportunities for objective observation of both the lithe bodies of the four young male Californian singer/dancers backing up Frankie Valli's show and those of the young female drinks servers at The Orleans (whose uniform consists of fishnet tights over a thong-style body stocking and little else but high heels). The combined aesthetic effect would suggest the idea of bisexuality to anyone whose mind wasn't utterly closed to new experiences. And the mind of the scientific investigator must never be closed to new experiences. I didn't actually gather any useful linguistic data that particular evening, but my mind was open. I think any fair-minded tax inspector would agree that the social science fieldworker must be prepared to get out there in the speech community and interact as a participant observer, not just sit around playing with his Google.

Oh, I nearly forgot: the foregoing paragraph contains another relevant noun-noun compound with a regularly inflected plural noun as first element; I put it in just to test the reader's alertness.

Posted by Geoffrey K. Pullum at 07:54 PM

Activities centers in Paradise and Santa Cruz

I'm deeply impressed by Geoff Pullum's devotion to linguistic science during his recent field trip to Las Vegas (or rather, as he tells us, Paradise NV). However, there is one aspect of one of his reports which puzzled me, namely the part where he cites a children's "Activities Center" in Paradise as a counterexample to the claim about regular plurals not being allowed as the first element of a noun compound in English. It's not Geoff's viewpoint that I found puzzling -- he's sensible as always, and right as usual. It's the claim itself, recently associated with Peter Gordon (though he is not the originally guilty party).

Using google search of ucsc.edu it's easy to find many counterexamples on Geoff's home campus itself. Searching for "activities center," for example, we find two instances, e.g. "the Student Activities Center A-Frame". Looking down a few pages of the list of hits for "foods", we can find "a gourmet wild foods salad", "the organic and natural foods industries", "the Organic Foods Production Act", and "the Center for Agroecology and Sustainable Foods Systems (CASFS)". Searching for "goals setting" nets three instances, e.g. "goals setting and career advising". And searching for "publications list" finds 18 examples on the ucsc.edu site, including one on Geoff's own home page.

I'll leave it as an exercise for the reader to do the same thing on the web site for Peter Gordon's home campus, though I can't resist mentioning the "Dean E. Smith Student Activities Center". The point is, you don't have to go to Paradise to find regular plurals as non-head members of compounds. There's something to be said about when you get such plurals and when you don't -- but those who maintain that you never get them are going through life (never mind Paradise) with their eyes and ears covered up.

Of course Geoff Pullum knows how easy it is to find such counterexamples -- and is aware of the many counterexamples to the claim that have previously been published, some by him. But whenever I read about this, I have to marvel all over again. There are so many genuine and genuinely interesting generalizations about speech and language, and yet sensible linguists like Geoff continue to have to argue against the surprising number of researchers who continue to propose "explanations" for easily-disproved "facts".

Posted by Mark Liberman at 04:55 PM

Postcard from Vegas, 3: regularly-inflected plurals exclusion? I don't think so

One other piece of linguistic data gathered during my trip to Vegas and I think it will become clear to any tax inspector that the entire trip should be allowed as a tax deduction.

Peter Gordon has argued (in `Level ordering in lexical development', Cognition 21 [1986], 73-93 -- and at least one psycholinguistics course features a whole slide show on this paper) that children have an innate understanding of a key feature of how noun-noun compounds are formed in English: they know that regularly inflected plurals cannot occur as the first (non-head) component of a compound, though irregular plurals can. Thus a monster that eats mice may be called a mice eater, but a monster that eats rats cannot be referred to as a rats eater.

Well, there is doubtless much to be learned about how children learn compounds, but while looking around at the Fairfield Grand Desert Resort in Las Vegas I happened to pass a room full of young children doing educational things, and over the door it said ACTIVITIES CENTER. That's a compound with a regularly inflected plural as the first of its two elements, and it supports the position argued by me and Barbara Scholz in Empirical assessment of stimulus poverty arguments (The Linguistic Review 19, 9-50): whatever the right answer may be, Gordon cannot explain children's language acquisition by reference to innate universal knowledge of his alleged principle, because it's not true.

Posted by Geoffrey K. Pullum at 03:50 PM

Postcard from Vegas, 2: syntactic data collection on the strip

For the grammarian (and that is what I am, though I am also fun enough to go to Vegas for a wild weekend), data is all around us.

Oh, all right, prescriptive pedants: data are all around us.

And in Las Vegas, right on the strip, on the way to a show, I gathered a beautiful example which definitively settled a question I had regarded as either open or possibly closed in the other direction: whether a proper name can be a fully natural antecedent of a singular they. Let me explain.

It is a familiar myth from bad usage books that sentences like Everyone does what they are told are grammatically incorrect. The claim, let me stress immediately, is absolute nonsense. The pronoun they (in its various inflectional forms: they, them, their, theirs, themselves) has been used with a singular antecedent for hundreds of years. It occurs in Chaucer, Shakespeare, Milton, Austen, Wilde... it is natural, idiomatic, fully grammatical English for every native speaker who has not had their brain completely warped by bad usage books like Strunk & White's disgusting little atavistic compendium of falsehoods The Elements of Style.

What the bad usage books say about an example like Everyone does what they are told is that they is a plural pronoun but everyone is a singular noun phrase (notice the singular agreement on does), and you can't be both plural and singular, so it's wrong. The claim is a stupid mistake -- it depends on confusing the partially semantics system of choice for person, number, and gender on anaphoric pronouns with the mostly syntactic system of subject agreement in person and number marked on verbs. Nowhere did God say they have to line up in some simple way, and indeed they don't. The usage grouches are just flat wrong about the history and structure of English. I could go on for some time about this, but I won't, because I already did once, on Australian radio, in a talk called "Anyone who had a heart (would know their own language", and you can read the script at http://www.abc.net.au/rn/arts/ling/stories/s546929.htm, or listen to the program itself by visiting http://www.abc.net.au/rn/arts/ling/lfranca_040502.ram.

But the example I found in Vegas corrects something I said in the radio talk. I drew a distinction between referring pronouns (as in John and Mary called to ask if they can meet with you, where they just refers to John and Mary) and bound pronouns (as in No one called to ask if they can meet with you, where they is a variable bound by the quantifier no one: the sentence means "No one is an x such that x called to ask if x can meet with you"), and I ventured the claim that singular they "really is ungrammatical" with a singular name as antecedent. I pointed out that if you know someone called Chris was here and left a pen behind, then even if you don't know whether Chris is a man called Christopher or a woman call Christine, you can't say *Chris left their pen.

I still think that sentence sounds terrible. But the question arises of whether there could possibly be a singular name that in some way manages to have the sort of denotation that would allow a singular they to refer back to it. And in Las Vegas, right on the strip, I finally heard a real live native speaker say such an example, and to my ear was perfectly grammatical and natural. I got on a bus at around 6 p.m. to ride south a mile or so from up by the Stardust down to Bellagio (to see Cirque du Soleil) The traffic was a disaster. The bus moved at slower than walking pace -- this was the worst $2 I ever spent on transportation. And as the bus inched its way south toward the Mirage, a blaze of light showed up on the right, and the driver said:

If you look to the right, Treasure Island's having their show right now.

The Treasure Island hotel was running its free pirate show (men with eye patches and head scarves and swords climbing up rigging and being shot and falling off into the water nightly at 6, 8, and 10 p.m., for you to watch for free from the street). Notice the singular agreement: Treasure Island's is the colloquial reduction of Treasure Island is. But what gender is a hotel, in the sense not of a building but of an entity that can perform a show? Not clear. So the bus driver used singular they. And quite right too.

In the light of this evidence, I would now say that although *Chris left their pen still sounds dreadful for some reason (perhaps because whoever Chris is, he or she really does have a gender), nonetheless it is possible to have a singular they with a singular proper name antecedent. This is actually not excluded by what it says in The Cambridge Grammar of the English Language (chapter 5, section 17.2.4, pp. 491-495, esp. p. 494), which only says cases of referential antecedents are rare. The Cambridge Grammar has it right, and the claim in my radio talk is slightly too strong. You heard it here first, and I heard the crucial evidence in real life -- or what passes for real life in Vegas.

[Note added later: Chris Culy has pointed out to me that at least one other example can be found using Google:

Profit-sharing, career training, creative child-care solutions, lactation centers and developmental opportunities add to the many ways Principal helps their employees create a healthy work/life balance." (From http://www.latinastyle.com/2002list.html; underlining added.)

"Principal" is the Principal Financial Group, picked by Latina Style as one of their list of the best companies to work for if you're a Latina. (Ooh! There's another example! Their list!)

Some might unkindly suggest that this observation of Culy's means my trip to Vegas should not be a deductible expense; but I have commented on that mean-minded suggestion elsewhere.]

Posted by Geoffrey K. Pullum at 03:21 PM

Postcard from Vegas, 1: twilight-zone semantics

I spent the long Thanksgiving weekend in Las Vegas. (I may be a grammarian, but as grammarians go, I am a very sexy super fun wild and crazy guy, as I believe I have previously mentioned.) The linguist in Las Vegas will rapidly learn that this is a town with a twilight-zone semantics. Superlatives everywhere (the greatest, the finest, the best, the most chances of winning...); Elvises who aren't Elvis; signs saying things that are outrageously false (like the ones saying you can book to see Siegfried and Roy at the Mirage); free things not really free; even the term "Las Vegas" doesn't mean what you thought (all of the strip, from the Four Seasons and Mandalay Bay all the way up to Circus Circus and the Sahara, is in a place called Paradise, and is outside the boundaries of the city of Las Vegas, which most visitors never actually enter); nobody means the things they say. I told the server at a restaurant that I'd like the check now please, and the reaction, with a beaming smile, was, "Fantastic!" Now how could it be fantastic for me to want the check? Isn't such a request included in virtually every restaurant transaction script? I was baffled. Had a wonderful time, though. Best steak: The Steak House, at Circus Circus. Best music: by good luck, Frankie Valli and the Four Seasons were in town, at the Orleans, and did an extraordinary show -- Frankie still has the wonderful voice heard on Sherry more than 40 years ago. Best show: Cirque du Soleil's show entitled "O", at Bellagio, is beyond belief -- fully worth the 3-digit ticket price.

My server's phraseology may have seemed to indicate that Vegas has a low fantasticness threshhold, but some of the things in Vegas are... well, just fantastic.

And I gathered some interesting data too... but that's a different postcard.

Posted by Geoffrey K. Pullum at 02:48 PM

Stress and death in Samarra

Because of the ambush and subsequent firefight in central Iraq yesterday, the news has been full of mentions of the Iraqi town whose name is spelled "Samarra" in English. In the context of this serious event, I hate to bring up the relatively trivial matter of pronunciation, but one way or another, we have to say the words...

This morning on NPR, Bob Edwards said [sam'ara] but Carl Kasell said [s'æmara] (where single quote marks the main-stressed vowel, and I'm ignoring the details of the quality of the unstressed vowels).

I asked Tim Buckwalter how this word is pronounced in Arabic, and he responded:

The word sAmar~A' has two long vowels (/sa:mar:a:?/) so the stress should fall on the last long vowel and all preceding ones get shortened. However, names that end in /a:?/ tend to drop the glottal stop, and stress shifts to the nearest preceeding long vowel. A good example of this is "Sinai": /si:na:?/ in MSA, but /si:na/ in colloquial (and sloppy MSA). So, I suspect that this is how he got /sa:mar:a/. But since I don't know Iraqi, maybe I got it all wrong.

[Note: for the interpretation of Tim's transliteration sAmar~A', see this table]. According to Tim's answer, the correct formal pronunciation in Modern Standard Arabic would have final-syllable stress (which neither NPR announcer used), whereas the colloquial pronunciation (at least in the Levantine Arabic that Tim knows best) would have initial-syllable stress, as in Carl Kasell's pronunciation. If I understand the transliteration right, the vowel quality would also be closer to American English cat than cot.

There are several colloquial Arabics spoken in Iraq, so I guess there could be additional answers, but my guess is that Bob Edwards' pronuncation [sam'ara] is just the default American-English stress rule for foreign words: "if it ends in a vowel, use penultimate stress", along with the default American-English idea about how to pronounce orthographic "a" in foreign words ("use the vowel in cot, not the vowel in cat"). This is certainly how I always thought the word Samarra should be pronounced in English. And maybe Bob Edwards and I were right -- this is English, after all, not Arabic -- but the version with initial stress and a fronter vowel is apparently closer to the colloquial Arabic while remaining well within the phonetic space of our native American English.

Then again, Mohamed Maamouri supports the final-stress pronunciation that neither announcer used: "According to what I know, the pronunciation is /samar~A'/ with the stress on the last long vowel and with possible deletion of the final glottal stop." Mohamed has visited Samarra (in the 1970s), and so he has some direct personal evidence. He also mentioned that the names comes from an Arabic form meaning 'have an evening of entertainment.'

The first time that I ever had occasion to pronounce this word, if only to myself, was when I was 12 or 13, reading John O'Hara's novel Appointment in Samarra. Amazon.com gives it a blurb to die for by Ernest Hemingway: “If you want to read a book by a man who knows exactly what he is writing about and has written it marvelously well, read Appointment in Samarra.”

The book's action actually takes place in Pottsville, Pennsylvania. But the version that I read as a kid was an old-fashioned paperback edition with a trashy-looking cover, and I found it in a stack of mystery novels and suspense stories in my mother's sewing room. So I thought it was a spy thriller, and kept waiting vainly for that part of the story to start...

The title comes from a passage by W. Somerset Maugham:

DEATH SPEAKS: There was a merchant in Baghdad who sent his servant to market to buy provisions and in a little while the servant came back, white and trembling, and said, Master, just now when I was in the marketplace I was jostled by a woman in the crowd and when I turned I saw it was Death that jostled me. She looked at me and made a threatening gesture; now, lend me your horse, and I will ride away from this city and avoid my fate. I will go to Samarra and there Death will not find me. The merchant lent him his horse, and the servant mounted it, and he dug his spurs in its flanks and as fast as the horse could gallop he went. Then the merchant went down to the market-place and he saw me standing in the crowd and he came to me and said, Why did you make a threatening gesture to my servant when you saw him this morning? That was not a threatening gesture, I said, it was only a start of surprise. I was astonished to see him in Baghdad, for I had an appointment with him tonight in Samarra.

Posted by Mark Liberman at 11:50 AM

Where have all the inflections gone?

A bird's eye exercise, if we may. I will address, for non-linguist readers, inflections , most familiar as the noisome declensional and conjugational suffixes that bedevil English-speaking learners of, seemingly, most foreign languages we encounter. For the record, worldwide inflections can just as easily be prefixes as suffixes.

Based on Latin shedding so much of its inflections in becoming the Romance languages, and English's being such an inflection-shy sister in the Germanic family compared to, most strikingly, grand old Icelandic, linguists are taught that it is "natural" for languages to "molt" as a matter of course. Some languages stay heavily inflected, like Russian, while some descendants of the same ancestor just take it all off.

The common consensus is that English's paucity of inflections just "happened," a mere by-product of the syllable-initial stress tendency in Germanic, an unremarkable step beyond the denuded reality that Scandinavian and Dutch hide behind their nostalgic orthographies.

But would linguists find these developments so unremarkable if linguistic science had happened to develop (I beg your indulgence) among hunter-gatherers?

I present this as a genuine question: where is the streamlined, inflection-free North American Native American language? The linguist is accustomed to attending talks on these languages encountering bristling paradigms of prefixes and suffixes, indicating the obviative, the inverse and God knows what else, complete with portmanteau morphemes (that is, where one prefix or suffix carries two meanings, such as "me plus him"). Is it just by chance that I have never heard of a language of this area that has shed most of its inflections and relies largely on pronouns and free words along the lines of WILL? Where is the Algonquian "French"?

In the same vein, which Australian language is as inflection-shy as French or English? Has Australia witnessed the language that happened to wend its way into streaking about naked, like some Western European languages have? Or -- where is the Bantu (as opposed to Bantoid) language like this? After four thousand years, if it is so unremarkable for phonology and happenstance to shear off a grammar's affixes, then surely we would not expect that 500 Bantu languages recapitulate the familiar copious noun classes of this group. Where is the Bantu "English"?

I ask these questions because my research increasingly suggests to me that for a language to shed its inflections, rather than consistently replace or even retain them, is less business as usual than the unexpected case. From a global perspective, languages appear to usually do this as the result of widespread acquisition by adults, whose ossified language organs tend to clear away languages' "junk."

Thus the inflection-shy nature of Romance compared to Latin -- and Romanian's remnants of case-marking are dishwater compared to Polish, Greek or Lithuanian -- was due to imperfect renditions of that language being passed on in the context of invasion and imposition from the outside. English is the only Indo-European language in Europe with no gender marking on articles or nouns -- ever notice that? -- because of Vikings' approximation of Old English starting in the eighth century. It is presumably no accident that Persian, with its low inflection and gender-neutral third person pronoun, has been lingua franca par excellence throughout much of its history.

On-the-ground accounts of these "changes in progress" detailed enough to engage the sociolinguist are lost to history. But to require that French and English and Persian "just happened" leaves questions. Where is the inflection-free Slavic language? Yes, Bulgarian lost the nouns' declensions, but get a load of the verbs!! Why are the only Semitic languages that entirely bury the family's triconsonantal verb roots and their attendant vowel changes and affixations the few born amidst heavy non-native acquisition, like Juba and Nubi Arabic? Why is it that a Bantu variety as inflection-wary as English is the brand of Lingala spoken non-natively as a lingua franca?

The evidence suggests that the post-Neolithic "punctuations" that Bob Dixon describes in human languages' timelines have often sheared away a degree of languages' "mess" as they were imposed on adult speakers and passed down in abbreviated form to succeeding generations. Mandarin's mysteriously compact four tones would be a similar case, it having been adopted by Mongol invaders while Cantonese and the rest of the brood mutated uninterrupted, developing the eight and nine tones typical of the "card-carrying" Sinitic language.

Isn't it time that structural reduction played as much a part in theories of language change and contact as mere tradings of grammar?

Postscripts:

A: Re animals and music, my little cat Lara displays little interest in natural language grammars' loss of inflection, but is given to rolling ecstatically around on the floor when I play my CD of Bach organ pieces -- the fat bass notes seem to occasion especial enthusiasm.

B: My first response on reading Guy Bailey's account of Texan r-lessness in the New York Times was certainly "Why in God's name did he feed them that?" But then I realized that a scholar of Bailey's stature could not possibly have presented such claims as scholarly wisdom, and recalled from my own experiences with the media that it is almost unavoidable that journalists single out the casual parenthetical as isolated "fact." Almost superhuman feats of vigilance in on-line composition of utterance would be necessary to avoid misrepresentation of this kind.

Posted by John McWhorter at 01:50 AM