Language Log: March 2004 Archives

March 31, 2004

Convenience for the wealthy, virtue for the poor

Warning: this is a rant. I don't do it very often, but after editing a grant proposal for a few hours this morning, I felt like indulging in one. So bear with me, or move along to the next post. (Now that I think of it, I did indulge in a similar rant just ten days ago. Well, you've been warned...)

I was pleased to find, via Nephelokokkygia, this page by Nick Nicholas on Greek Unicode issues. In particular, he gives an excellent account, in a section entitled "Gaps in the System," of a serious and stubborn problem for applying Unicode to many of the world's languages. He sketches the consortium's philosophy of cross-linguistic generative typography, showing in detail how it applies to classical Greek, and explaining why certain specific combinations of characters and diacritics still don't (usually) work.

Given the choice between the difficult logic of generative typography and the convenient confusion of presentation forms, the Unicode consortium has consistently chosen to provide convenient if confusing code points for the economically powerful languages, but to refuse them systematically to weak ones. As a result, software providers have had little or no incentive to solve the difficult problems of complex rendering.

This reminds me of what Churchill said to Chamberlain after Munich: "You were given the choice between war and dishonor. You chose dishonor and you will have war." The problems of reliable searching, sorting and text analysis in Unicode remain very difficult, in all the ways that generative typography and cross-script equivalences are designed to avoid -- due to the many alternative precomposed characters (adopted for the convenient treatment of major European and some other scripts), and the spotty equivalencing of similar characters across languages and scripts (adopted for the same reason). At the same time, it's still difficult or impossible to encode many perfectly respectable languages in Unicode in a reliable and portable way -- due to the lack of complex rendering capabilities in most software, and the consortium's blanket refusal to accept pre-composed or other "extra" code points for cloutless cultures. I'm most familiar with the problems of Yoruba, where the issue is the combination of accents and underdots on various Latin letters, and of course IPA, where there are many diacritical issues, but Nicholas' discussion explains why similar problems afflict Serbian (because of letters that are equivalent to Russian cyrillic in plain but not italic forms) and Classical Greek (because of diacritic combinations again).

I'm in favor of Unicode -- to quote Churchill again, it's the worst system around "except all those other forms that have been tried from time to time." However, I think we have to recognize that the consortium's cynical position on character composition -- convenience for the wealthy, virtue for the poor -- has been very destructive to the development of digital culture in many languages.

There is a general issue here, about solving large-ish finite problems by "figuring it out" or by "looking it up." While in general I appreciate the elegance of "figure it out" approaches, my prejudice is always to start by asking how difficult the "look it up" approach would really be, especially with a bit of sensible figuring around the edges. My reasoning is that "looking it up" requires a finite amount of straightforward work, no piece of which really interacts with any other piece, while "figuring it out" suffers from all the classical difficulties of software development, in which an apparently logical move in one place may have unexpectedly disastrous consequences in a number of other places of arbitrary obscurity.

I first argued with Ken Whistler about this in 1991 at the Santa Cruz LSA linguistic institute. At the time, he asserted (as I recall the discussion) that software for complex rendering was already in progress and would be standard "within a few years". It's now almost 13 years later, and I'm not sure whether the goal is really in sight or not -- perhaps by the next time the periodical cicadas come around in 2117, the problems will have been solved. Meanwhile, memory and mass storage have gotten so much cheaper that in most applications, the storage requirements for text strings are of no consequence; and processors have gotten fast and cheap enough that sophisticated compression and decompression are routinely done in real time for storage and retrieval. So the (resource-based) arguments against (mostly) solving diacritic combination and language specificity by "look it up" methods have largely evaporated, as far as I can see, while the "figure it out" approach has still not actually succeeding in figuring things out in a general or portable way.

There are still arguments for full decomposition and generative typography based on the complexities of cross-alphabet mapping, searching problems, etc. But software systems are stuck with a complex, irrational and accidental subset of these problems anyhow, because the current system is far from being based on full decomposition.

In sum, I'm convinced that the Unicode designers blew it, way back when, by insisting on maximizing generative typography except when muscled by an economically important country. Either of the two extremes would probably have converged on an overall solution more quickly. But it's far too late to change now. So what are the prospects for eventually "figuring it out" for the large fraction of the world's orthographies whose cultures have not had enough clout to persuade the Unicoders to implement a "look it up" solution for them? As far as I can tell, Microsoft has done a better job of implementing complex rendering in its products than any of the other commercial players, though the results are still incomplete. And there is some hope that open-source projects such as Pango will allow programmers to intervene directly to solve the problems, at least partially, for the languages and orthographies that they care about. But this is a story that is far from over.

Posted by Mark Liberman at 11:56 AM

Bluffhead

In the April 2004 Scientific American, Dennis Shasha's Puzzling Adventures column discusses the game of Bluffhead. See this post for links to other entertaining discussions of dynamic and epistemic logic.

Posted by Mark Liberman at 07:47 AM

A natural boost to the immune system

Here is a medical footnote to Rosanne's researches on booger anaphora.

I feel the need to quote Dave Barry again:

Isn't modern technology amazing? A hundred years ago, if you had told people that some day there would be a giant network of incredibly sophisticated ''thinking machines'' that would allow virtually anybody, virtually anywhere on Earth, to hear a herring cut the cheese, they would have beaten you to death with sticks.

Just substitute "to read a Pakistani newspaper report about an Austrian doctor's speculation that eating snot is good for you" -- or some other amazing example of internet information transmission -- for the phrase in red. The original Ananova report is here, but the Pakistani version has a higher stick factor, in my view.

Posted by Mark Liberman at 07:45 AM

Chatnannies debunked

More bad science reporting, well exposed at waxy.org and Ray Girvan's blog. The over-credulous media this time included New Scientist, BBC News (again!), and Reuters, among others.

Posted by Mark Liberman at 07:41 AM

Postcard from Peking

Er, Beijing. Um, Peiping. Anyhow 北京.

Some people get to go to Las Vegas for business trips. Others of us (in this case me, Richard Sproat, and Chilin Shih) look elsewhere for our linguistic insights, specifically the Northern Capital of the Central Flowery Mountain. Which, now that I think about it, was also built by people with a lot of money and power at ridiculous expense in a location with really awful weather near a desert. Although the weather just now, thank you for asking, is really quite lovely, spring having arrived, all plum blossoms and willow buds and other Asian cliches.

I can tell you that the variety of food is much better now in Beijing than in my student days (my vague memories of that period seem to involve a lot of watermelon. Watermelon, cabbage, gruel, dumplings. And watermelon. And did I mention the gruel? Not to imply that I'm not fond of gruel, I am, very much, but in those days it was the really boring kind of gruel, not, say, the nice Hong Kong kind with the dried scallops and pig parts.) Anyhow, this isn't watermelon season, but I did get my fill of dumplings, which were quite excellent, I can especially recommend the fennel dumplings (hui xiang jiao zi 茴香饺子). In the last five years or so, it seems, Sichuan food has become very hip in the capital, and Richard and I ate (and saw signs everywhere else for) the well-known "shui zhu yu"水煮鱼, fish which is poached and then marinated and served in really astonishingly "numbing and hot" ("ma la" 麻辣) oil, numbing by means of massive quantities of Sichuan pepper ("huajiao", Xanthoxylum piperitum, fagara pepper), import of which has, I gather been recently banned in the United States, which makes replicating the recipe (especially the "massive quantity" part) difficult here in the States, and indeed, may cause the gastronomic semantics of "Sichuan restaurant" in the US to change wildly in the next decade.

There. I got the word "semantics" into that last sentence, which makes this a legitimate language log post. Besides, as further evidence of linguistics at work (albeit linguists at play) we visited some products of what might be called "Ming Dynasty Applied Speech Science"; the famous Echo Wall at the Temple of Heaven park, the Three Sounds Stone and the amplifying platform on the Round Altar, presumably all cases of architectural acoustics designed give a little magical extra to whatever it is that Emperors say upon ritual harvest occasions.

But what makes it even more legitimate is the following tidbit, which arises from a visit that Richard, Chilin, their daughter Lisa, and I made to what is now called Prince Gong's residence. (This is one of the very many estates that claim, in a sort of Chinese version of "George Washington Slept Here", the honor of inspiring what many, including yours truly, consider The Greatest Novel Ever Written, Cao Xue Qin's Story of the Stone. For those of you who have somehow managed to miss this, I recommend the really astonishingly unfaithful but nevertheless incomparably wonderful translation by David Hawkes and John Minford).

Where was I? Oh yes. In the Qing dynasty. Now as legend has it (and I checked this on the web, so it must be true), it came to pass when the great Kang Xi emperor 康熙 (1622ish) was only sixteen that his grandmother fell ill. Kang Xi thereupon got brush, ink, and paper, and drew a large (2-foot-ish high) character, the word 'Fu' 福 , "fortune, well-being", and sent it to her. This was no ordinary Fu 福. No, Kang Xi managed in the cursive Fu-swirls to build in the character for "long life" (shou) as well, and indeed later scholars have identified in its lovely brush-strokes the characters for "child" (zi), "long life" (shou), "fields" (tian), "money" (cai), "more" (duo), plus a dot (dian) hence carrying the hidden meaning "More children, more money, more land, more life, more Fu, and a little more". As soon as she received this magical Fu 福, Kangxi's grandmother's health improved, whereupon Kang Xi knew that his calligraphy had magical powers. He therefore commanded that a large stone be brought (note to confused readers: this stone has nothing to do with the Story of the Stone mentioned above), and that a copy of his "Fu" 福 calligraphy be carved into the stone. Kang Xi died, and the stone was forgotten for two generations, until He Kun, the evil prime minister of the Qianlong emperor, heard of the magical powers of the "Fu 福", and managed to steal the stone from the court. (yes, yes, the old "evil court minister with magical powers" story. But He Kun was specially evil, and may be the origin of many evil court ministers in a whole bunch of really excellent wuxia (武俠; swordsman/knight-errant/martial arts) novels, such as my favorite, Louis Cha's (Jin Yong) The Deer and the Cauldron (鹿鼎記), also translated by John Minford).

But we digress. To hide the magic Fu 福, He Kun built a special cave in his gardens at his estate north of the palace, and placed the stele there in this special cave. It is not known what magical use He Kun made of the Fu 福 but eventually he died, and his mansion and gardens passed on to other inhabitants, and to make a long story, well, still pretty long, He Kun's estate is none other than Prince Gong's residence, and thus you may guess that the stone has since been found and was seen in person by Richard, Chilin, Lisa, and yours truly.

Zhou Enlai, the premier of China, later called this Fu 福 "the greatest Fu福 in China". According to some souvenirs that Richard, Chilin and I bought, it's in fact "the greatest Fu 福 in the world (天下第一福)", but between you and me, I suspect that this may just be marketing hype.

Here's a really ugly gaudy velveteen souvenir scroll of Kang Xi's magical Fu (yes, yes, this is a picture of a souvenir I actually paid money for, but I promise the real Fu, which is just stone, is much more beautiful, but I couldn't find a picture on the web). If you look really carefully, you can see the "greatest Fu 福 in the world (天下第一福)" part on the right.

A final linguistic tidbit about Fu 福. As all you Chinese speakers out there know, Fu is the character that you often see around New Years, on doors throughout China and Hong Kong, upside down, like this. This is because the word "dao4" means both "upside-down" (written 倒) and "arrives" (written 到), so the visual image of an upside-down Fu would be described verbally as "Fu2 dao4" which would then mean both "upside down Fu" and "fortune arrives". A nice example of a visual-verbal bimodal pun.

p.s. Anyhow, as Richard points out, if nothing else, our trip and this post have together clearly raised the bar on fu.

Posted by Dan Jurafsky at 03:49 AM

March 30, 2004

The Huntington Challenge

Robin Arnette has a long, thoughtful response (on the AAAS's MiSciNet) to Samuel Huntington's "Hispanic Challenge" article from Foreign Affairs. MiSciNet also provides links to six other rebuttals: Daniel Drezner, James Joyner, David Adesnik, The Economist, eRiposte, and the L.A. Times. It's interesting that four of the six are weblogs, and that those four are generally more informative and interesting than the two standard media treatments. No list of pro-Huntington weblogs is provided, though Russell Arben Fox can be found dusting off Herder for the occasion over at Wäldchen vom Philosophenweg.

Alleged Hispanic resistance to learning English is one of Huntington's central claims. Arnette argues against this view, as do most of the other rebutters cited, but it would be nice to see someone take Huntington to task in more factual detail, especially in terms of the alleged contrast between today's Hispanic immigrants and earlier generations of immigrants (this is a hint to Geoff Nunberg, who has composed a post answering this description, but has not yet pulled the trigger...). [Update: his post is here.]

Last month, I cited the contrast between liberal Democrat Huntington and conservative Republican Brooks on this issue. This seems to be one of the many questions on which it's hard to predict views based on location in a one-dimensional political subspace.

Though some things are predictable: Arnette bolsters her argument against Huntington's claims about language with a link to a Boston Globe article hosted on freerepublic.com, despite the fact that the following comments section is a sort of sewer of national stereotypes, nativist prejudices and curious linguistic misconceptions, with a few sensible observations bobbing in the flood.

Posted by Mark Liberman at 07:32 PM

Sunday's Garfield doesn't count

It has occurred to me that people who are prepared to accept the legend from a Garfield cartoon as respectable printed prose (which is plausible enough) might send me Sunday's Garfield strip, which had the eponymous feline glutton saying (over several panels):

I'm so hungry I could eat and eat and eat and eat and eat and eat and eat and eat and eat and eat and eat and eat and eat and eat and eat... But why stop there?

That might appear to be 15 coordinates, a super example to submit in response to my earlier musings.

Unfortunately, this doesn't count. It isn't true coordination. This is coordinative reduplication. The meaning is intensificatory: notice that I could eat and I could eat is just a redundant way to say I could eat, but I could eat and eat means more than that, it means something like "I could eat a whole lot." So I can't count that one.

Posted by Geoffrey K. Pullum at 06:33 PM

Hunting for multiple-coordinate coordination constructions

I'm working with Rodney Huddleston on a textbook-size introduction to English grammar, and I recently came to a passage where we make and illustrate the point that coordinate structures don't appear to have any grammatical limit on the number of coordinate subparts. You get not just two coordinates (Starsky and Hutch) or three (The Good, the Bad, and the Ugly), or four (Bob and Carol and Ted and Alice), but any number. The temptation here is to show this by simply inventing boring examples with larger numbers of examples: We invited Bob, Carol, Ted, Alice, and Bruce (5 coordinates), and so on, and we were on the point of doing that, but it seemed to me it would be much better to illustrate with real examples. And it didn't take long to find a source with some real beauties.

You must understand, I'm not leaning toward corpus fetishism, the perverted insistence on using only real examples from a corpus of texts for your illustrations, no matter how much space that might waste. I just thought it would be livelier here to have some real, over-the-top examples of four, five, or six coordinates. And it was not hard to find them. For some reason, remembering some rich, ripe use of the English language, I took down from myself Lawrence Levine's The Opening of the American Mind. A quote I saw there led me to take down the book next to it, the one Levine is responding to: Allan Bloom's long, gloomy, preposterous jeremiad on everything wrong with American students, The Closing of the American Mind (1987). I really hit paydirt there. The extended polemic against rock music turned out to be particularly rich. These examples are all from pages 74 to 78:

There is room only for the intense, changing, crude and immediate. [4 coordinates]
People of future civilizations will wonder at this and find it as incomprehensible as we do the caste system, witch-burning, harems, cannibalism, and gladiatorial combats. [5 coordinates]
Nothing noble, sublime, profound, delicate, tasteful or even decent can find a place in such tableaux. [6 coordinates]

Great stuff. When Bloom gets going, he really loses it, the old fool. His excess of rhetoric is as masturbatory as the state he claims rock music gets young people into. How did his ridiculous book ever become a best-seller? I don't know. But I cherish it as a fund of examples.

I'm now wondering if I could find Bloom using a 7-coordinate example. And I'm wondering about what might be the largest number of coordinates ever recorded in an attested example from broadly respectable printed prose.

Gosh, if I muse aloud like this, people may start emailing them to me. All they have to do is realize that my login name is probably pullum and that I'm well known to be at UCSC.edu — not that I'd ever reveal that on the web for fear of spambots.

Posted by Geoffrey K. Pullum at 06:31 PM

In memoriam Larry Trask

We are deeply saddened to report that Larry Trask, a distinguished historical linguist and student of Basque, has passed away after a long illness. He made a strong and positive impression, not merely intellectual but personal, even on those who knew him only through his writing and correspondence. His Basque Language page contains much information about this often misunderstood language, including an excellent section on Prehistory and connections with other languages.

Here is an obituary written by his colleague Richard Coates at the University of Sussex, and here is an obituary in the newspaper Euskadi en el Mundo. Here is an interview with him published last summer in The Guardian.

Posted by Bill Poser at 06:08 PM

The first self-writing weblog

Check out R. Robot ("He's the only columnist I'll read" -- Ann Coulter), and then the many fine links (and ideas!) in Cosma Shalizi's post on the topic. While you're there, scroll down for Cosma's recipe for miwa naurozi to celebrate the Afghan new year.

I tried the interactive feature, supplying "Geoff Pullum" as the requested name, which yielded this post (though permalinks don't seem to work on the site), beginning "Just what was Geoff Pullum trying to say yesterday?" and ending "There's Geoff Pullum at the Commonwealth Club in San Francisco, making such inexplicable and execrable claims as, "Maybe we could get Iraq straightened out first," as he put it last week, and suggesting (with the internecine insouciance and contemptibly vile treachery that is his trademark, wont and fashion) that George Bush's moral leadership is for the purpose of votes."

I also recommend Newt Gingrich's memo "Language: a Key Mechanism of Control", which R. Robot cites a a source of inspiration and word lists. The link on R. Robot's index page appears to be broken, and for some reason the only copies I could find on line are on anti-Republican sites, who seem to find the memo more inspirational that GOP partisans do. Or maybe they don't need it anymore, I don't know.

Posted by Mark Liberman at 02:42 PM

Ten leading results in 20th century linguistics?

Lauren Slater's new book "Opening Skinner's Box", as described in this review by Peter Singer, sounds interesting:

The idea behind Lauren Slater's book is simple but ingenious: pluck 10 leading experiments in 20th-century psychology from the pages of the scientific journals in which they were first published, dust off the painfully academic style in which they were written up, add some personal details about the experimenters and retell them as intellectual adventures that help us to understand who we are and what our minds are like.

Now, it's clear that there are some issues about the actual content here. Slater has been accused of misunderstanding or misrepresenting some of the research she discusses, as well as some of her interviews with psychologists. See this Guardian review for some discussion, and look here for letters of complaint to her publisher from several of the psychologists whose interviews she described in the book, and here for an extended critique of a recent Guardian piece by Slater presenting material from one of the book's chapters. And according to this story, Deborah Skinner is suing over the way her upbringing (by B.F. Skinner) and its consequences are depicted in the book, for reasons she discusses in a Guardian piece entitled "I was not a lab rat." It sounds like psychology is not more reliably depicted by its popularizers than linguistics is.

I'm also not wild about the overall slant of Slater's choice of experiments (as describe in the reviews -- I haven't read the book). She focuses on clinical issues, especially pyschological damage allegedly due to bad parents and other authority figures. I don't have any problem with her choices taken individually -- all are interesting at least in a sociological sense, and most are scientifically interesting too. But her interest in mental health problems excludes neat (though less fraught) stuff like Fitts' Law (relating time, distance and target size for aimed movements), or the Rescorla-Wagner model of classical conditioning. This is a matter of taste, and her tastes are no doubt more popular than mine would be.

Anyhow, I like the "ten great experiments" concept. Not the "top ten" -- it's silly to try to map everything onto a single dimension of evaluation -- just a limited set of especially interesting and important things. As I was walking back from class this morning, I spent a few minutes thinking about what I'd pick as ten leading pieces of work in 20th-century linguistics.

I had no trouble coming up with a list -- the biggest problem is to trim it to ten -- and I'll tell you what it is in a later post. I'd be curious to hear what other people's suggestions are as well, so feel free to send me your ideas by email.

[Update 4/13/2004: The NYT has noticed the fuss about Slater's veracity, after Peter Singer totally missed it in his 3/18/2004 review. It's odd that he did so, since he himself notices that "Slater makes some errors that made me wonder about her accuracy in areas with which I am not familiar." The information was easy to find on the web a month ago. I guess he may have written the review before Deborah Skinner's 3/12/2004 Guardian piece appeared, but was it before Ian Pitchford's 3/2/2004 posting on psychiatry-research, or the late-February weblog posts by folks like Rivka? As a professor at Princeton, Singer doubtless knows how to research a subject; as a best-selling author, I bet he has assistants who can do it for him; this is supposed to be an area of expertise for him; I found everything cited here just by idly googling "Laura Slater"; was this really "due diligence"?]

Posted by Mark Liberman at 01:56 PM

Cartoons of the day

A Gricean evergreen, Pirates vs. Philosophers, accent and identity, and lexical innovation from "HER! Girl vs. Pig".

Posted by Mark Liberman at 08:47 AM

Jeniffer afficionados

Continuing the discussion of English orthographic gemination, Bill Poser observes that he sometimes finds himself writing "Jeniffer". This is not an experience that rings a bell for me, but Bill is clearly in tune with the zeitgeist, or anyhow the Jennifergeist:

	f	ff
n	869,000	481,000
nn	15,400,000	120,000

Keith Ivey emailed to point out that "[t]wo accepted variant spellings of words borrowed from Spanish provide examples of an added geminate and a lost one: afficionado [and] guerilla." And notice that the result in each case is consistent with the orthographic pattern seen in the contingency tables for Attila, Karttunen and Jennifer: a preference for a single consonant paired in an adjacent syllable with a double one, in either order.

Qov emailed to say that "The ones I have to watch for are parallel, accelerate and tomorrow. I don't
quite understand how the numbers in the tables prove your thesis, but Google finds many more tommorows than tomorows and many more paralells than paralels."

Indeed, and also consider the relative paucity of "tommorrows". Here is the contingency table for tomorrow, which shows basically the same pattern that we've seen before:

	r	rr
m	67,700	14,300,000
mm	228,000	189,000

The case of variants for "parallel" is somewhat different, because there are apparently three different consonants involved to some extent in the confusions, and two of them are L's:

FORM	ghits
paralel	162,000
paralell	65,700
parallel	13,800,000
parallell	94,500
parralel	8,700
parralell	2,200
parrallel	31,200
parrallell	475

The analysis here is a bit more complicated -- maybe later, I have a grant proposal to write. I'll also see if I can find another, more accessible way to come at the explanation of the statistical analysis of contingency tables, to supplement the one I provided here

Posted by Mark Liberman at 08:19 AM

Saskatoon

Writing about the activities of the University of Saskatchewan Library reminded me of a joke. Since Geoff hasn't posted any bad linguistics jokes in quite a while, and most of our readers probably don't get much exposure to Canadian humour, I thought I'd tell a Saskatchewan joke.

Two Canadians, sick of the rat race, went to a travel agent and asked her to book them to the remotest place she could get them to by commercial air. Twenty-four hours later, they staggered off a plane in Alice Springs, Australia. Tired and thirsty, they headed for the nearest pub. It was obvious to the locals that they had come from somewhere distant, which led to much speculation. Finally, one of the locals said: "Let's settle this. I'll go over and ask them". He went over to their table and asked: "Where are you folks from?". They answered, "Saskatoon, Saskatchewan". When the local returned to his table, the others asked him: "So where are they from?". He answered: "I couldn't find out. They don't speak English.".

Posted by Bill Poser at 01:10 AM

March 29, 2004

The Kamloops Wawa

The University of Saskatchewan Library recently acquired a full run of the Kamloops Wawa, a newspaper published primarily in Chinook Jargon between 1891 and 1923 in Kamloops, British Columbia. The information about the exhibit that the library put on to celebrate the new acquisition contains images of several pages.

Chinook Jargon is a pidgin based primarily on Chinook and Nuuchanulth (Nootka) that served as a trade language throughout the Pacific Northwest. Very few settlers learned the native languages, such as Secwepmectsín (Shuswap), the native language of the area around Kamloops, so Chinook Jargon played a major role in communication between settlers and native people.

The Kamloops Wawa was published in a French shorthand known as the Duployé shorthand, which the Oblates of Mary Immaculate had decided was the easiest way to write the various native languages that they dealt with in Southern British Columbia. They used this writing system not only for Chinook Jargon but for English, French, Latin, Lillooet, Secwepmectsín (Shuswap), and Nlaka'pamux (Thompson). Here is the first page of the Shushwap Manual or Prayers, Hymns and Catechism, in Shushwap published at Kamloops in 1906.

Duployé shorthand was a good writing system for the languages whose sound systems it was designed for, such as English. Indeed, because it was easier to write English in Duployé shorthand, which had no arbitrary spellings, than in the usual English spelling with which we are still encumbered, the Oblates encouraged settlers to learn it as a stepping-stone to English literacy. It was a less than adequate way of writing the native languages since it did not provide enough letters for all of their sounds.

Posted by Bill Poser at 06:44 PM

The perils of degemination

In response to my recent post on conservation of gemination, Stefano Taschini has sent a stunning message that weaves together the themes of phonology, art, religion and female genitalia.

The point of my original piece was that English speakers sometimes seem to remember that a word like Attila has a double consonant in it somewhere, but get confused about just where it is. Stefano gives several other examples of the same sort, observing for example that "the differential equation studied by Jacopo Francesco Riccati registers about a thousand Google hits as 'Ricatti equation' (which is particularly disturbing, considered that 'ricatti' is the Italian for 'blackmail')". He also brings up some unexpected intrusions of (orthographic) gemination, asking how it happened that "the italian word 'regata' entered English as 'regatta'", and noting that there are 12,900 Google hits for "Gallileo."

The truly shocking news (for those of us who don't know Venetian slang) is at the end of his note:

A case of its own is the famous painting by Leonardo da Vinci, allegedly portraying a certain Monna Lisa (where Monna is the contraction of Madonna, i.e. My Lady) and known in the English-speaking world as Mona Lisa. Now, in the whole north-east of Italy, including Venice, "mona" is a rather obscene word denoting female pudenda, and, not unlike similar words in English, can be used by synecdoche to denote a woman. Referring to "Mona Lisa" in Venice can attract rather amused (or shocked) looks.

The Italian Wikipedia page for Monna Lisa includes the geminate, but the Dutch, German, Swedish and Hebrew Wikipedia pages on the same topic have only one N (or equivalent letter). French and Romanian of course have La Joconde and Gioconda respectively.

This merits a closer look, I can see. More later.

[Update: as for regatta, the OED blames it on the Italians, giving the etymology [It. (Venetian) regatta (and regata) 'a strife or contention or struggling for the maistrie' (Florio): hence also F. régate.] The earliest citation is late 17th century: 1652 S. S. Secretaries Studie 265 The rarest [show] that ever I saw, was a costly and ostentatious triumph, called a Regatto, presented on the Grand-Canal.

It's true that there is no regatta in contemporary Italian; was the 17th-century borrowing Regatto just a mistake?

I should also note that some northern Italian varieties don't have phonological geminate consonants at all, as I understand it. But perhaps those are the northwestern dialects. ]

[Update 3/30/2004: Des Small emailed this additional information about geminates in Venice:

I went to Venice last year for a conference (and accordingly saw approximately none of its glorious patrimony) and I took the Lonely Planet Italian phrasebook with me, so I could buy bus tickets (which is slightly non-trivial, as they are sold only in tobacconists' shops and never on buses), and I seem to remember it saying that Venetian dialect had _no_ geminates.

Given that I knew then exactly enough Italian to buy bus tickets and know less now, and I am not by any means a phonologist, that's also what I heard on the Venetian streets.

But the Internet agrees with me;
http://www.netaxs.com/~salvucci/ITALdial.html says:

"Double consonants are to some extent singularized in Venetian: el galo (il
gallo), el leto (il letto); note also the use of the masculine article el
(il)."

while http://www.veneto.org/language/index.asp says

"[...] Venetian (spoken in Venice, Mestre and other towns along the coast).
It has 24 phonemes, seven vowels and 17 consonants; original Latin
plosives are softened and voiced and often disappear entirely; no double
consonants can be found;"

Maybe Venetians reverentally resort to deobscenifying diglossia in cases of artistic appreciation; it can surely hardly be that the Internet is wrong!

So if Stefano is correct that "[r]eferring to 'Mona Lisa' in Venice can attract rather amused (or shocked) looks" -- and surely he must know -- then perhaps the references in question are in writing; or perhaps Des is right about facultative diglossia for aesthetic purposes; or perhaps the Venetians would be just as amused (or shocked) by references to 'Monna Lisa', if they should happen to hear any.]

Posted by Mark Liberman at 03:19 PM

Perl dictionary hacking

There's an interesting-looking article at perl.com by Sean Burke on how to render a dictionary represented in Shoebox format. I think that Burke's introduction rather exaggerates the general cluelessness of field linguists, many of whom are capable programmers themselves, or have previously teamed up with programmers to do similar things; but the article (which I haven't had time to read carefully yet) looks like it offers a good tutorial on how to use HTML or RTF to render a simple dictionary database for printing or on-screen reading.

As some of the (many available) examples of prior (and perhaps better) art, take a look at Bill Poser's lecture notes on extracting fields from Shoebox dictionaries using AWK (which unlike Burke's program, handles the case where there are repeated tags within an entry), or his paper "Lexical Databases for Carrier", or his "Poor man's Web Dictionary", which provides a working example of a simple pure HTML (no CGI, no database) lexicon generated automatically from a Shoebox database, together with the code necessary to generate it. Although simple, it includes audio and images.

Posted by Mark Liberman at 12:46 PM

Google's print edition

If you have plenty of bookshelf space, you may be interested in Google's print edition .

The quantitative side of the ad is a bit under-researched, even for a joke. In particular, the claim that "Google's 36,795 volumes will be ten times larger than the unabridged Oxford English Dictionary" seems simultaneously to attribute far too few volumes to Google and far too many to the OED.

Posted by Mark Liberman at 06:57 AM

Conservation of (orthographic) gemination

Lauri Karttunen once remarked to me that Americans, who misspell his last name a lot, render it as "Kartunnen" more often than as "Kartunen". That is, rather than just omitting the doubled letter T, they substitute a doubled letter N instead. This is not a mistake that any native speaker of Finnish is likely to make,but non-Finns seem to remember that there's a double letter in there somewhere, even if they aren't very sure where it is.

I thought of this the other day, because in a post about Attila the Hun, in which the name "Attila" occurred a half a dozen times, I misspelled it once as "Atilla". I noticed the error and corrected it, even before Geoff Pullum did. But meanwhile, David Pesetsky had emailed me with important movie lore. He first copied my error, and then immediately correctly himself: "Did I really just spell Attila with one T and two L's? I do know better." Well, both of us do, but our pattern of typos still exhibited Lauri's hypothesized conversation of gemination.

Despite Lauri's many contributions, I feared that the name Karttunen would not occur often enough on the internet to check his intuition statistically. But Attila is another matter.

When I queried Google a few days ago, I got the following page counts:

String	Ghits
"atila the hun"	989
"attila the hun"	43,300
"atilla the hun"	9,400
"attilla the hun"	2,400

I didn't go any further with the issue then, but this evening I'm riding Amtrak from Washington to Philly, and so I have a few minutes to play with the numbers.

Arranging the counts in a 2x2 table, and giving the row and column sums as well as the overall total, we get:

	l	ll
t	989	9,400	10,389
tt	43,300	2.400	45,700
	44,289	11,800	56,089

One sensible way to view this set of outcomes is as the results of two independent choices, made every time the word is spelled: whether or not to double the T, and whether or not to double the L. After all, every one of the four possible outcomes occurs fairly often. This is the kind of model of typographical divergences -- whether caused by slips of fingers, slips of the brain, or wrong beliefs about what the right pattern is -- that underlies most spelling-correction algorithms.

In the case of the four spellings of Attila, we can represent the options as a finite automaton, as shown below:

There are four possible paths from the start of this network (at the left) to the end (at the right). Leaving the initial "A", we can take the path with probability p that leads to a single "t", or the alternative path with probability 1-p that leads to a double "tt". There is another choice point after the "i", where we can head for the single "l" with probability q, or to the double "ll" with probability 1-q. In this simple model, the markovian (independence) assumption means that when we make the choice between "l" and "ll", we take no account at all of the choice that we previously made between "t" and "tt".

But are these two choices independent in fact? If Lauri was right about the "conservation of gemination", then the two choices are not being made independent of one another. Writers will be less likely to choose "ll" if they've chosen "tt", and more likely to choose "ll" if they've chosen "t".

There are several simple ways to get a sense of whether the independence assumption is working out. Maybe the easiest one is to note that in the model above, the predicted string probabilities for the four outcomes are

	l	ll
t	pq	p(1-q)
tt	(1-p)q	(1-p)(1-q)

This makes it easy to see that (if the model holds) the column-wise ratios of counts should be constant. In other words, if we call the 2x2 table of counts C, then C(1,1)/C(2,1) (i.e. atila/attila) should be pq/((1-p)q) = p/(1-p), while C(1,2)/C(2,2) (i.e. atilla/attilla) should be (p(1-q))/((1-p)(1-q)) = p/(1-p) also. We can check this easily: atila/attila is 989/43,300 = .023,while atilla/attilla is 9,400/2,400 = 3.9.

The same sort of thing applies if we look at the ratios row-wise: C(1,1)/C(1,2) (i.e. atila/atilla) should be pq/((p(1-q)) = q/(1-q), while C(2,1)/C(2,2) (i.e. attila/attilla) should be ((1-p)q)/((1-p)(1-q)), or q/(1-q) also. Checking this empirically, we find that atila/atilla is 989/9,400 = .105, while attila/attilla is 43,300/2,400 = 18.0.

Well, .023 seems very different from 3.9, while .105 seems very different from 18.0. But are they different enough for us to conclude that the independence assumption is wrong? or could these divergences plausibly have arisen by chance?

The exact test for this question is called "Fisher's Exact Test" (as discussed in mathworld, and in this course description for the 2x2 case). If we apply this test to the 2x2 table of "attila"-spelling data, it tells us that if the underlying process really involved two independent choices, the observed counts would be this far from the predictions with p = < 2.2e-16, or roughly 1 in 500 quadrillion times. In other words, the choices are not being made independently!

The direction of the deviations from the predictions also confirms Lauri's hypothesis -- writers have a strong tendency to prefer exactly one double letter in the sequence, even though zero and two do occur. Given that the two-independent-choices model is obviously wrong, there are other questions we'd like to ask about what is right. But with only four numbers to work with, there are too many hypotheses in this particular case, and not enough data to constrain them very tightly.

However, there's a lot of information out there on the net, in principle, about what kinds of spelling alternatives do occur, and what their co-occurrence patterns look like. The key problem is how to tell that a given string at a given point in a text is actually an attempt to spell some specified word-form. We've solved that problem here by looking for patterns like "a[t]+i[l]+a the hun" (not that Google will let us use a pattern like that directly, alas). In other cases, we would have to find some method for determining the intended lemma and morphological form for a given (possibly misspelled) string in context. This is not impossible but the general case is certainly not solved, or spelling correction programs would be much better than they are.

[Update: I was completely wrong about the possibility of checking this idea with web counts of the name Karttunen and its variants. We have
Karttunen 57,500
Kartunnen 3,330
Karttunnen 156
Kartunen 628
or in tabular form

	n	nn
t	628	3,330
tt	57,500	156

There is a small problem: many of these are actually valid spellings of other people's names (even if historically derived from spelling errors at Ellis Island or wherever), rather than misspellings of Karttunen. Still, the result also supports Lauri's hypothesis, and I have no doubt that it would continue to do so if the data were cleaned up.]

Posted by Mark Liberman at 12:58 AM

March 28, 2004

Speech Accent Archive

The Speech Accent Archive at George Mason University is a neat idea. But it seems to be based on the premise of providing one exemplar of each place of origin -- for countries where there are multiple speakers, each one is identified as being from a unique city or town. This makes it less than optimally useful for studies of variable phenomena, for studies of things that depend on level of experience with English, and so on.

It'd also be nice to be able to get the audio in a convenient form, for further analysis. The quicktime .mov format in which they're stored is not among the more widely recognized formats, at least by audio analysis programs.

Posted by Mark Liberman at 04:54 PM

Bad named entity algorithms at the Gray Lady?

The first paragraph of a story in today's NYT by David Carr, entitled "Casting Reality TV becomes a Science", reads, in the online version, like this:

In a suite high above Columbus Circle, Rob LaPlante is looking for next season's breakout television star. There is no agent hovering nearby, no technical crew, just Mr. LaPlante, his assistant and a digital video camera, auditioning Laura Fluor, a car saleswoman from Monmouth County, N.J.

The hyperlink on Laura's last name "Fluor" leads to a page about the Fluor Corporation on the NYT business site, giving us the standard NYT "Company Research" treatment: share price and price history information, a thumbnail description of the company's business ("The Group's principal activities are to provide professional services on a global basis in the fields of engineering, procurement, construction and maintenance...") a list of the latest insider trades, and so on. A similar page is available for any company traded on the major stock exchanges.

There is absolutely nothing in the original Carr article to lead us to believe that Laura Fluor has anything at all to do with the Fluor Corporation. I can't imagine that the writer, an editor or even any human hyperlinker would think that this link was appropriate. So either someone is having a little joke, or the NYT's online site is running some company-name-recognition software that needs work. The state of the art for "entity tagging" is far from perfect, but it's better than this.

Posted by Mark Liberman at 04:21 PM

And yet.

How many times does a word or phrase need to be repeated in order to seem characteristic of a speaker or author? I think that the answer is "not very many times, maybe only once or twice, if the use in context is salient enough".

If this is true, then the kind of statistical stylistics that David Lodge worried about will not be adequate to uncover these associations. Raw frequencies certainly will not work, since these words or phrases may only be used a couple of times, or at least will only have been used a couple of times at the point where we start to associate them with the writer or speaker. Simple ratios of observed frequencies to general expectations will not work either, because at counts of two or so, such tests will pick out far too many words and phrases whose expected frequency over the span of text in question is nearly zero. As readers and listeners, we mostly ignore these cases, attributing them to the influence of topic or to random noise in the process. To model human reactions in such cases, we need to be able to discount the effects of topic, and perhaps also to understand better, in some other ways, what makes the use of a word or phrase stylistically striking or salient.

I recently came across an example of this phenomenon in the climactic scene of Jennifer Government, a satirical SF novel by Max Barry. This is the end of chapter 84, p. 313, where the book's eponymous heroine Jennifer Government arrests the arch-villain, her former lover John Nike:

... "John Nike, you are under arrest for the murder of Hayley McDonald's and up to fourteen other people."
         "What? What?"
         "You will be held by the Government until the victim's families can commence prosecution against you." She hauled him up and marched him towards the escalators. He was a pain to move. His legs kept slipping out from under him, as if he was drunk.
         "You're arresting me? Are you serious? I don't belong in jail!"
         "And yet," she said.

When I read this passage, I recognized that "and yet" -- as a phrase by itself, with the continuation left unspoken -- was an expression characteristic of the character "Jennifer". I couldn't remember any specific instances from earlier in the book where she had used the expression, though I did feel that one of them had been in a conversation with her four-year-old daughter Kate.

Courtesy of amazon.com's search function, I can easily find out how often the expression occurs elsewhere in the book. The answer is "twice". The first is indeed part of Jennifer's effort to get her daughter up in the morning, on p. 170:

Kate's eyes opened, then squeezed closed. "I'm tired..."
"It's time to get ready for school."
"I don't want to."
"And yet," she said.

The second instance is in the context of a government raid on General Motors' London headquarters (p. 195):

In a way, Jennifer felt bad, busting into such a nice place in full riot gear and scaring the crap out of everybody. But in another, more accurate way, she enjoyed it a lot. She collared a scared-looking receptionist and read out her list of target executives. "Where are they?"
         "They're--different floors. Four, eight and nine."
         "Three teams!" Jennifer said. "I'll take level nine. Meet back here."
         "You can't go up there!" the receptionist said, horrified. "This is private property! You can't!"
         "And yet," Jennifer said. She hit the stairs. She found her target by striding down the corridor and barking out his name: when a man popped his head out of an office, she cuffed him. It was much easier than she'd expected.

It seems fairly easy to explain post hoc why the phrase "and yet" as a sentence in itself should trigger our linguistic novelty detectors -- the words in this case are clearly free of topic-specific content, and the bigram "and yet" at the end of sentence, written without continuation dots, is much rarer than would be predicted given its overall frequency and the frequency of sentence-ends. However, I suspect that a scan for bigrams with quantitatively similar properties would turn up lots of unremarkable examples, and that other examples of passages evoking a similar psychological reaction might not yield as easily to simple frequentistic analysis, even post hoc.

This reminds me of Josh Tenenbaum's analysis of generalization from very small sets, down to sets of size one. It would be interesting to try an analogous approach here. A more strictly analogous problem would be inferring the 'sense' of a word or phrase from a single use in context. This is related to the point under discussion here, I think, since in many cases we seem to identify a speaker or writer's lexical habit from a couple of uses, in part by concluding that those uses constitute a novel (or at least unusual) sense.

I should point out that Max Barry (the author of Jennifer Government) tries to salt the mine, so to speak, by having the little girl in the first passage cited above respond "Mommy, I hate it when you say 'And yet.'", thus trying to clue us in overtly to his intentions. I don't think this is necessary or even effective -- I don't think it had any effect on my reactions in this case.

And yet.

No, the context isn't quite right for this to be a valid instance of Jennifer's little verbal tic, as established by the three examples in the novel. In fact, I think that any one of those examples would probably do as an adequate basis for lexicographic generalization, suggesting that my use in the preceding paragraph, though plausible enough, is not the same sense. In some sense.

Posted by Mark Liberman at 02:55 PM

Searching for Santa Cruz

A new service for searching language archives has just been set up on the LDC website. Enter a language name like Warlpiri, to find 41 results in 7 different language archives, ranging from a bunch of primary resources in the Australian Studies Electronic Data Archive, to a paper in the ACL Anthology on "Parsing a Free-Word Order Language." If you use a variant or incorrect spelling of the language name (e.g. Walbiri), the service will direct you to the correct version, thanks to Ethnologue's list of alternate language names, approximate string matching, and various other tricks. Enter a country name to find resources for languages spoken in that country. Search for Santa Cruz (a language of the Solomon Islands) and find Voorhoeve and Wurm's recordings held in the Pacific And Regional Archive for Digital Sources in Endangered Cultures. Now try the same search using Google, to discover a host of irrelevant sites (like the UCSC homepage) and realize the value of having this new service which searches a union catalog of major language archives. Visit LINGUIST List for a more fine-grained interface for searching within the same collection. All this is made possible by OLAC, the Open Language Archives Community...

Back in October Mark Liberman wrote: "One thing I'd like to understand better is the relationship to the Open Archives Initiative and the Open Language Archives Community. Steven?" (Another scientific revolution?). Later Mark gave OLAC some more air-time: "The OLAC Metadata set is a modest set of extensions to the Dublin Core, useful for cataloguing language-related archives of various types" (Borges on metadata). Let me take this as my cue to tell you some more about OLAC.

In December 2000, an NSF-funded Workshop on Web-Based Language Documentation and Description, held in Philadelphia, brought together a group of nearly 100 language software developers, linguists, and archivists responsible for creating language resources in North America, South America, Europe, Africa, the Middle East, Asia, and Australia. The outcome of the workshop was the founding of the Open Language Archives Community, with the following purpose:

OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.

Today OLAC has over two dozen participating archives in seven countries, with 26,656 records describing language resource holdings. Anyone in the wider linguistics community can participate, not only by using the search facilities, but also by documenting their own resources (providing data), or by helping create and evaluate new best practice recommendations (sign up for OLAC mailing lists, starting with OLAC General).

OLAC is built on two frameworks developed within the digital libraries community by the Dublin Core Metadata Initiative and the Open Archives Initiative. The DCMI provides a way to represent metadata in electronic form, while the OAI provides a convenient method to aggregate metadata from multiple archives.

"Metadata" is structured data about data - descriptive information about a physical object or a digital resource. Library card catalogs are a well-established type of metadata, and they have served as collection management and resource discovery tools for decades. The OLAC Metadata standard defines the elements to be used in descriptions of language archive holdings, and how such descriptions are to be disseminated using XML descriptive markup for harvesting by service providers in the language resources community. The OLAC metadata set contains the 15 elements of the Dublin Core metadata set plus several refined elements that capture information of special interest to the language resources community. In order to improve recall and precision when searching for resources, the standard also defines controlled vocabularies for descriptor terms covering language identifiers, linguistic data types, discourse types, linguistic fields, and participant roles. You can see three of these vocabularies in use by searching for Pullum and picking the record for Pullum & Derbyshire's paper Object-initial languages.

I'm indebted to Gary Simons, along with dozens of institutions and individuals for helping to build and support OLAC.

Posted by Steven Bird at 05:46 AM

March 27, 2004

Onion entropy

Just when I thought the Onion was getting predictable, we get this.

Slyly carrying on the joke, Classics in Contemporary Culture observes that "someone has the rudiments of Greek grammar, but doesn't know about final sigmas...", but one of the commenters suggests that "The text was probably created with Symbol, which IIRC doesn't include a final sigma."

Posted by Mark Liberman at 06:13 AM

X are from Mars, Y are from Venus

Back in November of 2003, this weblog post (by journalist Gavin Sheridan) accused author John Gray of exaggerating his educational credentials. Well, to be more precise, it called him a "fraud" for claiming a PhD from "Columbia Pacific University", which was shut down by the state of California in 2000 for "award[ing] excessive credit for prior experiential learning to many students; fail[ing] to employ duly qualified faculty; and fail[ing] to meet various requirements for issuing Ph.D. degrees." Gray apparently had some lawyers issue a threatening letter, which included the additional information that his B.A. and M.A. are from "Maharishi European Research University". The result has been to publicize the questions about Gray's credentials much more widely, since the story was picked up by Glen Reynolds (here and here) among others.

Gray is the author of the "Men are from Mars, Women are from Venus" series, popularizing a version of the "two cultures" (or in this case perhaps "two planets") theories about inter-gender communication, originated in an academically more serious form by Deborah Tannen and others. That's a topic for another post -- the only new thing that I've learned about it from reading the blog entries cited above is that Gray's own communications skills are apparently so finely tuned that he's been able to talk his way through eight marriages. Here I'm just registering another sign of his success as a communicator, namely the spread of the "X are from Mars, Y are from Venus" snowclone.

A bit of internet searching turns up X/Y pairs from many domains, including

suppliers/buyers
customers/suppliers
buyers/brokers
distributors/manufacturers

media/scientists
scientists/journalists
students/teachers
teachers/pupils
scientists/educators
physical scientists/mathematicians
pathologists/clinicians
developers/testers
lawyers/doctors
directors/actors

Republicans/Democrats
Americans/Europeans
Germans/Italians

Nikes/Reeboks
web searchers/ browsers
PCs/Macs

Dogs/Cats
mandrills/lemurs
bulls/cows
humans/monkeys

As far as I know, this formula is original to Gray (unless it was suggested by some anonymous editor or editorial lackey). I'm not convinced that theories of inter-cultural communication have significantly improved relationships between any of the X's and Y's in the list, but I could be wrong.

Posted by Mark Liberman at 05:10 AM

March 26, 2004

What that wooden stake is really for

Those who have been following the vampire language saga (here and here) on Prentiss Riddle's blog will want to take a look at this report from Walachia, south of Transylvania. Apparently in this culture, vampires only prey on their families:

"That's the problem with vampires," said Doru Morinescu, a 30-year-old shepherd who, like many in the village, has a family connection to the current case. "They'd be all right if you could set them after your enemies. But they only kill loved ones. I can understand why, but they have to be stopped."

The methods for dealing with vampires are also not quite as Bram Stoker depicted them.

"Before the burial, you can insert a long sewing needle, just into the bellybutton," he said. "That will stop them from becoming a vampire."

But once they've become vampires, all that's left is to dig them up, use a curved haying sickle to remove the heart, burn the heart to ashes on an iron plate, then have the ill relatives drink the ashes mixed with water.

"The heart of a vampire, while you burn it, will squeak like a mouse and try to escape," Balasa said. "It's best to take a wooden stake and pin it to the pan, so it won't get away."

I'm not sure that I see just how to pin something to an iron plate with a wooden stake. I can see why the reporter didn't ask more about this -- it's the sort of thing that easier to demonstrate than to describe effectively. Whatever the proper technique, it's illegal in Romania, where local authorities are threatening to file charges against the relatives of alleged vampire Toma Petre.

"What did we do?" pleaded Flora Marinescu, Petre's sister and the wife of the man accused of re-killing him. "If they're right, he was already dead. If we're right, we killed a vampire and saved three lives. ... Is that so wrong?"

[news tip from John Bell]

Posted by Mark Liberman at 07:29 PM

March 25, 2004

Diamond geezer?

Among the "over-used phrases" that the Plain English Campaign has cited as as "a barrier to communication" is diamond geezer. This one is so far from being over-used, at least in the circles that I inhabit, that the obstacle it poses to communication is that I've never heard of it and have no idea what it means.

A Google search turns up 23,000 pages, of which the first few include a weblog featuring London sitcoms, restaurant reviews, railway security and "street cries of old/new London;" a jewelry store "tirelessly scouring the world to match you with your perfect diamond"; a site that identifies the term diamond geezer with people who wear brightly-colored harlequin trousers in support of a rugby team; and a music promoter offering expertise in "2 Step, Bhangra, Bungati, Charts, Dance, Disco, Drum and Bass, Dub, Funk, Garage, Hard House, Hip Hop, House, Indie, Jungle, Live Music, Miami Base, Old Skool, Old Skool (Drum & Bass), R&B, Ragga, Reggae, Rock, Salsa/Latin, Soul, Swing, Techno, Trance".

At this point, my ability to form natural classes has been already been stretched beyond its limits. This phrase is not an irritatingly overused and tired cliché, it's a complete f***ing mystery. I see confirmation here for my original conjecture that the whole Plain English Campaign thing is an elaborate Pythonic joke. Can someone offer a clue?

[Update: many clues have been offered. Anders suggests that I "have a butcher's" at this page, which glosses diamond geezer as "A really wonderful man, helpful and reliable; a gem of a man. A commonly heard extension to 'diamond'. [Mainly London use]".

John Kozak explains that

It's East End slang. "geezer" = "man", in a "one-of-us" sort of way. Here, "diamond" is approbatory, so the overall sense is "a good sort". There's a slight overlay to all this in that most people's exposure to this term is via an interminable set of films sponsored by the public lottery about East End gangsters, so most would situate it more narrowly in that context.

John goes on to ask "In the US, 'geezer' = 'old person', doesn't it? Wonder how that came about? "

And Des Small writes that

This is Cockney/London slang for "great bloke". Since you obviously can't go around believing random stuff that people tell you, here's a link to a source:

http://www.LondonSlang.com/db/d/
"""
diamond geezer - - a good 'solid' reliable person.
"""

It's on the Internet, so it must be true!

Thanks to all!

]

[Update 2: The OED glosses geezer as "A term of derision applied esp. to men, usu. but not necessarily elderly; a chap, fellow. " Its first citation is from 1885. Of the ten citations, four (including the first) explicitly say "old" in association with geezer, and I think that all are British sources. One of the citations is "1893 Northumbld. Gloss., Geezer, a mummer; and hence any grotesque or queer character. " This suggests the equation grotesque = old as the source of the association with old age. In (my intuitions about) American usage, this association has become part of the core meaning of the word, and to use geezer for a child or youth would have to be a joke or other special effect.]

Posted by Mark Liberman at 11:20 PM

That queerest of all the queer things in this world

Last November, I suggested that ambient cell phone conversations are distracting and annoying not because they're loud, but because they're one-sided and therefore frustrating to try to follow. In 1880, Mark Twain wrote a "comic sketch" about the experience of listening to one side of a "telephonic conversation" in which he makes a similar point.

I handed the telephone to the applicant, and sat down. Then followed that queerest of all the queer things in this world—a conversation with only one end to it. You hear questions asked; you don’t hear the answer. You hear invitations given; you hear no thanks in return. You have listening pauses of dead silence, followed by apparently irrelevant and unjustifiable exclamations of glad surprise or sorrow or dismay. You can’t make head or tail of the talk, because you never hear anything that the person at the other end of the wire says.

He goes on to give a complete transcript of his end of this particular conversation. Some aspects of the piece are dated -- the interaction with the central office, the need to shout to be heard down an unamplified phone line, and Twain's casual display of sexist stereotypes, which today is permitted in our better publications only when directed at men. But the experience is basically¹ the same today as it was 124 years ago.

¹The Plain English Campaign thinks that basically is "irritating". I think it's the right word in this context, meaning (as the American Heritage Dictionary tells us) "In a basic way; fundamentally or essentially".

Posted by Mark Liberman at 10:54 PM

Sapir-Whorf alert

The April Scientific American has a feature on Paul Kay, discussing his research before and after Basic Color Terms in 1969. An interesting quote:

"Two key questions must always be kept separate," Kay adds. "One is, do different languages give rise to different ways of thought? The other is, how different are languages?" It is possible, he says, that the respective answers are "yes" and "not very."

We've discussed related issues in the past (here and here, for example). A current controversies has to do with differences in spatial reference -- the relative role of cultural, linguistic and situational factors is debated, with different experiments pointing in different directions (so to speak). More on this soon.

Posted by Mark Liberman at 10:31 PM

More on the McGurk Effect

The McGurk effect to which Sally Thomason refers, whereby someone presented with a video of a person saying [ga] and simultaneous audio of someone saying [ba], perceives [da], is indeed interesting, and has been exploited in various ways to get at aspects of speech perception. You can find out more about it from this web page at Haskins Laboratories, which includes this link to a demonstration of the effect.

Posted by Bill Poser at 12:37 PM

Irritating cliches? Get a life

The Plain English Campaign is not just an amiable bunch of British eccentrics, says Mark (here); they are humorless hypocrites, "short on judgment, common sense and consistency", and their pronouncements, themselves laden with clichés, are not to be taken seriously. I agree, of course. Don't just listen to me about the Campaign's indefensible citation of Defense Secretary Donald Rumsfeld for an allegedly confusing pronouncement; listen to The Economist , which loves to mock Americans and word-manglers, but agreed with me on this.)

The Campaign's list of the most irritating clichés in the English language does include some clichéd phrases that I can imagine people being irritated by. Their number one, the (largely British) phrase at the end of the day — which I understand to have a meaning somewhere in the same region as after all, all in all, the bottom line is, and when the chips are down — may shock people by its complete bleaching away of temporal meaning. As I understand it, users of this phrase would see nothing at all peculiar in a sentence like It's no good saving money on heating if it means having a cold bedroom, because at the end of the day, you've got to get up in the morning.

The second-ranked at this moment in time might annoy people by being a six-syllable substitute for the monosyllabic now — though this has happened before: Colonel Potter in the TV series MASH used to say WW2, a seven-syllable abbreviation for the three-syllable full-length version World War Two.

However, some of the other items on the list are surely just incorrectly classified: as I understand what a cliché is, many of these aren't clichés at all. They're just words some people have taken an irrational dislike to. That's very different. A few examples follow:

The adverb absolutely.
The adjective awesome.
The adverb basically.
The noun basis.
The adverb literally.
The adjective ongoing.
The verb prioritize.

A cliché is a trite, hackneyed, stereotyped, or threadbare phrase or expression: spoiled from long familiarity, worn out from over-use, no longer fresh. But if the Plain English Campaign is going to claim the right to say that about individual words that its correspondents suddenly take a disfancy to, surely most of the words found in smaller dictionaries will have to go. Many of the words we use -- like every single one of the words in this sentence -- have been around and in constant use for several hundred years. What on earth is the Plain English Campaign suggesting we should do with its list of pet hates? Is it recommending word taboos on the basis of voting out, a kind of lexical Survivor?

And what is getting the poor loser words voted off the island? Why, for instance, should a persistent problem be permitted to persist while the ongoing use of an ongoing problem is condemned? Of the two, persistent is the older, hence presumably the staler.

But the Campaign can't really be worried about staleness. Another of their picks is just one of the half-dozen uses of like. The unpopular use is of course the one where it is a hedge meaning something like "this may not be exactly the right word but it gives the general impression." I discussed it here, and later discovered that it is actually used by God. An odd choice indeed as a cliché: the one thing everyone agrees on is that it is fairly new in the language. I figured that was why it was hated so much. What's supposed to be wrong with these condemned items: are they too old or too new?

I don't understand these wordgripers and phrase disparagers. If I may borrow a phrase that genuinely is hackneyed and familiar (immortalized in William Shatner's wonderful Saturday Night Live Trekkies sketch and none the worse for its frequent affectionate requotation): people, get a life.

Posted by Geoffrey K. Pullum at 10:45 AM

Baba vs. Dada

Back in the days when I taught a Phonetics class (because I was in a department that had no genuine phonetician, the kind of person who is not a technophobe and can introduce students to the wonders of phonetics software), I used to give my students an emphatic warning: when you work on your term project, I told them, do tape-record your consultant pronouncing a 200-word Swadesh list of basic vocabulary, but don't use those tapes as a substitute for face-to-face elicitation and checking of data. The reason is that seeing your consultant pronounce the sounds helps you hear them better and identify them correctly. Yesterday I began to doubt the complete wisdom of this advice when my colleague Pam Beddor showed a video in which a lecturer illustrated the McGurk effect. Probably all my fellow bloggers already know about this remarkable demonstration, but I'll describe it anyway.

The speaker announced that she would pronounce a nonsense word, baba. She instructed her audience to close their eyes and listen. Sure enough, with your eyes closed, you could tell that she was saying baba. No surprise there. Then the audience was told to listen again with open eyes. This time the video showed the speaker apparently pronouncing dada -- no lip closure at all, though I couldn't actually see much of what was going on behind the teeth. And in fact I heard dada. No matter how hard I tried, knowing that she was actually saying baba, I could not hear baba. True, it sounded like a slightly odd version of dada, or at least I imagined that it sounded oddish, but I couldn't even imagine baba while watching her. Moral (?): in a clash between eyes and ears, the eyes have it.

[Update by Mark Liberman: Sally is right to be impressed by the McGurk effect -- it's a stunning demonstration of the power of "sensory fusion" in speech perception. However, her description of the details is a bit different from the way in which the standard effect is usually demonstrated. The standard McGurk effect involves seeing a video of [ga] while listening to a synchonized audio of [ba] and perceiving [da], unless you close your eyes. It feels like you're controlling the playback with your eyelids.

There's a excellent McGurk page here.]

Posted by Sally Thomason at 07:34 AM

Bored of

A recent post on wordorigins discusses "bored of" as opposed to "bored with". This one strikes me just like "worried of" (discussed here and here) and "eligible of" (discussed here) -- in other words, ungrammatical.

However, Google gets 162,000 hits for "bored of". Lots are "Bored of the Rings" and such-like bad puns, but quite a few are things like "If you are bored of your computer, Desktop Studio can help you." The search also turned up a year-old article entitled "Unnatural Language Processing", by Michael Rundell, that treats this very topic. Rundell observes that

When the British National Corpus (BNC) was assembled in the early 1990s, there were 246 instances of 'bored with', but only 10 hits for 'bored of' -- and most of these came from recorded conversations rather than from written texts. The bored of variant would still, I suspect, be regarded as incorrect by most teachers, but a search on Google finds 112,000 instances of this pairing, as against 340,000 examples of bored with. It is always a bad idea to make predictions about language, but bored of seems to be catching up with bored with, and may well end up being recognized as an acceptable alternative.

It would be neat if this were true, though I'm afraid that Rundell may have been fooled by the "Bored of the Rings" and "Bored of Ed" jokes. It's not totally impossible, though -- "bored of it" now gets 25,400 ghits, whereas "bored with it" gets 48,500 , barely 1.9 times more. All the more reason to look carefully at verb/preposition associations across time, space and genre. Human Social Dynamics, yo.

[Update 3.29.2004: "bored of it" in a cartoon here.]

Posted by Mark Liberman at 12:01 AM

March 24, 2004

Big of a deal

Mary at eyes.puzzling.org asks "Is "big of a deal" as in 'it's not that big of a deal' a US usage, or am I just missing out on a trend?"

Kenneth Wilson discusses this in The Columbia Guide to Standard American English:

of a occurs more and more frequently in Nonstandard Common and Vulgar English in uses such as It’s not that big of a deal; She didn’t give too long of a talk; How hard of a job do you think it’ll be? All these are analogous to How much of a job will it be?, which is clearly idiomatic and Standard, at least in the spoken language where it most frequently occurs. It is possible, therefore, that the first three could achieve idiomatic status too before long, despite the objections of many commentators.

Another possible source is suggested by an observation attributed to Groucho Marx:

Outside of a dog, a man's best friend is a book. Inside of a dog, it's too dark to read.

It's interesting that Wilson's examples all involve positive-end scalar predicates: big, long, hard. The adjectives from the other ends of such scales show up less often in this construction, both absolutely and in proportion to the frequency of each particular adjective itself. The numbers below are Google hits (which are document counts rather than word or phrase counts, but they'll do):

	ADJ of a	ADJ
big	161,000	133M
small	18,500	116M
hard	10,300	89.5M
easy	4,020	79.6M
far	9,120	62.1M
near	624	48.8M

The case of long and short is a problem, because "short of a" has another meaning that is very common, as in "one can short of a six pack" or "just short of a miracle". We can avoid this by checking "too long of a" and "too short of a", which show the same effect, as do heavy and light:

	too ADJ of a	ADJ
long	12,400	152M
short	5,830	69.2M
heavy	1,540	27.5M
light	690	77.2M

Posted by Mark Liberman at 11:17 PM

Cuteness

Rachel Shallit posts here and here about an interesting new morphological fad: "X + ness = X, which I am trying to be funny or cute about". This has something to do with the cutesy snowclone "crunchy X goodness", as in "CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness", or "Shoggoth.net is filled to the brim with crunchy Cthulhu goodness" or "Now with more crunchy sarcastic goodness in every bite", or "this week, I have nearly 300 pages of crunchy Economist goodness to read." For some crunchy goodness from nearly 30 years ago, look here.

Posted by Mark Liberman at 07:33 PM

Fed up with "fed up"?

The Plain English Campaign has "surveyed its 5000 supporters in more than 70 countries" and determined that "'at the end of the day' [is] the most irritating phrase in the language," with 30-odd additional phrases listed as runners-up. Many publications and broadcasters have picked up the P.E.C. press release, including Reuters and the BBC World Service, where I heard it discussed this morning.

I need to begin my comments with a confession. It's hard for me to take anything that Robin Lustig says seriously, because whenever he opens his mouth, I think that I'm listening to a Monty Python skit. This is pure associative prejudice, I know, like the view that "technology doesn't sound nearly as impressive when it is discussed in a booming hick drawl", but I can't help it. So hearing Robin Lustig discuss this on the radio started me off with a feeling that the whole thing was some kind of high-entropy ironic foolishness.

Checking out the Plain English Campaign's press release confirmed and strengthened this feeling. They quote Orwell's dictum "Never use a metaphor, simile, or other figure of speech which you are used to seeing in print". But then, in the space of a few short sentences, they use the phrases "fed up", "pressure group", "barrier to communication", "tired expressions" and "tuning out", among other metaphors and figures of speech that I'm sure they are as used to seeing in print as I am. Google has seen these particular metaphors and figures of speech 835,000 times, 356,000 times, 20,340 times, 2,048 times, and 171,160 times, respectively. I don't have any objection to these phrases, myself, but it's definitely Pythonesque to strike a pose about avoiding commonplace metaphoric phrases in a document that uses two or three of them in every paragraph.

As a point of comparison, "blue sky thinking", which is one of the cliches we are told to shun, gets 3,660 Google hits. Can "blue sky thinking" possibly be a more offensive metaphorical expression than "fed up" or "tune out"? No, this has to be some deadpan English joke.

Alas, it isn't. The Plain English Campaign is the same outfit that gave its "foot in mouth "award to Donald Rumsfeld's plain-spoken exploration of epistemic logic. They're serious. They're just short on judgment, common sense and consistency.

Moving down the page from "fed up with cliches" to their previous press release, we find that its title alone deploys two commonplace metaphorical phrases: "From head to toe - medical consent company makes it crystal clear". These get 402,000 and 1,660,000 ghits respectively. The press release goes on to say:

Chrissie Maher, founder director of the Campaign, praised EIDO's achievement. 'Expecting patients to sign a consent form they can't understand is nothing short of a cruel joke. EIDO have shown that, no matter what the medical or surgical procedure is, you can produce clear information that truly allows patients to understand what they are agreeing to. By achieving plain English in every document, EIDO have become a guiding light for the entire healthcare industry.'

39,000 google hits for the "cruel joke" simile, 145,000 for the "guiding light" metaphor

3 And the scribes and Pharisees brought unto him a woman taken in adultery; and when they had set her in the midst,
4 they say unto him, Master, this woman was taken in adultery, in the very act.
5 Now Moses in the law commanded us, that such should be stoned: but what sayest thou?
6 This they said, tempting him, that they might have to accuse him. But Jesus stooped down, and with his finger wrote on the ground, as though he heard them not.
7 So when they continued asking him, he lifted up himself, and said unto them, He that is without sin among you, let him first cast a stone at her.
8 And again he stooped down, and wrote on the ground.

Posted by Mark Liberman at 06:12 PM

It wasn't Lexus, it was Lexis!

Communications from several members of the now defunct company I talked about in this post have established that it was in error. Despite some conflicting reports, it now seems that it was not the Lexus division of the Toyota corporation that wrote a threatening letter to a company that wanted to call itself Lexeme. I'm lucky Lexus didn't send ugly guys round to break my legs. No, it was (at least, so I am now told) the Lexis-Nexis legal database corporation that sent that letter. (If anyone did; there are some who say that there never was any such letter, there was just an evil scheming boss who wanted to put his stamp on the company and give it a new name of his own devising.)

You might say it makes a bit more sense that a company in the info biz might be concerned at Lexeme's first syllable. But a lawyer friend of mine has told me a bit more about why even Lexus might have been concerned. Now, he does not want to be named or quoted, because lawyers have to be so careful not to get on the record with critical remarks about the law, in case something is worded incorrectly, or in case it touches upon some case where they have privileged information or a case still in progress. But my friend didn't forbid me to describe what he said. I will do that here; but keep in mind that from here on, for the next dozen paragraphs, I am not presenting ideas of my own. I am skating along the blurry line between paraphrase and plagiarism, relying on an unnamed legal source and reproducing a lot of it unchanged.

Trademark law used to be mainly about consumer protection. Trademarks served to identify the source of products and services so the consumer wouldn't be fooled into buying fakes. Protecting the "good name" of the seller or trademark owner was a secondary consideration, and largely a byproduct of the primary purpose of the laws. The "likelihood of confusion" in the mind of the relevant consumer was the cornerstone of trademark protection: a second-comer was prevented from using a particular mark only if it could be shown that consumers might be "confused" about the source or origin of the goods.

But what counts as likely to cause confusion? There are a bunch of factors, sometimes known (because of a case involved Kodak) as as `the Polaroid factors'):

the strength or distinctiveness of the plaintiff's mark;
the proximity (or similarity) of the goods or services (which may be evaluated in terms of the target audience to which products or services are marketed);
the degree of similarity between competing marks or designations;
evidence of actual confusion;
the similarity in the marketing channels (or advertising media) used (or the manner in which competing products are marketed);
the type of goods (including quality), classes of prospective purchasers and the degree of care likely to be exercised by the purchaser (alternatively characterized as the sophistication of the purchasers);
the defendant's intent in selecting its own mark; and
the likelihood of expansion of the parties respective product lines (alternatively phrased as the likelihood that the plaintiff will "bridge the gap" between its market or business and that of the defendant).

Many problems arise from this list, but one of the most problematic is factor 3, the degree of similarity between competing marks or designations. What it means is that if there are lots of somewhat similar marks out there, then the plaintiff has to make a stronger showing that this particular defendant is creating a likelihood of confusion. In other words, if Starbucks says that everyone associates the name "Starbucks" with coffee and pastries from a distinctive source, and their opponent, an upstart company called Starbacks, is able to point to a whole bunch of similarly named establishments from which coffee, pastries or similar goods can be obtained, then it becomes much harder for Starbucks to argue that their name is so distinctive that consumer confusion will result. That is, if there's already a Star Bach's Restaurant and a Sta-Brucks Coffeehouse and a Star Bucky's and a Starbukes Pastries'n'Beer out there, then factor (3) starts to weigh against Starbucks when they go after Starbacks -- because Starbucks is now what those in the trademark biz call a "crowded field".

In other words, you have an easier case against infringers if your name is unique in the field. McDonald's would have had a much harder time preventing people from using the name McDharma's for vegetarian Indian fast food (which actually happened here in Santa Cruz County) if there had already been a McDreamer's and a McGoogle's and a MacGonigle's serving burgers down the street.

The bottom line is that corporations and their lawyers are now forced to go after similarly-sounding-named businesses in related areas, even if the particular "infringement" is arguably not a problem. If anything a bit similar to your business name is is used by someone else, your corporation will have a much harder case to make against a clearer case of infringement down the line, because you have permitted erosion of the distinctiveness of your mark.

It means that Starbucks simply must move against Star Bock, right now, because if that amusing pun is permitted, behind it may come StarBlech's joke imitation vomit and StarBic ballpoint pens and StarBickie cookies... And eventually will come the Starbacks coffee house, a clear ripoff that really does threaten Starbucks' business, with lawyers ready to argue that the Starbucks mark is not protectable any more because it is no longer distinctive: it's located in a crowded part of the phonetosphere within which Starbucks has permitted numerous other companies to use similar names, some of them in the food service industry, and the public no longer associates that corporation with all foods and drinks that come from something with a name similar to "Starbuck".

But it now gets worse. There is a relatively new cause of action known as trademark dilution. It is based on the idea that if you have a really famous and distinctive mark, you can prevent people from using your mark, or similar marks, for anything at all. Kodak has a famous mark. "Victoria's Secret" was recently found to be a famous mark by the US Supreme Court. It is fairly clear that Starbucks has a famous mark, and Lexis-Nexis could probably make that case as well.

The principal philosophical difference between dilution and traditional trademark law is that the purpose of dilution law is not to protect the public from confusion. Dilution protects "famous" marks when there is no likelihood of confusion at all, merely because they are famous. In other words, its purpose is to protect trademark owners against offense caused by someone using "their" word in a way not approved by them. It functions to provide a new and broad right of protection to those who are already the most successful and rich, a species of protection that is not available to anyone else.

Dilution law has been used to squash parody. Those t-shirts and posters that said "Enjoy Cocaine", and had a logo remarkably similar to the Coca Cola logo, are no longer available -- that poster was judged to be capable of tarnishing the Coca Cola mark. Debbie Does Dallas (which featured an enthusiastic girl named Debbie wearing a Dallas Cowboys Cheerleaders outfit, apparently) is no longer distributed: although nobody could possibly think that the Dallas Cowboys had anything to do with this film, their trademarks were "tarnished" by it.

For linguists, a further scary thing is that there are in fact people who are prominent in the trademark community who believe that you ought to be able to bring a lawsuit against anyone who tarnishes their name. This would include dictionary publishers. For example, McDonalds recently made noises about a cause for action against Merriam-Webster, because their latest edition of Merriam-Webster's Collegiate Dictionary defines a "McJob" as "low paying and dead-end work" (there is a short MSNBC story about it here).

The lawyer friend from whose email I pillaged the above thinks that under dilution law as it presently exists, McDonald's could well win on such a claim. But as yet no dictionary maker has actually been hauled into court for listing words people use. It has not proved feasible to stop dictionaries listing the uses of the numerous trademarks have turned into lower-case nouns and verbs, clearly diluting them. British English speakers talk about hoovering the carpet with an Electrolux hoover, and Hoover gnashes its corporate teeth in frustration but doesn't win cases against dictionary compilers who record the facts. American speakers talk about xeroxing a document on their Canon copier, and the Xerox Corporation bristles at such incorrect talk about xerocopying, but they haven't yet tried to break a lexicographer's legs. Sorry! I said something beginning with lex...

Posted by Geoffrey K. Pullum at 01:57 PM

"Under God," Hapax Legomenon

As Bill Poser notes in the previous post, the inclusion of "under God" in the Pledge has long been controversial, but the interpretation of the phrase poses a particular linguistic problem, since as I noted once in a "Fresh Air" piece, the phrase is actually a hapax legomenon in this context.

"Under God" was taken from Lincoln's Gettysburg Address, but there it's used as an adverbial: "...this nation, under God, shall have a new birth of freedom." But in the Pledge, the phrase is used adjectivally, to modify nation. As best I can tell, this is the only context in English where "under God" is used in this way, which leaves its meaning up for grabs. Is it like "under orders," "under a monarch," or "under heaven"? But then vagueness is probably what commended the phrase in the first place -- what better way to signal the doctrinal neutrality of the state?

Posted by Geoff Nunberg at 01:54 PM

Leading Pigeons to the Flag

A long-standing ritual in the schools in the United States is the Pledge of Allegiance, in which the children are called upon to recite the words:

I pledge allegiance to the flag of the United States of America and to the Republic for which it stands, one Nation under God, indivisible, with liberty and justice for all.

The pledge has been the subject of much controversy and litigation over the years. Members of the Watchtower Bible Society object to saying the pledge on the grounds that it constitutes idolatry. Many people have refused to recite the pledge as a form of political protest. This was not uncommon during the Vietnam War. The latest controversy surrounds the inclusion of the words under God, which were added in 1954 after a campaign by the Knights of Columbus. In a 2002 decision reported here the 9th Circuit Court of Appeals held that the Pledge is unconstitutional because these words violate freedom of religion. The case is shortly to be heard by the Supreme Court.

The case now before the Supreme Court is a narrow one, concerning only whether the inclusion of the words under God violates the freedom of religion of those who do not believe in God or who do not consider the United States to be a nation under God. That this is the case is so plain that I am stunned that any rational person can argue otherwise. What else could the words possibly mean? The Knights of Columbus didn't want these words added because they they improved the poetic quality of the pledge. They added them in an effort to impose their religion on schoolchildren.

The larger controversy has been over forcing children to recite the Pledge in any form. The Supreme Court upheld the right of children to decline to participate in West Virginia State Board of Education vs. Barnette (1943) and reaffirmed it in Tinker vs. Des Moines Independent School Board (1969) where it held that students

do not shed their constitutional rights to freedom of speech or expression at the schoolhouse door

Nonetheless, schools have frequently acted in defiance of the Court and the Constitution. When I was a junior high school student during the Vietnam War, my school tried to force students to recite the Pledge, which was reintroduced in an effort to suppress anti-war sentiment. A 14-year old should not have to remind a school principal of Supreme Court decisions. According to this report from 2002, the Walker County, Alabama school board requires students to recite the pledge. In the incident reported, a student was beaten for refusing to recite the pledge and silently holding up his clenched fist in protest. The school board claimed he was punished for "disrupting class" rather than for refusing to recite the pledge. How did he disrupt class? By refusing to recite the pledge and holding up his clenched fist. Where I come from, the school board's disingenuousness is called lying.

If any further evidence is needed that the purpose of the Pledge of Allegiance is to inculcate mindless loyalty to the state, it can be found in the fact that many children clearly do not understand what they are saying. This can be seen in the eggcorns that they construct. My mother tells me that as a little girl she believed that there was a thing called a legiance that she was pledging to the flag. She didn't know what it was. In today's column in the New York Times, entitled Of God and the Flag, William Safire reports that as a little boy he thought that the Pledge began "I led the pigeons to the flag". In a roundabout way, I think he understood it all too well.

Posted by Bill Poser at 08:52 AM

Copyfight

Copyfight is a new (since March 14) group weblog that promises to "explore the nexus of legal rulings, Capitol Hill policy-making, technical standards development and technological innovation that creates--and will recreate--the networked world as we know it".

You can learn there that BioMed Central has adopted a Creative Commons license, and that Creative Commons has launched its "Science Commons" branch. You can also learn that Night of the Living Dead is now available for download from archive.org (if you missed the same news on boingboing.net).

If you're interested in intellectual property and marketplace issues in scientific and scholarly communications, you should also check out Anoop Sarkar's thoughts and links at Special Circumstances.

Posted by Mark Liberman at 08:39 AM

Soundboards

John Pasden at Sinosplice has a nice Flash soundboard exhibiting Shanghainese vs. Mandarin words and phrases. As John points out, soundboards are best known for their role in generating the fake side of prank phone calls (here or here, for example), or other jokes such as this Mohamed Said Sahaf soundboard. His application shows that jokes are not the only use for such things. Jokes are fine, of course -- I myself am looking forward to the conversation between the George Lakoff soundboard and the Noam Chomsky soundboard.

Anyhow, to facilitate the more serious use of the technology -- in language instruction or in speech generation for the handicapped -- it would be nice to write a "soundboard generator" that would let people create such programs easily: Flash ActionScript has become a decent programming language, but this class of applications could easily be generated by a much simpler "little language", since it is mainly just a matter of text layout and audio links.

Posted by Mark Liberman at 08:04 AM

Why We Don't Understand When the Fat Lady Sings

If you ever listen to opera, you have probably found that sopranos are very difficult to understand. Of course, if you don't understand Italian or German or whatever the language of the opera is, everybody is hard to understand, but even if you do know the language, sopranos are particularly difficult. There's an article in Physics Today that explains why.

Posted by Bill Poser at 01:00 AM

March 23, 2004

Verbs and prepositions

Over at Transblawg, the estimable Margaret Marks has posted a sort of quiz about verb/preposition associations in English. Her examples are all from the context of legal translation, but most of them apply more widely.

I've recently been musing about unexpected associations of this type, especially "worry of" (here and here). These norms about complementation have a lot of interesting practical and theoretical properties. They're syntactically and semantically quasi-regular, for one thing -- a mixture of predictability and idiosyncrasy, and therefore presumably a mixture of "figure it out" and "look it up" strategies. They're somewhat variable across individuals, dialects and times. And they're relatively easy to study by string-search methods -- such searches don't in themselves produce reliable counts, because of the variable structure of the results, but they yield samples that can be humanly checked to produce accurate rates. One of the things that I've come to realize is that there is more low-level variation in the "meme pool" for such constructions than one might think. Actually, anyone who grades student papers will have learned this -- college students, even quite literate ones, often produce unexpected verb/preposition combinations.

For all these reasons, verb/preposition (and noun/preposition) combinations should provide a good domain in which to study what you might call the population memetics of grammar. And having accurate statistics for complementation would be useful for parsing purposes, anyway. So I've been thinking about how to design and implement large-scale studies of this sort of thing.

As another very small-scale exploration of the area, here are a few observations on some of Dr. Marks' examples.

1. to be eligible for parole

This is a pretty strong norm -- but you can find some examples, apparently produced by native speakers, using eligible of instead of for. These uses seem just as wrong to me as "worry of it" does, but I'm disposed to treat them as low-frequency variants in the meme pool rather than as production errors:

Wright State University's Police Department has a page about its S.A.F.E. escort service that includes the heading "Who is eligible of the S.A.F.E. escort service?"

Emory's study abroad site includes a document explaining that "[i]n order to be eligible of the Advanced Language Study Abroad Grant students must..." have four stipulated properties.

2. to sentence someone to a term of imprisonment

The preposition for is an alternative (at much lower frequency of occurrence) for to in this context, despite the potential confusion between crime and sentence.

This Florida appeals court decision quotes a trial record as finding that

It will be the judgment and sentence of this court that Russell Lee Yates be adjudicated guilty, and that he be sentenced for a term of years not exceeding 30 years.

I'll quote at greater length from an article in the Cornell Daily Sun, because it happens to feature Wayles Browne, an excellent linguist, in a non-linguistic role:

"The punishment doesn't fit the crime [under the Rockefeller Laws]," said Prof. Wayles Browne, linguistics, who spoke at the meeting. He cited the case of 17-year-old Angela Thompson, who was sentenced for 15 years to life as a first-time offender. Ten years later Browne, who reviewed the case, finally won clemency for the girl after two tries in the appeals court.

The lede of an Irish Examiner story says that

A woman scarred for life by a former lover told a court prior to the man being sentenced for six years yesterday that she just wants to feel safe again.

A California appeals court ruling explains that

The trial court sentenced him for a term of life imprisonment, as an habitual offender and imposed a $100,000 fine.

4. to make money from dealing in heroin

I believe that "in" is the preposition that Dr. Marks has in mind here, but a direct object would work as well, as in an article from the Chesterton Tribune asserting that "Michael V. Higi ..., was charged with dealing heroin, a Class B felony punishable by a term of six to 20 years in prison". In fact, "dealing heroin" gets 753 ghits, while "dealing in heroin" only gets 570. Language Hat suggests that "dealing heroin" is the American version, but this Dublin Sinn Fein site has "a number of tactics used by a very small element of the anti-drugs movement - of targeting young addicts who dealt heroin to feed their habits - failed and failed miserably".

And "dealing of heroin" get 31 ghits. These mostly strike me as fine, like the sentence "Some small scale street dealing of heroin and cocaine also occurs in this area" (from this article), or this article's discussion of "a vehicle thought to be involved in the dealing of heroin".

However, "dealing" is likely to be a noun in all of these cases (it's a noun in the ten that I checked), and as in the case of "worry", the noun version of a predicate often reverts to "of", at least optionally.

[Update 6/22/2004: Wayles Browne writes:

I found my name in the Language Log in connection with an example from the Cornell newspaper. But they got it wrong, and I wrote a letter to them disclaiming the credit. Let me disclaim it to you too. Really what I said at the meeting was that ten years later a retired judge, who reviewed the case, finally won clemency for the girl. I wish it had been me, but I haven't got the legal skills.

Wayles added: "Obviously I said she was sentenced TO 15 years to life." Well, that was the reporter's preposition, I think, and apparently no more reliable than the quotation. ]

Posted by Mark Liberman at 08:17 AM

untitled

The untitled project (via locussolus).

Posted by Mark Liberman at 07:29 AM

Is 30 the new 42?

Via a recommendation at Infomusings, I've just read a paper by Marcia Bates that introduced me to the "Resnikoff-Dolby 30:1 Rule" (originally proposed in publications from 1971-72). Bates summarizes this idea as "suggest[ing] that human beings process information in such a way as to move through levels of access that operate in 30:1 ratios... Something about these size relationships is natural and comfortable for human beings to absorb and process information. Consequently, the pattern shows up over and over again."

Quoting (with ellipses) from Bates' paper:

Howard Resnikoff and James Dolby researched the statistical properties of information stores and access mechanisms to those stores... Again and again, they found values in the range of 28.5:1 to 30:1 as the ratio of the size of one access level to another...

•A book title is 1/30 the length of a table of contents in characters on average
•A table of contents is 1/30 the length of a back of the book index on average
•A back of the book index is 1/30 the length of the text of a book on average
•An abstract is 1/30 the length of the technical paper it represents on average
•Card catalogs had one guide card for every 30 cards on average. Average number of cards per tray was 30^2 or about 900.
•Based on a sample of over 3,000 four-year college classes, average class size was 29.3
•In a test computer programming language they studied, the number of assembly language instructions needed to implement higher-level generic instructions averaged 30.3.

Once you start looking for this kind of thing, you can find it all over the place. I conjecture that written English sentences probably average about 30 morphemes in length. I haven't ever measured this directly, nor seen any distributions, but mean sentence lengths in texts of various types tend to be about 15-25 words, and if we split compounds, regular inflections and compositional derivational morphemes this is likely to add about 5-10 tokens per sentence. You could get a wide range of numbers, depending on the mix of text types and writing styles, but the average is probably not far from 30 morphemes.

There's a considerable danger of confirmation bias in this sort of thing. We can find confirmation in the fact that military platoons average about 30 soldiers in size, but we could have picked on squads, companies or battalions instead. For sentence length, we could count syllables, morphemes or words; we can pick conversational transcriptions or text types of various kinds; we could have looked at clauses or paragraphs instead of sentences. For some combinations of choices, we're pretty sure to come out with a number close to 30.

Still, I'm prepared to believe that Resnikoff and Dolby are on to something. The main thing that makes me skeptical is precisely that I haven't heard of this idea before -- and that's a sad sort of argument.

This recent PowerPoint presentation by Ian Rowlands (with the snowclone title "30 is the new 42) closes with an appropriate set of questions:

how valid (or useful) is the 30:1 `rule’?

if it’s valid, what is the underlying explanation?

is it just a structural feature of print, or can it be extended to the e-world? (or HCI or map scales or visualisation compression ratios?)

is the report by Resnikoff-Dolby a citation sleeping giant or a dodo?

The title is an allusion to the passage in Douglas Adams' Hitchhiker's Guide to the Galaxy, in which Deep Thought gives the answer to the Ultimate Question of Life, the Universe, and Everything as "42".

Posted by Mark Liberman at 12:05 AM

March 22, 2004

DVDs

I resisted DVDs for a long time, thinking they were yet another ephemeral bit of consumer fluff. I bought my first one last year, and can only play them in my computers. But I have to say that DVDs are actually quite nice from a linguistic point of view. I recently rented a movie called Musa The Warrior. The title is some sort of mistake. musa is just a transliteration of Korean 무사, which in Chinese characters is 武士. It means "warrior", so the English title should be just "The Warrior". I don't know why they doubled it up this way. It makes it sound like it is about an Arab warrior named Moses. But I digress.

The neat thing about this movie, which was a joint Chinese/Korean production, is that you can listen to it in either Korean or Cantonese, and you can get subtitles in Korean, Traditional Chinese, Simplified Chinese, or English. DVD's have the huge advantage over videotape of allowing the viewer to choose from multiple sound tracks and subtitles. This not only caters to a linguistically diverse audience, it is terrific for language learners. In this case, the variety is a result of the joint production together with the recognition that adding English will greatly increase the market. I hope the studios will take the trouble to provide a variety of languages more often now that the technology exists.

P.S. It's a good movie. It's about a Korean embassy to Tang China that is not accorded diplomatic status but instead is imprisoned in Western China. The Koreans attempt to escape and get back to Korea. It features authentic-looking costumes and scenery, martial arts, a beautiful Chinese princess... What more could you want?

Posted by Bill Poser at 10:59 PM

They can kiss her grits

Ginger Stampley describes her reaction to reading the list of 100 most mispronounced words that's been making the rounds lately:

“If they think there’s something wrong with the dialect pronunciations ‘bidness’, ‘bob wire’ (aka ‘bob wahr’) and ‘yolk’ pronounced without an obvious l, they can kiss my sweet Texas grits.”

and to reading my exchange with Dr. Language himself, Robert Beard:

“They can definitely kiss my grits, because this U/Non-U pronunciation crap chaps my hide. I speak Texan, not Received Midwestern Broadcast.”

My correspondence with Dr. Beard began because I complained about another piece of overreaching linguistic moralizing from yourDictionary.com, one that focuses specifically on anti-Texan prejudices, namely their list of "5 Top Mispronunciations by President Bush in 2003". These lists have a pervasive anti-regional bias that I would find distasteful even if I weren't married to a Texan. But the list-maker's desire to feel right while putting others in the wrong puts nearly every American on the "wrong" side of the line: plenty of Midwestern broadcasters say Nevada with the vowel of cot rather than cat; and a professional lexicographer emailed in response to my exchange with Dr. Beard, a couple of months ago, to say that "as for February, I think my chances of ever hearing a native speaker of English utter the word \feb-roo-ary\ in an unself-conscious way are about as good as finding the Loch Ness monster."

So they can kiss our collective apple pie, while they're at it.

[Update 3/24/2004: apparently grits are chic at the moment.]

Posted by Mark Liberman at 10:16 PM

You can't look up everything

A student essay in aesthetics that my philosopher partner was reading as I looked over her shoulder begins thus: In the art world there is the age-old ad itch "Beauty is in the eye of the beholder."

Indeed there is such a proverbial saying. But the word the student meant to use for such sayings is adage.

The error is, of course, of the type that here on Language Log we long ago decided to call an eggcorn, a kind of word creation due to a mishearing that a glance at the written form would normally have corrected. (There are now so many posts about eggcorns that I am not going to attempt to make a list, but here is a recent observation of Mark's on the topic.)

It would be so easy to dismiss eggcorns as signs of illiteracy and stupidity, but they are nothing of the sort. They are imaginative attempts at relating something heard to lexical material already known. One could say that people should look things up in dictionaries, but what should they look up? If you look up eggcorn you'll find it isn't there. Now what? And you can't look up everything; sometimes you think you know what you just heard and you don't need to look it up. Someone says something about "the Oxford/Cambridge boatrace" and you just assume that Oxford and Cambridge hold a race that involves boats of some kind (correctly, as it happens). You don't go rushing to the dictionary to look it up, to make sure they didn't say bone trace or beau treize rather than boatrace. You're an intelligent native speaker; you have a right to just trust your ears and your brain sometimes. And sometimes in consequence an eggcorn is born.

Posted by Geoffrey K. Pullum at 07:21 PM

Homesteading the phonetosphere

Following up on a sequence of posts about organizations laying claim to portions of what Geoff Pullum has taken to calling the "phonetosphere", I have a small collection of stories about similar cases. Here's one, which shows that it's not always big companies that are the ones laying claim to chunks of sound space. Sometimes they're on the receiving end.

In 1993, the British conglomerate ICI "demerged" its pharmaceuticals interests into a new entity named Zeneca. As I understand it, this was a completely made-up name, supplied by one of the consulting firms specializing in such things. Shortly afterwords, I got a phone call from a lawyer representing Zeneca (or so he said), asking if I would be willing to provide advice and perhaps expert witness testimony. According to my caller, the issue was a threat from the Seneca Nation of Indians to sue for intellectual property infringement over the company's new name. I declined to serve, but offered for free my opinion that /s/ and /z/ are generally regarded as distinct phonemes in English, and therefore capable of distinguishing one word from another. In fact, I continued, this capability is not merely theoretical but is realized in practice, in minimal pairs such as sue and zoo, sip and zip, or peace and peas. Getting carried away with the story, I noted that these different words denote entirely unrelated concepts.

To my surprise, the lawyer seemed to find this informative. He asked me to spell "phoneme", and to repeat the minimal pairs slowly, so that he could write them down. After a bit more Q & A, he asked again if I'd like to be a consultant. I again declined, mainly on the grounds of hassle avoidance. I once agreed to testify in an intellectual property dispute involving a pronouncing dictionary, and the lawyers for a certain large electronics firm subpoenaed me to provide them with a copy of all notes, papers, correspondence and other records in my possession dealing with word pronunciation, speech technology and related topics. I learned that it is possible to "quash" a subpoena when it's obviously just a form of harrassment, as this one was; but you have to pay a lawyer to ask the court in the right way, and the courts put the boundary between legitimate discovery and totally ridiculous harrassment in a different place than I had imagined. I have friends who do quite a bit of expert witnessing, including one who does it for a living, so I'm sure it's possible to adapt to that universe, but as far as I'm concerned, life is too short to deal with hostile lawyers unless some worthwhile principle is at stake.

Anyhow, Zeneca didn't have to change its name, and survived to merge in 1999 with Astra into AstraZeneca, which seems still to be doing business. I don't know if they beat back the challenge from the Seneca, or settled with them, or what. "Zeneca" is a little more similar to "Seneca" than "Star Bock" is to "Starbucks", but then pharmaceuticals are perhaps more different from Indian nations than beer is from coffee, I don't know.

You'd think that an intellectual property lawyer working for Big Pharma on a naming case would be familiar with a bit of basic practical phonology, but apparently not. This may be because phonology itself is less relevant than psycholinguistics, according to my limited understanding of the law of trademarks. The U.S. Lanham act, for example, defines "infringement" as the "use in commerce [of] any reproduction, counterfeit, copy, or colorable imitation of a registered mark in connection with the sale, offering for sale, distribution, or advertising of any goods or services on or in connection with which such use is likely to cause confusion, or to cause mistake, or to deceive."

I don't think that U.S. federal intellectual property law would apply to the Zeneca-Seneca issue, because it requires names to be registered to be protected. If the Seneca Nation have registered their name with the USPTO, searching for "Seneca" in the "Full Mark" field of TESS did not turn it up among the 46 registrations, though some of these are owned by relevant Indians, such as a wholesale cigarette company formed in 1999 by the Sac and Fox Nation. There are also quite a few gambling-industry trademarks, such as "Seneca Alleghany Casino" registered to the Seneca Nation of Indians in 2003. The rest of the registrations show that non-Indian companies have been using the "Seneca" name for nearly a hundred years. I suppose that the "goods and services" aspect of the Lanham law makes it inappropriate for a group like the Seneca to register, and therefore offers them no protection independent of particular commercial activities that they might carry out under their name. However, state law in the U.S. depends on a common-law tradition that doesn't require registration, and this may have been the premise for the challenge allegedly raised to Zeneca.

I wonder if any psycholinguists have gotten involved in quantifying the degree to which particular names are "likely to cause confusion, or to cause mistake". There seem to be a few examples of psychologists doing brand-confusion experiments in a general way, but the ones that I've heard about aren't focused on linguistic aspects of the problem. And legal practice still seems mainly to depend on dueling assertions by lawyers: there don't seem to have been any psycholinguistics experiments in the Star Bock or Lexeme cases. Maybe it's time for a bit of pro bono psycholinguistics in this general area, to protect the phonetosphere from inappropriate encroachment? Of course, this would do no good without parallel legal efforts.

Surely the sound space of human languages is the most crucial single part of the public domain. The right to name new enterprises without excessive constraint is essential to individual freedom, and is just as important to civil society as the right to own such names. If trademark lawyers trying to expand their clients' claims are the only voices in the debate, we risk having the whole sound space of English taken over by the penumbra of owned names, as Geoff Pullum has warned.

[Update: James Gleick has an article on this topic entitled "Get out of my namespace" in yesterday's NYT magazine.(via Rosanne at the X-bar)]

Posted by Mark Liberman at 09:27 AM

March 21, 2004

Language and Politics

Our recent discussion of the relationship between case-marking and military prowess reminded me of a humorous piece by J. Moore and K. Wohlmut entitled Towards a New Word Order on the relationship between politics and word order. It begins:

During the cold war, the balance of power seemed equally divided between the West, dominated by SVO speakers, and the Soviet Bloc, dominated by Scrambling speakers. However, with the end of the cold war, and the emergence of the West as the surviving political force, there emerged the possibility of a new dominant word order. As scrambling Russia fell into turmoil, the SVO West waxed. A major victory for SVO forces came with the reunification of Germany, where the East Germans threw off the yoke of their scrambling masters, and were able to assert their fixed-order heritage. Seizing the opportunity, the United States declared the emergence of a New Word Order: SVO.

You can read the rest here.

Posted by Bill Poser at 11:58 PM

Don't say "lexeme" or we'll break your legs

David Elworthy, a natural language processing engineer who keeps his journal of Massachusetts life here, writes to me with the astonishing story of how linguist James Pustejovsky and others set up a company in 1997 to build technology based on ideas developed in Pustejovsky's book The Generative Lexicon, and called the company Lexeme, and got threatened out of it by an already existing corporation. You might like to try and guess which corporation.

Lexeme is a fairly familiar technical term in linguistics (especially morphology). It stands for the abstract entity that correlates with separate dictionary entries. Notice that in most dictionaries you can't separately look up take, taken, takes, taking, and took; they are treated as just inflectional forms of a single entity take. That is the lexeme. The forms that are not predictable are listed at the beginning of the entry for the lexeme. The Cambridge Grammar of the English Language uses the distinction between word forms and lexemes throughout, and distinguishes the two notationally: word forms like taken in plain italics and lexemes like take in bold italics. It is hard to see how we could have written the book without the distinction in question.

So which company used its lawyers to threaten Pustejovsky's fledgling company for trying to register this anodyne and unambiguous lexical item as a name? One might have thought it would be Lexis-Nexis, who at least have business in the area of handling text. But no: the legal challenge came from Lexus, a division of Toyota making luxury cars. And they assert a right to all names that begin with the letters l e x, apparently.

Rather than waste money on a legal fight when it wasn't even born yet, Lexeme caved, and changed its name to LingoMotors. The company once held the URL www.lingomotors.com, but that seems no longer to exist (maybe the Lexus company still wasn't satisfied with the name change and hired some ugly guys to come round and threaten them completely out of business). But what a piece of insanity. Why on earth should a luxury car company have the right to prevent a small natural language processing and search technology company from adopting a technical term of linguistics as its company name? Aren't the owners of the English language ever going to rise up against greedy corporations like Lexus and Microsoft and Star bucks who lay claim to whole regions of the phonetosphere as if their financial power gave them arbitrary dominion over any set of possible words they take a fancy to, when no use of their trademark is being made and no possible confusion threatens?

Posted by Geoffrey K. Pullum at 08:39 PM

Case and Military Prowess

Mark Liberman has pointed out that the languages of the "barbarians" who defeated Rome were well endowed with cases, which demolishes the proposition that it was the case system of Latin that did in the Romans. The same point can be made the other way round. The Roman Empire not only had an end, it had a beginning. The Romans, with seven cases, began their expansionist career by defeating speakers of other Italic languages, such as Faliscan, Oscan, and Umbrian, all of which had the same or fewer cases. As they expanded out of Italy, they overcame, among others, speakers of Greek, with five cases, Punic, Aramaic, Arabic and other Semitic languages, with a maximum of three cases, and Egyptian, with no case distinctions. The case system of Latin doesn't seem to have enfeebled them in the least.

We might also enquire about the inflection of the languages of other great military powers. The greatest empire the world has ever known was that of the Mongols. Classical Mongolian had seven cases, all clearly distinguished, in contrast to Latin: nominative, accusative, dative, genitive, ablative, instrumental, and comitative. Somehow this impediment didn't stop them from conquering most of Eurasia, including the caseless Chinese.

Posted by Bill Poser at 05:00 PM

Them old diacritical blues again

Depending on your browser, you may have noticed some oddities in the Chuvash endings cited in this recent Language Log post about Attila the ~~Goth~~ Hun. The privative and benefactive suffixes should have vowels (a and e) written with underdots. Since there are no economically important languages that use vowels with underdots, the Unicode Consortium in its wisdom has determined that such characters must be handled in the virtuous fashion, by composition of character features, rather than in the convenient and workable fashion, using pre-composed characters such as those provided for the major European and East Asian languages. For the same reason, software writers have been lackadaisical at best about supporting character composition. This creates a catch-22 of global proportions: diacritic-heavy languages like Yoruba don't have the clout to force Unicode to include pre-composed variants, as for instance Korean did; but they also don't have the clout to make software writers render the relevant combining-character sequences correctly.

As I've mentioned in an earlier post, the problems of complex character rendering can be very complex indeed. However, putting a dot under a vowel is not exactly rocket science, and you'd think that people could agree about how to do that much, and then implement that agreement in a consistent way. Of course, you'd be naive and foolish to think that.

In order to get around the fact that not all browsers (and/or browser character encoding and font settings) deal with raw unicode correctly, I dutifully transformed Unicode 0323 "COMBINING DOT BELOW" into its html character-entity form ̣ (changing the code point number from hex to decimal and wrapping it with &#___;). And then I put this abominable string after the vowel to be underdotted, like so:

-sa&#803;r

which should produce "-sạr" with a nicely underdotted "a".

But in fact it produced a bewildering variety of different outcomes in the different circumstances that I've tried so far.

In Internet Explorer on my windows laptop, depending on what font I've selected, it produces "sar" with a single dot below the "a", "sar" with a pair of side-by-side dots below the "a", or "sar" with an empty square glyph (indicating a missing character), either after the "a" or superimposed on the "a" (I don't know which settings are responsible for the last difference).

In Mozilla/Netscape/Firebird/Firefox, depending on what font I've selected, it produces the single or double dots NOT under the (preceding) "a" BUT RATHER under the following "r". As I understand the Unicode and HTML specs, this is wrong. Of course, mozilla is also happy to produce versions with empty boxes in various locations, if the font is missing the combining diacritics, as many fonts are.

Macromedia's Dreamweaver doesn't even try to do any composition in such cases. I haven't had the heart to check Opera or Safari or Java's HTML rendering classes or any of the other options.

The empty boxes are just a matter of fonts lacking the diacritic glyphs -- OK, that's just a residue of history. The double underdots are apparently a particular font that has gotten 0323 and 0324 mixed up -- OK, that's just a little mistake. But not being able to agree on whether combining diacritics combine to the right or the left?

Come on, people, this is pathetic. It's an gratuitous, ongoing insult to the hundreds of millions of people around the world whose languages are normally written in a Latin alphabet with diacritics that Unicode doesn't happen to provide in precomposed form. And since Microsoft often takes a few lumps in the blogosphere, let's specify that it's the Beast of Redmond that did the right thing here, and Mozilla that gets it wrong.

What I actually decided to do on the page in question was to put the ̣ character entities in the wrong place (before rather than after the vowels) so that the whole mess renders correctly in the mozilla-based browsers that I usually use. If I set the font right.

Here it is, so you can see what happens in your environment: "-ṣar".

Of course, the result is that the page renders incorrectly for the 55% of our readers who use some form of Internet Explorer. Sorry, folks. What I'm supposed to do, I guess, is to put in some javascript code figuring out what browser people are using, and then select different stretches of html, depending. But I won't.

[If one of you can tell me how to do underdots in html in a reliably portable way, I'll buy you a very good dinner the next time we're in the same city.]

[Update: Tenser, said the Tensor finds underdot e and a in Unicode! He spies them "in the Latin Extended Additional range, which I believe is the 'tricked out Latin characters for use in Vietnamese' range." He has a few other useful suggestions as well, all of which I'll try out when I next have a few minutes. Meanwhile, I believe that I owe a dinner, and will make arrangements to pay up.

One of the reasons that I didn't find these characters is that the index of character names given seems quite incomplete. None of the obvious index points seems to turn up e.g. LATIN SMALL LETTER A WITH DOT BELOW (such as LATIN or SMALL or A or DOT or BELOW).

In any case, what I said about combining diacritics still stands -- for example, to handle Yoruba, you need to be able to combine underdotted vowels with acute and grave accents (for tone).]

Posted by Mark Liberman at 12:28 PM

Attila the Gothic dad

What language did Attila the Hun speak? Well, Hunnic, of course. But no one really knows what kind of language Hunnic was, which is odd considering what a big splash Attila and the Huns made across Asia and Europe in the 5th century AD. This question came up because of a silly joke about the allegedly enfeebling effects of Latin noun cases. Hunnic has an Ethnologue code (XHC), but that's just a placeholder with no real information associated with it. So I asked Don Ringe, who knows more about language history than anyone else of my acquaintance, and he wrote:

"I think the prevailing opinion is that they were probably speakers of some Turkic language--and probably not of the 'nuclear' branch, which should still have occupied a compact area in western Mongolia at the time, so possibly something more closely related to Chuvash?? But so far as I know, this is all speculation."

OK. Chuvash (otherwise known as "Bulgar") is reported to have eight nominal case forms:

nominative: -
genitive: -Vn
dative/accusative: -(n)a/-(n)e
locative: -ra/-re, ta/te
ablative: -ran/-ren, tan/ten
instrumental: -pa/-pe, pala/pele
privative: -ṣar/-ṣer
benefactive: -ṣ̌an/-šen

as well as first, second and third person possessive affixes in both singular and plural flavors.

If Hunnic was indeed Chuvash-like, this profusion of inflectional options, though greater than those available in Latin, doesn't seem to have interfered with the Huns' martial vigor

Don added something that I found very interesting:

The guy's "name", of course, is Gothic--it means "Dad"--and it's obviously what his Gothic troops called him.

This may come as news to those of you whose Gothic is as weak as mine. But it does accord with the generally multicultural and therefore polyglot practices of the Huns. Here is the Roman diplomat Priscus, reporting on his visit to the Hunnic leader Onegesius:

When I arrived at the house, along with the attendants who carried the gifts, I found the doors closed, and had to wait until some one should come out and announce our arrival. As I waited and walked up and down in front of the enclosure which surrounded the house, a man, whom from his Scythian dress I took for a barbarian, came up and addressed me in Greek, with the word Xaire, "Hail!" I was surprised at a Scythian speaking Greek. For the subjects of the Huns, swept together from various lands, speak, besides their own barbarous tongues, either Hunnic or Gothic, or--as many as have commercial dealings with the western Romans--Latin; but none of them easily speak Greek, except captives from the Thracian or Illyrian sea-coast; and these last are easily known to any stranger by their torn garments and the squalor of their heads, as men who have met with a reverse. This man, on the contrary, resembled a well-to-do Scythian, being well dressed, and having his hair cut in a circle after Scythian fashion. Having returned his salutation, I asked him who he was and whence he had come into a foreign land and adopted Scythian life. When he asked me why I wanted to know, I told him that his Hellenic speech had prompted my curiosity. Then he smiled and said that he was born a Greek and had gone as a merchant to Viminacium, on the Danube, where he had stayed a long time, and married a very rich wife. But the city fell a prey to the barbarians, and he was stript of his prosperity, and on account of his riches was allotted to Onegesius in the division of the spoil, as it was the custom among the Scythians for the chiefs to reserve for themselves the rich prisoners. Having fought bravely against the Romans and the Acatiri, he had paid the spoils he won to his master, and so obtained freedom. He then married a barbarian wife and had children, and had the privilege of eating at the table of Onegesius.

Gothic had only five cases. However, this should still have left the military "balance of inflections" roughly equal, since Latin's historical seven cases (with only four or five really distinguished by the fifth century AD) strike a rough average between the Hunnic eight (or nine, if they had not merged accusative and dative yet) and the Gothic five.

The life story of Priscus' Greek Hun is consistent with Gibbon's speculations about the multicultural attitudes and practices of pastoral invaders:

In all their invasions of the civilised empires of the South, the Scythian shepherds have been uniformly actuated by a savage and destructive spirit. The laws of war, that restrain the exercise of national rapine and murder, are founded on two principles of substantial interest: the knowledge of the permanent benefits which may be obtained by a moderate use of conquest, and a just apprehension lest the desolation which we inflict on the enemy's country may be retaliated on our own. But these considerations of hope and fear are almost unknown in the pastoral state of nations. The Huns of Attila may without injustice be compared to the Moguls and Tartars before their primitive manners were changed by religion and luxury; and the evidence of Oriental history may reflect some light on the short and imperfect annals of Rome.... in the cities of Asia which yielded to the Moguls, the inhuman abuse of the rights of war was exercised with a regular form of discipline, which may, with equal reason though not with equal authority, be imputed to the victorious Huns. The inhabitants who had submitted to their discretion were ordered to evacuate their houses and to assemble in some plain adjacent to the city, where a division was made of the vanquished into three parts. The first class consisted of the soldiers of the garrison and the young men capable of bearing arms; and their fate was instantly decided: they were either enlisted among the Moguls, or they were massacred on the spot by the troops, who, with pointed spears and bended bows, had formed a circle round the captive multitude. The second class, composed of the young and beautiful women, of the artificers of every rank and profession, and of the more wealthy or honourable citizens, from whom a private ransom might be expected, was distributed in equal or proportionable lots. The remainder, whose life or death was alike useless to the conquerors, were permitted to return to the city, which in the meanwhile had been stripped of its valuable furniture; and a tax was imposed on those wretched inhabitants for the indulgence of breathing their native air. Such was the behaviour of the Moguls when they were not conscious of any extraordinary rigour.

Gibbon calls the Hun "Scythians" because they had come to occupy the region previously inhabited by that group. The Scythians spoke an Indo-European language (a conclusion based on the handful of Scythian words recorded by Herodotus), and there were doubtless lots of speakers of Scythian dialects in Attila's multicultural army. They had at least as many noun cases to contend with as Latin speakers did, as well.

[Update: Bill Poser emails:

I have seen studies ... of the recorded names of Huns, mostly military officers, and the great majority are Germanic, which is consistent with the view that although the Huns may have had at their core a Turkic-speaking group, they absorbed all manner of other peoples.

]

Posted by Mark Liberman at 11:17 AM

Imprecational Categories

I did a "Fresh Air" piece last week on profanity, which led (the preterite of lede?) by mentioning Bono's "really fucking brilliant" remark on the Golden Globes last year (though on NPR, of course, that came out as "effing brilliant"). The FCC had refused to sanction NBC for the remark, on the grounds that their guidelines limit indecency to "material that describes or depicts sexual or excretory organs or activities," whereas Bono had merely used fucking as "an adjective or expletive to emphasize an exclamation." (Since then, FCC Commissioner Michael Powell has announced that he would be reconsidering the Bono ruling.)

Several commentators were unable to resist observing that the agency had gotten the sentence's grammar wrong -- since fucking modifies brilliant, they said, it must be an adverb, not an adjective. But in the piece, I said I wasn't entirely convinced by that argument -- true, fucking isn't an adjective here, but if it were really functioning as an adverb, shouldn't it have been fuckingly?

The next day I received an email from an English teacher who remostrated with me for making that suggestion. "Of course the word "f*%#ing," as used by Bono, was an adverb! " she wrote, "Why confuse the general public, by implying that f*cking might not be an adverb because it doesn't end in the ending -ly? Maybe I'm old school, but I don't think we can afford to suggest that only words that end in -ly are adverbs."

Well, but not so fast.

True, not all adverbs end in -ly (that was my little joke) -- you wouldn't want to say that very wasn't an adverb in very brilliant. But fucking doesn't behave the way real adverbs do:

1. How brilliant was it? Extraordinarily (so).
2. How brilliant was it? Very.
3. *How brilliant was it? Fucking (so).

In fact, if you say that fucking is an adverb in "fucking brilliant," then aren't you committed to saying it's also an adverb when it appears as an infix in "in-fucking-credible"? And while we're at it, it seems odd to analyze fucking as an adjective in a phrase like no fucking way; -- after all, it certainly doesn't modify way, nor does it pass the ordinary tests for adjectives, as in *The test seemed fucking, etc. (Note that fucking can be applied to just about any idiom chunk, however resistent to modification it otherwise is -- cf He kicked the fucking bucket, They shot the fucking breeze. etc.)

So maybe we should think of fucking as an emphatic particle (whatever that is) in all these uses. But in response to that English teacher's question, is this something we can afford to suggest to the general public? Not on Michael Powell's watch.

PS -- I note that Salon's Scott Rosenberg has gotten this one right.

Posted by Geoff Nunberg at 02:42 AM

Attila, Honoria and nominal inflections

Hugh Reilly writes, in a piece entitled "May the Latin language requiescat in pace":

As a quidnunc schoolboy, I am delighted with the demise of Latin. No longer will kids have to grasp more cases than a Heathrow airport baggage-handler. Forget decadence; the reason for the collapse of the Roman Empire was that while Marcus et al had their heads up their anuses dealing with datives, ablatives and nominatives, Attila rode in and implemented the rather nihilist diktats of the Hun town-planning department.

I hate to puncture a joke with mere facts, but there are three problems with this passage: first, if Hunnic was Turkic or at least Altaic, it probably had about as many noun cases as Latin did; second, when Attila invaded the Latin-speaking western Roman empire, the fighting on both sides was done by alliances of dozens of tribes, mostly Germanic, who also spoke languages with roughly the same number of cases as Latin; third, Attila's first invasion ended in his defeat, and the second one ended in his death, and in neither invasion did he flatten many Latin-speaking cities.

On the positive side, the whole Attila thing is a great story, though not one in which case endings play much of a role.

Here are a few links: Priscus on Attila's court; Jordanes on Attila in person; Gibbon on the Huns, and on Attila's invasions of Gaul and Italy; Arther Ferrill's essay " Attilla the Hun and the Battle of Chalons".

From Gibbon's description of Attila's invasion of Gaul in 450 A.D., giving a sense of the Battle of Chalons as World War minus I, more or less:

The kings and nations of Germany and Scythia, from the Volga perhaps to the Danube, obeyed the warlike summons of Attila. From the royal village in the plains of Hungary his standard moved towards the West and after a march of seven or eight hundred miles he reached the conflux of the Rhine and the Neckar, where he was joined by the Franks who adhered to his ally, the elder of the sons of Clodion...

Theodoric ... declared that as the faithful ally of Aetius and the Romans he was ready to expose his life and kingdom for the common safety of Gaul. The Visigoths, who at that time were in the mature vigour of their fame and power, obeyed with alacrity the signal of war, prepared their arms and horses, and assembled under the standard of their aged king.... The example of the Goths determined several tribes or nations that seemed to fluctuate between the Huns and the Romans.... the troops of Gaul and Germany, who had formerly acknowledged themselves the subjects or soldiers of the republic, but who now claimed the rewards of voluntary service and the rank of independent allies; the Laeti, the Armoricans, the Breones, the Saxons, the Burgundians, the Sarmatians or Alani, the Ripuarians, and the Franks who followed Meroveus as their lawful prince....

The nations from the Volga to the Atlantic were assembled on the plain of Chalons; but many of these nations had been divided by faction, or conquest, or emigration; and the appearance of similar arms and ensigns, which threatened each other, presented the image of a civil war.

That was a battle that Attila's forces lost.

The Roman princess Honoria played a curious role in Attila's invasions of the western Roman empire -- I'm surprised that no movie has yet been made about her. Here is how Gibbon describes her initial role in the drama:

When Attila declared his resolution of supporting the cause of his allies the Vandals and the Franks, at the same time, and almost in the spirit of romantic chivalry, the savage monarch professed himself the lover and the champion of the princess Honoria. The sister of Valentinian was educated in the palace of Ravenna; and as her marriage might be productive of some danger to the state, she was raised, by the title of Augusta, above the hopes of the most presumptuous subject. But the fair Honoria had no sooner attained the sixteenth year of her age than she detested the importunate greatness which must for ever exclude her from the comforts of honourable love: in the midst of vain and unsatisfactory pomp Honoria sighed, yielded to the impulse of nature, and threw herself into the arms of her chamberlain Eugenius. Her guilt and shame (such is the absurd language of imperious man) were soon betrayed by the appearances of pregnancy: but the disgrace of the royal family was published to the world by the imprudence of the empress Placidia, who dismissed her daughter, after a strict and shameful confinement, to a remote exile at Constantinople. The unhappy princess passed twelve or fourteen years in the irksome society of the sisters of Theodosius and their chosen virgins, to whose crown Honoria could no longer aspire, and whose monastic assiduity of prayer, fasting, and vigils she reluctantly imitated. Her impatience of long and hopeless celibacy urged her to embrace a strange and desperate resolution. The name of Attila was familiar and formidable at Constantinople, and his frequent embassies entertained a perpetual intercourse between his camp and the Imperial palace. In the pursuit of love, or rather of revenge, the daughter of Placidia sacrificed every duty and every prejudice, and offered to deliver her person into the arms of a barbarian of whose language she was ignorant, whose figure was scarcely human, and whose religion and manners she abhorred. By the ministry of a faithful eunuch she transmitted to Attila a ring, the pledge of her affection, and earnestly conjured him to claim her as a lawful spouse to whom he had been secretly betrothed.

After Attila's defeat in Gaul, he regrouped and invaded Italy in 452:

The Italians, who had long since renounced the exercise of arms, were surprised, after forty years' peace, by the approach of a formidable barbarian, whom they abhorred as the enemy of their religion as well as of their republic. Amidst the general consternation, Aetius alone was incapable of fear; but it was impossible that he should achieve alone and unassisted any military exploits worthy of his former renown. The barbarians who had defended Gaul refused to march to the relief of Italy; and the succours promised by the Eastern emperor were distant and doubtful.

So the Romans decided to agree to let Honoria marry Attila:

The Western emperor, with the senate and people of Rome, embraced the more salutary resolution of deprecating, by a solemn and suppliant embassy, the wrath of Attila. ... The Roman ambassadors were introduced to the tent of Attila, as he lay encamped at the place where the slow-winding Mincius is lost in the foaming waves of the lake Benacus, and trampled, with his Scythian cavalry, the farms of Catullus and Virgil. The barbarian monarch listened with favourable, and even respectful, attention; and the deliverance of Italy was purchased by the immense ransom or dowry of the princess Honoria. The state of his army might facilitate the treaty and hasten his retreat. Their martial spirit was relaxed by the wealth and indolence of a warm climate. The shepherds of the North, whose ordinary food consisted of milk and raw flesh, indulged themselves too freely in the use of bread, of wine, and of meat prepared and seasoned by the arts of cookery; and the progress of disease revenged in some measure the injuries of the Italians.

However, Attila didn't live to hook up with Honoria:

Before the king of the Huns evacuated Italy, he threatened to return more dreadful, and more implacable, if his bride, the princess Honoria, were not delivered to his ambassadors within the term stipulated by the treaty. Yet, in the meanwhile, Attila relieved his tender anxiety, by adding a beautiful maid, whose name was Ildico, to the list of his innumerable wives. Their marriage was celebrated with barbaric pomp and festivity, at his wooden palace beyond the Danube; and the monarch, oppressed with wine and sleep, retired at a late hour from the banquet to the nuptial bed. His attendants continued to respect his pleasures or his repose the greatest part of the ensuing day, till the unusual silence alarmed their fears and suspicions; and, after attempting to awaken Attila by loud and repeated cries, they at length broke into the royal apartment. They found the trembling bride sitting by the bedside, hiding her face with her veil, and lamenting her own danger, as well as the death of the king, who had expired during the night. An artery had suddenly burst: and as Attila lay in a supine posture, he was suffocated by a torrent of blood, which, instead of finding a passage through the nostrils, regurgitated into the lungs and stomach.

[Reilly article via Classics in Contemporary Culture]

[Update: David Pesetsky emailed to point out that in fact a movie has been made about Attila and Honoria. David says that it's "not too good, but not impossibly terrible, either." (That's a poster blurb you don't see too often -- "Not impossibly terrible!") He explains that "[t]he movie is about Attila, technically, but a somewhat messed-up version of the Honoria/Attila story is the movie's main event. "]

Posted by Mark Liberman at 12:10 AM

March 20, 2004

Starbucks and Haidabucks

Geoff Pullum has pointed out that Starbucks' threat to sue the purveyors of Star Bock Beer for trademark infringement is likely to founder on the fact that Starbucks and Star Bock do not rhyme and are not all that likely to be confused. You might think that Starbucks would also have difficulty persuading a court that they are in the same business as the folks making beer; trademarks are only applicable within a particular type of business, and the coffee and beer trades are generally considered distinct. However, to my surprise, a check of the US Patent and Trademark Office database shows that the Starbucks Coffee people have also registered a trademark in the alcoholic beverages category, so it looks like it is only the confusability of Starbucks and Star Bock that will determine the matter.

Although Starbucks and Star Bock don't seem all that confusable to us, the Starbucks people have made legal threats in at least one case in which the allegedly infringing term was even more different from Starbucks' trademark. Last year Starbucks threatened to sue Haidabucks Café, a small café in Masset, British Columbia run by four Haida Indians. They called their place Haidabucks because buck is a local term for "young Indian man" and they are young Haida men. Haida and Star don't sound much alike. Furthermore, to quote Haidabucks' characterization, Haidabucks is:

A small café located in NW Canada - on an island, in a village of 700 inhabitants

that

Serves a full menu of tasty food and beverages at reasonable prices.

while Starbucks is a:

Publicly traded, global conglomerate with locations in metropolitan areas

that

Serves high-priced coffee, tea, and pastries.

The odds of anybody confusing the two are pretty small.

In this case, Starbucks eventually backed down as the result of the awful publicity it generated coupled with the support of Joseph Arvay, QC of Arvay Finlay, one of British Columbia's top lawyers, and Baldwin & Baldwin Business Solutions, whose owner was so outraged that he set up Haidabucks' web site gratis. I think there are two lessons here. One is that, as Geoff said, greedy corporations are trying to take over whole regions of the phonetosphere. The other is that sometimes the little guy does win, when people pull together to fight the kleptocrats.

Posted by Bill Poser at 11:04 PM

St@rb*cks claims a whole chunk of the phonetosphere

Rex Bell, known as "Wrecks" to his buddies, is a bar owner in Galveston, Texas. His bar is a laid-back place called the Acoustic Cafe. One night last February a customer asked for a "Lone Star... uh... make that a Shiner Bock." And Bell said, for some reason, "I'll give you a Star Bock." And then he thought, hey, that's a good joke, and a good name. The Lone Star state with its own bock beer, Star Bock. So he set up an arrangement with Brenham Brewery to ship a version of its high-rated Brenham Bock to the Acoustic Cafe labeled as Star Bock.

And of course (this won't surprise readers of Language Log) Starbucks have instantly issued legal threats (as widely reported, e.g. here). Bell is being ordered to abandon his Federal trademark registration efforts and "immediately [their emphasis] cease any and all use of the Starbock Beer and/or Starbock mark" and that he "destroy any signage, menus or other materials bearing the Starbock Beer and/or Starbock."

But of course, linguists will note, the vocalism is different here: bock does not rhyme with buck. Starbucks surely can't lay claim not just to their own trade name but also to everything that begins with star and ends with anything that might remind anyone of bucks: star box, star pox, star fox, star backs, star blacks, star Macs... where's this commercial takeover of whole regions of the phonetosphere going to end? (You've seen this issue before on Language Log, of course, quite recently, here.)

Posted by Geoffrey K. Pullum at 09:43 PM

A picture is worth...

The pensive primate on the right was sitting at a roadside stand in south India when I shot this picture last August 16. The original is a 1704x2272 jpeg which has been occupying 1,187,028 bytes on my laptop's disk since I uploaded it in Bangalore shortly after taking it. A few minutes ago, I checked the Language Log backups, as I do occasionally. The compressed archive of all the text we've posted since I took that picture (and there were only 8 posts before that) occupies 987,890 bytes.

See the archives for September, October, November, December, January, February and March for a more qualitative experience of the text whose information content is a bit less than that of the digital photograph of this monkey. Well, 1,593,104 bits less, as long as we're counting.

I was going to say more, but I think I'll leave it at that.

Posted by Mark Liberman at 03:35 PM

Linguistics in the ecology of academia

The author of the blog Tenser, Said the Tensor writes that

I got mentioned in a post on Language Log and my traffic more than doubled. That's right, I'm now solidly into the double digits. Boo-yah! It's the linguistics blogosphere equivalent of an Instalanche—it's a Logalanche!

Instalanche is defined here as

A sudden influx of thousands of hits that threatens to crush your server, brought on by a link from Glenn Reynolds at Instapundit.com.

This is flattering until you consider the numbers. As of this morning, the sitemeter stats for Language Log are an average of 969 visits and 2,345 page views per day, whereas the same numbers for Glen Reynold's Instapundit are 82,500 visits and 109,999 page views per day. Both numbers vary, since these are week-sized running averages of time series that have meaningful variation on time scales from hours to years. In both cases, the overall trend is positive -- Glen's total visits in February were about 25% higher than in November, and ours more than doubled over the same period. Details aside, I think it's fair to say that politically-oriented sites like Instapundit get two orders of magnitude more traffic than language-oriented sites like Language Log. So a Logalanche is a pretty small slide, alas.

This reminds me of something a publisher told me a few years ago. In the U.S., the number of students each year who take an introductory psychology course at the college level is about 1.5 million, while the number of students who take an introductory linguistics course is about 50 thousand.

It may seem like a natural feature of the intellectual landscape that 30 times more college students should learn psychology than linguistics, but from a historical perspective, it's nuts. Through early modern times, the foundation of all education was presumed to be grammar, logic and rhetoric; one of the greatest intellectual accomplishments of the 19th century was philology and the reconstruction of linguistic history; through much of the 20th century, linguistic anthropology was central to the social sciences, and the "linguistic turn" loomed large in philosophy; after 1950, formal grammars came to play a central role in the development of computer science, and issues in the psychology of language were at the core of a conceptual revolution in psychology.

How the field of linguistics squandered these natural advantages over the past century is a long, sad story. However, we can still be optimistic about the future. Discarding history in favor of intellectual zero-based budgeting, there are plenty of good reasons that basic linguistics should be taught in high school and that every college student should take an introductory linguistics course. Human experience and human nature are based on language and language use: you can't understand the human mind, human society or your own life without learning about what language is and how it works. You need basic linguistics to understand public policy issues like how to teach reading or how to inculcate language standards while respecting linguistic variation. It's essential for making informed decisions about the remediation of reading disabilities or speech impediments, planning the education of a deaf child, or dealing with the aftermath of a stroke. It's relevant for understanding and evaluating language-related technology, and necessary for developing it. It's helpful in teaching and learning any language-related skill, from spelling to rhetoric. The concepts and skills involved are useful in other fields of study, from sociology and history to physics and molecular biology. And, of course, it's fun.

It took a long series of unfortunate accidents and bad decisions, over several centuries, to bring the study of language to its current all-time low point in the ecology of academia. It would take a sustained effort over several decades to reverse this trend. It could happen, though -- and then being mentioned in a popular linguistics communications channel might really overload someone's cortical implant, or whatever the 2025 measure of high-tech popularity turns out to be.

[A note to the side: I compared enrollments in linguistics and psychology because those are the numbers I happen to know, and I'm taking college-level course registrations as a proxy for general intellectual mindshare. I believe that political science, economics and similar fields have total enrollments comparable to those in psychology, anyhow within a small multiplier in one direction or the other. But unless I'm missing something, there are relatively few psychology weblogs, and even fewer by academic psychologists or psychology students (though there are some, for instance Jonathan Baron's). The psych blogs that exist (even the clinical ones, like John Grohol's) don't seem to have especially large influence, at least to the extent that I can measure this through e.g. technorati link counts. I don't have an explanation for this, though I have a hypothesis. ]

Posted by Mark Liberman at 10:40 AM

National Character Writ Large?

A recent Letter from Asia in the New York Times by Norimitsu Onishi entitled Japan and China: National Character Writ Large advances the proposition that the way in which Japanese writes foreign words reflects the strong separation that the inward-looking Japanese make between things foreign and things Japanese. Onishi writes:

Of all languages in the world, Japanese is the only one that has an entirely different set of written characters to express foreign words and names. Just seeing these characters automatically tells the Japanese that they are dealing with something or someone non-Japanese.

Onishi contrasts the Japanese with the outward-looking Chinese, who have no special way of writing foreign words but create Chinese character spellings for them. He's right about the difference in national character, but I am doubtful of the relationship he suggests between national character and writing.

Japanese is written in a mixture of three sets of characters. One set consists of 漢字 [kanji] "Chinese characters". Most 漢字 are of Chinese origin, though as I've previously mentioned, there are some 国字 [kokuji] "national characters", which were created in Japan. The other two sets of characters are ひらがな [hiragana] and カタカナ [katakana], each of which by itself constitutes a basically phonological, moraic writing system. Except for certain details hiragana and katakana differ only in the shapes of the letters e.g. hiragana な vs. katakana ナ [na]. Japanese can in principle be written entirely in hiragana or entirely in katakana, though this is rarely done in practice. What Onishi refers to in the passage quoted is the fact that foreign words are usually written in katakana. This is true, but it isn't true that Japanese "has an entirely different set of written characters to express foreign words and names".

Historically, there is no association at all between katakana and foreign words. Originally, Japanese was written entirely in Chinese characters, where the characters were sometimes used for their meaning and sometimes for their sound. Not just any character could be used for its sound: for each syllable a certain set of characters could be used, up to about a dozen. This writing system is called 万葉仮名 [man'yoogana] "10,000 leaf kana", after the 万葉集 Man'yooshuu "collection of 10,000 leaves", the great anthology of poetry compiled in 752 C.E., which was written in this writing system. Over time, the redundant characters were eliminated, so that each syllable was represented by a single character, and the characters were simplified, which had the effect of differentiating them from Chinese characters. For instance, the katakana letter ナ [na] is a simplification of the Chinese character 奈.

This systematization and simplification of the 万葉仮名 took place twice, resulting in hiragana and katakana. hiragana came to be used particularly by women, katakana (together with Chinese characters) by men. Prior to the Second World War, katakana were routinely used to write native Japanese words. When European words first began to enter Japanese in the latter half of the sixteenth century, in many cases Chinese character spellings were created for them, as I've mentioned before. There was no special way of writing them.

Two relatively recent developments give rise to the impression that katakana are for writing foreign words. The earlier of the two is the shift away from maximizing the use of Chinese characters. This resulted in most of the old Chinese character spellings for European words being abandoned, and in the cessation of the creation of Chinese character spellings for newly introduced foreign words. The more recent of the two is the postwar shift to hiragana as the default phonological writing system. Together, these resulted in foreign words being written phonologically, and in the use of katakana becoming special.

Even if katakana were not developed for the purpose of writing foreign words, are they now used exclusively for this purpose? No. katakana are also used in a number of other situations:

to write the common names of plants and animals in scientific text, e.g. カエル [kaeru] "frog";
to write certain female given names e.g. エミ Emi and マリ Mari;
to write slang words such as インチキ [inchiki] "fake";
to write onomatopoeic words such as ワンワン [wanwan] "bow-wow";
to spell out a person's name so that the reader will be sure to pronounce it correctly;
to write any word when it is desired to emphasize it, as italics are used in English;

Until recently, telegrams were always written in katakana. However, in 1988 it became possible to use hiragana, katakana and roman letters in telegrams, so the default writing system for telegrams shifted to hiragana as it had for other text.

So we see that even now katakana are by no means used exclusively for foreign words. The real principle at work is that hiragana is the default, while katakana is marked. When you want to mark something as special, you use katakana, rather like italics and scare quotes are used in English. The fact that Japanese usually write foreign words in the marked writing system may reflect a particularly intense interest in what is foreign and what is Japanese, but it isn't really very different from the English practice of writing words and expressions still perceived as foreign in italics, such as ad hoc and force majeure.

Posted by Bill Poser at 03:27 AM

March 19, 2004

The A-er the B, the C-er the D

In response to my observation that "the X of it all" is a phrasal template without any content words, Bert Cappelle emailed to point out that there is at least one other English pattern of the same kind, which he characterizes as

The X-er (...), the Y-er (...)
where "(...)" can, but does not need to, be filled with clausal material.

Bert sends along an intriguing example from The Simpsons (By the way, I've been told that The Simpsons has now taken over from Shakespeare and the Bible as the largest single source of quotations and allusions in English-language text. I'm not sure who measured this, or how, or when. Most likely someone just made it up, like 87% of all cited statistics. However, it might well be true...):

The older they get, the cuter they ain't.

As Bert points out, this isn't quite grammatical (entirely apart from the use of ain't -- substituting aren't doesn't change things), nor is it quite semantically compositional, but the meaning is clear.

Bert notes that "The X-er (...) the Y-er (...)" patterns are called "correlative conditionals" in Huddleston and Pullum's grammar -- I don't have my copy at hand, so I'll have to check later to see what they say.

A common form of this pattern is the verbless "The A-er the B, the C-er the D", as in the proverb "The nearer the bone, the sweeter the meat." This month's Atlantic Magazine has a poem by Samuel Hazo (the first state poet of Pennsylvania!) whose first 15 lines are of this form, with the final line "The longer you live, the fewer your years."

[Update 3/20/2004: Tenser, said the Tensor responds with an apt description of Ray Jackendoff's recent ideas about how to establish a formal continuum from fixed phrases to phrase structure rules, passing through cases like those discussed here.

And Semantic Compositions suggests a Talmudic model for a similar construction, as well as correcting a typo in the (original version of) this post.]

Posted by Mark Liberman at 01:58 PM

Yesterday's technology tomorrow

Slogans are small pieces of language designed to be catchy and to provoke thought. Often they succeed, sometimes but not always in the way intended. Today I saw on a truck (I don't even know which company's truck) the slogan

TOMORROW'S TECHNOLOGY TODAY

And I suddenly realized that is the exact opposite of what I want. I don't want tomorrow's technology today. I want yesterday's technology tomorrow. I want old things that have stood the test of time and are designed to last so that I will still be able to use them tomorrow. I don't want tomorrow's untested and bug-ridden ideas for fancy new junk made available today because although they're not ready for prime time the company has to hustle them out because it's been six months since the last big new product announcement. Call me old-fashioned, but I want stuff that works.

Shall I tell you how The Cambridge Grammar of English was prepared? (I am not changing the subject; trust me.) The book is huge: 1,859 printed pages. The double-spaced manuscript was about 3,500 pages (yes, it actually had to be printed out and written on by a copy editor the old-fashioned way). It took over ten years to write. And it was done using WordPerfect 6 for DOS. Rodney Huddleston chose to upgrade to that around 1989, wrote a couple of hundred complex macros, and stuck with it. I learned the WP DOS macro language in order to collaborate on the project.

WordPerfect was basically in its final, completed form before Clinton first ran for office. It works. The file format is fine for authors, and records everything we need to record. Rodney and I are still using WP6 file format today to write our planned student's introduction to English grammar. In all the years since the late 1970s, WordPerfect has not altered the file format: all the largely pointless upgrades in the program have been backward compatible. The format really does the job. But things are different with the WordPerfect program itself. The progress has largely been backward.

The things we have noticed about version differences are minor, but they all tell in the same direction: every upgrade is a downgrade. Version 5.1 is widely acknowledged by WP fans to have been superb except for not having graphics screen WYSIWYG capability. WP6-DOS added that graphics capability, but at least one neat thing that Rodney needed to do with a macro for generating packages of unique example-numbering labels (don't ask me to explain) turned out to be no longer possible with version 6. I had version 6.1 for Windows, a fine program except that it requires the use of the dreadful Windows OS. When I started using WP version 8 recently for greater compatibility with a newer (and worse) version of Windows that I had to get because I needed compatibility with DSL software, it was with reluctance. Well-founded reluctance: I found that version 8 always crashed if I ever used the spelling checker, and its file management system is much worse than that of 6.1. The spell-checker bug is deep, apparently (I tried standard published fixes and reinstalling of files). So version 8 is worse than version 6.

I did try version 10 for a day, because it came with the new Windows machine, but it became corrupted during its first attempt to access a printer driver and never worked again. No help from the machine's manufacturer (HP) or the software company (Corel) could fix things. I had to remove version 10 completely (and couldn't re-install, because of a new improved system that has a second hard disk with a copy of the system as it came from the factory, and you can wipe your disk and re-install with that, but everything you ever did is then wiped out).

Here's what I'm telling you (you may have lost the drift). The upgrade from WordPerfect version 5.1 to 6.0 lost some functionality for Rodney. Upgrading from version 6.1 for Windows to version 8 for Windows lost me the use of the spelling checker and made the file management worse. Upgrading to version 10 lost even the ability to print and crashed so badly that the machine had to be rebooted. Every upgrade has lost some important functionality. No upgrade since version 6 for DOS (which added screen graphics) has added anything I needed. Every upgrade is a downgrade.

WordPerfect still struggles on. They have announced version 12. It will add features that version 11 didn't need, and will contain bugs version 11 didn't suffer from.

Recently I noticed that Adobe's Acrobat Reader was warning me that I really should ditch version 5 and upgrade to version 6 as soon as possible and would I please just click here to do so. Thank heaven I asked my expert friend Adam before doing so. Adam told me that version 6 is a catastrophe for his work: it runs slow, the Find mechanism has been redesigned and is now hopeless... He had to reinstall version 5 to be able to go on doing his work.

Notice, I'm no Luddite. I don't reject technology. I depend on it. I'm deeply versed in many kinds of technology, information-age and other. What I'm talking about is an insatiable urge on the part of the people designing it to upgrade it to something worse. I could illustrate from any aspect of my technological life. For example, the new integrated Inter-Library Loan (ILL) system was introduced today at my university and there is lost functionality for the user (they are completely explicit about it: "On March 22, 2004, as part of a UC system-wide change, UCSC ILL will be migrating its current ILL management software into a system-wide management system. As a result, there will be noticeable changes to our on-line services... Unfortunately, the new system does not yet have a patron self-service interface. Beginning 3/22/04, the Patron Self Service (Manage Your Requests) Interface, which includes renewals, re-requests, ILL request tracking and cancellations, will need to be done over the phone or at the ILL service desk..."). We have a new email server too. It's terrible; migration to it has been temporarily halted. Unknown numbers of messages have been completely lost.

I could go on. I could grip you by the arm like the Ancient Mariner and tell about about non-information technology: about bathtubs and basins that won't hold the water in because instead of a rubber disc on a chain they have a $200 chrome assembly with hidden moving parts that leaks slowly when closed and won't drain well when open. I could tell you about electric kettles that used to switch themselves off when the water boiled but now with the new improved electronic sensor it doesn't work and they carry on boiling until your kitchen is full of steam. I could tell you about...

What? This is Language Log? Oops. Sorry. I thought I was writing on the new wide-open blog for curmudgeons, http://www.RandomFlames.org. (You may have a little trouble connecting to their site; they just migrated to a new and improved server.)

Posted by Geoffrey K. Pullum at 12:52 PM

Putting the X in Y

Bert Cappelle of K.U. Leuven sent in his collection of "we put the X in(to) Y" snowclones. Bert explains that "I'm not a native speaker but do watch the Simpsons a lot." With respect to Geoff Pullum's question "where did you first hear this pattern?", Bert cites a Simpson's episode:

We put the spring in Springfield. (Dancing girls from Springfield's controversial "Maison Derrière")

The rest of Bert's collection:

Individuality is yours alone and we put the YOU in YOUnique.
Deliverease 2001. Where we put the ease in deliveries
We put the Cool in Afterschool
We put the "funk" in DysFUNKtional
We put the k in kwality
We put the sin in Cinema (or in business, or in Wisconsin)
We put the "OH" in "Ohio"!
We put the ass in Massachussetts
We put the sex in Sussex (or in Essex)
Welcome to California, where we put the mock in democracy
"Springfield Christian Academy: Where we put the FUN in FUNDAMENTALIST DOGMA!"
We put the fun in fundraising.
We put the fun in funeral.

Geoff observed that the Wall Street Journal corpus seems to be devoid of examples of this particular pattern, but the internet (being larger and less staid) yields quite a few. Various groups, for instance, claim to put the fun in:

dysfunctional, fundamental science, fundamentalist extremism, fungus, funky, phonetics, phonology, profundity, unfunded, Funchal, your function

Others put the fu in fun, the dumb in fandom, the ho in holiday, the eek in geek, and even the A in "hoasting." The only relevant thing that gets put in blog is "blah" -- come on, guys, what about "lo" and "log"?

[Update: John Kozak emails:

There must be vast numbers of these. The first I know of is a old (1940s or earlier?) advertising slogan from a tea vendor:
"Typhoo puts the T in Britain"

Others:

"X puts the mental in fundamentalist" (which needs the cockney sense of "mental" (psychotic) to work)
"X puts the scatology in eschatology" (1970s Cambridge student mag)
"X Jones puts the jones in cojones"

In the spirit of " On the shoulders of Giants", I wonder what the earlier use of this trope was? Perhaps Uncle Jazzbeau can track down a case in Plautus...]

Posted by Mark Liberman at 09:29 AM

March 18, 2004

The backpack of it all

In a series of posts (here, here, here, here, here, here) we've discussed phrasal templates such as "X is the new Y" and "I for one welcome our new X overlords", for which Glen Whitman proposed the term snowclone. While exploring a strange usage of the verb worry ( here and here), I stumbled on a template that is (I think) unique in being composed entirely of closed-class words: "The X of it all". An article, a preposition, a slot for a singular noun, a pronoun, and a quantifier. A template without any specific noun, verb, adjective or adverb.

The pattern "The * of it all" gets 714,000 ghits (Google hits), and looking through the first few pages, we find that X can be quite a few different things:

absurdity, antiquity, beauty, beginning, bleakness, bottom, burden, centre/center, core, cost, crux, drama, end, economics, ethics, fatness, finality, fun, futility, gall, heart, horror, hub, hypocrisy, insanity, irony, key, love, madness, magic, meaning, midst, miracle, niceness, object, physics, pity, point, queerness, reality, rest, root, shame, sociality, sound, source, speed, splendor, start, stigma, stupidity, sum, thrill, weirdness, why, wonder

and so on -- there must be thousands of different substitutions within the 714,000 examples. Of course, not every kind of singular noun is equally likely to occur: outside of this template, backpack is about twice as frequent as broccoli, which in turn is about twice as frequent as crux, but "the crux of it all" gets 608 ghits, while "the broccoli of it all" gets 1, and "the backpack of it all" gets none. This follows from the meaning of the template, of course -- whatever exactly that is.

We can push the pattern further: "the * and * of it all" gets 16,300 ghits, including on the first few pages the substitutions:

action and challenge, audio and transcripts, bright and dark, chills and thrills, creativity and spontaneity, good and bad, groans and agony, heart and center, heart and soul, heart and source, highs and lows, immensity and beauty, ins and outs, joy and beauty, joy and glory, joy and malice, length and brutality, long and short, magic and wonder, merits and thruths, mess and brilliance, misery and pointlessness, nuts and bolts, prose and poetry, rhyme and rhythm, rise and fall, shock and awe, short and long, sound and fury, stress and tragedy, strong and weak, structure and symmetry, ups and downs, vagueness and remoteness, wisdom and glory, wonder and amazement

Again, there must be several thousand distinct strings in this set, and this is one of many patterns where there would be something to be learned about synonyms, antonyms and word associations by compiling the list.

There are only 336 ghits for "the * and * and * of it all":

lights and sounds and mystery, rush and excitement and newness, fun and thrill and joy, voices and emotion and spectacle, fun and excitement and newness, drama and sadness and heartache, etc.

Fans of Andrei Andreyevich Markov will be amused to note that 714000/16300 is approximately equal to 16300/336 (43.8 vs. 48.5).

As evidence that "the X of it all" is a kind of templatic phrase, consider how relatively rare it is to make any changes in the pattern (even though all the variants are grammatical and interpretable):

the point of it all	10,600
a point of it all	26
her point of it all	0
the points of it all	10
the point of some of it	36
no point of it all	2
the beauty of it all	12,200
a beauty of it all	0
her beauty of it all	0
the beauties of it all	2
the beauty of some of it	0
no beauty of it all	0

Compare "the account of it all" (7), "an account of it all" (127), which is not the same thing at all.

Construction grammar, yo. (And have you noticed that idiom creeping into general usage?)

Unless someone can provide a (compositional) pragmatic or semantic account for the facts here?

Posted by Mark Liberman at 11:10 PM

Crouching Tiger, HIdden Dragon, Extra Cat

When I finally got around to seeing Crouching Tiger, HIdden Dragon, long after everybody else saw it, I was struck by one oddity. There's a scene where Governor Yu's daughter is secretly visited by her lover from out West. While the two are talking in her bedchamber, the attendant hears something, approaches the curtains, says that he thought he heard something and asks if everything is alright. What Governor Yu's daughter answers in Chinese is just: "It was nothing.", but the English subtitles translate what she says as: "It was just the cat." In the grand scheme of things it doesn't make much difference, but I keep wondering how the cat got in there. I know of nothing about the English language or culture that calls for the insertion of a cat in this context. The only plausible explanation I've heard so far is that the line was originally "It was just the cat", that the script was subsequently changed, and that the subtitles were based on the earlier version of the script. That seems plausible enough, but then I know nothing about the movie business. Is it likely that the subtitles would be based on a non-fiinal version of the script? If anybody knows the basis for this discrepancy, I'd lbe interested.

Posted by Bill Poser at 05:56 PM

Neck Reading

According to this news item, NASA is developing a system that recognizes subvocal speech (speech in which articulatory motions are made but no sound is actually produced, as many people do when reading silently) using sensors placed under the chin and on each side of the Adam's apple. The sensors pick up the electrical signals in the nerves that control muscles used in speech. They hope that the system will be useful on space missions and for the handicapped.

Posted by Bill Poser at 11:42 AM

March 17, 2004

No snow

Laputan Logic, informative as always, presents the Inuit legend of Sedna, after whom the newly-discovered planetoid 2003 VB12 has tentatively been named. In the Inuit narrative, there is no snow or ice: "Soon they arrived at an island. Sedna looked around. She could see nothing. No sod hut, no tent, just bare rocks and a cliff."

There's no snow in these two Inuit folk tales either. I wonder whether the widely-presupposed centrality of snow in Inuit culture might be just as exaggerated as the widely-asserted numerousness of their snow words.

Posted by Mark Liberman at 11:44 PM

On beyond ghoti

Q. Pheevr brilliantly imagines George Bernard Psschaughal collecting his Nobel prize. The punch line is particularly good.

Here is an alt.usage.english discussion of the ghoti back-story. And just for fun, Gerard Nolst Trenité's The Chaos.

Posted by Mark Liberman at 07:23 PM

Early Writing Gets Earlier

Another article in yesterday's New York Times, about human sacrifice in ancient Egypt, has an intriguing comment on Egyptian writing:

"In recent years...German archaeologists have...found, among other things, evidence of early forms of hieroglyphs from about 3200 B.C. If that date is correct, this would seem to show an earlier Egyptian writing than anything previously known, putting its origins at about the same time as that of the Mesopotamian cuneiform."

3200 B.C.! As early as the earliest Sumerian writing! And, if true, this is another piece of evidence supporting the claim that the Egyptians invented their writing system on their own, not via stimulus diffusion from the Sumerians.

Posted by Sally Thomason at 08:54 AM

Multilingual in Urumqi

Americans tend to be much impressed with somebody who knows more than one language, provided that that somebody is a native speaker of American English. Bilingual and multilingual Native Americans and foreigners are merely normal and ordinary, because often the second or third or nth language is English, and speaking English is obviously the norm (in the opinion of native-English-speaking Americans). It's easy for us to forget that most of the world is bilingual, and much of it is multilingual. Yesterday's New York Times has a reminder of this linguistic characteristic of the rest of the world, in an article about Urumqi in China.

The reporter quotes Ahat Imam, who "operates a bank of telephones on the sidewalk." "You've got to be able to speak a little bit of a lot of languages in this work," he told the reporter, because Urumqi is multiethnic. "I can speak Uighur, a little Kazakh, a little Uzbek..."

True, these are all closely-related Turkic languages; but Dutch and English are also closely related, and yet English speakers can't understand Dutch without studying it first. Well, O.K., Ahat Imam's three Turkic languages are probably more closely related than English and Dutch -- maybe even as close as Modern English and Chaucer's Middle English. But Modern English speakers also have to study Middle English in order to read Chaucer, so the point still holds. What's ordinary every-day life with languages to Ahat Imam is exotic to Americans, who live in what may be the only country in the world where most people can afford to be monolingual. For those of us who think that learning languages is a deeply enriching activity, this seems a shame.

Posted by Sally Thomason at 08:51 AM

March 16, 2004

Problems with pictographs

Spurred by SC's passing observation that "he expects no sympathy from readers of pictographic alphabets", I can't resist linking to Rudyard Kipling's Just So Story about why pictographic writing is Not a Good Idea. "Pictographic alphabet" is an oxymoron of sorts, and for reasons that Rudyard's story explains, the orthographic systems sometimes called "pictographic" (writing pictures) or "ideographic" (writing ideas) are usually in fact "logographic" (writing words), and generally specify words (or morphemes) using some mixture of semantic (meaning-based) and phonological (sound-based) features. See this set of lecture notes on Reading and Writing for more details, if you're interested. As I expect SC will remind us, his monicker stands for "Semantic Compositions" and not "Orthographic Compositions" or "Morphophonemic Compositions." But still.

[Update: Bill Poser suggests that "morphographic" would be a more accurate term than "logographic" for the writing system of Chinese and Japanese, since (where there is a difference) the characters represent morphemes rather than words. This is entirely true, and maybe we should join Bill in trying to get "morphographic" into general usage in place of "logographic". However, we'll have to fight off the biologists, who think that "morphographic" means "pertaining to morphography", which in turn means "the scientific description of external form; descriptive morphology."]

Posted by Mark Liberman at 10:19 PM

Making Email Eight-Bit Safe

While we're on the topic of improvements in the infrastructure for writing systems other than the Roman alphabet, I thought I'd mention that the email system seems to have become eight-bit safe. What that means is that it is safe to include in email messages bytes whose high (most significant) bit is set. The original network architecture assumed that all email would consist of ASCII characters. ASCII character codes range from 0 through 127, which is to say, from 00000000 through 01111111. The eighth or high bit has the value 128 and so is set (has the value 1) only in bytes whose value ranges from 128 through 255. If you sent a byte with the high bit set through the email system, you couldn't be sure what would happen to it.

As long as all that people wanted to send was ASCII text, this was alright, but soon people began to want to send data other than text, such as images and programs. In order to get such non-textual data through the email system, it had to be encoded in such a way that what was sent over the network consisted entirely of bytes with values between 0 and 127. The most common way of doing this was by means of uuencoding. uuencode originally stood for Unix-to-Unix encode, but the same encoding technique soon spread to non-Unix systems. Uuencoding distributes the information in three eight bit bytes (24 bits) over four bytes all of which have their seventh and eighth bits unset. Essentially, abcdefgh ijklmnop qrstuvwx is reencoded as 00abcdef 00ghijkl 00mnopqr 00stuvwx, where each letter of the alphabet represents a 0 or a 1. On Macintoshen, the program that performs the same function is called binhex. In both cases binary data was encoded, sent over the network, and then decoded back into its original form. When you send a file as an attachment, the same sort of encoding and decoding is performed.

This system wasn't too inconvenient as long as people were willing to write in plain ASCII, but it becomes tedious if you want to write in a writing system whose encoding requires more than 128 codepoints, such as one of the ISO-8859 encodings for European languages, Arabic, and Hebrew, or in Unicode, since many bytes in Unicode have their high bit set. For example, the UTF-8 Unicode encoding of the Korean word 한글 [hangul] (the name of the Korean alphabet) looks like this when written out in binary:

11100001	10000100	10010010
11100001	10000101	10100001
11100001	10000110	10101011
11100001	10000100	10000000
11100001	10000101	10110011
11100001	10000110	10101111

Every byte has its high bit set.

A lot of software still isn't eight-bit safe. One such program is Movable Type, the software that runs Language Log. You can enter UTF-8, for example, but your post will be truncated at the first byte with its high bit set. In order to get non-ASCII characters in, therefore, you have to use HTML numeric character entities, which represent Unicode codepoints using ASCII characters. For instance, the phonetic symbol ʤ has the Unicode code 0x02A4, which comes out as 0xCA 0xA4 in UTF-8. But it won't work to enter these two bytes directly into Movable Type since both have their high bits set. So instead, I enter the eight byte sequence ʤ. Each of these eight characters is an ASCII character, so Movable Type is happy, but your browser knows that such a sequence is to be interpreted as "the Unicode character whose hexadecimal value is 02A4", and if you have a suitable font installed, displays the correct character.

The email system, however, seems now to be eight-bit safe. A little while back, I tried sending myself unencoded email in Unicode, and somewhat to my surprise, it worked. Since then, I've corresponded successfully in Korean with a friend at Rutgers. There may still be parts of the network that are not eight-bit safe, but it looks like things are looking up.

Posted by Bill Poser at 09:35 PM

Panting up the Hill of Significance

Steve at Language Hat writes today:

"That poor meme is panting and sweating, but it just can't drag the load up the Hill of Significance."

A nice way to describe the widespread misapprehension that a culture's degree of interest in a topic can be accurately measured by counting the number of words it has for relevant concepts. Steve is critiquing a New York Times Op-Ed piece by Simon Montefiore on Russia, which applies this idea to the concepts of "law" and "personal connections". As Steve shows, the vocabulary metric indicates that "Russia must be twice as legal a culture as any English-speaking one", and that in the "personal connections" area the two cultures are more or less in a dead heat, thus refuting Montefiore's thoughtless little sentence: "There are few words in Russian for the Western concept of 'law,' but there are legions of words for connections, helping people from one's neck of the woods."

One reason for this meme's resilience is that it's based on a valid generalization: people develop the concepts that their lives and their livelihoods require, and find or invent words or word-senses or fixed phrases to express those concepts. As a result, for example, the Carrier have lots of beaver words, and the Somali have lots of camel words, and not vice versa. This is neither arbitrary nor surprising.

Why is the Eskimo snow-words meme so often wrong, then? Because it's usually a cheap rhetorical device, masking facile cultural stereotyping and little or no actual linguistic analysis.

As an expert in Soviet history, Montefiore doubtless knows that Russians have plenty of experience with legal codes, both civil and ecclesiastical; and as a writer with six or seven books from American and British publishers, I bet he knows that English speakers have plenty of experience with social networking. There may well be differences between the cultures in these areas -- there are certainly profound historical differences in the civil societies -- but it's naive to think that these differences will show up in word counts, or even in the counts of culturally identified concepts. Montefiore wants to draw attention to the role of personal networks in Putin's (and Stalin's) Russia. Fair enough, but it's one thing to assert that personal power and patronage play a bigger role in Russia than in (say) France, and a completely different thing to believe that Russians make a larger number of conceptual and lexical distinctions among types of personal and social influence. It might be true -- though I doubt it -- but some evidence should be required, not just a facile assertion that relies on the reader being too ignorant or too uninterested to check.

And evidence is just what we almost never get. Such claims are usually asserted without any careful comparative linguistic analysis, and often without any linguistic analysis at all. The infamous Eskimo snow words meme is spread by people who don't know about Inuit or any related language. Montefiore doubtless knows quite a few Russian words for the form, contents and use of social networks, but he clearly has not tried to compare this part of the Russian lexicon systematically -- even by the blunt instrument of counts -- to the comparable vocabulary of other languages. His rhetoric needed a piece of evidence about the role of relationships vs. laws in Russian politics, and this assertion about numbers of words is brief and clear. And probably wrong, at least in its implicit comparison to languages such as English and French, but you can't have everything...

Posted by Mark Liberman at 09:00 PM

Alphabets, somewhat unscrambled

Pango-1.4.0 has been released, along with Glib-2.4.0 and GTK+-2.4.0. This is good news -- or at least a step towards good news -- for those who need or want to display, enter, edit, search or browse in a wide variety of orthographies, especially those that require complex rendering. See the links in this post for details on why the issues are far from trivial. Pango seemed to have been stalled for a long time, so I'm really happy to see forward motion.

This part is especially good news:

Bidirectional editing and interface flipping improvements

GTK+ now automatically determines the base direction for label and text-entry widgets based on their contents, rather than requiring it to be specified by the application; this gives a much better user experience when editing mixed right-to-left and left-to-right text. Support for user-interface mirroring in right-to-left locales has now been extended to cover virtually all widgets.

Unicode 4.0 is also now supported. There's still a ways to go before (for instance) the text widgets in standard scripting languages support entry and editing in arbitrary orthographies. But this looks like a step forward.

Posted by Mark Liberman at 07:39 PM

Arabic writing innovations

Bill Poser posted an interesting note on a modified Arabic alphabet that is easier to render than traditional Arabic script. For the idealistic among us, another very interesting alphabet variant has been proposed by Nizar Habash: his Semitish alphabet is a cross between the Arabic and Hebrew alphabets, designed to be (with a little practice) intelligible to readers of both.

See also his Palisra page for artistic imaginings of a land not torn apart by nationalism, as well as his work on platform-independent support for Arabic.

Posted by Philip Resnik at 04:43 PM

Nicholas Wade on Gray and Atkinson

Today's New York Times contains a piece by Nicholas Wade based on the paper by Gray and Atkinson on the dating of the divergence of Indo-European that both I and Mark Liberman commented on a while back. It doesn't add anything new to our previous discussion, so I'll defer further comment on Gray and Atkinson and more generally the problem of subgrouping and dating to another occasion. What I'd like to comment on now is a much larger confusion that pervades the piece.

Wade begins with a comment on how great it would be if we could reconstruct the family tree of all human languages. He then writes:

Yet in the view of many historical linguists, the chances of drawing up such a tree are virtually nil and those who suppose otherwise are chasing a tiresome delusion.

Languages change so fast, the linguists point out, that their genealogies can be traced back only a few thousand years at best before the signal dissolves completely into noise: witness how hard Chaucer is to read just 600 years later.

But the linguists' problem has recently attracted a new group of researchers who are more hopeful of success. They are biologists who have developed sophisticated mathematical tools for drawing up family trees of genes and species. Because the same problems crop up in both gene trees and language trees, the biologists are confident that their tools will work with languages, too.

This shows a fundamental misunderstanding of what is at issue.

There are two aspects to classifying languages. One is showing that they are related at all. We don't know a priori that human languages are all related to each other. In fact, if you include the signed languages, we know for certain that they are not. But for the oral languages, we just don't know. The other aspect consists of determining how languages are related, once you know that they are. This is known as subgrouping since it consists of determining what the subgroups of the language family are, and in turn what the subgroupsof the subgroups are, and so forth, ultimately resulting in a family tree. Each branch of the family tree represents the divergence of two or more languages, an event that took place at some time in the past. Ideally, we'd like to be able to assign dates to those events.

The problem that Wade refers to in the passages cited is the first problem, that of establishing that all (oral) languages are related. The mainstream view of historical linguists is that this has yet to be demonstrated and probably never will be, even though it may be true. To see why, let's review what is involved in showing that languages are related. We need to show that the languages exhibit similarities that can only be explained by the hypothesis of common descent. To do this, we need to show:

That the similarities are not attributable to innate linguistic universals;
That the similarities are unlikely to be due to chance;
That the similarities are unlikely to be due to diffusion

The first point is easy. If there are properties common to all human languages that are due to the way our minds and bodies work or to the way signal channels work, they are explained by a hypothesis other than common descent and therefore provide no evidence for common descent. The second point reflects the fact that it is fairly easy to find meaningless random similarities between languages, especially if you allow yourself to go fishing among a lot of languages. The third point reflects the fact that in addition to innate univerals and common descent, languages may be similar because they have borrowed from each other. The fact that English has words such as zen and samurai which closely resemble Japanese words is not evidence that English and Japanese are genetically related because we know that these words are fairly recent loans.

Thus far, claims of very large-scale genetic relationship, such as the Amerind, Indo-Pacific, Eurasiatic, Nostratic, and Vasco-Dene language families, or the even stronger claim that Proto-World has been reconstructed, have not been generally accepted. One reason for this is not, strictly speaking, methodological. A good deal of this work is based on extremely poor data. Some of the leading proponents of such hypotheses have been found to have been incredibly slip-shod in their handling of the data. The phonetic form of words cited is often wrong, or the meaning is incorrect. Sometimes the cited words don't come from the language they are supposed to come from. Very frequently, words are given incorrect or unjustifiable morphological analyses. In theory, of course, large-scale comparisons can be done competantly, and some are, but a good deal of this kind of work has been invalidated on the grounds of inadequate data handling.

A second reason for skepticism about the cases that have been made thus far is that they don't pass the statistical test. By and large, they involve similarities so few and vague overall that we are not persuaded that they are not attributable to chance. We're also skeptical about the possibility of more persuasive arguments being made in the future because, as Wade mentions, languages change sufficiently fast over time that the "signal" of relationship at great time depth is likely to be very weak and overridden by the "noise". It's important to understand that nobody is claiming that there is an absolute limit. We can't say: "languages change at such a rate that the remotest relationship that could be demonstrated is X thousand years ago. Any claim beyond that is bunk a priori.". We're just saying that the only satisfactorily demonstrated relationships go back no more than about 10,000 years, and that since human language probably goes back at least 50,000 years, there is a large gap between the date of Proto-World, if there was such a language, and anything thus far demonstrated. It is possible that by chance some evidence may have survived long enough; if so, we'd love to see it. But until somebody provides convincing evidence of genetic relationship at great time depth, there's no case.

The third reason for skepticism is that one of the significant developments of research over the past twenty years or so (much of it by our own Sally Thomason) is that we know a lot more about language contact than we used to. In particular, we've learned that massive borrowing does occur, that grammatical structures can be borrowed, and that borrowing of basic vocabulary is more common than we thought. This means that we have to be more concerned than we used to be about the possibility that non-chance similarities between languages are due to borrowing rather than common descent. This problem because more severe the farther back we go both because the total amount of evidence becomes smaller and because the farther back we go the less likely we are to know anything about the external history of the languages, that is, who was in contact with whom and what the nature of the contact situation was. As a result, at great time depth we are in a poor position to distinguish genetic affiliation from diffusion. Of course, borrowing can also skew subrouping, so our improved knowledge of language contact phenomena poses a problem there too.

The problems that Wade's opening alludes to are the problems of determining whether languages are related at all. The problem addressed by Gray and Atkinson and by related work, is the other problem, that of subgrouping and dating. Their method presupposes not only that we know that the languages in question are related, but that we have reconstructed the details of that relationship so that we can determine which words in the daughter languages are cognate. So, even if Gray and Atkinson's approach works, it will only provide a new and better means of subgrouping and dating languages known to be related. It won't help in the slightest to demonstrate relationship in the first place.

Posted by Bill Poser at 01:39 PM

Trees of life and language

If you've seen this 3/16/2004 NYT story by Nicholas Wade, "A Biological Dig for the Roots of Language", which is based on the 11/27/2003 Gray and Atkinson article in Nature, "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin", you might want to look back at this review by Bill Poser, with follow-ups here, here and here.

If you have a Nature subscription (and most university libraries will offer access to students, faculty and staff through their web site), you can read David Searl's excellent background article from the 11/27 issue here, and the original Gray and Atkinson article here.

Here is a discussion of a much less serious effort to apply computational methods to reconstruct linguistic phylogeny, in an earlier 2003 PNAS paper by Forster and Toth, 'Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European.'

Posted by Mark Liberman at 12:50 PM

A New Arabic Alphabet?

A while back Mark Liberman commented on the problems of rendering alphabets in which the graphical order is not the same as the phonological order or in which letters take different forms in different contexts. Arabic is one such alphabet.

For instance, here are the forms that the letter ghayn takes in isolation, in initial position, in medial position, and in final position: ﻍ ﻏ ﻐ ﻎ
To see how complicated this is, look at the words to the right.. In the first line I've written /is/ so that you can compare it with the word /ism/ "name" in the second line. You might think that adding an /m/ to the end of the word (which is the left edge since Arabic is written right to left), would result in /is/ being a graphical suffix of /ism/. It doesn't, because the final form of /s/ is ﺲ but the medial form is ﺴ In the third line I've written /al-ism/ "the name", and in the fourth, /al/ "the" by itself. Here again, /al-ism/ is not a simple concatenation of /al/ and /ism/.

The New York Times contains a report on a proposed modification of the Arabic alphabet that avoids these difficulties, though for a different reason. Saad D. Abulhab, an Iraqi-American, encountered resistance when he tried to teach his six year old daughter to read and write in Arabic. She did not like the fact that Arabic is written right to left since she had already begun to learn to read English left to right. In response, Mr. Abulhab developed a modified version of the Arabic alphabet in which each letter has a single form and in which letters can be written separately rather than linked together in the usual cursive style. Among other things, this allows it to be written equally well right to left or left to right. His modified alphabet is intended primarily as a transitional system, to make it easier for children like his daughter to learn to read and write in Arabic, but he reports that adults accustomed to the traditional system are able to read his modified alphabet with little difficulty.

This new system should be much easier to render, and is probably easier to learn to read and write, so one can imagine it eventually replacing traditional Arabic writing. But I doubt that it will. Traditional Arabic writing is so much a part of the cultures in which it is used and so tied up with Islamic tradition, that even if the new system has great advantages there will be enormous resistance to change.

Posted by Bill Poser at 02:03 AM

Thesauri, SKOS and terminology variation

There's a lot of activity in Semantic-Web-Land these days. I've been skeptical about the prospects for this work (e.g. here and here), but I'll be happy for any success that these folks manage to achieve, and I try to stay current. Here's a note on some SW doings, which you may find interesting if you're in the same sort of boat that I am.

SKOS-core 1.0 is "an RDF schema for representing thesauri and similar types of knowledge organisation system (KOS)", being developed by SWAD-Europe. Here's the current version of the SKOS-core 1.0 rdf file. Its "sister vocabulary" SKOS-mapping "allows you to assert mappings between concepts from different schemes". This is a picture of the "meta-model" of SKOS, showing two schemata of concepts with a partial mapping between them and labels for some of the concepts:

The SKOS-Core Guide says that

SKOS-Core is intended as a complement to OWL. It does provide a basic framework for building concept schemes, but it does not carry the strictly defined semantics of OWL. Thus it is ideal for representing those types of KOS, such as thesauri, that connot be mapped directly to an OWL ontology. SKOS is also easier to use, and harder to misuse than OWL, providing an ideal entry point for those wishing to use the Semantic Web for knowledge organisation.

Here's a recent W3C press release about OWL, in case you're not up on current Semantic Web acronyms.

My first point of reference for this stuff is a practical one -- how can I use it in projects that I'm involved with? For biomedical information extraction, terminology and terminology variation is a big issue, and so is connection of referents across different ontologies and ontology-like databases. So the issues that SKOS is addressing are relevant ones for some of the work that I do.

But so far, I'm not convinced that the SWAD-Europe work -- or any of the Semantic Web work -- is engaging these problems in a helpful or realistic way. The only example of terminological variation that I can find so far on their pages is

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"

<skos:Concept rdf:about="urn:swad-e:example/concept/0001">
<skos:prefLabel>Bangers and mash</skos:prefLabel>
<skos:altLabel>Sausage and mash</skos:altLabel>
<skos:altLabel>Sausage and mashed potato</skos:altLabel>
<skos:inScheme rdf:resource="urn:swad-e:example/thesaurus"/>
</Concept>

</rdf:RDF>

Of course this is not intended to be anything more than a toy example to show how the system works, but it helps make clear that what SKOS is offering: the ability to designate a string as a preferred label for a concept, and a set of strings as alternative labels. However, real-world terminology variation usually looks like something other than just a list of alternative strings. Real-word terms are often complex phrases with free variation among alternatives in several different locations, with variable phrasing and ordering, drawn from a large and apparently open-ended set. Experience suggests that it's hard to get adequate coverage just with a set of strings, even with a quite large set of strings. It's not entirely clear what the best long-term approach will be, but a plausible way to get reasonable performance is to apply a statistical pattern recognition algorithm, trained on a set of examples in context and perhaps provided with a more general model of terminological variation. I'll give a simple example below in support of this view.

Note that this leaves aside several more difficult questions: the relationships among referents vs. the structure of the ontology, the problems of metonymy and synecdoche, elliptical variants of terms, etc. I'm talking about the easy case where there is a single well-defined referent and a bunch of strings that are clear and complete references to it. We can find some (relatively easy) examples of this kind by scanning the MEDLINE corpus for examples of explicitly defined acronyms. A typical source sentence containing three explicitly-defined acronyms:

Luteinizing hormone/chorionic gonadotropin (LH/CG) receptor complementary DNA (cDNA) isoforms were amplified using pseudopregnant rat ovarian total RNA as a template and the primers reaching over the coding regions at both ends in a reverse transcriptase-polymerase chain reaction (RT-PCR).

These should be cases where terminology is "on its best behavior", so to speak.

It's easy to recognize these patterns and map the acronyms onto the corresponding strings. When we look at the resulting sets of defining strings for a given acronym, they turn out to be remarkably diverse. The top of the histogram for definitions of RT-PCR is given below, with each example preceded by the count of occurrences (in our slightly-out-of-date local copy of MEDLINE). I haven't folded case or eliminated hyphens, but with or without such normalization, there are a lot of variants. This is just the head of the list -- there several times as many more to come, though with lower counts -- and one suspects that there are other variants in principle "out there", that didn't happen to come up in MEDLINE's billion words.

2191 reverse transcription-polymerase chain reaction
1627 reverse transcriptase-polymerase chain reaction
731 reverse transcription polymerase chain reaction
683 reverse transcriptase polymerase chain reaction
273 Reverse transcription-polymerase chain reaction
216 reverse-transcriptase polymerase chain reaction
211 reverse-transcription polymerase chain reaction
178 Reverse transcriptase-polymerase chain reaction
159 reverse transcription-PCR
123 reverse transcription and polymerase chain reaction
84 Reverse transcriptase polymerase chain reaction
80 Reverse transcription polymerase chain reaction
76 reverse transcriptase PCR
56 reverse transcription PCR
56 reverse transcriptase-PCR
26 Reverse transcription-PCR
25 reverse transcription followed by polymerase chain reaction
24 Reverse-transcription polymerase chain reaction
18 reverse-transcription-polymerase chain reaction
18 reverse transcription and the polymerase chain reaction
17 Reverse-transcriptase polymerase chain reaction
15 reverse-transcriptase-polymerase chain reaction
15 reverse-transcribed polymerase chain reaction
15 reverse transcribed polymerase chain reaction
13 Reverse transcriptase-PCR
11 Reverse transcriptase PCR
11 reverse transcribed-polymerase chain reaction
10 reverse transcription coupled to polymerase chain reaction
9 reverse transcription-polymerase chain reactions
9 Reverse Transcription-Polymerase Chain Reaction
9 Reverse transcription PCR
9 reverse-transcription PCR
9 Reverse transcription and polymerase chain reaction
9 reverse-transcriptase PCR
9 reverse transcriptase-linked polymerase chain reaction
8 reverse transcriptional polymerase chain reaction
8 reverse transcriptase-polymerase chain reaction

Let me make it clear that these particular instances are not problematic, since a constant acronym RT-PCR is given adjacent to them. We're interested in the problem of how to recognize instances of the "same" term in general, and this list just represents a convenient way to get a lot of examples of alternate complete renditions of a well-defined term in a well-controlled context. I'm not trying to insist that no finite list could possibly cover such cases adequately. However, a complete enough list would be quite long and quite hard to compile -- and probably the easiest way to compile it would be to use a generative model for terminological variation, effectively equivalent to the pattern-recognition approach that I've suggested as an alternative.

There's nothing wrong with providing a standard XML method for giving a thesaurus list of alternative strings for an item in an ontology. However, I think it's naive to suppose that this will be go very far towards solving the problem of recognizing "entity mentions" in texts and connecting them to standard referents, even in the simplest and most straightforward cases such as the one described above.

It's fair to respond that the authors of SKOS are trying to solve a different problem, namely how to let people who are putting explicit semantics in their web documents do so in a way that allows for variable concept labels and partly-related alternative conceptual schemata. Fine -- but some people may think that this will help to represent the content of the ordinary-language documents that ordinary folk write, especially when the documents are scientific or technical in character. But it won't.

Posted by Mark Liberman at 12:23 AM

March 15, 2004

Dihydrogen Monoxide

According to this AP report, the city of Aliso Viejo, California nearly banned styrofoam cups after learning that dihydrogen monoxide is used in their manufacture on the grounds that dihydrogen monoxide is a substance that could "threaten human health and safety". The gaffe was apparently the result of a paralegal being misled by a prank website that described dihydrogen monoxide as a tasteless, odorless chemical that can be fatal if accidentally inhaled.The description is perfectly true but misleading: dihydrogen monoxide is water.

At first I thought that this was just another depressing example of how little most people seem to learn, or retain, about science, but on reflection, I think it is more interesting. The fact that "dihydrogen monoxide" refers to water is not a fact that they should have learned in science classes because scientists don't normally refer to water that way. The problem lies in their ignorance of the system by which chemists refer to chemicals.

Every compound has a chemical formula and scientific name. The name is generated by a set of rules from the chemical formula. A compound may also have a common name. In the case of salt, for example, we have the chemical formula NaCl, the scientific name sodium chloride, and the common name salt.

I suspect that the problem is that most people know that scientists have exotic names for chemical compounds but that a lot don't understand the difference between chemical names and chemical formulae and that the names are systematically related to the formulae. The result is that, on hearing "dihydrogen monoxide" they don't even try to translate it into a chemical formula because they don't realize that there is a procedure for such translations, or if they do know, they don't remember how to do it. Furthermore, it doesn't occur to them that "dihydrogen monoxide" might be water because they think they already know the scientific name for water: H₂O, which they don't realize is a formula, not a name. So on encountering an unfamiliar term like "dihydrogen monoxide", they assume that it must be an unfamiliar chemical.

Posted by Bill Poser at 11:46 PM

Clichés, stereotypes and other obsolete metaphors

Having found Tom Mangan's excellent "collection of reviled news media cliches", it occurred to me to wonder about the word cliché itself. It turns out to be a metaphorical extension of a technical term from an obsolete printing method known as stereotyping.

The first sense of cliché in the OED is:

1. The French name for a stereotype block; a cast or ‘dab’; applied esp. to a metal stereotype of a wood-engraving used to print from.
Originally, a cast obtained by letting a matrix fall face downward upon a surface of molten metal on the point of cooling, called in English type-foundries ‘dabbing’.

And stereotype in turn, in its literal sense, is both a relatively recent invention and an obsolete one:

1. The method or process of printing in which a solid plate or type-metal, cast from a papier-mâché or plaster mould taken from the surface of a forme of type, is used for printing from instead of the forme itself.

1798 Ann. Reg. Chron. 22 The celebrated Didot, the French printer, with a German, named Herman, have announced a new discovery in printing, which they term stereotype.

The OED's second sense of cliché -- equally obsolete -- is a photographic negative. It's not until the third sense that we get to the only use that's current:

3. a. fig. A stereotyped expression, a commonplace phrase; also, a stereotyped character, style, etc.

The earliest uses of this sense are from the 1890s:

1892 A. LANG in Longman's Mag. Dec. 217 They have the hatred of clichés and commonplace, of the outworn phrase, of clashing consonants. 1895 Westm. Gaz. 19 Apr. 3/2 The farcical American woman who ‘wakes everybody up’ with her bounding vulgarities..is rapidly becoming a cliché, both on the stage and in fiction.

Before 1890, people must have used a range of English words and phrases for this concept: commonplace, hackneyed expression, etc.

I wonder about the history of cliché in French -- when Flaubert wrote his Dictionnaire des Idées Reçues (unfinished at his death in 1880), did he avoid using the term cliché because he thought that idée reçue and locution reçue ("received idea" and "received phrase") were more apt, or because the term cliché wasn't used then in its modern, metaphorical meaning in French either?

Tom Mangan offers a quote from George Orwell as the epigraph for his "collection of reviled news media cliches": "Never use a metaphor, simile, or other figure of speech which you are used to seeing in print." If you really tried to put that advice into effect, you'd find it difficult to write anything at all. Most of the common meanings of most of the words and phrases we use are metaphorical in origin -- including cliché and stereotype. I prefer the way that Geoff Pullum put it, in a Language Log post that Mangan also quotes:

Dennis, I want to make a suggestion to you about your use of hackneyed phrases in kit form to launch articles, and it's this: get a life. Think up some novel stuff. Don't be an indolent hack, use your left brain. Don't just make trips up the well-worn staircase to the attic full of dusty phrasal bric-a-brac that journalists keep returning to time after time after time.

There's no special value in avoiding metaphors and collocations that have been used in print before -- that's very difficult to do, and not especially worthwhile, like writing a novel without using the letter E. The thing to avoid is writing without thinking.

[Update 3/16/2004: John Kozak emailed:

I had an discussion with a typographer once in the 80s; I used the term "hardwired" in its (now) standard metaphorical sense, and got a bit of a ticking-off for polluting the language with these ghastly tech metaphors. Esprit d'escalier later handed me the insight that an awful lot of our contemporary mental toolkit is derived from the technology of printing. Like "matrix" of course - how else to get from a brood cow to a rectangular grid of numbers?
I got the same feeling browsing through Bucks' IE dictionary once: a ridiculous fraction of vocabulary seemed to derive from metaphorical extensions of droving.

John is absolutely right that "technology" -- in a sense broad enough to include agriculture and animal husbandry -- has always been a major source of conceptual and lexical metaphors.

With respect to his particular example of matrix, I'm not sure whether the mathematical use comes from the typographer's term or not. The OED gives examples of the meaning "womb or uterus" from 1425; the general extended sense glossed as "A place or medium in which something is originated, produced, or developed; the environment in which a particular activity or process begins; a point of origin and growth" has citations from 1586; the typographer's sense glossed as "a metal block in which a character is stamped or engraved so as to form a mould for casting a type" dates from 1626. The mathematical usage dates only from 1850, and is quoted below. It seems as likely to be a specialization of the general extended sense, as a metaphorical extension of the typographer's use, which seems to have dealt with moulds for individual letters rather than arrays of such moulds.

1850 J. J. SYLVESTER in Philos. Mag. 37 369 We..commence..with an oblong arrangement of terms consisting, suppose, of m lines and n columns. This will not in itself represent a determinant, but is, as it were, a Matrix out of which we may form various systems of determinants by fixing upon a number p, and selecting at will p lines and p columns, the squares corresponding to which may be termed determinants of the pth order.

]

[Update 3/17/2004. John Kozak emailed in response:

I'd mis-remembered the printers' usage of "matrix" - what I was thinking of was (my source here is Hugh Williamson's "Methods of Book Design") a "matrix-case", which is a grid of different "slugs". Interestingly, Williamson seems to say that an individual slug cast from a matrix was also called a matrix, so there's clearly a fair bit of semantic slippage going on here.

I don't know how old matrix-cases are, but the illustration in Williamson looks C19 to me.

I'm way out of my history-of-technology depth here. But the trusty OED does have as one of the meanings of Monotype

3. (With capital initial.) The proprietary name of a composing machine consisting of two units, a keyboard which produces the perforated paper tape used to control the caster, which produces type in individual characters.

with citations suggesting invention about 1893, and use of the term matrix-case in exactly the way that John suggests:

1893 Official Directory World's Columbian Exposition (Chicago) 459/1 Lanston Monotype Machine Co., Washington, D.C. Monotype Machine. 1895 Current Hist. (Buffalo) V. 961 The Lanston Monotype..invented by Tolbert Lanston, of Washington, D.C. marks an important advance in the development of typographical art..both a type-setting and a type-casting machine. [...] 1965 J. MORAN Composition of Reading Matter vi. 65 The Monotype machine consists of two unitsthe keyboard and the caster. By operation of the keyboard a paper ribbon is perforated by means of compressed air. The ribbon is fed into a caster which carries a matrix-case. This moves to different positions in accordance with the perforated ribbon and molten metal is pumped into the appropriate matrix. Cast types are ejected singly and assembled in a channel until a line is completed. [...]

And here is a very clear explanation, with pictures. The matrix-case seems to be 16x16. Here is something called The Monotype Chronicles, by Lawrence W. Wallis, which indicates that Langston was born in 1844, and starting applying for his patent(s) in 1885, and got them in 1887. So the OED's first citation is about six years late -- though Wallis also indicates that the first press notice wasn't until 1891 -- but Langston's invention is in any case too late to have played a role in Sylvester's 1850 innovation of the term matrix for a two-dimensional array of numbers.

So unless there was an earlier piece of printing technology involving arrays of elements called "matrix-cases" (or some other name involving matrices), it looks like Sylvester's "as it were, a Matrix" is a reference to the more general meaning, inspired by his view of a matrix as a source of "systems of determinants".]

Posted by Mark Liberman at 09:40 AM

March 14, 2004

It's Usually Nice to be Wanted But...

According to this report, the US Selective Service is drawing up contingency plans for a "targetted draft" of linguists and computer experts. In military usage, "linguist" frequently means someone who knows a foreign language, not someone with expertise on clausal coordination or fear of toes, but in spite of linguists' reputation in some circles for not having a practical knowledge of languages, too many of us would be vulnerable even under this definition.

Here's a helpful hint for the armed forces: they wouldn't have such a shortage if they stopped firing the ones they have who happen to be gay. There's something incongruous about the military complaining of shortages when the headlines read: Military Gay Linguist Firings Escalate. Pentagon whining that allowing openly gay people to serve won't work is plainly nonsense. The armed forces of countries such as Australia and Canada don't discriminate and do not report any problems. The Israel Defense Force, probably the world's best on a per capita basis, has officially allowed openly gay soldiers since 1983; all restrictions (such as exclusion from intelligence positions) were removed in 1993. If the military wants to make a case for drafting people, they'd better first show that they're making full use of volunteers.

Posted by Bill Poser at 06:07 PM

Why worry of it?

In response to my "Don't worry of it" post, several people have written to suggest that -- never mind the New York Times and Dave Farber -- such phrases are just not English. Well, "suggest" is too weak a term. My correspondents have asserted, insisted or maybe even proclaimed that phrases like "I'm worried of what my friends will say" are completely beyond the linguistic pale, and that I'm being excessively permissive, laissez-faire and generally pusillanimous when I wrote about "learning" a usage that is so far from being a genuine part of the English language.

Now, as a linguistic libertarian, I believe that we should let people say what they want, even if it's wrong. Of course, I shouldn't put it that way, because I also believe that the vocabulary of morality has no place in a discussion of usage. However, though that's what I believe, what I feel is that the sentence "Don't worry of it" is wrong, wrong, wrong.

To keep my inner prescriptivist at bay, I spent a few minutes thinking about where "to worry of it" or "be worried of it" might come from. I came up with two possibilities: generalization from uses of the noun worry; and analogy to certain other verbs and adjectives expressing propositional attitudes, such as think and be afraid.

In expressing a complement of the noun worry, I have no trouble accepting "worry of X" (X the source of worry), as in these random web examples:

I developed alopecia and my hair started falling out with the worry of it all.

Christopher hit the roof when he discovered that Adriana hadn't told him about her worries of fertility trouble.

In 1801 Beethoven confessed to his friends at Bonn his worry of becoming deaf.

Your pet counsellor will help you fit your ferret with the proper size of ferret harness, so your little guy can get some fresh air without the worry of him escaping.

The LED lights keep the bike happy when at idle and eliminate my worry of draining the battery.

"The worry of it all" is a more-or-less fixed expression with special properties: it's at least awkward to substitute about for of, and other variations on the phrase are also odd, such as "my worry of it all" or "some worries of it all". However, there are plenty of regular cases of nominal worry of.

Another possibility is analogy to other verbs such as think, which can take either of or about. There is a slight difference in meaning, or at least in characteristic usage, between thinking of X and thinking about X -- the second one seems to focus more on the process, or something like that. As evidence, consider the following Google counts:

	of		about
been thinking	109,000	21%	404,000	79%
just thought	61,500	80%	15,800	20%

It seems that both constructions are compatible with the different aspectual meanings, but to different degrees. Anyhow, "think of" is very common, and offers another model for "worry of".

Verbal and adjectival synonyms for intransitive worry -- such as be anxious, brood, be concerned, dwell, fret, etc. -- usually take complements with about, over or on, in a mixture depending (as usual) on the particular verb or adjective. Expressing a complement with of doesn't appeal to me as an option for any of these words, but the fact is that most of them also seem to be used that way, at least sometimes by some native speakers. The following are all from what seem to be web pages written by native speakers of English:

Mark McLemore admitted being anxious of going to Japan for the season opener.
Leaning back in his sagging wooden chair, Cyrus brooded of better days, better nights, better love.
...now that he owns a digital camera he can photograph the kids all day without being concerned of running out of film.
Without dwelling of failures, mistakes, or past ill feelings, quickly list the most important accomplishments of your life.
Many have fretted of our generation ever accomplishing anything.

All of these seem ungrammatical to me personally.

Some other verbs and adjectives expressing attitudinally-tinged propositional attitudes are like think -- they regularly express their complements with of as well as with other prepositions, though typically with a difference in meaning.

I'm afraid of you / about you.
He's ashamed of it / about it.

Some others don't, for me -- though again, others differ:

I wonder about that / *of that.
She's annoyed about that / *of that.

For example, "I wonder of" has 4,350 ghits, to 82,200 for "I wonder about". Some of these are typos: "I wonder of it would be easier to just go to 17 digits." Others are different constructions entirely, folded in because Google ignores punctuation: "The first thing I wonder, of course, is if he's using."

Others are antique examples of the meaning "marvel at" rather than the modern "ponder over":

I wonder of their being here together. (Shakespeare, Midsummer Night's Dream)
But I wonder of the scope that Xenophon allowes them. (Montaigne's essays, Florio's 1603 translation.)

However, some are modern and apparently genuine uses (though if you look them over, you'll see a remarkable percentage of uses in bad poems ("As I wonder of his everlasting devotion..." or in self-consciously poetic prose):

I wonder of the danger of glorifying or privileging *physis* over *techne* in music.

A quick sample of 50 ghits (from the 10th, 20th, 30th, etc. screenfuls) found 8 genuine "I wonder of" cases; based on this estimated proportion, there were really .16*4350 = about 696, or about 1% of the number of "I wonder about" cases.

I wonder about the dynamics of this situation, both in individual learning and for the language as a whole over time. There is a a very small, but definitely existing, proportion of "worry of" in the linguistic meme pool. This is probably due to the influence of nominal worry and other verbs like think, which are apparently enough to give some people a different view of specific verb-preposition pairings than most of us have, not just for worry, but also for similar words, such as be anxious, brood, be concerned, fret. Are these oddball complementations correlated? that is, do people who say "don't worry of it" also say "I was concerned of running out of film"?

On a larger scale, is this the leading edge of a change, or just one of those variable things? I suspect the latter, since it seems that "wonder of" is a change in (long, long) regress rather than one in progress.

In any case, I feel that we should be tolerant of the carriers of this unusual meme. It makes them different, sure enough, but not wicked. Or at least, not any wickeder than the rest of us.

Posted by Mark Liberman at 04:44 PM

March 13, 2004

Pot pourri

I wish I had time to write about these (recent language-related weblog links). Or even time to think about them. But I'm grateful to have had the opportunity to read them, and you may be, too.

At Laputan Logic, The History of Naming. Also Marsh Arabs (one of the most interesting articles I've ever seen on the web).

At Blogalization, The Cult of Arab Spain.

At the discouraging word, Participated by others (not a permalink -- posted March 13, 2004). Exploration of a quote from Disraeli: "The same passion for the works of Cicero has been participated by others."

At phluzein, the Vedic Hymn of Creation.

At Prentiss Riddle: Language, East is South and South is East -- "the French won." And "Learn the original vampire language" -- check out the comments for the best part.

At A Roguish Chrestomathy, Online Shorthand for the Linguistically Conservative (see esp. fwoabw) and Adam Smith among the savages ("in which Mr. Wealth of Nations speculates on the origins of the major lexical and functional categories").

Renee Perlmutter at glosses.net, smouse. An unexpected cousin of schmooze.

Rachel at a tear in the fabric of spacetime, Toda su base es pertenece a nosotros ( and here in Japanese). Also, an idea developed with entangledbank and Ryan Gabbard, a linguistic theory based on the video game Power Pete: "instead of subjects attracting to Spec,IP to check features they have to jump there over crocodile-infested waters to pick up golden health globules."

At Tenser said the Tensor, I prefer and I'd rather and "I don't think it means what you think it means".

At Impearls, In praise of the C-word and In praise of the C-word II.

At Uncle Jazzbeau's Gallimaufrey, unable wholly to reject (Augustus de Morgan on flies and elephants).

At cannylinguist, Shodding. Sort of like trodding.

Rosanne at the X-bar, various posts on booger anaphora.

At Language Hat, Copyrighting a Language. (More on this one later...)

At Semantic Compositions, Why SC likes fictional languages now.

Posted by Mark Liberman at 09:52 PM

Beijing, Peking, Peiping and all that

Our recent discussion of the pronunciation of Beijing has raised the question of why there are so many versions of the name for the capital of China. Here's the answer.

The current capital of China has three names that one is likely to encounter: 北京, 北平, and 燕京.

北京: means "northern capital". It is the name by which the city has been known to its inhabitants and to most Chinese speakers for the past six hundred years.
北平: means "northern peace". This name was used by Chiang Kai-Shek's Nationalist Party and its supporters because they denied the legitimacy of the Communist government and therefore did not recognize the city as the capital of China. It goes back to 1368 when native Chinese rule was re-established with the defeat of the Mongols by the Ming dynasty. Initially, the Ming capital was at 南京 Nanjing "southern capital". In 1421 the Ming capital was transferred to 北平, which was then renamed 北京. The Nationalists also had their capital at 南京 until they lost the civil war to the Communists and moved their capital to 台北 Tai Bei (usually spelled Taipei) in Taiwan.
燕京: means "Yan capital". It is an old name, reflecting the time when there was a Marquis of Yan. It is no longer used as an ordinary name for the city, but, outside of China, has a literary, antiquarian flavor. Within China it isn't normally used as the name of the city but it is frequently heard since it is the name of the country's most popular beer.This is the name spelled Yenching and Yenjing in the Harvard-Yenching Institute and in the names of restaurants, such as the Yenching restaurant in Harvard Square.

北京 has had other names at other times. The earliest name known to us is 薊 Ji which it had as early as the Spring and Autumn Period (772-481 B.C.E.). After the Marquis of Yan overthrew the Marquis of Ji and merged the two states, the city was variously known as 燕京 and as 薊. The Jin dynasty named the city 中都 Zhong Du "central capital" when they made it their capital in 1153. When the Mongols captured the city from the Jin in 1215 it ceased to be a capital until 1271 when Kubilai Khan made it his capital. He named it 大都 Da Du "great capital" in Chinese, Khanbalig in Mongol. Marco Polo's Cambaluc is a rendering of the Mongol name. Regrettably, the name used by Peking Man, if any, has not been preserved.

Any of these names can be pronounced in any variety of Chinese. The current official pronounciations are what we usually call Mandarin, the somewhat artificial standard variety of Chinese based on the dialect of the Beijing area. Strictly speaking, the official language is called 普通話 putonghua "common language" in the People's Republic of China and 國語 guoyu "national language" in Taiwan. Overseas Chinese, especially those in Southeast Asia, often refer to the language as 華語 huayu. The term Mandarin originally referred to 官話 guanhua the language used by officials at the Imperial court. Mandarin is now generally used as the English equivalent of 普通話 and 國語, though a few people would like to reserve "Mandarin" for the larger dialect group to which the standard language belongs so as to avoid ambiguity. The versions of Chinese words that we generally now see and on which romanizations are based is Mandarin.

However, Chinese is a very diverse language. In fact, the so-called "dialects" are by other criteria several distinct languages. Someone from Beijing, for example, cannot understand someone from Hong Kong or Shanghai speaking his or her local variety of Chinese.Since European contact with China was concentrated in the great ports of southwestern China, many European versions of Chinese words are based on the Chinese dialects of the Southwestern coast, especially Cantonese. Place names, in particular, are often based on the usage of the old British-run Chinese postal system, which was based in Hong Kong. This is the probable source of the king of Peking. 北京 is pronounced [pejʧiŋ] in Mandarin but [pakkiŋ] in Cantonese. An alternative theory is that Peking was borrowed into European languages from Mandarin prior to the palatalization of [k] to [ʧ]. Actually, both could be true, in the sense that the unpalatalized form familiar from Southern Chinese dialects like Cantonese may have discouraged Europeans from adopting the later Mandarin pronounciation.

A third factor is how Chinese words are romanized. The main issue here is how the Chinese aspiration distinction is handled. Many varieties of Chinese, including both Mandarin and Cantonese, do not distinguish voiced and voiceless stops and affricates. Instead, they have a distinction in aspiration. If a sound is truly voiced, that means that the vocal folds are vibrating during the sound itself. If there is a substantial lag between the release of the closure of a stop or the end of the frication of an affricate, and the onset of voicing in the vowel, it is said to be aspirated. A sound in which there is no voicing during the sound itself but in which the lag before the onset of voicing in the vowel is short is voiceless but unaspirated.

You can see the distinction in the following three images, which show the waveforms and spectrograms of the Thai syllables [tʰa], [ta], and [da]. (Thai is a convenient language to illustrate this since unlike English and Chinese it has a three-way contrast. You can find the audio files here.) In the first image I've highlighted the aspiration region. You can see that there is no voicing (which shows up as energy near the bottom of the frequency range) until the onset of the vowel, but there is a long (70 millisecond) noise segment between the release of the stop closure and the onset of voicing. In the second image there is very little aspiration but no voicing during the stop closure. In the third image you can see voicing during the stop closure as well as some higher frequency noise.

Acoustic analysis of the Thai syllable [tha]

Acoustic analysis of the Thai syllable [ta]

Acoustic analysis of the Thai syllable [da]

In English, the voiced stops and affricates in fact have little voicing in word-initial position, whereas the voiceless stops are aspirated in syllable-initial position in stressed syllables, so English speakers tend to hear the Chinese aspiration distinction as corresponding to the English voicing distinction. Unaspirated sounds seem to be voiced, aspirated ones voiceless.

One system for romanization of Chinese, the Wade-Giles system, is more accurate from a technical phonetic point of view and treats all Chinese stops and affricates as voiceless and using an apostrophe to indicate aspiration. In this system, 北京 is written Peiching and 北平 is written Peip'ing. Peiping is what you get if you use the Wade-Giles system but drop the apostrophe. The other system for romanization of Chinese, which is now official in China, is the Pinyin system. This system writes the stops and affricates in a way that is less technically accurate but more meaningful to English speakers.In Pinyin, 北京 is written Beijing and 北平 is written Beiping. Folk transcriptions of Chinese by English speakers tend to be like Pinyin in this respect.

To summarize, the variation in the form of the name of the capital of China arises from three different sources:

different underlying names
different pronounciations in different dialects of Chinese
different romanizations of the same pronounciation of the same name

The combinations that you are likely to run into are the following:

Beijing: is what you get if you use the Pinyin romanization for the Mandarin pronounciation of the current official name.
Peking: is what you get if you use the old postal system romanization, which was based either on the pronounciation in a Southern dialect or an archaic pronounciation in Mandarin of the current official name.
Peip'ing: is what you get if you use the Wade-Giles romanization of the Mandarin pronounciation of the Nationalist name.
Peiping: is what you get if you use the Wade-Giles romanization of the Mandarin pronounciation of the Nationalist name but drop the apostrophe.
Beiping: is what you get if you use the Pinyin romanization of the Mandarin pronounciation of the Nationalist name.
Yenching: is what you get if you use the Wade-Giles romanization of the Mandarin pronounciation of the old literary name.
Yenjing: is what you get if you use a Pinyin-style folk romanization of the Mandarin pronounciation of the old literary name. This isn't the true Pinyin romanization, which would be Yanjing. I've never seen that outside of China except in scholarly contexts.

It's all very simple, really.

Posted by Bill Poser at 04:21 PM

Science Fiction citations for the OED

The most recent OED Newsletter has an article by Jesse Sheidlower on a volunteer, web-based effort to track down the origins of science-fiction vocabulary. There is a web site for the project, including instructions for how to participate, and an interesting graph of the resulting word origins by decade (showing a lexicographic baby boom centered in the 1940s, and a mini-boom, half based on critical vocabulary, centered in the 1980s).

Some examples: empath originated in a 1956 story of that name by J.T. Mcintosh; force field has been dated to E.E. Smith's 1931 story "Spacehounds of IPC"; web cast has been dated to 1987 in "The Armageddon Blues" by Daniel Keys Moran. A word whose recent date surprised me is morph (as a verb meaning to change shape), which has been traced only as far back as a 1993 story "Being Human" by Mark Bourne.

This may turn out to be a pilot project for a net-based version of the distributed volunteer lexicography that has always been one of the OED's working methods. However, as Sheidlower writes

Science fiction has several advantages as a subject for this kind of investigation. The vocabulary is largely self-contained; SF terms tend to occur in SF and nowhere else, while, say, political language can be found anywhere and everywhere. The fans are particularly committed, often have linguistic interests, and are computer literate. They may also be more likely to be able to volunteer time than specialists in more academically oriented fields.

Posted by Mark Liberman at 06:11 AM

Windows, Lindows, and Transitivity

One of the many companies selling a version of Linux, the free/libre open source version of Unix, is Lindows, whose product is aimed at non-technical people who currently use Microsoft Windows. Lindows sells a version that is particularly easy to install and update and that provides an interface that is very similar to that of Microsoft Windows. The name Lindows is a blend of Linux and Windows. Microsoft does not like the competition and has been claiming that Lindows is so similar to Windows that the two are likely to be confused. In December, 2001 Microsoft filed suit in the United States for trademark infringment.

Thus far, Microsoft has not been successful in the United States. (A detailed chronology and copies of legal documents are available here.) The biggest sticking point is that Lindows is similar to Windows but unquestionably distinct from Microsoft Windows. In order for Microsoft to prevail, the court must rule that Windows by itself is a valid trademark. However, there is strong evidence that windows is a generic term for windows on a computer. For instance, here is what one of my desktops looked like a short time ago.

As you can see, it contains a bunch of windows, none of which has anything to do with Microsoft since my computers all run GNU/Linux. [The background is one of the new images of the surface of Mars, in case you're wondering. According to this report, the servers that distribute these images run GNU/Linux.] The term window was applied to such windows, which Microsoft did not invent, before Microsoft added them to its operating system.

Microsoft has been more successful outside of the United States. A Dutch court has ruled that Lindows infringes Microsoft's trademark and has ordered Lindows not to use the name Lindows in the Benelux countries. I think that this is a flawed decision, for two reasons. First, no one with half a brain could possibly confuse the two products. If you go to the Lindows web site and the Microsoft Windows home page you'll find very different looking web pages, with different logos. Lindows goes to some length to differentiate itself from Microsoft. Many pages on its site, including the one entitled What is LindowsOS?, end in this statement:

Lindows.com is not endorsed by or affiliated with Microsoft Corporation in any way - in fact, we don't even really like them because they are suing us.

Second, in order to reach the conclusion that windows by itself is a valid trademark, the court claimed that although window is a generic term in English, it is not a generic term in the languages of the Benelux countries. Given the widespread use of English terminology in the computer domain all over the world, this is a dubious claim. Following standard Language Log practice, I used Google to get some data. I used the search terms:

window Nederland Linux maar
I included Linux so as to avoid getting oodles of hits on sites dealing only with Microsoft Windows, and the Dutch words Nederland "Holland" and maar "but" to find sites in Dutch. Sure enough, I had no trouble finding instances of the word window in its generic computer sense in Dutch text. Here's one example, from the web site of RES Multimedia. I've highlighted the generic uses of the word window:

Workstation features:
• Onmiddellijk herstel van onderbroken VMware sessies.
• Elke VMware wereld is gelijk aan een volledige PC in een window.
• Volledige netwerkondersteuning, dial-up toegang en file-sharing ondersteuning.
• Ondersteuning van SCSI en IDE schijven.
• Beeld in een Window of full-screen.
• Draai Dos, Windows 3.x, Windows '95, Windows '98, Windows NT, Windows 2000, BreeBSD, Linux en andere Intel operating systems onder VMware workstation.
• Operating systems draaien tegelijkertijd, zonder te rebooten.
• Voeg nieuwe operating systems toe zonder uw schijven te herpartitioneren.
• VMware workstation installeert als elke andere applicatie.

This section of the site is an advertisement for a product called VMWare, which allows you to run multiple operating systems on a single computer simultaneously. The second item says "Each VMWare world is like a full PC in one window". The fifth item says "Display in a window or full-screen."

Here's another example, which I found at this Dutch Linux site in an explanation of the features of OpenOffice.org, the FLOSS office suite:

Verschillende extra windowtjes maken bepaalde opties en view modes makkelijk bereikbaar.

It says:

Various extra little windows make specific options and view modes readily available.

Notice that in this case the word window bears the diminutive suffix tje, which is good evidence that the word has truly been incorporated into Dutch.

So, I think that the Dutch court erred and hope that the decision will be reversed on appeal. But the situation gets worse. In order to comply with the court's decision, in the Benelux countries Lindows has shifted to using the brand name Lin---s, which it suggests that people pronounce as if it were written LinDash. According to this article in the Register Microsoft has filed another complaint with the court. Now they claim that LinDash sounds too much like Windows. That's pretty shaky if you ask me. They also claim that Lin---s infringes their trademark because people, "when confronted with 'Lin---s', will be reminded of 'Lindows'", and Lindows in turn is too similar to Windows.

This is nonsense. If you follow this logic, any two terms can be argued to be excessively similar because they are connected by a chain, possibly with many links, where each link is short. Here's my demonstration that Microsoft and Apple are easily confusable:

Microsoft Micresoft Mikesoft MySoft MyLoft Syloft Signoff Sinnof Sanoff Sanol Sampol Ampol Ample Apple
Each pair of words is similar enough that if one were a trademark a reasonable case could be made that the other was so similar as to be infringing, but no reasonable person would consider Microsoft and Apple to be confusable. I hope that the court is smarter than Microsoft's lawyers and understands that similarity is not a transitive relation.

Incidentally, this is why it doesn't work to define dialects as linguistic varieties that are mutually comprehensible, as opposed to languages, which are not mutually comprehensible. Comprehensibility is like similarity; it isn't transitive. You can't have a classification without an equivalence relation, and one of the three defining properties of an equivalence relation is transitivity. (The other two are symmetry and reflexivity.) It is easy to find chains of linguistic varieties where A and B are mutually comprehensible and B and C are mutually comprehensible and C and D are mutually comprehensible and so forth, but once you get a few links apart, the varieties are not mutually comprehensible. This resuls in a contradiction. If A and B are dialects of the same language, and B and C are dialects of the same language, then A and C must be dialects of the same language. But if A and C are not mutually comprehensible, by this criterion they aren't dialects of the same language.

Posted by Bill Poser at 01:19 AM

March 12, 2004

Clausal coordination of nonidentical illocutions (parental advisory: nerdy!)

Can I get a little bit more nerdy than is normal on Language Log? I don't mean as nerdy as Poser when he is hot on the trail of some neat fact about the native Japanese origin of the kanji character for a non-rice paddy, or Liberman sitting on a Japanese subway train watching teenagers texting and actually working out their transmission rates in bits per second, or whatever the two of them might discuss when they get together for power-nerd talks, like, I don't know, the influence of orthographic errors in the log books of syphilitic Portuguese sea captains on the evolution of the prosody of the Middle Korean word for camel spit. But I need to get just a little bit nerdy. You'll forgive me?

Good. Because I want to briefly discuss a technical issue in syntax about whether clauses of dissimilar illocutional force can be joined in a coordination. John Robert Ross, in his famous 1967 doctoral dissertation, presumed the answer is no, and used this fact in an argument. The supposed constraint was also discussed briefly in my own 1976 Ph.D. thesis. I think I'm now inclined to say that there is no such syntactic constraint at all.

You don't really need to know about Ross's argument. All right, since you ask: he was arguing against deriving supplementary relative clauses transformationally from coordinations. That is, he thought he had a case against saying that the deep structure of Even Clarence, who is wearing mauve socks, is a swinger (people just loved goofy examples in those days) was the deep structure that also yields this:

Even Clarence is a swinger and Clarence is wearing mauve socks.

The argument was that clauses of different illocutionary force can't be joined with a coordinator, so there would be trouble deriving Is even Clarence, who is wearing mauve socks, a swinger? -- it would have to have a deep structure that was not well formed, namely the deep structure counterpart of this:

*Is even Clarence a swinger? and Clarence is wearing mauve socks.

I was reminded of this argument this morning when I saw the following sentence in a letter to the editor in the Santa Cruz Sentinel:

They do not know what they are losing and don't give them a dime!

That is a declarative + imperative clausal coordination, and it seems natural enough to take the wind out of the sails of anyone who wants to claim that you can't have such a thing, and to refute the notion of a syntactic constraint that bans clause coordinations of differing illocutionary force.

The reverse case, with an imperative followed by a declarative, is also easy to illustrate, because of this construction:

Make one little remark and they jump all over you.

Sure, the interpretation of the first clause is not that of a directive or command; but don't confuse semantics with syntax. Syntactically that first coordinate clause is an imperative, I think.

Now, the trouble with main-clause interrogatives as the first part of a coordinate structure, as in the starred example about Clarence above, is that when you try to write them down you don't know where to put the question mark, and if you try to speak them you don't know what to do with your intonation, and when the hearer tries to interpret what you've said they get the odd feeling that you asked a question but instead of waiting for the answer you plunged into a statement. So there are orthographic, phonological, and pragmatic difficulties. If we make sure the interrogative is the second of the two coordinates, it is much easier to envisage examples, especially with a rhetorical question as the interrogative coordinate:

I don't even have shoes to wear and do you hear me complain?

In case you feel the need for some real, attested examples, I found some easily and rapidly, using a small collection of classic novels:

You have read this strange and terrific story, Margaret; and do you not feel your blood congeal with horror like that which even now curdles mine? (Mary Shelley, Frankenstein)

I cannot imagine you sitting in an office over a ledger, and do you wear a tall hat and an umbrella and a little black bag? (Somerset Maugham, Of Human Bondage)

In thirty seconds, as it seemed certain then, I would have been overboard; and do you think I would not have laid hold of the first thing that came in my way -- oar, life-buoy, grating -- anything?

I also found, in the same search, an attested case of an imperative + interrogative coordinations:

Consider all this; and then turn to this green, gentle, and most docile earth; consider them both, the sea and the land; and do you not find a strange analogy to something in yourself? (Herman Melville, Moby Dick)

It has not escaped my notice that the older authors like to punctuate with a semicolon where the illocutionary force changes; but that is hardly enough to indicate that we are not dealing with coordination. I think enough evidence is piling up that if Ross had been faced with it all in 1967 he wouldn't have proceeded further with an argument based on a putative general condition banning coordination of clauses with dissimilar illocutionary force. And in that case section 2.3.10 of my Ph.D. thesis (Rule Interaction and the Organization of a Grammar, published by Garland, New York, 1979, pp. 184-186), where I argue that the constraint works better on cycle-final structures rather than deep structures, would not have been needed.

Nothing in particular follows from this. I just thought I would use Language Log as a place to note the facts. All right, you think I'm a syntax nerd. Well I'm not. I'm a sexy super fun wild and crazy guy, O.K.?

Posted by Geoffrey K. Pullum at 04:45 PM

Harm's way: in and out since 1661

Claire over at Anggarrgoon cites Science Friday for a couple of "mixed metaphors", one of them being "... all our soldiers out in harm's way ...", which she characterizes as "a cute variation of the 'out of harm's way' idiom".

But "in harm's way" is an idiom in its own right, with 100,000 ghits to 54,200 for "out of harm's way." It's the title of at least six books and many articles, as well as a power metal band.

It's been around for a while: the OED has (in the entry for nicely) 1677 MANTON Serm. Ps. cxix, civ. Wks. 1872 VIII. 5 To stand nicely upon terms of duty is to run in harm's way.

The OED has eleven quotes including "out of harm's way" to just this one instance of "in harm's way", and gives "out of harm's way" its own subsense citation: harm, n. 1.c., first quotation 1661 FULLER Worthies (1840) I. xviii. 61 Some great persons..have been made sheriffs, to keep them out of harm's way.

Still, "in harm's way" has made it into the English lexicon, if not into the OED. Maybe it was a cute variation on "in harm's way" in 1677, but by now it's just another cliché.

Of course, what Claire doubtless meant to point out was that "out in harm's way" combines the two normally opposite forms of the cliché in one doubly-resonant sequence.

Posted by Mark Liberman at 03:38 PM

Fear not your toes, though they are strong

On my flight to Japan a week ago, I thought about beautiful feet in the Messiah. On my flight back yesterday, I beguiled a couple of hours by trying to remember other poetical feet. Not metrical feet, mind you. I mean things like "How beautiful are thy feet with shoes" (Song of Solomon 7), or "And did those feet in ancient time" (Blake, Jerusalem), or "The noble son on sinewy feet advancing" (Whitman, Leaves of Grass), or "Cuando no puedo mirar tu cara/ miro tus pies" (Neruda, "Tus pies"), or "He got feet down below his knees" (Lennon, "Come together"). Since I soon ran out of remembered feet, and it was a long flight, I allowed myself to count things related to toes: "And she said 'It's a fact the whole world knows/ That Pobbles are happier without their toes"' (Lear, "The Pobble who has no toes"); or "Mi fea, el mar no tiene tus uñas en su tienda" (Neruda, Cien sonetos de amor XX).

I had three hours between flights at SFO, so I paid $9.95 for tmobile wifi access to catch up on email, and when I was done, I checked on some lines that I hadn't been able to remember fully. In the process, I stumbled across some uses of "toes" that puzzled me, until I realized that they were were typographical errors for "foes." I enjoyed them so much that I'll share them with you here.

For the Canadian poet Henry Alline (1748-1784), the Lion corpus has 56 hits for the word foes, in catchy little couplets like

Ten thousand foes with all their rage
Against my naked soul engage

Come, mighty God, these foes subdue,
Form my benighted soul anew

and resounding stanzas like

O God for my poor soul appear,
And make my foes submit;
Unlock, unlock this prison door,
And bring me from the pit.

According to the Lion, the word toes occurs three times in Alline's poetry. With their context, these are:

Although ten thousand toes beset
Their souls on ev'ry side
Jesus securely guides their feet
On him they may confide.

and

Fear not your toes, though they are strong,
The conquest doth to you belong;

and

When I am try'd he bears my grief,
And doth my toes destroy;
When in distress he brings relief
With his immortal joy.

I surmise that all three of these are the result of OCR errors happily not caught in post-editing. T for F is not a very likely lapsus calami, despite the touch-typing affinity of T and F, but it's a very common OCR error.

The Lion's OCR system and proofreader have unfortunately missed several other opportunities to improve Alline's verse, notably:

GLAD news to men, the Prince of Peace
Has in his triumphs rose!
From death and hell he takes release,
And tramples on his foes.

Most of the Lion's 2023 poetical hits for toes are not typos. For example, Ambrose Bierce wrote

Through summer suns and winter snows
I sets observin' of my toes

and Shakespeare (in Romeo and Juliet) has

Ladies that haue their toes
Vnplagu'd with Cornes, will walke about with you:
Ah my Mistresses, which of you all
Will now deny to dance? She that makes dainty,
She Ile sweare hath Cornes

The toes typos in Alline's verse are a good example of the sort of thing that spelling-correction algorithms are still not clever enough to catch. I'm not sure whether we yet know how to build a statistical language model that would register "fear not your toes" as a sufficiently improbable 18th-century hymn sequence to set off an alarm, as it did for me when I read it. In the case of Alline's oeuvre, it might be better to set the algorithms to looking for places to introduce additional errors that could plausibly be blamed on the OCR process -- this is a technically easier task, as well as a more aesthetically defensible one. Unfortunately the Lion corpus is not (as far as I know) available for either sort of research.

Posted by Mark Liberman at 01:50 PM

Overnegation supererogation

Following up on this followup to our long list of overnegation posts, Steve (Language Hat) emailed a link leading to the sentence

"As for the site, I'm going to try to get back on track with updating soon, but don't be surprised if the new story doesn't debut as late as April."

Steve wrote: "I'm pretty sure he (Derek Kirk Kim) meant '...if the new story debuts...'"

I agree.

However, this example may be different from "fail to miss" and "impossible to underestimate" and the other overnegation examples we've discussed. In those cases, the source seems to be natural processes like construction generalization and negative concord, revealed because the processing difficulty of multiple negations prevents the results from being noticed and edited out. In this case, the source itself was probably editorial second thoughts. To make a long story short, I think the the writer first thought (or even wrote) something like "... don't be surprised if the new story doesn't debut until as late as April", and then took out "until" without taking out the associated negation. The difficulty of dealing with multiple-negation sentences helps explain the error, but (if this analysis is correct) the process is a different one.

Here's a longer "pop semantics" analysis of the case.

If we say that something changes state at time X, we often mean that it doesn't change state until time X, and vice versa. Whether to say "CHANGE-OF-STATE-VERB TIME-ADVERBIAL" or "not CHANGE-OF-STATE-VERB until TIME-ADVERBIAL" is then a matter of nuance. For example, a description of Delayed-Onset Muscle Soreness says that "It typically begins a day or two after the activity that caused it"; but it could as well have said "It typically doesn't begin until a day or two after the activity that caused it." A page on fetal development say that "The expression of the sex characteristics doesn't begin until about 6 weeks after gestation", but it could also have said "The expression of the sex characteristics begins about 6 weeks after gestation." It's easy to find similar examples with stopping instead of starting, or with changes like opening or closing.

In such cases, the issue seems to be whether you want to focus on the period before the change of state, or the period after it. For example, the DOMS explanation was followed by another clause focusing on the following period: "... and lasts several days to a week".

When you put the whole thing in the complement of a propositional attitude verb like "surprise", this time-period focus can become part of what the attitude applies to. Here's a sentence from a page of poison-ivy information: "[D]on't be surprised if the rash doesn't begin until a day or two after the child has touched the plant." Here the rashless period is precisely what we're not supposed to be surprised about. Rewriting the sentence as "Don't be surprised if the rash begins a day or two after the child has touched the plant" may cause confusion (though in speaking, intonational emphasis on "a day or two after" would clarify things).

You can get the same effect of focusing on the rash-onset delay by adding a qualifier like "as late as" to the time adverbial: "Don't be surprised if the rash begins as late as a day or two after the child has touched the plant." It also works to use both the belt and the suspenders -- "Don't be surprised if the rash doesn't begin until as late as a day or two after the child has touched the plant" -- at the cost of adding complexity to an already intricate little syntactico-semantic gizmo. However, until and the negation are a package deal -- they have to come and go together.

In the example that Steve cited, the writer may have gone down this road, and then decided to simplify things by letting "as late as" help "don't be suprised" find its focus, eliminating "until" as an unnecessary bit of mechanism. The extra negation was left behind by mistake.

Posted by Mark Liberman at 12:33 PM

Terrorism and the Language of the Devil

Geoff Pullum has already debunked Professor Fred Halliday's ridiculous claim that Basque nationalists are particularly wild and intransigent because they "speak a language that no one can learn". But there's more to this story. We may not expect a professor of international relations to know much about languages, but you'd think that an expert on terrorism ought to know something about the politics of the places that breed terrorism, including sociological information about language use and attitudes where that is part of the political situation.

In the same BBC World Service discussion, as an illustration of the intransigence of the Basque and the role of language in Basque separatism, he claimed that one can't even get service in a post office in the Basque country without speaking Basque. In point of fact, only about 30% of the population of the Basque region can speak Basque. Are we expected to believe that post offices, which are operated by the national government, hire only Basque speakers, and that they refuse to serve their fellow non-Basque-speaking Basques? I have never been to the post office in the Basque country, but I have been to the post office in Catalonia. The problem that I encountered on several occasions was that post office employees sometimes insisted on speaking Spanish when I addressed them in Catalan.

Even in the institutions of the Basque Autonomous Community the use of Basque is not required. The government of the Basque Autonomous Community has a very informative web site, which includes full details of the national and regional laws and regulations governing language use and of the government's efforts to promote the Basque language. And don't worry if you can't read Basque: the web site is also available in Spanish, English, French, and German. A few documents are available only in Basque, but almost all are available in both Basque and Spanish, most in Basque, Spanish, and French.

As to the politics of Euzkadi Ta Eskatasuna, the organization on which the Spanish government is trying to pin the Madrid bombing, from Professor Halliday's comments you'd think that it was a politically and intellectually isolated organization whose positions and actions were comprehensible only in the special Basque linguistic and cultural context, like the Oriental cults in Sax-Roehmer's Fu Manchu stories. This is hardly the case. If you want to get a good idea of the position of ETA, a lengthy interview was published on February 22nd in the newspaper Gara. You can read it here in Basque and here in Spanish translation. It reveals that ETA is a Marxist organization whose leadership is well aware of the political situation around the world and in other parts of Spain and which makes fine distinctions among the policies and behavior of the various Spanish political parties, the national government, and the regional governments. You may or may not agree with ETA's positions, but they are hardly those of some strange and isolated cult.

An understanding of ETA's position also brings out the poltiical divisions among the Basque and the fact that ETA represents a small minority of the population. Although the Spanish Inquisition used to call Basque "the language of the devil", the Basque are traditionally devout Catholics and have their own story about how the Devil was unable to tempt them because he was unable to learn Basque. Modern Basque nationalism as propounded by Sabino Arana (1865-1903), the founder of the Basque National Party (PNV), which is now the ruling party in the Basque Autonomous Community, was a racist, extreme Catholic movement. Arana wanted to protect the Basque country against what he considered to be the depraved Spanish. The PNV is now a Christian Democratic organization. The racism and extreme Catholicism have been toned down, but even so, mainstream Basque nationalism is far from the revolutionary Marxism of ETA.

It's bad enough that the BBC seems to be doing so poorly, but if this is the level of expertise of experts on terrorism, God help us.

Posted by Bill Poser at 02:33 AM

March 11, 2004

A language that no one can learn

In a discussion on the BBC World Service today after the appalling tragedy of the train bombings in Madrid, the Basque separatist organization ETA was being blamed (perhaps too soon: Al Qaeda has now reportedly claimed credit for the slaughter). Professor Fred Halliday, Professor of International Relations at LSE and an expert on terrorism, spoke angrily and contemptuously about young-generation Basque separatists "whingeing about nothing" (since all reasonable demands have basically been met by moves toward granting of local government autonomy in the Basque country), and spoke of their intransigence. As part of what he said about their wild nationalism and unreachability by moral or political argument, he said, "For a start they speak a language that no one can learn."

It's a pity this myth about the Basque language still drifts around out there as part of the folk nonsense about language that most people have heard somewhere or other.

Basque is not Indo-European, so learning it should be compared with learning Japanese rather than with learning Spanish; but it's perfectly learnable. My friend Larry Trask (an American) at the University of Sussex is thoroughly conversant with it, and has a Basque language page devoted to providing information that is "free of the errors, misconceptions, and just plain lunacies that so often turn up in published sources of information on the language." Rudolph de Rijk (a Dutchman) wrote a dissertation on its syntax at MIT decades ago. One of my teaching assistants this quarter (an American) has spent some enjoyable summers learning it. Lots of non-Basques have successfully learned it.

Whatever the reasons might be for the political isolation of the Basque nationalists, let's not add completely mythical difficulties. If the members of the Basque nationalist movement are politically unreachable at the moment, it's not because their language is unreachable by linguistic analysis, or unlearnable by people who want to learn it.

Posted by Geoffrey K. Pullum at 02:24 PM

Trodding: winning or fading?

Sally Thomason posted recently about "trodding" for "treading" in the NYT, and speculated that "trod might replace [tread] completely in the not too distant future." I thought I'd check if Altavista's "date range" feature might help confirm this hypothesis.

Time period	treading	trodding	Ratio	percent trodding
03/02/96-03/01/98	975	29	33.6	3.0%
03/02/98-03/01/00	4432	113	39.2	2.6%
03/02/00-03/01/02	10108	251	40.3	2.5%
03/02/02-03/01/04	83297	2027	41.1	2.4%

For what they're worth, these numbers actually suggest that trodding's market share is decreasing slightly. I'd be hesitant to draw strong conclusions about sociolinguistic trends from this kind of data, though, since the distribution of sources, genres etc. on the web has not been constant over time. Since the observed "trend" in this case is weak at best, it's probably not worth trying to sharpen up the experiment.

Posted by Mark Liberman at 01:48 PM

Trodding, as in Plodding?

In yesterday's New York Times an article on "Now, the search for the next diva of domesticity" explored the exciting possibilities for replacements for Martha Stewart, on the assumption that her legal troubles will cause her to lose her throne. One sentence in the article began this way: "Others trodding the Stewart path include...."

Trodding? This is an interesting mistake, in part because it obviously seemed natural enough to slip past the usually stern and vigilant NYT proofreaders.

The fate of English irregular verbs varies, of course. Lots of them retain their irregularity -- swear/swore, hide/hid, sleep/slept, go/went, freeze/froze, and so on. Some have slipped over into regularity; dream is heading in that direction, with its alternate past tenses dreamt and dreamed, both of them acceptable in Standard English. Tread itself falls into this category, according to my Webster's Collegiate dictionary, which gives a regular past tense form treaded beside the irregular past tense trod. Occasionally the slippage goes in the opposite direction: the etymologically expected past tense of wear would be weared, which existed until Middle English times, but instead we have Modern Standard English wore -- because of rhyming verbs like swear/swore, bear/bore, and tear/tore. People make up irregular past tense forms for fun, like snuck instead of sneaked, and children and second-language learners often produce irregular past tense forms.

But replacing present-tense tread with trod , as in the NYT article, is the only example I can think of where a new non-past form has been based on an irregular past-tense form (though there probably are others that aren't occurring to me right now). The analogic process that produced trodding is ordinary enough, and I bet the regular verb plod played a role: both verbs have to do with walking, and both occur relatively infrequently, making them likely targets for analogical remodeling. And they rhyme. At the moment, tread is still the only Standard English non-past form of this verb; but that could change -- trod might replace it completely in the not too distant future.

Posted by Sally Thomason at 10:11 AM

Beijing, Bolshoi

Bill Poser's hypothesis about the source of the "zh" pronunciation of Beijing is appealing, and very likely correct. "Zh" works better as a French pronunciation than "j" does, and Americans seem to have a fondness for Frenchifying all foreign words. There's the story -- possibly apocryphal, but I hope not -- of the refined lady who called the San Francisco opera house wanting to know when the "Bolshwa" ballet would be performing -- using the "wa" pronunciation of the final oi, as in French, instead of the Russian (and ordinary English) "oy" pronunciation. And there are other examples out there of fake-French pronunciations of words assumed to be foreign.

Posted by Sally Thomason at 10:08 AM

Wild parasitic gap construction escapes

In my enthusiasm to report a naturally occurring instance of a rare but highly studied syntactic phenomenon, I made a mistake: it turns out that the example in my recent posting on parasitic gaps is not a parasitic gap construction at all. Fortunately, the posting generated some interesting further discussion on naturally occurring parasitic gaps and double right node raising, so some good has come of it.

Jon Nissenbaum was kind enough to point out my error, and even kinder to have done it by private e-mail. :-)

He explains that the example I gave

...is an instance of across-the-board movement, coupled with right node raising out of a coordinate structure, rather than a parasitic gap:

... the woman's sweater [which(1) I [wore _(1) to _(2)] AND [had to return _(1) to its rightful owner at _(2)] a big meeting

The coordinate structure is the smoking gun, since

...parasitic gaps are gaps that appear, not in a coordinate structure as above, but in an island (as you cite Peggy Speas as mentioning). The "main" gaps (the ones upon which the PGs are parasitic) are perfectly acceptable as stand-alone extractions, unlike ATB movement out of a coordination. So, while both versions of (1) are fine, the same is not true of (2):

(1)  a sweater that I wore _ [without returning _]
        a sweater that I wore _ [without returning your phone call]

but,
(2) a sweater that I wore _ [and had to return _ ] *a sweater that I wore _ [and had to return your phone call]

Many thanks to Jon for setting me straight! Again, see the followup posting by Chris Potts for some true parasitic gap examples that have been properly caged.

Posted by Philip Resnik at 08:12 AM

Beizhing

The capital of China is 北京, which in Mandarin Chinese is pronounced [bejʤiŋ] (using the voiced symbols for what are, strictly speaking, voiceless unaspirated consonants). bayjing would be a pretty good English folk spelling. For mysterious reasons, this word is routinely mispronounced by newscasters and other people who are supposed to know better.

I just heard yet again on TV the pronounciation [bejʒɪŋ], where the sound at the beginning of the second syllable is the one we might write as zh in folk spelling, that is, the sound of the s in words like measure and pleasure. Ever since newscasters and the like switched from Peking to Beijing I've been hearing this mispronounciation, and it drives me crazy. I'd understand if they were just adapting the Chinese word to the sound system of English - I don't expect them to learn Mandarin - but that's not what is going on. English speakers are perfectly capable of pronouncing [bejʤɪŋ]. [ʤ] is a common sound in English. It's the j of jaw and the dg of nudge. If you can say budging you can say [bejʤɪŋ].

So, why is it that we so often hear [bejʒɪŋ]? The only hypothesis I've come up with is that it is because the sound [ʒ] is somewhat exotic in English. It isn't very common, and as far as I know, all of the words that contain it are loans from French. So perhaps people think that [bejʒɪŋ] sounds more exotic than [bejʤɪŋ] and therefore that it is more accurate. If anybody has a better idea, I'd like to hear it. And if any newscasters are reading this, stop saying [bejʒɪŋ]!

[Update: Lameen Souag has pointed out a parallel example: Azerbaijian, in which the j represents a [ʤ] that should pose no problems for English speakers, is sometimes pronounced with a [ʒ].]

Posted by Bill Poser at 12:59 AM

Kanji for European Words

Mark's reports from Japan about texting made me think about how European words are written in Japanese. Usually such loan words are written in katakana, one of the two moraic writing systems (usually mischaracterized as syllabaries). For instance, [meeru] "mail" is written メール [me-lengthmarker-ru]. Sometimes words borrowed from European languages are written in Chinese characters. This used to be more common, when the prevailing attitude was that it was desirable to use Chinese characters as much as possible, but some such spellings are still found today.

Sometimes the Chinese character spelling is derived by choosing characters whose meaning adds up to something appropriate, without regard to their pronounciation. [tabako] "tobacco" can be written 煙草. The first character means "smoke". It has the native readings [kemu] and [kebu] and the Sino-Japanese reading [en]. The second character is "grass". It has the native reading [kusa] and the Sino-Japanese reading [soo].

In other cases, the Chinese characters are used primarily for their sound. [kurabu] "club" can be written 倶楽部. The first character is [ku] "together". The second is [raku] "pleasure", also read [gaku] "music". The third is [bu] "department, section, category". This is essentially a phonological spelling, in which the characters are used for their sound, but they have been cleverly chosen so that their meanings are not far off. Another example of this type is 珈琲 for [koohii] "coffee" The first character is [ka] "ornamental hairpin". The second is [hai] "string of pearls".

The first approach to spelling European words represents a continuation of the technique by which native Japanese words were given Chinese character spellings centuries before. For the most part, native Japanese words and morphemes were associated with single Chinese characters, but not always. Sometimes native Japanese words were given Kanji spellings by choosing a sequence of Chinese characters whose meaning was appropriate,without regard to their pronounciation. For example, [mukade] "centipede" can be written 百足. The characters mean "100 feet", just as the English word does, but the pronounciation of the word is not derived compositionally from the pronounciation of the characters. If the characters are given their native Japanese readings, the pronounciation should be [momoashi]. If they are given their Sino-Japanese readings, the pronounciation should be [hyakusoku].

The second approach, using Chinese characters for their sound, goes all the way back to China, where foreign words, such as Sanskrit Buddhist terms, were written according to their sound. Early Japanese was written entirely in Chinese characters, some used for their meaning, others for their sound. The characters used for their sound were gradually systematized and simplified until they became the kana in use today.

Posted by Bill Poser at 12:14 AM

March 10, 2004

The Decline of the BBC

The BBC seems not to be doing very well in the areas of parrot language and frog biology, but I'm afraid that this is not a recent phenomenon. I used to think that the BBC was the Voice of God, the one truly authoritative and unbiased source of news. I still watch their regular newscasts from time to time - the British accents are fun to listen to, and Mishal Hussein is rather cute - but I learned over a decade ago that they can't be trusted when it comes to scientific matters.

Back in 1992 the BBC broadcast a "documentary" entitled Before Babel about how all the world's languages are related to each other and how path-breaking linguists have succeeded in reconstructing the Mother Tongue, Proto-World. It included melodramatic scenes with eerie background music in which people uttered bits of what was alleged to be Proto-World. The people interviewed were at the extreme fringe of the field - to be blunt, they were people whom most historical linguists regard as cranks - and the claims made were completely unjustified by evidence. The presentation was totally one-sided, with only a brief comment by a single mainstream historical linguist. The biological analog would be a program promoting the view that the world was created only 6,000 years ago with only a single, brief comment by an evolutionary biologist. It was so bad that WGBH in Boston, which is not run by linguists, recognized it and remade it, in a much more balanced version entitled In Search of the First Language that was broadcast on its NOVA program. You can read the transcript here. In short, it was trash, the sort of thing one expects from the National Enquirer, not from the BBC. That was when I learned that the BBC had fallen from grace. It's a shame.

Posted by Bill Poser at 02:54 AM

More junk science from the BBC

Last time it was a telepathic talking parrot. Now it's a "three-headed frog", which has "stunned a BBC wildlife expert" who is totally ignorant about frogs, as Ray Girvan explains. We surmise that BBC experts (whose bodies have been taking over by changelings from the supermarket tabloids) are measuring the frogs' English vocabulary as we go to press.

The "expert" named in the BBC frog story does seem to exist, and is identified in other BBC stories as "biologist, Mike Dilger". He is also one of the 331 "BBC employees, presenters, reporters and contributors" who signed this statement in support of Greg Dyke. The political allusion is not completely gratuitous here: confronted with three frogs mating, found by "children in a nursery", the Beeb's expert biologist Dilger said "I have never seen anything like this", and "it could be an early warning of environmental problems." The first statement was no doubt true, but he might have continued "but of course I don't know anything about frogs", instead of taking an allegedly expert poke at environmental problems. Though there are no doubt many environmental problems afflicting amphibians, this seems to be a pretty clear example of the tendentious, politically-driven "reporting" that got Andrew Gilligan and his editors in trouble.

More than five days after the frog story appeared, no follow-up or retraction by the BBC can be found (at least by searching for "frog" on their web site), despite what seems to be moderately widespread merriment among those who understand amphibians. The same can be said for the N'kisi story, except that that one was about six weeks ago. Isn't it time for the BBC to have an ombudsman, as the NYT now does?

Posted by Mark Liberman at 02:06 AM

March 09, 2004

More on meiru

Discussions over lunch today in the TITech cafeteria clarified some things about Japanese cell phone text messages. If there are mistakes in the summary that follows (and there probably are), it's because I misunderstood.

Japanese cell-phone text messages are always sent as email, and are fully integrated into the regular email system. That's why the same word meiru is used for both. Cell phones are not used for instant messaging, and in fact (at least among my consultants) instant messaging is not much used at all.

Cell-phone text messages are sent and displayed in standard Japanese orthography -- that is, kanji (Chinese logographic characters), two kinds of kana (moraic characters often referred to as a "syllabary"), and romaji (the familiar latin alphabet). Text entry is via kana, using the simple principle that the kana can be arranged in a 5x10 table, so that pressing the "1" key four times means the 4th kana in the first column, namely "e"; or pressing the "4" key once means the 1st kana in the fourth column, namely "ta" (or something like that...). There is an "enter" key that sometimes needs to be pressed between characters and sometimes doesn't, and there is some sort of auto-complete functionality, which (I think) varies among entry software variants.

Though this hardly seems like a very efficient use of keystrokes, people get to be pretty good at it. The folks at the lunch table, after some discussion and experimentation, estimated that practiced users achieve about 15 wpm, and some meiru-atheletes may do substantially better. I was told that a well-known violin soloist, Senju Mariko, is also a novelist, and does her writing by cell phone, because this enables her to integrate her writing into her busy daily life.

The opinion of all the Japanese at the lunch table (three male, one female) was that women are faster at cell-phone meiru than men, and also use it more.

As for why Japanese people in general use cell phone meiru so much, there was agreement that it is considered rude to talk on the phone (cell or otherwise) in the hearing of others, and that talking on a cell phone in a public place would be especially impolite. It was also agreed that cell phone messaging is very cheap, almost free, whereas cell phone talk minutes are relatively expensive. Finally, cell phone message is done with one hand, and so can be done while standing on a train or on a platform or bus stop, where a laptop computer could not be used. Given that long commutes on crowded vehicles are the norm -- one of my friends said that he has a short commute, only one hour each way -- this certainly motivates a one-handed solution, whether for comunication or for web information access or for gaming.

My understanding of the situation in Europe is that the economics are different (SMS messages are far from free), and also that the cell phone messages are not normally integrated into the regular email system, but just go back and forth between cell phones, and that talking on cell phones in public is not any ruder than it is the U.S. So I'm somewhat puzzled about why text messaging by cell phone is so popular there.

Here is a weblog entry from textually.org that discusses differences between Japan and Europe, quoting other articles and blog entries -- some (quoted) highlights:

-- Pricing of SMS vs. mobile email is one major differentiator between Europe and Japan

-- The Japanese message lengths are longer (in some cases, 1,000 characters),

-- Some of Japanese are not familiar to PC. Cell phone is major way for mail and web.

And so far, this whole thing is not happening in the U.S. -- because of pricing, availability and interoperability issues, and maybe cultural differences. Will the U.S. just suddenly catch up at some point? or go off in a different direction?

Here, by the way, is a weblog consisting of "pictures or thoughts sent live from my cellphone in japan". (That's "my" as in belonging to the author of the cited weblog, not "my" as in belonging to me). There don't seem to be many Japanese-language weblogs, though.

[Update 3/31/2004: Matt X. emailed to say:

I live in Japan and most of your article matches what my friends tell me -- but the final line, about there not being many blogs in Japanese, is far from the mark. The "census" you link to is also way, way wrong -- looking at their methodology, it seems that they simply aren't looking where the Japanese blogs are.
It would be accurate to say that "there aren't many Japanese blogs which are integrated closely with the America-based English-language blogging community", I guess, but that's only to be expected. If a more accurate census was taken, covering Japanese sites equivalent to livejournal (many of these sites allow or even encourage you to update and view blogs by cellphone, incidentally..), the numbers would look very different.

Someone should tell the people at the NITLE blog census where to look.]

Posted by Mark Liberman at 11:35 PM

Beseeched by questions

Is it an eggcorn, or just a garden-variety malapropism? This one is on the boundary, since there is a one-feature difference in pronunciation. I also have to admit, it's sensible enough to be literally "beseeched by questions" -- if you allow that the questions rather than the questioners are doing the beseeching -- so except for the fact that "beseeched by questions" is obviously a misconstrual of "besieged by questions", it would be fine.

Perhaps surprisingly, "beseeched by questions" has only three ghits (and two of them are reprints of the same article, and the third , though "besieged by questions" has 772.

Posted by Mark Liberman at 10:43 PM

Double Right Node Raising in Nature

Some of the sentences that turn up in discussions of syntax are rather exotic. On the one hand, that gives rise to skepticism on the part of some, who take this to mean that syntactic theory is based on artificial data that doesn't reflect real language use. On the other hand, the interest of some of these examples has been taken to be an argument against total reliance on corpora of naturally occuring utterances, since interesting and important data may never turn up there. So, as Phil Resnik mentioned, it is reassuring when, from time to time, one comes across a natural example of one of these exotic sentences.

Here's another example, one that I encountered years ago when I translated the manual describing the implementation of the mathematical function library for a Japanese computer. (In the glosses A stands for accusative case, which roughly speaking indicates that the preceding Noun Phrase is the direct object.)

平法関数	を	ニュートン・ラフソン法	を	正弦関数	を	羃級数法	を	使用して	計算する。
heihookansuu	o	Nyuutonrahusonhoo	o	seigenkansuu	o	bekikyuusuuhoo	o	siyoosite	keisan suru
square root function	A	Newton-Raphson method	A	sine function	A	power series method	A	using	computes

The square root function is computed using the Newton-Raphson method; the sine function is computed using the power series method.

The Japanese does not actually contain any passives but I've translated it using the passive because the Japanese clauses do not have overt subjects and it is not clear what the subject really ought to be. A more literal translation is:

The mathematical function library computes the square root function using the Newton-Raphson method and the sine function using the power series method.

This is a remarkable if not bizarre-looking sentence. It has four accusative Noun Phrases followed by two verbs. NP NP NP NP V V is not something we expect to encounter. What it is is a derivative of this:

ニュートン・ラフソン法	を	使用して	平法関数	を	計算して	羃級数法	を	使用して	正弦関数	を	計算する。
Newton-Raphson method	A	using	sqrt function	A	computing	power series method	A	using	sine function	A	computes

Here the two conjuncts have all their verbs and the "using" clauses come before the "computing" clauses. An intermediate step is to put the "using" clauses between the object and the main verb, like this:

平法関数

を

ニュートン・ラフソン法

を

使用して

計算して

正弦関数

を

羃級数法

を

使用して

計算する。

sqrt function

Newton-Raphson method

using

computing

sine function

power series method

using

computes

This yields a structure in which each of the two conjuncts ends in the sequence "using computes". When you have a sequence of conjoined clauses of this type, Japanese allows all but the verb of the last clause to be deleted. This is known as Right Node Raising. As this sentence shows, Right Node Raising applies not only to a single verb but to a clause final sequence. In our example, both of the verbs at the end of the first conjunct have been deleted.

This sentence is quite natural and easy to read. It didn't even occur to me that it was remarkable when I first read it; it was only when I started to translate it that I realized that it was interesting. I haven't tried any Japanese parsers in quite a while, but the last time I did, none of them could handle such sentences.

Posted by Bill Poser at 10:07 PM

Don't worry of it

The headline on this NYT version of an AP story reads "Roe V. Wade Author Was Worried of Politics". The headline writer means "worried about politics", as the lede makes clear:

As the 1992 presidential election approached, the author of the Supreme Court's landmark Roe v. Wade ruling worried that there were no longer enough votes on the court to uphold the right to abortion -- and that his ideological opposites on the court would play politics with the issue.

The usage "be worried of X" for "be worried about X" strikes me as the sort of thing that a non-native speaker would write. However, a web search shows that it's a staple of headline writers:

Diplomat Worried of U.S. Plans for Cuba (Miami Herald/AP)
Officer worried of personal liability (Topeka Capitol-Journal)
Brazil Worried of Aid to Colombia (AP)

Not all the examples are from headlines. This one is from Dave Farber's "interesting people" list -- and I know that Dave is a native English speaker:

Customers' Coffee Talk Heard [ and you worried of the FBI listening .. djf]

Some other examples, from apparently native writing, including the (to me completely impossible) "Don't worry of it":

I'm worried of what my friends will say.
Mary, please, don't worry of it. Take care of yourself.
Well, no sense worrying of it more.

It's amazing to me that at the age of 56, I can still learn a new subcategorization frame for a common verb of my native language -- though I've documented other examples in this space, for instance here. It's possible that "worry of" is a regional thing, like positive anymore or "close the lights". It does seem to be new -- at least all the OED's examples of worry (in the new-fangled meaning "give way to anxiety" rather than the traditional meaning "strangle" or "suffocate") have "worry about", not "worry of". But anyhow, I've somehow missed out on this simple and useful usage.

Posted by Mark Liberman at 09:23 PM

The feet of them

Before leaving for Japan last Saturday, I loaded Handel's Messiah (Ton Koopman's version, apparently now out of print) into my cheap little mp3 player. It was a long journey, so I got to listen to it several times through, and I listened to the words more carefully than I have in the past. One line in particular caught my attention:

"How beautiful are the feet of them that preach the gospel of peace."

This is just how I'd always heard the passage, but I'd never thought about it before. In today's America or even in Handel's England, it's odd to single out the feet for evaluation in this way. Why not "hands" or "lips"? Why any particular body parts at all? So I listened to it several times, wondering if my perception of it might be a mondegreen.

There are more than a few (sources of) mondegreens in the Messiah, including a remarkable case where the string of words is heard correctly but the sense of one of the words is misconstrued: "(all) we like sheep." The line continues "have gone astray" rather than "and dislike goats", which should provide a clue, but several people have told me that as children, singing in a seasonal Messiah choir somewhere, they understood the words as expressing a positive emotional response to the ovidae.

Anyhow, after careful attention, I concluded that it really is the feet, and they really are beautiful.

Pondering this on the plane, I decided it could have to do with the special (negative) role of (dirty) feet in Middle Eastern culture. "Shake off the dust of your feet", Jesus washing his disciples' feet, Iraqis beating Saddam's statue with the soles of their sandals. The messengers are so welcome that even their travel-stained feet are beautiful. And I had a dim memory of some scriptural passage about beautiful feet, though I thought there were mountains in it somewhere.

Now that I have internet access for a while here in Tokyo, I've learned that the text of Handel's Messiah was a compilation of scriptural passages created by his friend Charles Jennens (though this page claims that it was really Jennens' secretary, Dr. Pooley). And at least according to legend, Handel wrote the oratorio's music in 21 (or 23, or 24) days, which is extraordinary if true, given how much of it there is, and how good it all is.

The "feet" passage is from the King James version of Romans 10.15:

and how shall they preach, except they be sent? as it is written,

How beautiful are the feet of them
that preach the gospel of peace,
and bring glad tidings of good things!

which in turn refers to Isaiah 52.7

How beautiful upon the mountains are the feet of him that bringeth good tidings, that publisheth peace; that bringeth good tidings of good, that publisheth salvation; that saith unto Zion, Thy God reigneth!

and/or to Nahum 1.15

The Tidings of Nineveh's Fall
Behold upon the mountains the feet of him that bringeth good tidings, that publisheth peace!

Like the feet, Handel's music is beautiful, though it is now stuck in my mind like a jingle or a pop song hook.

Posted by Mark Liberman at 08:56 PM

Parasitic gaps in the wild

Philip Resnik offers a naturally occurring parasitic gap example, noting that it is "one of the few naturally occurring parasitic gap constructions I've ever come across".

I have encountered my share of parasitic gaps in the wild. I offer my full collection below (ten examples). I am proud to say that it is quite diverse. I hope these examples are useful to linguists who are studying this topic.

To whet your appetite, I begin with the most exotic of the pack:

(1) "Then there are some scenes that fill in the new blond Patricia Arquette incarnation's seedy history, and some scenes showing how deeply and ferociously attached to the blond Patricia Arquette Robert Loggia is, and some scenes that make it abundantly clear that Robert Loggia is a total psychopath who is definitely not to be fucked around with, or snuck around behind the back of with the girlfriend of."
David Foster Wallace. David Lynch keeps his head. A Supposedly Fun Thing I'll Never Do Again, p. 159.

(2) "Napoleon is one of those figures one can admire without particularly liking. Sigmund Freud is another."
Joseph Epstein. With My Trousers Rolled, p. 85.

(3) "Something you can desire without ever being expected to strive for."
Richard Russo. Empire Falls, p. 224.

(4) "Please Inspect Before Using! [...] Please inspect your documents before using."
(Fidelity Investments instructions for using new checks)

(5) "Yet the peculiar thing (which Justine had seen too often before to wonder at) was that he seldom took her advice."
Anne Tyler. Searching for Caleb, p. 45.

(6) "Or else the scandal is alluded to without being named [...]"
John Thorne. Simple Cooking, p. 198.

(7) "And the letter had that awkward, semibureaucratic, semi-messianic style she had grown accustomed to without ever liking."
John LeCarre. The Spy Who Came in from the Cold, p. 164.

(8) "Brand-name foods contain things we've never heard of and should think about twice before allowing into our house."
John Thorne. Serious Pig, p. 318.

(9) "Homicide has such daredevil energy and intensity that it almost, but not quite, carries us past the many loose ends and red herrings that Mamet unleashes without knowing what to do with."
Phillip Lopate. When writers direct. In Totally, Tenderly, Tragically p. 319.

(10) "The roasted duck was first brought to the table in a copper sautoire for the diners to view before being carved."
Michael Ruhlman. The Soul of a Chef (p. 250.

A lot of these are from food writing. It could be a coincidence, but, then again, foodies are renown for their excessively gappy sentences (Remove from bowl; place on counter; knead vigorously).

Posted by Christopher Potts at 01:06 PM

Clinton and Me

No, not Clinton and me. I'm referring to Clinton and Me, a recent book by Mark Katz, the speechwriter and humorist who wrote much of Clinton and Gore's best material.

The book is wonderful, in and of itself, but what inspires this posting is the fact that it contains one of the few naturally occurring parasitic gap constructions I've ever come across -- in the index, no less.

A brief Linguist List discussion does a nice job of briefly laying out what parasitic gaps are and why they are interesting. Peggy Speas explains it quite clearly:

A parasitic gap is a gap that is in a position where you'd think it should violate an island condition, but it doesn't, because there is another gap in the sentence.

For example, in a, the gap is inside an "adjunct island", and in b, it's inside a complex NP island. As expected, they are ungrammatical:

a. *What did you wash an apple before you ate ____?
b. *Clinton is a guy who people who meet ____ usually vote for Democrats.

However, the grammaticality significantly improves if there is another (well-formed) gap that the offending one can "be parasitic on":

a. What did you wash ___ before you ate ___
b. Clinton is a guy who people who meet___ usually like ___.

The parasitic gap is still in an island, so it is important in GB (or any other theory of islands) to figure out what it is about the additional gap that makes it so much more grammatical.

Like so-called donkey sentences, parasitic gaps come up vastly more often in linguistics papers than in real life, so I always find it interesting when I spot one. In Clinton and Me, the index is rather nontraditional, with entries like

Law School

Parents Wanted Me to Go To, 57
Girlfriend's Parents Wanted Me To Go To, 257
Both My Brothers Go To, 57
I Might As Well Have Gone To, 330

Name Dropping, Stray Historical

Francis of Assissi, 269
Carl Linnaeus, 127
...

Name Dropping, Gratuitous Beltway

You get the idea. Can you spot the parasitic gap in this index entry?

Hair, My
- Mom cuts my, 21
- Mussed by Woman's Sweater I Wore To and Had to Return to Rightful Owner at Big Meeting, 22
- Barbara Streisand affectionately strokes my, 197
- Adolescent obsession with, 47
- Insights into the difference between Republican vs,
  - Democrats' hair, 31

Yup. It's:

My hair[, which was] mussed by [the] woman's sweater I wore ___ to and had to return ___ to [its] rightful owner at [a] big meeting.

Postscript: In the interest of full disclosure, I should mention that I'm an old friend of the Katz family. Go buy the book! (That's the sort of shameless promotion I hope you will read without criticizing.)

[Update and warning: Please see this posting for an explanation of why this example is actually not a parasitic gap construction after all.]

Posted by Philip Resnik at 10:30 AM

Dialog trajectories

One of the most interesting things here at LKR2004 was a talk by Jerry Wright, Alicia Abella and Al Gorin, from AT&T Labs, on the topic of "Speech and Dialog Mining." Jerry (who gave the talk) surveyed a range of ways to analyze the logs of customer interactions in order to find and diagnose problems, with the goal making future interactions work better. The thing that interested me most was what they called "dialog trajectory analysis." This is interesting because of the techniques they've developed and the enormous scale of the data they've used them to explore, but it's also interesting because it highlights the apparent complexity of the problems that we humans solve every time we communicate successfully with one another.

Wright et al. build on an old idea, now commonplace -- representing a dialog schema in terms of a finite automaton. By a "dialog schema" I mean an abstract characterization of a set of possible (human-machine) dialogs. It's conventional to represent this as a network of interconnected nodes (for machine actions such as reading a prompt or querying a database) and directed arcs (for responses from users or from database queries). Any particular interaction is a path through this network. Here's a simplified form of a schema for a sub-dialog to get a phone number, from their paper:

Most dialog schemata are much more complicated that this. Here's a picture of part (less than 20%) of the "what is the problem" sub-dialog from an AT&T "trouble ticket" application (again from their paper):

The arcs highlighted in red and blue have been picked out by the "dialog trajectory analysis" I mentioned. The algorithm analyzes data from millions of passages through the network, selects particular classes of undesired outcomes, and looks (statistically) for arcs that seem to have a "causally significant" relationship to those outcomes. Those arcs may be quite far away in the graph. This kind of analysis sounds interesting -- it's too bad that no significant amount of data of this kind is generally available, outside of the companies whose systems generate it.

The more general point that I wanted to make is just how complicated even very simple, stereotyped dialog schemata quickly become. It's reasonable to suppose that human mechanisms for planning and managing communicative interaction are different -- but what are they, then, and how do they work? One natural idea, central to the "classical AI" explorations of this question, is that a kind of logical reasoning is involved, where the initial premises include one's own model of the world and communicative goals, and relevant aspects of a model of the mental state of one's interlocutor, all updated dynamically as the interaction goes forward. This has the advantage that the number of states in the corresponding transition graph (if one continues to look at things that way) can become astronomically large, or even infinite, depending on how things are parameterized, while the relevant knowledge can be succinctly respresented, and seems to be independently needed anyhow. And as new things are learned, or the situation changes, the implicit "dialog schema" graph should shift in global ways to accomodate the new information. However, as I've mentioned here before, attempts to model dialog in these terms have failed to be able to scale to handle non-trivial cases.

It feels to me as if something basic is missing from this discussion. Perhaps it's analogous to Descartes' discussions of consciousness: he wrote 350 years ago about what "clockwork" couldn't do, without any real understanding of what abstract information-processing mechanisms might really be like -- just as Chomsky wrote in 1957 about what statistical models couldn't do, without understanding such models at all. It seems silly to us today -- well, at least it seems silly to me -- to worry whether conscious intelligence can be modeled with the kind of clockwork mechanisms that 17th-century inventors built. (And yes, I know that you can build a clockwork computer, and Babbage designed one, but I think the point still stands). Will it seem just as silly to cognitive scientists of the future that in the early 21st century, we worried about whether human communication can be modeled with finite automata, or with dynamic logic, or with various other frameworks that don't really work, without having a clue about... what?

Posted by Mark Liberman at 01:32 AM

March 08, 2004

Texting

I've visited Japan a couple of times before, most recently about a decade ago. One thing that's changed since my last visit is texting. Most younger people in Japan now seem to spend a lot of time sending and reading text messages on their cell phones.

This morning, I took the train from Meguro (near my hotel) to Ookayama (near Tokyo Institute of Technology). My car had about 60 people in it. Of these, 12 were busy texting. Among the other younger-looking people in the group, five were sleeping, and one was reading an English workbook. All of the other riders seemed to be older.

A striking contrast to the American pattern is that no one was actually talking on a cell phone. There may be some kind of rule about cell phone usage on the trains, I don't know -- on the bus in from Narita airport, there was a sign in Japanese and English requesting riders not to use cell phones because it "annoys the neighbors". But I don't see or hear a lot of people talking on cell phones here, compared to the U.S. In fact, I don't think that I've actually overhead any cell phone calls during the couple of days that I've been in Tokyo, although I've spent about five or six hours in various public spaces where I'd expect to hear such conversations in the U.S. I've seen people engaged in cell phone conversations, but they have always been doing it so quietly or so much off by themselves that I couldn't hear.

I don't think that I've even seen anyone texting in the U.S. Now that I think about it, this is a bit surprising, since there are plenty of foreign students at Penn who come from places (like China, Korea and much of Europe) where texting is common. Does this mean that texting is only attractive if the telecom price structure discourages talking?

The Japanese texters ("textingers"? I wonder what the agentive form really is...) are highly practiced, holding the phone in the fingers of one hand while pumping away with the thumb at about 5-6 Hz.. If all 12 keys were equally likely, this would be about 20 bits/second; more reasonably, I suppose, it's about 10 bits/second.

Anyhow, I'd like to see a scan of a typical kid's motor homunculus after a decade or so of texting.

[Update 3/10/2004 8:28 am (in Tokyo -- 18:2 3/9/2004 in Philly). A Japanese friend gave me some additional information. First, the Japanese refer to communicating by means of cell phone text messages as "mailing", using the borrowed English verb meiru (I think that's what the right romanization would be), which is the same word used for emailing. Second, many older people do it too, though perhaps not so obsessively; I infer that it has roughly the same distribution as email usage does. Third, some of the kids that I saw clicking away on the train may have been playing games rather than sending text messages -- I checked this morning on the train, and sure enough, some were.

I realize that I don't really know much about how this system works now in Japan -- for instance, are people both reading email messages and doing a form of instant messaging? If so, are the email systems and IM systems integrated with those that work on computers? I'll ask around and report back.

Posted by Mark Liberman at 11:47 PM

Dr. Tufts and the Marthambles

I wrote earlier about the use of the word Marthambles in novels by Dorothy Dunnett and Patrick O'Brian. Lisa Grossman, co-author of Lobscouse and Spotted Dog, sent an informative note in response, which she has kindly given me permission to post here.

Sir:

In reference to your suggestions as to the origins of the Marthambles as mentioned by Patrick O'Brian, I believe I can shed a little further light. I am the surviving author of the gastronomic companion to the Aubrey/Maturin series, *Lobscouse & Spotted Dog*, and I clearly remember researching this very question for a chapter entitled "The Sick-Bay." While we were not able to locate the actual panphlet in question, we were fortunate enough to discover - with the kind assistance of a researcher from the National Library of Medicine - a scholarly work which describes it in some detail. He paraphrased as follows:

According to C.J.S. Tompson's _The Quacks of Old London_ (page 100), the marthambles is one of several nonexistent diseases invented by a Dr. Tufts in a pamphlet in order to sell his tonics and medicines. The other diseases mentioned in Tuft's pamphlet are the "Strong Fives" (apparently not "fires" as Patrick O'Brian quotes it), the "Moon Pall," and the "Hockogrockle." Tufts claims to have encountered these diseases on his travels over a period of forty years, and that he can cure 'em all.

I do not now have ready access to the volume itself, but I do have my notes from that time, which substantially corroborate the above explanation. The pamphlet was not clearly dated, but Tompson placed it circa 1675.

As regards the "strong fires"/"strong fives" question, I think we can be fairly sure that the former is the typo, and that it is the result of an error in the transcription of the interview. I have a tape of the interview, and if you wish can dig it out and listen to it for further confirmation. FWIW, however, I can tell you now that throughout the novels (the text of which I have on disk, which enabled me to check this quickly at the time) the disease appears *only* as the "strong fives."

There is, of course, no way to be absolutely sure that O'Brian had not read Dunnett before he first mentioned the Marthambles. But unless Dunnett alo used the Strong Fives and the Moon Pall, both of which appear in the Aubrey/Maturin novels, I think it's safe to say that she was not his only source for the word. (I have long wondered, though, how he managed to resist using the Hockogrockle, which in my view is the best of them all!)

This is not to say that O'Brian was perfectly reliable in his attributions or his definitions. His tongue-in-cheek letters sent us on a merry chase after Balmagowry, Bidpai Chhatta and Pondoo; the evidence, such as it is, strongly suggests that he simply invented all three. At any rate I have yet to prove otherwise - and not for lack of trying. He and his wife were especially mischievous about Balmagowry: his hints on that subject compelled us to study southern (yes, as in the American South) cookery in depth and to read and re-read every word Robbie Burns ever writ. Not until we had done this and a great deal more did we reluctantly conclude he had simply made the thing up out of whole cloth; whereupon we took his meagre description between our teeth and boldly did the same. (If the O'Brians picked up on the delicate dig in the headnote to the recipe, they were too discreet to say so, but I can't imagine they missed it; I do hope they enjoyed the joke.) I love the word almost as much as I do Hockogrockle, and I still occasionally use it as a greeting, an expletive, even an on-line user-name.

I hope you find some of this useful.

Balmagowry to you -
Lisa Grossman

And Balmagowry right back atcha.

If marthambles was there in the original edition of Dorothy Dunnett's The Ringed Castle, then the most sensible explanation seems to be that she and O'Brian read The Quacks of Old London independently -- unless they traded obscure vocabulary by some back-channel connection.

Note that Dr. Tufts, publishing in 1675, is equally anachronistic with respect to The Ringed Castle (set in 1555, 120 years earlier) and Desolation Island (set in 1811, 136 years later).

P.S. I'm still in Japan, and will be posting a bit about LKR2004 later on, probably tomorrow.

[Update 5/20/2004: additional evidence indicates that "the Marthambles" was a term used among medical mountebanks in Tudor times. Read about it here.]

Posted by Mark Liberman at 11:40 PM

Suppose generative syntax was born in Nigeria?

English: JOHN OFTEN KISSES MARY.

French: JEAN EMBRASSE SOUVENT MARIE. ("John kisses often Mary.")

Generative syntacticians have told us that the reason for the difference in the order between verb and adverb here is that in languages with what the layman knows as conjugational suffixes, the verb moves to the left of the adverb in order to be marked by "inflection " writ large. Presumably, this is part of Universal Grammar, encoded by neurons in the brains of humans on Baffin Island, in car washes in Cleveland, hoeing gardens in Sumatra, getting wasabi-highs in Osaka, and everywhere in between. But does this really hold up?

It's an interesting idea -- but over the years contradictions have poured in. Scandinavian languages like Danish and Swedish are almost as poor in conjugational suffixes as English, and yet in some dialects the verb moves. Then little Faroese, close sister to Icelandic and bristling enough with inflection to give Latin a run for its money, often leaves the verb right where it is. Portuguese creoles like Cape Verdean and the one in Guinea-Bissau barely know a suffix from their elbow -- and yet in them, verb movement is not unknown.

Adherents of the verb movement analysis have come up with some intriguing "fixes" here, but one wonders how valid a theory can be based largely on a few languages spoken in Western Europe.

In this, a thought experiment has always haunted me. Imagine that modern theoretical syntax was founded by southern Nigerians speaking Edoid languages like Edo, Urhobo, and Degema, and were familiar only with these and Mande languages (like Mandinka, Mende, Susu) spoken further up the west African coast.

In Edoid languages, various tenses are encoded solely by a tone change on the verb rather than prefixes or suffixes. In Edo, when the A in IMA has a low tone, the word means I SHOW, but when the A has a high tone, it means I SHOWED. But Mande languages are tone-shy as subsaharan African languages go. They do not use tone to express tense, nor even do they use it much in the "Chinese" way, that is, distinguishing words from one another.

Now, as it happens, Edoid languages have subject-verb-object order, while Mande languages put their verbs at the end of a sentence.

Let's imagine that our Edoid-speaking linguists hypothesized that the difference in word order was because the verb in Edoid languages had to move to the left of the object to be marked by tone, as verbs in French are assumed to move leftward to be marked by a suffix. For them, Mande verbs stay where they are because there is no tone to be marked by.

For us, this looks kind of wacky, because marking tense is a side-dish function of tone in the typical subsaharan African language. In most of them other than the Bantu ones southside, differences in tone make different words from the same syllable: in Yoruba, FO on a high tone is "float", while on a low tone it is "fly". This seems the "main course" of tone in these languages; tone that marks tense seems to be "other," just as we assume that mammals have hair mainly to keep them warm rather than to keep them dry in the rain.

But Edoid languages are special. Tone does not distinguish verbs like in Yoruba. In Edoid, tense marking is a prime function of tone (which is why linguistics textbooks often use Edo to show how tone can mark tense). This is what our pioneering southern Nigerian linguists would be working from, and hence their hypothesis.

Nevertheless, for linguists in the real world, a theory that we are innately specified for verbs moving to the left to be marked by tone looks quaint. Theories of language change show that tones arise by accident, such as when consonants at the beginning or end of a word wear away and leave a difference in pitch as a remnant like the Cheshire Cat disappears and leaves his smile, to quote the masterful analogy of Jim Matisoff. We think of tone as merely one of many developments that a natural language may drift into over time, and its marginality in most European languages only reinforces that assumption.

Yet theories of language change also show that prefixes and suffixes, too, are almost always the result of whole words glomming onto other ones over time by accident. It is generally assumed, for example, that the -ED past ending in English (and its equivalents in other Germanic languages) arose as a form of DID -- "I WALK-DID" -- that got stuck onto the end of verbs and devolved into a homely suffix. And after all, legions of languages in the world have no conjugational endings or gender prefixes or suffixes at all. Mandarin is not strange -- it's just another way of speaking human.

As such, one imagines that the southern Nigerian Ur-linguist, confronted with Indo-European languages, would see prefixes and suffixes as beside-the-point accidents just as we see tone.

And I suspect that both they, over in their alternate universe, and we, in our real one, are correct. Just as only a narrow data set could lead us to suppose that tone drives verb movement, we must entertain that the idea that verbal conjugation (to the linguist, INFL) driving verb movement may well be an accidental notion that will not stand up to a broader examination.

Posted by John McWhorter at 09:59 PM

Linguist jokes (1)

Q: Two linguists were walking down the street. Which one was the specialist in contextually indicated deixis and anaphoric reference resolution strategies?

A: The other one.

[Please note that Language Log is experiencing temporary staffing difficulties and may from time to time have to fill up space with stupid linguist jokes. Please bear with us. Our normal level of service will be resumed as soon as possible. By the way, in answer to the several hundred queries that seem to have flooded in asking how I could be permitted to post something as stupid as this on Language Log, the answer is that it's my birthday, so I can do any damn thing I want.]

Posted by Geoffrey K. Pullum at 07:12 PM

Multilingual Instant Messaging in Iraq

According to this press release from the Office of Naval Research, a recently deployed communications system combining Instant Messaging with machine translation is receiving rave reviews from soldiers in Iraq. It is said to allow speakers of English, Arabic, Polish, Ukrainian, and Spanish to communicate with each other. If it really is improving communication, that's good, but I hope that the machine translation is better than some of the systems out there. This is a situation in which a bad translation could literally get somebody killed.

Here's an example of what can go wrong.

Posted by Bill Poser at 02:38 PM

March 07, 2004

Chinese Characters Don't All Come From China

kokuji Since Mark is off fishing in Japan, I thought I'd mention a bit more about Japanese writing. A little known fact is that not all Chinese characters come from China. The Japanese created a few characters themselves. An example is 畑 [hata(ke)], which denotes a dry field as opposed to 田 [ta] which denotes a paddy. It was formed by combining 火 "fire" with 田. Since 田 is a "real" Chinese character, it is read both [ta], its native Japanese reading, and [den], its Sino-Japanese reading. In contrast, 畑 has only the readings [hatake] and, in some compounds, [hata], because it represents only a native Japanese morpheme; there is no corresponding Chinese morpheme, so there is no Sino-Japanese reading. Such characters are known as 国字 [kokuji] "national characters".

Posted by Bill Poser at 08:57 PM

Careless talk spreads viruses

One morning this week I heard an NPR newsreader say (trying to be helpful to us all) that there is a new computer virus on the loose and you are warned "not to open an attachment unless it is from someone you know." Sheesh! Are they insane? How could anyone be as careless with the language as this? Surely NPR's news team has people who know enough to be aware that the Bagle virus (which accounts for MOST of my mail at the moment) forges the names in its From-lines, picking people extremely well known to the sender (by borrowing names from other mail or from address book files belonging to the idiots who still use Outlook Express). As with the transmission of certain other viruses, you're actually more likely to get it from people you know well than to get it from a stranger. I am getting virus packages by email daily from trusted friends, from colleagues, down the hall, from family members... Yet I have never had a virus infect my computer, because I never download mail attachments, regardless of whether from a stranger or from a lover or from the administration of my university, unless certain very stringent restrictions are satisfied:

the file opening is not being done under Windows (generally, that is enough to guarantee safety in itself); or
I know exactly what the attachment is and have confirmed by exchange of messages with a thoroughly trusted and computer-savvy user that the apparent sender is the real sender and the attachment is a safe file from an uninfected machine.

And even then, in the latter case, I still back up all crucial data files and say a short prayer. God bless you and keep you.

Posted by Geoffrey K. Pullum at 06:36 PM

Say No More - Once More

Jack Hitt's New York Times Magazine article Say No More about the impending death of the Qawasqar language in Chile has evoked much criticism. Mark Liberman has criticized Hitt's apparent lack of interest in the language itself as well as Hitt's muddled statements about historical linguistics. David Beaver and I have criticised Hitt's claims about the relationship between language and culture. Claire Bowern has provided a recipe for creating articles like Hitt's, and Kerim Friedman has a discussion of what a good article on the topic might be like, along with some insightful comments on the whole issue of language endangerment and maintenance. The only sympathetic comment I've seen is by Language Hat, who acknowledges the numerous linguistic errors and the fact that the piece says little about language, but considers it to give a moving account of the human meaning of language loss.

I'd like to offer a qualified defense of Hitt and his editors. The fact that articles of this type are so similar, as Claire Bowern points out, isn't really cause for criticism. It reflects the way the journalistic world operates. Most mass market publications aren't going to be interested in a new and deeper perspective because their readership won't understand it or be interested in it. Furthermore, most of their readership won't be bothered at all by the fact that a similar article has appeared elsewhere because they won't have read it. Those of us who have a particular interest in language endangerment actively seek out material on it and read everything that comes to our attention, but we're a tiny minority.

It is also true that Hitt's article does not reflect much knowledge of language, but as John McWhorter has pointed out and Geoff Pullum has seconded, Hitt's linguistic gaffes merely reflect the generally low level of knowledge about language. Except for linguists and a small number of others with a particular interest in language, very few people know much of anything about language. Journalists typically major in English or Communication or perhaps something like Political Science, with the result that they know nothing about linguistics or natural science or most other subjects beyond the sort of basic general knowledge most university graduates have. Since linguistics isn't taught at all in secondary school, this means that they know even less about language than they generally do about math and science.

Journalists arguably don't need to know much about language unless they are going to write about it, but ignorance about language extends to people who really ought to know something about it. A prime example is school teachers. Some knowledge of linguistics would be helpful in teaching reading and writing, language arts, and foreign languages. It would be helpful in identifying developmental problems and in dealing with children whose first language is not English. It would be helpful in teaching social studies, history, and geography. Lily Wong Fillmore and Catherine Snow have written a paper entitled What Teachers Need to Know about Language [PDF file] that discusses what teachers ought to know and why. In fact, the great majority of teachers know next to nothing about language.

Lest it seem that I think that all journalists are ignorant boobs, I should note that there are exceptions. Indeed, one of my best friends is a reporter. One journalist by whom I have been positively impressed is Jonathan Manthorpe, the foreign affairs columnist for the Vancouver Sun. He's one of the very few journalists writing on international affairs who really knows what he is talking about. As someone with a special interest in Asia, a part of the world of whose history most Westerners are quite ignorant, I've been especially impressed by his knowledge of Asian affairs.

Another is Gina Kolata, whose work I have admired, in Science, The New York Times, and several books, since the 1970s. The largest part of her writing seems to be about biology and medicine, but what first impressed me was that she wrote comprehensible pieces about mathematics. A large part of the time I find news items on developments in mathematics incomprehensible, which I believe to be attributable to the fact that the reporter does not understand what he or she is writing about. Her articles stood out in the understanding of the subject that they reflected. It turns out that she has a B.S. in microbiology and an M.S. in applied mathematics, plus two years of doctoral work in molecular biology. She's a terrific science writer. Knowing what you are talking about helps.

Posted by Bill Poser at 04:53 PM

Reverse Domain Name Hijacking

There's an interesting article by Stacey Knapp on Internet domain names in the Oklahoma Journal of Law and Technology. In general, domain names belong to the first person or organization to register them. However, there is a provision by which ownership can be disputed if the domain name is a trademark or is likely to be confused with a trademark. This is intended to prevent hustlers from registering lots of domain names in which they have no interest, then holding them hostage when someone with a real interest in them comes along. If you've been clever enough to register MechanicsvilleToolAndDie.com, if the Mechanicsville Tool and Die Company decides to develop a web presence, you can't refuse to give up the domain name unless they pay you a lot of money. The Uniform Dispute Resolution Policy (UDRP) provides that in such a case the domain name should be transferred to the trademark holder.

A problem has arisen, however, with critique sites. These are sites devoted to criticism of a company with which people have had bad experiences. They often have names like MicrosoftSucks or AntiPhillips. Some of the targetted companies have attempted to obtain ownership of these domain names by claiming that they are readily confused with the company's own site. This practice has become common enough to have a name: Reverse Domain Name Hijacking. Many such claims have failed, as they should, since both the name and the site content make clear the difference between a company's own site and a critique site. But according to Knapp, some such complaints have been successful. In some cases the arbitration panel has ignored the fact that the site content will quickly clear up any confusion and ruled that non-native speakers of English might not understand American slang like sucks. They apparently don't realize that American slang probably diffuses faster than any other aspect of the language.

By the way, another tactic adopted by some large corporations is to pre-empt critics by themselves registering likely domain names. A salient example is the Chase Manhattan Bank, which has registered a plethora of domain names, such as chasesucks.com. I've never had any dealings with them, but what this strategy tells me is that they are afraid of criticism, and that tells me that they probably have poor service and are unresponsive to complaints. This doesn't sound like a good way to promote your business.

Posted by Bill Poser at 04:08 PM

March 06, 2004

What gets taught; what gets learned

John McWhorter remarks at the end of this post, about silly statements about language by Jack Hitt and sundry others: "We have to learn to expect articles like Hitt's until basic linguistics is taught in middle or high school." He's quite right. And John's remark brought back to me something more general that I have been thinking about for some time concerning middle and high school. Below is a short list of some things that I feel I really needed to know (still need to know, in some cases) in order to be a functioning human being in a modern society. And the surprising thing about this list (to telegraph my bottom line) is that the sum total of what I learned about them in (the equivalent of) middle school and high school -- in Britain, where high schools are generally regarded as much better than those in the USA -- is zero.

Basic modern abstract mathematics: what sets, relations, and functions are.
Basic logic: what an argument is, and what it means for arguments to be valid and sound.
Basic macroeconomics: what inflation is, and why national-level budget deficits might cause it.
Basic investment: what stocks are, what bonds are, and when you should hold which.
Basic meteorology: what cold fronts and low pressure troughs are and what that means for the weather tomorrow.
Basic microbiology: what bacteria are, what viruses are, and why antibiotics only kill the former.
Basic nutrition: what carbohydrates are, what proteins are, what hydrogenated fats are, and what that all means for how you should eat.
Basic law: What's a tort? What can you sue people for? What's the difference between civil law and criminal law? What's the difference between felonies, misdemeanors, and infractions?
Basic racism: who the Jews are and why Hitler murdered six million of them in an attempted extermination; who the Africans are and how the country in which I was raised grew rich on shipping them to the New World that would be illegal for cattle, destroying their culture and their humanity, and working them to death.
Basic politics: what the right wing is, what the left wing is, and what it means for how you should vote.
Basic phonetics: what vowels are, what consonants are, and how letters differ from sounds.
Basic general linguistics: roughly how many languages and language families there are, what sorts of differences there are between languages, how all languages have grammar, how we find out about such things.

Not a single one of these (from which I have omitted things like sex and driving, which were even further away from the high school curriculum back then) figured at all in anything I was taught in England between the age of 11 and the age of 16. That was the age at which I dropped out of high school, having been too often disappointed with the way subjects in which I already had a real intellectual interest had been turned by teachers, through some weird reverse alchemy, into unendurably boring crap.

Some of the things on my list I did learn later. Others I'm still shaky on. But unlike me (I ultimately returned to education and became the first in my family to get a degree), many kids never had the benefit of any formal education beyond the age of sixteen. I often wonder how they are supposed to invest for their retirement or understand the weather forecasts or know how to react to an anti-Semitic remark. It is scary how thin and trivial and alien and pointless they managed to make high school education in southern England in the second half of the twentieth century.

Perhaps in the first half of the twenty-first century, in the USA, things are different. I hope so. But what I encounter as I teach freshmen in a public university in California suggests we should not be too complacent.

Of course, there is a difference between presenting important topics to students and getting them to listen (it is the difference between leading the proverbial horse to water and getting it to drink). What I think is the most important teaching I do is a course on the Unix operating system. It changes people's lives. I taught a student who went on to use his Unix on Silicon Graphics workstations at Industrial Light and Magic, and his team won an Oscar for the special effects in Forrest Gump. But this week I found that one young woman had turned in some proposed code using the Unix stream editor sed full of elementary mistakes (it clearly hadn't been tested, though it was close to being right; just a little more careful thought would have gotten it to work), and underneath it, commented out, but intendedly visible to the me and the TA's, were these words:

# Note from Linda: I hate sed! I know # none of this is right. Blah. Whatever.

Blah? Whatever? And she wanted me to know that she was aware that the code didn't work and that she didn't care? If that's the prevailing attitude, it won't make any difference what we teach in high schools or in colleges, and we can expect rank ignorance not only about language but about everything else. On a good day I persuade myself that for most students it is not so: that they are not just saying "Blah; whatever" to themselves as I try to teach them about things I think are important.

Thanks to Bill Poser for reminding me that law should be included.

Posted by Geoffrey K. Pullum at 07:06 PM

Plum Poison

With Mark Liberman gone fishin' to Japan, and knowing the trouble LanguageLoggers are likely to get into in exotic foreign places (see Geoff Pullum's postcard from Vegas), my thoughts naturally turned to venereal disease. The Japanese word for "syphillis" [baidoku] is written 梅毒. 梅 is "plum", as an independent word given its native Japanese reading [ume], and 毒 is "poison". You might wonder what syphillis has to do with plums. The answer is: nothing. The original way of writing [baidoku] is 黴毒, where 黴 means "must, mildew". As a separate word it is given the native Japanese reading [kabi]. "mildew poison" makes a lot more sense. 黴 is a complicated character, requiring 23 brush strokes, and isn't all that commonly used. In fact, it isn't on the Ministry of Education list of characters approved for official use. So it was replaced in common usage by the homophonous 梅, a common character requiring only 10 or 11 strokes depending on which variant you use. Such replacements of a character by a different character having the same sound are called 当て字 [ateji].

Posted by Bill Poser at 12:14 PM

March 05, 2004

Gone fishin'

I'm on my way to Tokyo for LKR2004, an "International Symposium on Large-scale Knowledge Resources." The background is the " 21st Century COE (Center of Excellence) Program" of the Japanese Ministry of Education, Culture, Science and Technology (MEXT), which provides about ¥18B in special funding to 113 research programs at 50 Japanese universities. One of these programs is centered at Tokyo Institute of Technology, and is entitled "Framework for Systematization and Application of Large-scale Knowledge Resources".

According to a paper by Prof. Sadaoki Furui, the program leader,

This project will conduct a wide range of interdisciplinary research combining humanities and technology to build the framework for systematization and application of large-scale knowledge resources in electronic forms. Spontaneous speech, written language, materials for e-learning and multimedia teaching, classical literature, historical documents, and information on cultural properties will be targeted as examples of actual knowledge resources. They will be systematized based on a structure of their meanings.

I'll try to post some notes from the symposium if I can get internet access, but my blogging is likely to be light until I get back, a week from now. I expect that my fellow language loggers will pick up the slack, or you can read some of our fine vintage posts.

Posted by Mark Liberman at 06:41 PM

Fine writing at 40% adjective rate

Claudia Roth Pierpoint has a fascinating article (under the generic header "Annals of Culture") in The New Yorker (March 4th, 2004) about the great anti-racist anthropologist and linguist Franz Boas. At one point (p. 63) she sums up in a single ten-word phrase what Stephen Jay Gould managed to do to advance Boas's legacy while he was Honorary Curator of Paleontology at the American Museum of Natural History: he showed us "that punctilious Darwinian science was fully compatible with Boasian ethics." Exactly so.

The ratio of adjectives to total word tokens in that effective snippet of prose, by the way, is an unusually high 40 percent. One more indication that the people who decry adjectives as indicative of bad writing are totally nuts.

Posted by Geoffrey K. Pullum at 03:58 PM

Page rank puzzles

Prompted by Semantic Compositions' recent self-evaluation, I finally decided to explore the page rank numbers available on the Google toolbar for Internet Explorer. I don't usually use IE, but for the occasion I cranked it up and gave the thing a try.

And I'm puzzled. In the large, things sort of make sense. But the details are puzzling.

As you doubtless know, page rank is a method for using the eigenstructure of the web's link graph as a source of information about about the relative importance or value of pages. Though there are various attempts to go on beyond google, this is still the method of choice for sorting web pages whose content looks relevant to a query (where this basic relevance is calculated as some sort of weighted sum of the words a given page shares with the query). As Google's site explains

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

It's talk like this that has Semantic Compositions calling his site "totally unimportant" just because Google (or rather the Google toolbar) assigned it a PageRank of 0 (on a crudely quantized and probably-logarithmic scale of 10). As SC is well aware, his negligence is temporary, and will disappear with time. In fact, as of this writing, his Google toolbar page rank is already 2! But many puzzles of page rank remain unexplained by the lag of linking to new sites and/or sampling new links.

MIT is the only university that I've found at page rank 10 -- Penn, Stanford, Harvard, Princeton, Yale, Berkeley, UCLA, UCSC, UCSD, Michigan, Rutgers, Colorado, Florida and CMU are all 9; Swarthmore, UConn, Vermont and Georgia are 8, but OSU is only 7, down with Vassar, Haverford and Maine. What's up with that?

Apple and Intel are 10, but Microsoft, Sun, AOL and Oracle are 9; AMD, Hitachi and Sony are merely 8.

Science magazine is 10; The New York Times, CNN and Scientific American are 9; Atlantic magazine, Salon and The New Yorker are 8; Arts and Letters Daily, the Philadelphia Inquirer, Slate and Harper's are 7.

LinguistList is 8; the LSA is 7, as is the LDC. The linguistics departments at Penn, Stanford, UCSC and Berkeley are at 7; the linguistics departments at UMass, OSU and UCLA are at 6; but the MIT linguistics department is at 8. As far as I can tell, the differential ranking of the departments' pages doesn't match the size of the departments, the ranking of the departments, or the amount of interesting stuff directly reachable from their front pages.

As a final indication of how semi-random this can be, my home page is 7 while Geoff Nunberg's is 6. He's much more famous than I am, and also has much more interesting stuff on his home page. And it's hard to believe that my feeble little home page gets as many page rank votes as Slate, Harper's, Vassar College and OSU, and more than the whole UMass and UCLA linguistics departments.

This mathematical and practical discussion of page rank helps explain why results can sometimes be unintuitive. For example, if you get quite a few links from others but don't send any outside your site, that boosts your page rank -- that might explain Science magazine at 10 -- the links come in, but they don't go out! This might also explain the oddly high rank of my graduate alma mater MIT, which is not only famous but also famously self-involved :-). But I'm still kind of puzzled about me and Geoff and OSU.

Another alternative is to try the Alexa toolbar, which uses a very different method of calculating importance. In Alexa rankings, the best possible number is 1, and higher numbers indicate less importance (like rank in a competition). For Alexa, the NYT is 66 while Science is 8,506; Harvard is 1,110 to MIT's 1,199; the LDC is 2,231 while LinguistList is 47,790 and the LSA is 577,603; and Geoff Nunberg's home page ranks 1,704,538, while mine has "no data", but OSU is 3,123! Go figure.

Posted by Mark Liberman at 05:40 AM

Dennis Miller and Giving Hitt a Break

Geoff's post on chimp talk reminds me of my appearance on Dennis Miller's new show a few weeks ago. Miller is using a chimpanzee as a mascot in homage to Dave Garroway on the Today Show back in the day. While I was awaiting my participation in the sequence of sound bites that is the panel part of the show, I found myself fascinated by the chimp Miller had sitting on the set while he was taping an interview with a UN official.

Monkeys make me distinctly uncomfortable -- they strike me as, basically, shitty little people. But watching that quizzical little primate casually scratching an itch on its face at one point, with the same poised deliberation that we humans use doing the same thing, I was somehow intrigued.

So because there was a delay in the taping and I had run out of things to talk about with fellow guests Mickey Kaus and a chain-smoking LA talk show host, I decided to go check this creature out when his handlers (five people!) took him behind the set during a break.

Truly weird, as it kept reaching out with its five-fingered hand as a quest for attention of some kind. I love cats and dogs but didn't quite know what to do with this little humanish thing, so I asked "How much language does it understand? Dogs can manage about twenty words; my cat has about three. What about him?"

One handler soberly said "Oh, they understand about as much as an eight-year-old." And this meant spoken language, not signs.

Well, obviously this would have meant that the critter was processing everything we were saying, when it clearly was not. The handler was displaying the usual slippage between folk conceptions of language and we linguists' conceptions of same.

And in this vein I cannot help thinking that we need to give the Hitt article in the Times a break. The gulf in perception between us and laymen in terms of language is almost as vast as the one between ours and astrophysicists' understanding of time and space. Naturally Hitt thinks there is no future marking in Kawesqar because they use canoes -- universally people suppose that languages are just bags of words, and then Level Two of understanding, generally picked up one morning in Anthro 101, is that grammar and culture march in lockstep.

In this vein I present assorted comments on language that I have encountered over the years from thoroughly smart people.

Party, 1995: "I spent a year in Ghana and I learned Twi. It was easy because there isn't any grammar." (That is, Twi has little inflection -- despite its tones which make it as hard to learn for us as Chinese...)

White Plaza, Stanford University, 1992, in conversation with Czech-American. John: "Well, you know, Russian doesn't even have a verb TO BE in a sentence like I AM YOUR FATHER." Czech person: "That's stupid!"

Sproul Plaza, Berkeley, 1995: "I learned the creole in Guinea-Bissau when I was in the Peace Corps, and it's all metaphor."

San Francisco, 2000, backstage at an opera performance, in conversation with a native Tagalog speaker: "Oh, I always thought Tagalog was a pretty easy language." (Te-tell that to someone ang who tr-um-ies to maga-learn it after the age of three!)

Assorted conferences and linguistics publications, 1998 to present: "The only reason creoles look less complex than Navajo is because verbs take inflectional affixation when they move to INFL and in creoles like Haitian the verb doesn't move."

Man on the street: "English is easy at first, but then it's harder to get really good at." (As if once one has mastered the declensional and conjugational paradigms of German or Russian one is ready to write like Goethe or Tolstoy.)

We have to learn to expect articles like Hitt's until basic linguistics is taught in middle or high school.

Posted by John McWhorter at 03:08 AM

March 04, 2004

Jail copy editors for the right reasons

The news that copy-editing a paper before it appears in a journal may be a criminal offense if they come from one of the Bad Guy countries (further details here) is perhaps the most astonishing I have encountered in months (despite a ready flow of often astounding news, both on Language Log and elsewhere).

I'm all in favor of sending copy editors to jail; but I think it should be for their actual practices: changing which to that in a bid to impose the (completely mythical) generalization that which is not used in what The Cambridge Grammar calls integrated relatives (the kind without the commas); altering the position of adjuncts in phrases like willing to at least consider it because of a belief in the (again, completely mythical) view that there something called an "infinitive" in English and it should not be "split"; and so on.

I've spent too much time struggling (after editorial acceptance) for the right to use grammatical sentences in my own native language, battling against the enforcement of arbitrary cryptogrammatical dogmata. Send these which-hunters and adjunct-shifters to jail for a few months, by all means. Put them in solitary on a bread and water diet. Take away their red pencils. But not because the work they are fiddling with is by an author who happens to have the misfortune to live in Iran.

In fact, if I may extend my scrupulous sense of fairness even to copy editors, some of the changes they make to foreign-originated work will doubtless be to fix genuine errors in English crafted by non-native speakers, and that should count as a service to us, the Anglophone readers, not as a service illegally rendered to agents of a hostile foreign power.

Posted by Geoffrey K. Pullum at 03:07 PM

More flowers from the thicket of discourse studies

Back in November, I discussed a paper by Florian Wolf and Ted Gibson that criticized the RST ("Rhetorical Structure Theory") Discourse Treebank and the ideas about how to represent discourse coherence that it embodies. Wolf and Gibson documented an alternative approach to annotating discourse structure and announced an alternative annotated corpus. In December, I discussed a response by Daniel Marcu. Florian Wolf has just emailed to let me know that he and Gibson have written and posted a response to Marcu's response.

I've learned a lot by reading this back-and-forth. As I wrote in connection with the first Wolf and Gibson paper,

I think that these things -- both the RST Treebank and the Wolf/Gibson corpus -- are wonderful steps forward. Two alternative approaches to the same (hard) problem offer not just examples and arguments, but also alternative corpora (of overlapping material!), annotation manuals, annotation tools and so on.

The core question -- whether trees or more general graph structures are more appropriate for representing relations among units in discourse -- is especially difficult to resolve, in my opinion, because it's not easy to agree, in particular cases, about exactly what the units and relations should be. I share with just about everyone else the impression that such units and relations exist, and that it's possible to say things about them that are true (e.g. that segment 2 is an attribution for segments 0 and 1 in the passage below) or false (e.g. that segment 3 is an attribution for segments 1 and 2).

0. Farm prices in October edged up 0.7% from September
1. as raw milk prices continued their rise,
2. the Agriculture Department said.
3. Milk sold to the nation's dairy plants and dealers averaged $14.50 for each hundred pounds,
4. up 50 cents from September and up $1.50 from October 1988,
5. the department said.

That much is something that all parties in this debate agree about.

However, I find it harder to decide about some of the other specific issues that are in play here. For example, Marcu disagrees with W&G about whether there is a "similarity" relation between segments 2 and 5. I'm not sure what I think -- certainly the segments are similar, but I'm not clear whether that "similarity" is part of the cognitive structure of the discourse in the same sense that the "attribution" relation is. But this may be only because my first analytic instinct is to think in terms of some sort of generative model, for which tree structures a natural choice. Such an approach may lack traction in the case of discourse, where one might imagine that the underlying process is often not tightly constrained by the (perhaps partial and emergent) structures that are undeniably present in the result.

Another case where I lack clear intuitions is the question of whether segments 3-5 are an "elaboration" just of segment 1, as W&G prefer, or perhaps of segments 0-2, as Marcu suggests.

For that matter, we might ask about other possible sub-units in this little passage -- after all, there are three chunks describable as "up Q from DATE" -- perhaps these should be identified and linked by a sort of similarity relation?

This is not an attempt at a reductio argument, but rather a serious observation that a stretch of text, like a picture or a piece of music, generally embodies many simultaneous, often-overlapping and variably-salient relationships. In some cases it makes sense to strip out a particular layer of description and observe its adherence to some quite regular (often hierarchical) structure. This works for many artistic forms (sonnets, blues, movie plots) and some aspects of linguistic structure. In other cases, the process doesn't seem to converge, but instead leads us into an apparently bottomless process of interpretation and commentary. The case of "rhetorical structure" under its various names is one that has been stuck in the middle, so to speak, for millennia. The way forward is through exactly the kind of careful analysis and discussion that Marcu, Wolf, & Gibson and others are carrying on. So I agree with both of them.

Posted by Mark Liberman at 12:53 PM

The Eskimos, Arabs, Somalis, Carrier .. and English

The Eskimos may have an uncountable number of words for snow, it may have been falsely alleged in the 18th century that the Arabs have 500 words for lion, the Somalis may have 46 words for camel, the Carrier may have a special word for yellow pond lily roots, but only the English have 997 words for penis. At least, according to a BBC reviewer, the publisher of Cassell's Dictionary of Slang by Jonathon Green has advertised the book by reference to this count. Along with the 1232 words for sexual intercourse, 856 words for vagina, 449 words for beer, 994 words for prostitutes, and 707 words for marijuana.

I predict that Geoff Pullum won't care, despite his U.K.-ish origins. However, the "group X has Y words for Z" meme touches some deeply-resonant chord in most members of our species. So far, interest in this topic has not been observed in apes or parrots. I have not seen an explanation in terms of evolutionary psychology for the species-specificity of this obsession, so there may be an opportunity here for a new research project. The lack of animal models will make neurophysiological investigation more difficult, but perhaps someone will track the trait among Icelanders so that the "snowclone gene" (first hypothesized right here!) can be identified.

(Tip of the hat to Ray Girvan who pointed the way to the BBC review)

Posted by Mark Liberman at 10:25 AM

red gorilla bad me unattention

In response to this post, Mark Seidenberg sent a link to a page that he has set up contrasting the sentences of William Shatner, reflecting on his 1988 meeting with Koko the gorilla, and the sentences of Koko the gorilla, conversing in 1988 with William Shatner. His point, I think, is that we can be impressed by the emotional bond across species boundaries, and by Koko's first steps towards the symbolic expression of meaning -- or we can be impressed by the qualitative difference between what Shatner has to say and what Koko does.

The quotes are not (as I understand the page) taken from an actual dialogue, but are selected at random from Shatner's report and Koko's interactions with him (and/or with others?). So the point is not that the dialogue is incoherent (for which see here), but that Shatner's sentences (and the ideas they express) show a complexity of structure and a lack of connection to current stimulation that are very different from Koko's sentences and ideas.

One example:

"Koko and I talked. We touched hands and we touched minds. Feeling her powerful hand on the back of my neck was unlike any other experience I've known."
--William Shatner (aka Captain James T. Kirk), 1988

"candy you"
--Koko, the signing gorilla, 1988

And another:

"That hand across the border that Koko extended taught me in that one moment that we are linked inexorably with everything else in nature and that for us to be destroying species after species is criminal."
--William Shatner (aka Captain James T. Kirk), 1988

"red gorilla bad me unattention"
--Koko, the signing gorilla, 1988

Posted by Mark Liberman at 09:48 AM

Slut

Following the final episode of "Sex in the City", the word slut is in the air. Matt Yglesias (quoting " Sara Butler and Nick Troester and Sara Butler again and Amy Lamboley"), Language Hat, Semantic Compositions, Will Baude, Paul Goyette, and many others have been discussing its denotations, connotations and range of appropriate usage if any. Technorati gives 8088 hits ("thits"?) for slut, though a majority of these seem to be use rather than mention of the term. (It would be nice to have a tool that would show the graph of blogospheric commentary for a discussion like this -- or does one exist already? Technorati will show who links to a specific post, but that's only part of the picture.)

Much of the discussion deals with recent intellectual history -- for instance, Sara Butler's original post presupposes familiarity with three waves of feminism. Looking at the same ideas in a longer historical span may also help to frame the discussion, and a dictionary constructed on historical principles can help sketch how the ideas behind the words it tracks have changed (or haven't changed) over time. The OED's entry for slut is a little exercise in lexicographic sociology, with a surprising amount of conceptual continuity across the centuries: bad housekeeping, loose sexuality, general uppitiness and terms of endearment have been all mixed together since the middle of the 17th century. I was struck by how difficult it often is to assign the citations clearly to one sense or another, even more than in most cases of word-sense ambiguity.

In current usage, the sense of promiscuity predominates, along with what the OED calls "playful use, or without serious imputation of bad qualities". For some people, I guess the discussion now is about whether the "playful use" can become the main use, and for others, it's about whether the traditional "loose character" sense has been or can be purged of its negative connotations.

The word slut itself clearly retains strong negative connotations, quite apart from one's opinions about sexual morality, but such things can change if enough people want them to. I wouldn't use the word myself, not so much because it's offensive as because it projects bad associations based on a framework of ideas that I don't endorse. Embracing the word is one way to confront the framework -- as has been done with some success in the cases of queer and geek -- but slut is a case where attitudes are less polarized and perhaps the underlying issues are also more nuanced.

Here's the whole OED entry:

    1. a. A woman of dirty, slovenly, or untidy habits or appearance; a foul slattern.

  1402 HOCCLEVE Letter of Cupid 237 The foulest slutte of al a tovne. c1440 Pallad. on Husb. IV. 273 Ful ferd is hit for touching of vnclene Wymmenand slottes y suppose hit mene. 1483 Cath. Angl. 345/2 A Slute, vbi foule. 1530 PALSGR. 271/2 Slutte, souilliart, uilotiere. 1581 G. PETTIE Guazzo's Civ. Conv. III. (1586) 137b, I haue noted often those dames which are so curious in their attire, to be verie sluttes in their houses. 1621 BURTON Anat. Mel. To Rdr. 24 Women are all day a dressing, to pleasure other men abroad, and go like sluts at home. 1715 HEARNE Collect. (O.H.S.) V. 98 Nor was she a Woman of any Beauty, but was a nasty Slut. a1763 SHENSTONE Odes Wks. (1765) 190 She's ugly, she's old,..And a slut, and a scold. 1848 KINGSLEY Saint's Trag. II. viii, Almshouses For sluts whose husbands died. 1883 S. C. HALL Retrospect II. 249 She looked the part of a ragged, slatternly, dirty slut.

  fig. 1602 MARSTON Ant. & Mel. II. Wks. 1856 I. 26 Would'st thou have us sluts and never Shift the vestur of our thoughts? 1642 FULLER Holy & Prof. St. II. xii, Did Rome herein look upon the dust behind her own doores, she would have but little cause to call her neighbour slut.

b. A kitchen-maid; a drudge. rare.

c1450 St. Cuthbert (Surtees) 133 The quene her toke to make a slutte, And to vile services her putt. 1855 J. D. BURN Autobiogr. Beggar Boy (1859) 68, I lived with him..for nearly six months, and acted the part of cook, slut, butler, page, footman, and valet de chambre.

c. A troublesome or awkward creature. Obs.

c1460 J. RUSSELL Bk. Nurture in Babees Bk. (1868) 158 Crabbe is a slutt to kerve & a wrawd wight.

    2. a. A woman of a low or loose character; a bold or impudent girl; a hussy, jade.

  c1450 Cov. Myst. (Shaks. Soc.) 218 Com forth, thou sloveyn! com forthe, thou slutte! c1515 Cocke Lorell's B. 11 Sluttes, drabbes, and counseyll whystelers. 1577-82 BRETON Flourish upon Fancie Wks. (Grosart) I. 6/2 To haunt the Tauernes late,..And swap ech slut vpon the lippes, that in the darke he meetes. 1621 BURTON Anat. Mel. I. ii. IV. i. (1651) 143 A peevish drunken flurt, a waspish cholerick slut. 1698 FRYER Acc. E. India & P. 375 Disputes of their Religion, in which he found the crafty Slut would involve him. 1742 FIELDING J. Andrews II. iv, I never knew any of these forward sluts come to good. 1777 SHERIDAN Trip to Scarborough IV. i, These lords have a power of wealth indeed, yet, as I've heard say, they give it all to their sluts and their trulls. 1839 DICKENS Nich. Nick. xviii, Never let anybody who is a friend of mine speak to her; a slut, a hussy. 1848 Dombey xliv, Does that bold-faced slut intend to take her warning, or does she not? 1881 BESANT & RICE Chapl. of Fl. I. xii, My lord shall marry this extravagant slut.

  fig. 1602 KYD Sp. Trag. III. xiia, Night is a murderous slut, That would not haue her treasons to be seene.

    b. In playful use, or without serious imputation of bad qualities.

  1664 PEPYS Diary 21 Feb., Our little girl Susan is a most admirable slut, and pleases us mightily. 1678 BUNYAN Pilgr. I. 112 As the Mother cries out against her Child in her lap, when she calleth it Slut and naughty Girl, and then falls to hugging and kissing it. 1710-1 SWIFT Lett. (1767) III. 79 Ah! you're a wheedling slut, you be so. 1740-2 RICHARDSON Pamela III. 207 Well did the dear Slut describe the Passion I struggled with. 1846 LANDOR Imag. Conv. I. 233 Nanny, thou art a sweet slut. 1884 GORDON Jrnls. (1885) 115 Why the black sluts would stone me if they thought I meditated such action.

  transf. 1862 THACKERAY Philip xiii, You see I gave my cousin this dog,..and the little slut remembers me.

    3. A female dog; a bitch. Also attrib., as slut-pup. ?orig. U.S.

  1821 J. FOWLER Jrnl. 13 Nov. (1898) 42 A large Slut Which belongs to the Party atacted the Bare. 1845 G. LAW in Youatt's Dog (ed. Lewis, 1858) iii. 88 The dog-pup..and the slut-pup. Ibid. 89 The dog was of a dingy red colour, and the slut black. 1853 W. IRVING in Reader No. 57. 131/3 My little terrier slut Ginger..having five little Gingers toddling at her heels. 1893 J. INGLIS Oor Ain Folk (1894) 10, Sluts were not so frequently used for shepherding purposes as dogs, being less tractable.

4. a. A piece of rag dipped in lard or fat and used as a light.

1609 C. BUTLER Fem. Mon. (1634) 151 Matches are made of linen rags and Brimstone, after the manner that maids make Sluts. 1852 Blackw. Mag. Mar. 363 Writing by the light of what Irish Jenny called ‘sluts’twisted rags, dipped in lard, and stuck in a bottle. 1886 L. M'LOUTH in Library Mag. Aug. (1887) 64 Sometimes..there were for additional light, lard ‘sluts’, or tallow ‘dips’.

b. The guttering of a candle.

a1864 GESNER Coal, Petrol., etc. (1865) 92 The melted material overflows, and bears with it the name of ‘slut’.

5. Special collocations, as slut's corner, a corner left uncleaned by a sluttish person; also fig.; slut-, slut's-hole, a place or receptacle for rubbish; also fig.; slut's-pennies, hard pieces in a loaf due to imperfect kneading of the dough; slut's wool, the fluff or dust left on the floor, etc., by a sluttish servant or person.

1573 TUSSER Husb. (1878) 167 Sluts corners auoided shall further thy health. 1583 GOLDING Calvin on Deut. cxxxiii. 814 Our house shalbe swept, & we will good heed y^t no sluts corner be left. 1608 TOPSELL Serpents (1658) 779 Rubbing, brushing, spunging, making clean sluts-corners. 1710 SWIFT On a Broomstick Wks. 1755 II. I. 181 He sets up to be..a remover of grievances, rakes into every slut's corner of nature [etc.] 1750 W. ELLIS Country Housew. Comp. 21 There is often what we call slutts-pennies among the bread, that will appear and eat like kernels. 1862 Sat. Rev. 15 Mar. 298 There are a good many slut-holes in London to rake out. 1862 Edin. Rev. Apr. 410 Upstairs there is ‘slut's wool’ under the beds. 1893 Westm. Rev. Jan. 17 She would also..see that floors were scrubbed, and corners clear of ‘slut's-wool’, and spiders well kept down.

Posted by Mark Liberman at 08:20 AM

A Briefe and a Compendious Table

A question about the origin of concordances recently came up in conversation, and I'm preparing a talk on "Large-scale knowledge resources in speech and language research," so I thought I could justify a bit of historical investigation. The results will wind up as a single line of PowerPoint in my presentation, but I'll share the longer form below, thus saving you ten minutes of poking around on the web, if you're interested in the topic at all.

In 1230, 500 Dominican monks, working in Paris under Hugo de Sancto Caro (or Hugh of St. Cher), produced a concordance of the bible (Latin Vulgate). According to the Catholic Encyclopedia,

It contained no quotations, and was purely an index to passages where a word was found. These were indicated by book and chapter (the division into chapters had recently been invented by Stephen Langton, Archbishop of Canterbury) but not by verses, which were only introduced by Robert Estienne in 1545. ... This beginning of concordances was very imperfect, as it gave merely a list of passages, and no idea of what the passages contained. It was of little service to preachers, therefore; accordingly, in order to make it valuable for them, three English Dominicans added (1250-1252) the complete quotations of the passages indicated.

The 500 monks in Paris must have worked in a very inefficient way: perhaps each monk took one word and read the whole bible noting any instances, and then went on to another word. This would make the computational complexity O(N*M), where N is the number of lemmas (i.e. word stems to be indexed) and M is the number of lexical tokens in the Vulgate. If file-cards had been invented -- or monastic labor were not cheaper than parchment -- it should not have needed so many monks. Presumably the (unnamed) three English Dominicans could then use a random-access rather than serial-search method, reducing the complexity of their task to O(M). Even this improved work indexed only what we would now call "content words", and so

Another Dominican, John Stoicowic, or John of Ragusa, finding it necessary in his controversies to show the Biblical usage of nisi, ex, and per, which were omitted from the previous concordances, began (c. 1435) the compilation of nearly all the indeclinable words of Scripture; the task was completed and perfected by others and finally added as an appendix to the concordance of Conrad of Halberstadt in the work of Sebastian Brant published at Basle in 1496.

According to the Jewish Encyclopedia,

The revised edition of this [Hugo de St. Caro's] work, made by the Franciscan Arlotto di Prato (Arlottus), about 1290, served as a model for the concordance to the Hebrew Bible which Isaac Nathan b. Kalonymus, of Arles in Provence, compiled 1437-45. Isaac Nathan ... was led to undertake this task by discovering, during the polemic discussions forced upon him by Christian scholars, that, in order to refute the arguments drawn by his opponents from the Bible, it was necessary to have an aid that furnished a ready reference to every Biblical passage and a quick survey of all related passages. He called his concordance "Meïr Natib" (Enlightener of the Path); on the title-page of the first edition, however, it is also called "Yaïr Natib" (It Will Light the Path, after Job xli. 24 [A. V. 32]).

The first English concordance was of the New Testament in 1535 by Thomas Gybson, and of the whole bible by John Marbeck in 1550. Another mid-16th-century English concordance bore the charming title "A Briefe and a Compendious Table, in manor of a Concordance, openying the waye to the principall Histories of the whole Bible, etc."

According to the OED, the English word concordance dates from 1387:

6.b. An alphabetical arrangement of the principal words contained in a book, with citations of the passages in which they occur. These were first made for the Bible; hence Johnson's explanation ‘A book which shows in how many texts of scripture any word occurs’. Orig. in pl. (med.L. concordantiæ), each group of parallel passages being properly a concordantia.
This is sometimes denominated a verbal concordance as distinguished from a real concordance which is an index of subjects or topics.

1387 TREVISA Higden (Rolls) VIII. 235 Frere Hewe [ob. 1262]..þat expownede al þe bible, and made a greet concordaunce [Harl. MS. concordances] uppon þe bible. 1460 J. CAPGRAVE Chron. 154 Hewe [of S. Victor]..was eke the first begynner of the Concordauns, whech is a tabil onto the Bibil. 1550 MARBECK (title) A Concordance, that is to saie, a Worke wherein by the Ordre of the Letters of the A.B.C. ye maie redely finde any Worde conteigned in the whole Bible. 1561 T. NORTON Calvin's Inst. Pref. to Contents, They followed the Concordances of the Bible, called the great Concordances, which is collected according to the common translation. a1631 DONNE in Select. (1840) 192 To search the Scriptures, not as though thou wouldst make a concordance, but an application. 1665 BOYLE Occas. Refl. Pref. (1675) 27, I had not a Bible or Concordance at hand. 1737 CRUDEN (title) Complete Concordance to the Old and New Testament. 1828 E. IRVING Last Days 37 A simple reference to the concordance..will serve to clear up these prophetic matters. 1837 Penny Cycl. VII. 434/2 The compiler of the first concordance in any language was Hugo de St. Caro, or Cardinal Hugo, who died in 1262. 1845 MRS. C. CLARKE (title) Concordance to Shakespeare. 1869 D. B. BRIGHTWELL (title) A Concordance to the entire Works of Alfred Tennyson.

fig. 1741 WATTS Improv. Mind I. i. §5 Memorino has learnt half the Bible by heart, and is become a living concordance.

attrib. and comb.

1856 S. R. MAITLAND False Worship 163 All that the concordance-maker can tell us about it. Ibid. 196 Finding so much discordance in the concordance part of his work.

The earliest cited verbal use is

1888 Athenæum 6 Oct. 450/1 The difficult ‘Astrolabe’, which they concordanced some years ago.

Posted by Mark Liberman at 07:16 AM

Roots

Yellow Waterlily There doesn't seem to be much of a correlation between grammatical features, such as whether or not a language has a future tense, and culture, such as whether people stay put or travel around a lot, but there is a relationship between culture and the fine grain of the lexicon. People who live in hot climates don't have refined terminology for snow and ice; people who live in the northern forest don't have refined terminology for camels. Sometimes small details are meaningful.

In Carrier, there is a word for the roots of the Yellow Pond Lily Nuphar lutea. In the Stuart/Trembleur Lake dialect it is [xuɬ]; in the Southern dialects it is [hʌwuɬ]. There is no relationship whatever between these terms for the roots and the name for the plant as a whole, which is [xeɬt´az̻]. There is no other plant for whose roots Carrier has a distinct term. In every other case you just say "the roots of such-and-such a plant" or "X roots", just as we would in English.

Why is there a term for waterlily roots? When you know what Carrier life was like until recently it makes sense. As I've described before most of the food consumed during the long, cold winter was stored during the summer. By the end of the winter, the food supply was often running very low. Some years, people would be starving. Waterlily roots (photo here) are edible and easy to obtain once the ice is gone, so they were a valuable source of food at the time of year when supplies were likely to be low. They also had a medicinal use. So the reason that waterlily roots have a distinct name is that they played a particularly important role in traditional Carrier life.

Waterlily roots are the only roots that have a distinct name in Carrier, but there is another kind of root that is referred to in a special way. This is the root of the Black Spruce Picea mariana. Spruce roots are called [xi] or [xʌi] depending on dialect. The tree itself is called [t̻s̻´u]. This isn't a distinct term in quite the same way that [xuɬ]/[hʌwuɬ] is though. [xi]/[xʌi] is the bare stem of the word for "root".

In Carrier, there is a class of dependent nouns. These are nouns that cannot occur in their bare form. They can be used as components of compounds, but if they are used on their own they must be used with possessive prefixes. For the most part, the nouns in this class are kinship terms and body parts, that is, the things that are thought of as inalienably possessed. The parts of plants are treated like body parts. In Stony Creek Carrier, for example, "flower" is [indak], but [indak] never occurs as such. If you want to say "the flowers of this rose bush", you would say [ndi xwʌs bindak] "this rose its-flowers". If you want to talk about "flowers" without saying whose they are, you have to say [ʔindak] where /ʔ/ is the indefinite possessive prefix, meaning "someone's" or "something's". If you want to talk about the flowers as separated from the plant you have to use a special form for the alienable possession of inalienably possessed things; "my elder sister's flowers" is [sjat beʔindak], where [sjat] means "my elder sister". If you were to say [sjat bindak] it would mean that your sister was a plant and that the flowers were part of her body.

So, taking our examples from the Stony Creek dialect, the stem of "root" is [xʌi]. When it is possessed it becomes [ɣʌih]. Since "root" is a dependant noun, you can't use [ɣʌih] by itself. If you want to talk about some unspecified roots you have to refer to them as [ʔʌɣʌih]. So, the word for Spruce roots is not exactly a distinct word. Rather, it is the generic term for "root", treated in a special way. Unlike the generic term, it is not dependant. In effect, the Carrier word for "root" can be used both as a generic term, in which case it is dependent, and as the specific term for Spruce roots, in which case it is not dependent.

So, why are Spruce roots special? Well, Spruce roots play an important role in traditional culture. They are split to make a kind of fine cord used for such purposes as sewing up baskets.

A Basket by Madeline Johnnie
By and large, roots did not play a large part in traditional Carrier life, but Spruce roots had a special role and so were treated as the prototypical root.

Posted by Bill Poser at 12:53 AM

March 03, 2004

The Algonquian morpheme auction

Well-intentioned people who want to extend the legal concept of property rights to traditional culture, whether linguistic or otherwise, should consider the possibility of unintended consequences along the lines that Bill Poser and the Watley Review suggest.

I wrote about this in a presentation that I gave at Exploration 2000. I've reproduced the relevant section below -- not a lot (that I know of) has changed in the intervening three years. [Sorry about the stale links -- that's the web for you...]

UNESCO/WIPO proposal for sui generis folklore rights

The World Intellectual Property Organization (WIPO) has proposed a number of new forms of intellectual property, to cover cases that are omitted or given what is felt to be inadequate coverage under existing laws.

Two of these are especially relevant to language documentation. One proposal suggests special protection for databases, and another proposal suggests special protection for expressions of folklore.

The database proposal has been very severely criticized in the U.S., by individuals and organizations from many political and cultural viewpoints. The folklore proposal has been largely ignored, though many of the same objections apply.

Speaking for myself, I am sympathetic to the criticisms of the proposed sui generis database rights, and feel the same way about the proposed folklore rights. It is certainly true that standard copyright does not protect folklore, because it is not an individual "work of authorship", is often not "fixed in a tangible medium of expression", and so on. However, it is quite possible that the proposed cure would be worse than the condition it aims to help.

In evaluating things like the WIPO database and folklore protection proposals, one can see them in two ways: as attempts to protect people's work and people's rights -- a sort of human rights inititiative -- or alternatively, as an attempt to convert common ground into commercially exploitable property -- a sort of modern version of the enclosure movement. To the extent that the second view is correct -- and there will be many capable lawyers and deal-makers working hard to use any new laws in that way -- the results may be the opposite of what some supporters of these initiatives have in mind.

It is worth reading the proposals carefully, and thinking about what consequences they might have in actual practice.

Could Disney or Sony buy the exclusive rights to a body of folklore, in perpetuity? Yes, if sui generis folklore protection is a form of property, then it can be bought and sold; and in any case, licensing is to be at the sole discretion of "the competent authorities", who are free to negotiate exclusive arrangements. Could dissident works be suppressed or destroyed on the grounds that they are an "illicit exploitation" because they are "outside the traditional or customary context of folklore" and "without authorization by a competent authority"? Absolutely. Note that the WIPO model provisions specify that "an utilization, even by members of the community where the expression has been developed and maintained, requires authorization if it is made outside such a context and with gainful intent." (say, at a political fund-raiser...) In fact, a community member would be subject to "penal sanctions" if the relevant governmental minister determines that his or her "expressions of folklore" are "distorted in any direct or indirect manner prejudicial to the cultural interests of the community concerned." In other words, the minister of culture could put someone in jail for composing an irreverent folksong.

Reading the WIPO model provisions, my personal reaction is to see helpful-sounding principles with a staggering potential for tyranny in practice.

There are also some difficult conceptual issues. The fact that ethnic groups do not exactly coincide with national boundaries will make it hard to figure out which government would get to authorize activities and collect the tariffs for which body of folklore. For instance, would a Chicago polka band need get clearance from and pay royalties to the Polish government?. And there are also questions about how far back in history the ownership of such cultural property should go. According to this article, three Maori tribes are threatening suit against Lego for producing a game that includes characters with Polynesian names and story lines allegedly similar to traditional stories from Easter Island. Since the Easter Island culture is related to that of the New Zealand Maori roughly as Polish culture is to Russian, this case is roughly comparable to one in which a Russian nationalist organization sued the estate of Lawrence Welk over polka royalties. To sort all this out -- if it really is to be sorted out -- will involve a massive transfer of resources to the world's lawyers.

See Report on Australian Indigenous Cultural and Intellectual Property Rights for a more sympathetic perspective on the use of the law of property in this area.

The Hague Conference on Private International Law's Proposed Convention

It's worth noting in this connection that The Hague Conference on Private International Law's proposed Convention on Jurisdiction and Foreign Judgments in Civil and Commercial Matters would interact with local sui generis intellectual property rights in potentially pernicious ways. The cited Convention (see this link for more details) provides a set of rules about jurisdiction for cross-border litigation, covering nearly all civil and commercial litigation. Within this framework, each member country agrees to enforce the judgments and injunctive orders of courts in other member countries, without any requirements to harmonize the laws involved.

49 countries have signed, including the U.S., Canada, France, Germany, China, Croatia, and Egypt. The fact that the Hague Convention covers sui generis intellectual property regimes creates many opportunities for legal mischief. As James Love writes in a recent article on the topic:

For example, if Cuba enacted a sui generis regime and declared that the Cuban "beat" was intellectual property, it could get a judgment in Cuba against US record companies that were engaged in cultural "piracy," and demand for example, 5 percent of the revenues from global sales of music that use the Cuban beat. Other countries could do the same thing. These judgments would be enforceable globally, under the Convention. So too would bio-piracy judgments against US and European biotechnology and pharmaceutical companies, for "stealing" traditional knowledge, or exploiting without benefit sharing a variety of biological and genetic resources. The motion picture industry could be hit with new sui generis IPR liabilities by countries that give rights in history. Countries like China, which is a member of the Hague Conference, could use this to limit who could actually make films about China. The Hague convention would instantly create a legal framework to legitimatize all of these new IPR claims, and it would not even matter if the "infringing" party did business in the country at all, since the judgments would be enforceable globally, in any Hague member country, and the claims could be based upon shares to global (rather than local) revenues of products.

Love points out that the direction of legal action will not by any means only be from the less developed world against the U.S., Europe and Japan. In fact, developed countries (and the multinational companies based there) have more money and lawyers to devote to the process, and also better access to the courts where the outcomes will be decided, so that their sui generis extensions of intellectual property are likely to turn out to be more valuable:

Some would consider this [international enforcement of sui generis IPR] a positive feature of the Convention, because it would give the developing countries opportunities to "tax" the rich countries, under new and controversial IPR regimes. But of course, the rich countries could and will also enforce their own regimes, including, for example, the European Union sui generis regime on database protection. The US and EU would probably modify their sui generis regimes on pharmaceutical registration data to make it illegal for developing countries to rely upon those data for registration of generic products in poor countries, an approach already included in the new US-Jordan "Free Trade" agreement. And in general, would one would observe is a new dynamic of everyone trying to create their own "rights" in everything, until the public domain shrinks if not disappears altogether.

The ultimate outcome of all of this is uncertain, and depends on larger and more important issues than the IPR status of language documentation materials. The uncertainties should not prevent us from going forward with language document projects. It seems unlikely that sui generis property rights will be successfully attached to words, inflections, syntactic structures, or the forms of everyday discourse. Whatever the outcome, linguists' best protection against such problems is to be solidly based in the speech communities in question, which is a good idea in any event.

Posted by Mark Liberman at 02:06 PM

General Motors Purchases Algonquian Languages

Claire Bowern's joking suggestion that corporate sponsorship for linguistic fieldwork could be obtained by selling the right to name languages and use words from them as product names recalls this item, which apeared in the Watley Review last June. It begins:

General Motors (GM) has announced the purchase of exclusive rights to the entire Algonquian language family, including such well-known tongues as Cheyenne, Cree, and Mohican, in a $1.6 billion dollar deal.

and goes on to explain that:

GM acquired the languages in an apparent effort to secure the rights to potentially thousands of cool-sounding names for automobiles. With one of the least creative management structures in the automotive industry, GM has for years produced cars with increasingly lame names that have hurt sales.

As Mark Liberman comments, corporate sponsorship, of a somewhat different form, may actually be worth pursuing. But there is more to this story. Last August or September I sent links to this item to several native friends in British Columbia, who in turn passed it on to others. I soon received queries, one of them from the Maritimes, from people concerned and outraged because they thought that it was a real news story! [Just in case anybody has missed this, the Watley Review is satire. This is NOT a real news story.] The history of exploitation of native people is such that even very sophisticated native people consider the purchase of rights to a native language family within the realm of possibility. (One of the people who wrote me concerned that the story was true is a lawyer with many years experience in politics.) This just goes to show how tricky the politics of work on endangered languages can be.

Posted by Bill Poser at 01:31 PM

The metaphysical ashes of conscious awareness

Not far from Elsinore, adolescents are pondering the messages of ghosts again. Here is a Flash site for the project (somewhat unhelpfully laid out, in my opinion), and here is an overview of the project in .pdf form. These "ghosts" are interactive agents that learn and compete for users' ITUs ("interest token units") in order to increase their fitness, measured in terms of "vialence" ("viability valence").

A key quote from the paper:

The notion of ghosts has been chosen because of traditional characteristics of ghosts found in the popular literature and in folk tales:
1. Ghosts are mostly invisible or only vaguely visually manifested
2. Ghosts are often bound to a specific location which often has a very special relation to the ghost
3. Ghost owe their twilight status to some unfinished business and they are therefore active and striving
4. Ghosts only appear when called upon or if they feel an urge to manifest themselves
These features are heuristically very interesting for developing functionally satisfactory agent based assistance while keeping the technical requirements at a minimum.

After a brief scan of their site, I can add some other characteristics of ghosts that are useful from the point of view of system designers:

1. Ghosts are not always helpful or well intentioned.
2. Ghosts don't always tell the truth (or even know what it is).
3. Ghosts are often annoying even when they are entertaining.

If you're not familiar with this literature, you may enjoy reading about Cobot, an adaptive conversational agent who was purportedly artificial rather than metaphysical.

It's not really a surprise that Danish AI researchers didn't think to include Hamlet's ghost among their cast of meta-characters, but we can hope that some clever playwright will re-present Hamlet in a setting modeled on this new campus environment. In particular, I feel that Thin Lizzy would be able to provide helpful commentary on Shakespeare's original.

[via slashdot]

Posted by Mark Liberman at 01:22 PM

Pillow talk

Like Samuel Pepys, Sei Shonagon now has a blog.

But Plato was there first.

Posted by Mark Liberman at 11:53 AM

More on Hitt on language loss

A few final eddies in the dust kicked up by Jack Hitt's NYT magazine story on the death of a Chilean language: Kerim Friedman has a thoughtful exploration of the issues that sketches what a good magazine article on the topic might have been like, and Claire Bowern provides a wicked little recipe for cooking up more articles like Hitt's, should anyone care to do so.

Posted by Mark Liberman at 11:07 AM

Expression's vast varieties

Here's a late 18th-century snowclone, exactly of the original "words for snow" genre. Samuel Bishop, Epigram CCLXXVI (published posthumously in 1796).

ALIUSQUE ET IDEM.

In Araby, learned linguists say,
So copious is the vulgar phrase,
That speech at pleasure can display
The lion's name five hundred ways.

But while thus, column after column,
Expression's vast varieties fall,
These, though enough to fill a volume,
Mean but one lion after all.

Or else perhaps, with evident cause
A doubt might rise, which most would scare ye?
The lion's titles?---or his claws?
The desart?---or the Dictionary?

As far as I can tell from the Arabic dictionaries available to me, the basic claim about the number of words for "lion" is false.

I'm reluctant to suspect Bishop of originality here. Can anyone supply an earlier example of this trope?

Posted by Mark Liberman at 10:36 AM

Corporate sponsorship for language documentation

In a comment on Language Hat's site, Claire Bowern writes

When I was an undergrad we came up with a sponsorship scheme to raise money for our fieldwork. For the cost of enough fieldwork to produce a reference grammar, dictionary and texts, your company would get the right to name the language (many of the languages had no name the speakers used), unlimited rights to vocab items for product names, your company logo on the front of the dictionary, etc..

(Before anyone takes me seriously, we came up with this idea at the pub one Friday evening)

I think this should be taken very seriously indeed.

Not the parts about naming the language, or vocabulary ownership. And I think it's better to see it in terms of language documentation rather than field work, focusing as far as possible on on local institutions as active partners. But the company logo on the dictionary is a fine idea, and could be Good Thing for all concerned.

Every large company spends a lot of money every year on "good works", including educational and cultural projects. I once tried to raise money for making dictionaries of African languages from large companies doing business in the areas where the languages are spoken. I failed, and I concluded that it would not be easy to succeed. The idea takes some getting used to, and it's a lot of work even to get to talk with the people who make decisions about how this sort of money is spent, and these people sit in the middle of a web of obligations and constraints that has to be understood and dealt with. In the end, I couldn't put in the time that it would clearly take.

However, it's basically an excellent idea. There are many languages (including many in no imminent danger of extinction) that lack good dictionaries, or even any dictionary at all. A properly-done lexical database can be tapped to produce dictionaries for native-speaker school use as well as versions suitable for second-language learners and scholars.

The tangible results are a source of pride (and therefore gratitude) on the part of native speakers. These results last a long time -- such dictionaries continue to be used for decades after they are first published -- while many other worthwhile cultural programs such as the standard dance troupe tour are relatively ephemeral.

In addition, the process of creation can be valuable in itself. Dictionary projects can be an excellent training ground for practical linguistics, including especially native-speaker students, and can connect well to literacy programs, compilations of traditional and popular culture, and so on.

So it's a win/win/win solution -- for the sponsors, for the speakers, and for linguists. All we have to do is to explain the idea persuasively to the vice presidents for charitable and cultural activities at the world's multinational corporations.

Well, there's also the little matter of making sure that the process, once funded, actually works. For example, there isn't any good open-source lexical database software suitable for this application. And dictionary projects notoriously divide into two classes, roughly those that converge on a result and those that don't, and we'd have to find a way to guarantee that nearly all the funded projects fall into the first class, or the whole idea would fall apart. But still, I think that Claire's idea is a good one.

Posted by Mark Liberman at 10:27 AM

Foreign Asset Control and Censorship

In a previous post, Chris Potts mentioned the Treasury Department's claim that the US law restricting trade with certain foreign countries prohibits scientific journals from editing papers originating in those countries. Our colleague Language Hat has also raised this issue. Here is some additional information.

The Treasury Department's Office of Foreign Asset Control has taken the position that publication of papers originating in embargoed countries is legal but that editing such papers or translating them into English is illegal because it constitutes the provision of a service to residents of the embargoed countries. The law imposes penalties consisting of fines of up to $500,000 and 10 years in jail. The Institute of Electrical and Electronic Engineers has reluctantly accepted OFAC's interpretation and suspended publication of papers from the embargoed countries. The American Association for the Advancement of Science does not accept the OFAC interpretation. According to this report the American Chemical Society initially suspended editing of papers from the embargoed countries but has now rejected the OFAC interpretation and resumed publication of such papers. The Linguistic Society of America is looking into the matter but has not yet taken a position.

Declining papers on the basis of their country of origin clearly runs counter to the goal of most scientific organizations, namely that of promoting knowledge of science. It is also bad public policy. All too often trade embargoes have little effect on the elite that control the government and merely make life more difficult for ordinary people. I'd say its pretty clear that the US embargo on trade with Cuba has been an abject failure of exactly this type, and we learn almost daily of how Saddam Hussein and his cronies lived in luxury and diverted money intended for food and medicine while most Iraqis suffered. But even if we concede that a trade embargo may in some circumstances be effective in crippling a rogue government or deterring terrorism, there is not the slightest reason to believe that preventing the publication of scientific papers will have such an effect. Indeed, such unofficial communication between hostile nations tends to humanize the enemy and improve the prospects for peace and cooperation.

There is good reason to consider the Treasury Department's interpretation of the law to be wrong. This interpretation depends on the notion that editing a paper constitutes a service to the author. (For details, see Executive Order 13224, which you can download as a PDF file here.) Although the publication of the paper may be of some long-term benefit to the author's career, the principal purpose of editing is to bring the paper into conformity with the style of the journal and to improve the experience of the journal's readership. Editing provides a service to the journal's readers. Moreover, to the extent that editing provides a service to the author, the service is not of such a nature as to be of any economic, political, or military benefit to the government of the country in which the author resides. Prohibiting editing of journal articles is not going to deter terrorism.

The Treasury Department's position is contrary to the intent of Congress. In fact, it seems to be in clear violation of the statute as amended. The Berman Amendment, named after its sponsor, Congressman Howard Berman, prohibits the executive branch from interfering "directly or indirectly" with trade in "information or informational materials". That certainly seems to tell it to keep its hands off journal articles.

[Update 2004/03/07: Congressman Berman has written a letter to Richard Newcomb, the Director of OFAC, taking exception to OFAC's interpretation of the law .]

In any case, I believe that the law as interpreted by the Treasury Department is invalid. It contravenes the First Amendment guarantees of freedom of speech and of the press:

Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the government for a redress of grievances.

as well as Article 19 of the Universal Declaration of Human Rights, for which the United States voted:

Everyone has the right to the freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek receive and impart information and ideas through any media and regardless of frontiers.

[The Universal Declaration of Human Rights is available here in over 300 languages. The United Nations Association in Canada has information about the history and status of the Declaration here.]

Posted by Bill Poser at 12:18 AM

March 02, 2004

Anggarrgoon

A new language blog is born! Check out Claire Bowern's new blog Anggarrgoon. Claire is a graduate student in linguistics at Harvard, soon to be a doctor, whose expertise includes, among other things, Australian aboriginal languages.

Posted by Bill Poser at 11:04 PM

Koko in the chat room

I'm less skeptical about the communicative and even "propositional" capacities of non-human animals than my colleague Geoff Pullum is, but for some other examples of the kind of anthropomorphizing that he's talking about, read this interview with Koko the gorilla.

Questions were relayed in sign language by Penny Patterson, and Koko's signed responses were transcribed in real time by another friendly observer.

"HaloMyBaby is the moderator of the chat on AOL, DrPPatrsn is Koko's friend and trainer, and LiveKOKO is Koko the gorilla." Here are a few passages where the thread of the discussion is a bit divergent. In fairness, I should say that I've had somewhat similar conversations with certain humans.

HaloMyBaby:	MInyKitty asks, Koko are you going to have a baby in the future?
LiveKOKO:	Pink
DrPPatrsn:	We've had earlier discussion about colors today

Or again:

HaloMyBaby:	SBM87 asks, What are the names of your kittens? (and dogs?)
LiveKOKO:	foot
DrPPatrsn:	Foot isn't the name of your kitty
HaloMyBaby:	Koko, what's the name of your cat?
LiveKOKO:	no
DrPPatrsn:	She just gave some vocalizations there... some soft puffing
HaloMyBaby:	I heard that soft puffing!
DrPPatrsn:	Now shaking her head no.
Question:	Do you like to chat with other people?
HaloMyBaby:	That was from Rulucky!
LiveKOKO:	fine nipple
DrPPatrsn:	Nipple rhymes with people, she doesn't sign people per se, she was trying to do a "sounds like..."

Posted by Mark Liberman at 07:28 PM

Monkeys saying things again -- NOT

Andrew Carstairs-McCarthy in the latest Science mentions the view that although animals are capable of "knowing how..., e.g., how to get food", they are not capable of attaining propositional knowledge, or "knowing that", since "only humans have language, by means of which propositions can be entertained or expressed". However, he then argues against this view, citing the work of Herb Terrace in defense of the propositional competence of at least some non-human primates. He says:

Terrace's research with macaques (humble monkeys, not apes) casts doubt on the claim that only humans have declarative knowledge. To obtain a food reward, macaques can quickly learn to punch a sequence of five or more symbols on a keyboard--and do it consistently right, even when the keyboard configuration is randomly shuffled. Thus, the macaques learn not just a sequence of manual movements (analogous to learning a passage on the piano) but an abstract sequence of symbols, whose application involves different motor commands on each occasion.

I say this is not propositional knowledge. This is a complex form of knowing how to get food, as Carstairs-McCarthy almost admits ("To obtain a food reward", he notes). I've said this before and I'll now say it again: contra all the stupid stories people tell and are prepared to believe about communication with apes and parrots and dolphins and other worthy but stubbornly uncommunicative beasts, I see not a flicker of evidence that any animal has ever expressed a proposition (and they may well, therefore, never have understood one, either).

I found something on the web that's relevant to this, something that you may not have seen. It concerns an inside report about Kanzi, the allegedly language-competent bonobo.

Steve Jones, of Perth (Western Australia), posted this message in an archive of discussion about evolution on the American Scientific Affiliation, an organization devoted to "science in a Christian perspective", in which he says that a friend of his on another list (a closed list, apparently, so he felt he should not give the friend's name) had posted the following about Kanzi. Keep in mind while you read the quote below that the experiments with Kanzi are widely regarded as perhaps the most successful experiments on communication ever done with any animal.

I am amused by the Ape Story, mostly because I have met Kanzi! My Philosophy of Mind professor ... was a thorough naturalist, and thought it his responsibility to let us all know about the mental capabilities of our nearest relatives. So, we took a field trip to Rumbaugh's laboratory to see Kanzi, the famed bonobo, and his sister Panbonisha.

I was distinctly unimpressed. My class had been told about Kanzi's ability to understand complex commands, but he refused to perform or obey when we were present. The Rumbaughs had a huge electronic board with hundreds of symbols on it; whenever a symbol was pushed, the board would electronically pronounce the word associated with the symbol. This is how the bonobos are supposedly able to communicate as well as a three- year-old human. Again, Kanzi refused to push any of the symbols; his sister Panbonisha did push some of the symbols repeatedly, but it was difficult to tell if she was really communicating or just having fun making noise. For example, Panbonisha pushed a button repeatedly that said, "Chase." Of course, the trainers were happy to offer extensive commentary and interpretation: "See, she's trying to say that you [one of the humans] should chase him [another human]. She loves the game of chase." All of the alleged communication consisted of the ape pushing a button, and the trainers giving elaborate exegesis thereupon.

My personal opinion is that the Rumbaughs are possibly guilty of a little wishful thinking. And as for the assertion that Kanzi has the language abilities of a 3-year-old, I could read the newspaper at 3. Kanzi's nowhere close.

Comment from me would be almost superfluous. Except to say that there is a similar anecdote about a different chimpanzee reported in Joel Wallman's book Aping Language about a native ASL signer who was once employed, along with some other graduate students, to hang out with a famous allegedly signing chimp, and keep notes on what she supposedly said. It turned out he got the reputation of being the slacker on the project, because he never wrote down very much in his notebook. He said there wasn't much to write. The non-native signers were saying "Ooh, look, isn't that the sign for water, she must be thirsty" and wrote down that the chimp had asked for water, and this native signer just wasn't seeing that anything in his native language had been uttered at all. (I have since learned that the native signer in question, whom Wallman does not name, was Ted Supalla, now a Professor of Brain & Cognitive Sciences, Linguistics, and American Sign Language at the University of Rochester.) Much the same story as reported above for Kanzi, in other words. As for an actual declarative sentence, a statement that something was true or false? Fuhgeddaboutit.

I've said before and I'll say it again (what I tell you three times is true): I do not believe that there has ever been an example anywhere of a non-human expressing an opinion, or asking a question. Not ever. It would be wonderful if animals communicated propositionally -- i.e., could say things about the world, as opposed to just signalling a direct emotional state or need. But they just don't.

Posted by Geoffrey K. Pullum at 06:42 PM

Clairvoyance? No, just utterance processing

I hopped into the car for a ride to the UCSC campus just after 9 a.m. this morning and as Barbara started the engine the radio came up. She didn't want the radio on for her commute to San Jose, so (driver's privilege) she hit the button and popped it off immediately. There was only time to hear a familiar voice saying "vanities".

"It must be Tom Wolfe's birthday," I said. And indeed it is. And suddenly I thought, how on earth did I know that?

Well, the voice was Garrison Keillor's. At 9 a.m. every day he has a little 5-minute piece called ‘The Writer's Almanac’ on our local public radio station, KAZU, to which the car radio is usually tuned. The program begins by listing some famous authors whose birthday is that day. Just before the one word I caught, there was a hint of something like vth. It sounded like I had just heard the end of Garrison Keillor saying of the vanities.

Now, vanity is a non-count noun, only very rarely used in the plural. The only salient place anyone is likely to have heard it is in the title of the novel The Bonfire of the Vanities. The author of that novel was Tom Wolfe, who is amply famous enough to get a mention on ‘The Writer's Almanac’. What I had realized instantly, without any real processing time at all, was that Garrison must have included Wolfe in the list of birthday identifications for today.

Now that is how natural language works. Naive accounts talk about using sentences to express messages; naive teachers tell children to answer questions with full sentences; naive models of sentence processing assume we listen until the last word, use the grammar to verify grammaticality, and select one of the possible meanings as the most likely one to convey appropriate information in the context. But all I heard was Garrison Keillor's voice saying perhaps most of a preposition phrase, of the vanities, and really only the last word of it. For me there was no sentence. From a single second's access to just part of a part of a sentence, I was able to identify the speaker from the voice quality, spot the word, reconstruct the phrase, and make a comment which relied on my having guessed the truth conditions of the rest of the sentence (probably "Today is the birthday of Tom Wolfe, author of the novel The Bonfire of the Vanities, which..."). Speaker recognition, phonetic identification, phonological analysis lexical lookup, morphological analysis (spotting that vanities was in the plural), syntactic parsing, semantic interpretation, and pragmatic implications, all happening simultaneously and virtually instantly. Just a few seconds in the life of a speaker. Everyone is doing this sort of thing all the time.

If we could write a computer program to reliably model the syntactic analysis of complete sentences with no errors as presented in written form, perhaps with the literal meaning attached, but without any sensitivity to context, that would be a great achievement; but it would be nowhere near a computational account of what human beings are actually doing with their linguistic knowledge all the time, every day. As much psycholinguistic work by Ray Gibbs has shown, we bounce from partially-heard fragments in complete inferential leaps to brand new information on the basis of pragmatically conveyed propositions, as if extra reasoning over and above the grammar and the literal meaning took no time whatsoever. How we do that is something that — after fifty years of fairly intensive linguistic and psycholinguistic work and two or three decades of increasing attention to computational linguistics — we barely even have glimmerings of.

Posted by Geoffrey K. Pullum at 02:51 PM

No Future in Canoes?

fish in the smokehouse David Beaver points out that the claim in Jack Hitt's article Say No More in the New York Times Magazine that Kawesqar has no future tense because people who spent their time travelling around in canoes had no use for one is a non-sequitur. I can confirm this because I know a language of people who also traditionally spent much of their time travelling around in canoes that does have a future tense.

Carrier, the native language of a large portion of the centeral interior of British Columbia, is the language of people who were traditionally "nomadic". Now, this doesn't mean that they wandered around randomly. In fact, they had what is called a seasonal round. That means that they went different places at different times of the year. In the summer the most important thing was catching fish, which were smoked and dried for the winter. The most important fish are anadromous fish, fish like salmon that are caught when they run upstream in large numbers to spawn. So Carrier people would go to different fishing sites at different times. Picking berries was another important summer activity, and again, you had to go where the berries were, at the right time. In the winter people would go to a winter camp. The result is a lot of travelling around, but there is a pattern to it, and it is over familiar ground. I don't know anything about the Kawesqar, but I bet that they too moved around as necessary to obtain resources within a familar territory.

Traditional Carrier society was not only "nomadic", but very much oriented to the water. Fish was by far the most important food. Water was the principal means of travel. Even today, directions are usually given with respect to the flow of water. If you're in the village of Tache, 60km Northwest of the town of Fort Saint James, on the shore of Stuart Lake, where the Tache River enters the lake, and you plan to go into Fort Saint James, what you will say is [ndaʔ tisgu], literally "I am going to drive downstream". If you study the distribution of Carrier dialects, you will find that the closely related dialects are those adjacent to each other along waterways. Indeed, the Carrier name for themselves is [dakeɬ] "people who travel by boat". [The symbol ɬ represents a voiceless lateral fricative, the sound written ll in Welsh.]

So, does Carrier lack a future tense? Not at all. It has a perfectly fine future tense. Here is "to go around by boat" in the future tense. The rows give the person, the columns the number of the subject. For example, the form [nʌtiskeɬ] in row one, column one means "I am going to go around in a boat".

	singular	dual	plural
1	nʌtiskeɬ	nʌtakeɬ	nʌztikeɬ
2	nʌtankeɬ	nʌtihkeɬ	nʌtihkeɬ
3	nʌtikeɬ	notikeɬ	notikeɬ

Although Carrier has a future tense, strictly speaking it has no distinction between present and past tense. It has something that is similar, but it is really a distinction between imperfective and perfective aspect. To a first approximation, you could translate [nʌske] as "I am going around in a boat" and [nʌsʌski] as "I went around in a boat", but strictly speaking the distinction is between an ongoing activity and a completed activity, and the first, imperfective form must be used when we are talking about an ongoing activity in the past.

Now, if you follow Jack Hitt's line of thought, this shows a lack of correlation between language and way of life. Actually, I don't think that it does. A way of life like that of the Carrier involves moving around, but it certainly does involve planning ahead. You have to know when each species of fish will run and plan on being there at the right time. You have to know when different kinds of berries ripen. The traditional names for times of the year, now used as month names, reflect this. Here they are in the Stuart Lake dialect:

January	satɕo uzaʔ	time of the big moon
February	tɕʌzsʌl uzaʔ	moon of the middling snowflakes
March	tɕʌztɕo uzaʔ	moon of the large snowflakes
April	ʃin uzaʔ	moon of ground bare of snow
May	dʌgus uzaʔ	moon of sucker fish
June	daŋ uzaʔ	moon of full summer
July	talo uzaʔ	moon of the salmon
August	gesʌl uzaʔ	moon of the kokanee
September	bit uzaʔ	moon of the char
October	ɬoh uzaʔ	moon of the whitefish
November	banɣan nʌts´ʌkih	we go around by boat half the time
December	satɕo dinʔai	eve of the big moon

Most of all, you have to stock up for the winter. Traditionally, you could do some hunting in the winter, but travel in the winter is difficult and game not all that plentiful. You could fish under the ice, but you can't do that during break-up, when your supplies are stretched thin. If you didn't put away eough food in the summer, there was a good chance that you would starve to death. In such a world, you'd better be able to plan ahead and discuss your plans with other people.

Although the example of Carrier therefore doesn't demonstrate a lack of correlation between culture and grammar, this particular correlation in fact doesn't hold up. No one would claim that modern Japanese culture is one in which it is unnecessary to talk about the future, but Japanese has no future tense, not even a periphrastic one like English. That doesn't mean that you can't talk about the future. Of course you can. You can always make clear what time you are talking about by using an expression like "tomorrow" or "next year". But there is no future tense.

Posted by Bill Poser at 12:31 PM

Science Magazine on Evolution of Language

Science magazine's current issue (2/27/2004) is focused on "Evolution of Language." The cover shows two facing talking-head silouettes with the word "science" printed in "Japanese, German, Bengali (Roman script), Bahasa Malaysia, Bengali (Bengali script), Tamil, Cherokee, Swahili, Asante Twi, Hindi, Finnish, Slovak, Albanian, Arabic, Italian, Spanish, Tibetan, Russian, Dutch, Malayalam, Chinese, Hebrew, Hawaiian, and Swedish."

As usual for Science, this focus means that there are some "News", "Viewpoint" and "Books" pieces on the topic. The "Research" section of the magazine is not affected, and covers whatever topics are normally queued up for publication.

The language-related table of contents is given below. If you're reading this, I'll infer that you're interested in language, and will want to read these articles. I'm not clear whether they are among the "other selected materials" that free registration for "partial access" makes available, but this will get you at least the table of contents and summaries. Most academic libraries and many public libraries should have subscriptions (both on line and in paper form). Or you could join the AAAS...

ARCHAEOLOGY:
Continuing the Debate on Words and Seeds
Steven Mithen
Science 2004 303: 1298-1299. (in Books)

LANGUAGE:
Many Perspectives, No Consensus
Andrew Carstairs-McCarthy
Science 2004 303: 1299-1300. (in Books)

ARCHAEOLOGY:
As Our World Warmed
Lawrence Guy Straus
Science 2004 303: 1300-1302. (in Books)

The First Language?
Elizabeth Pennisi
Science 2004 303: 1319-1320. (in News)

Speaking in Tongues
Elizabeth Pennisi
Science 2004 303: 1321-1323. (in News)

Search for the Indo-Europeans
Michael Balter
Science 2004 303: 1323. (in News)

From Heofonum to Heavens
Yudhijit Bhattacharjee
Science 2004 303: 1326-1328. (in News)

The Future of Language
David Graddol
Science 2004 303: 1329-1331. (in Viewpoint)

Software and the Future of Programming Languages
Alfred V. Aho
Science 2004 303: 1331-1333. (in Viewpoint)

Of Towers, Walls, and Fields: Perspectives on Language in Science
Scott Montgomery
Science 2004 303: 1333-1335. (in Viewpoint)

Posted by Mark Liberman at 07:10 AM

There's no future in canoeing

My personal favorite from the NYT magazine article, that Mark Liberman has already puzzled over (first here, and then here) was the following simply wonderful non sequitur:

"For example, because of the Kawesqar's nomadic past, they rarely use the future tense; given the contingency of moving constantly by canoe, it was all but unnecessary."

It has often puzzled me why modern Germanic languages lack future tenses, and instead make do with an impoverished selection of auxiliaries of indeterminate meaning. The Indo-Europeans (famous, coincidentally, for their smug stay-at-home complacency) had a great tense system, which they obviously used to great effect in order to speculate on the outcome of tomorrow's game, whether it was really worth going out when it was almost certain to rain, and whether tomorrow would be the same as today and the day before and the day before that.

The loss of the future tense presumably occurred in Gothic times. "Goth" is thought to relate to the Gothic gutans = "pour"+ppl, as in "keep pouring", and I had taken this to show that the love of liquor was a matter of identity for the Goths. This naturally led me to suppose that it was Gothic substance abuse that eventually eroded the future tense. But now I stand corrected. It was not in fact their heavy drinking, but the Goth's peripatetic lifestyle that produced the famous IE tense leveling: every time those Gothic warriors jumped in a boat, they threw out all but the strictly necessary morphemes in favor of an extra round of sandwiches and a couple of clubs.

Posted by David Beaver at 04:17 AM

sesquidecennium

No doubt the net is often a fruitful source of data, as we are frequently reminded here on Language Log, but not only does it turn up misspellings, errors, and things that aren't what you were looking for, sometimes things that really ought to be there for some reason aren't. As a lark, I looked for the word sesquidecennium. This is a perfectly fine English word, of transparent meaning, which I have actually used in print. Curiously, it isn't listed in the Merriam-Webster Online nor in One Look. Google produces only two hits. Either Google isn't searching the right pages, or people aren't getting their money's worth out of this fine word. It has been a sesquidecennium since I used this word. I think its about time I did it again.

Posted by Bill Poser at 02:00 AM

March 01, 2004

Cohering into families on the run

I guess this deserves a separate post. In Jack Hitt's Sunday NYT Magazine piece "Say no more", along with the things I've already discussed at excessive length, there are many small puzzles, one of which Nicholas Widdows emailed about:

Another bit of astonishing nonsense in that perplexing article was this, on the bottom of page 2:

'Then the multitude of idioms developed on the run cohered into language families, like Indo-European, Sino-Tibetan and Elamo-Dravidian'

Now I'm sure that's not what it said in the book he got it from. (Well, I hope it's not.)

Right. Darwin, who modeled his ideas about "descent with modification" on philological accounts of language evolution, would have gone off in a very different intellectual direction if he had construed historical linguistics as Hitt does. Like rivers joining as they flow to the sea, or political tendencies joining in the social fabric of the republic, the multitude of birds and mammals, in all their diversity, eventually cohered into the great family of reptiles. Not.

[Update 3/3/2004:

Bill Poser emails:

It occurs to me that, given the sloppiness with which Jack Hitt writes about the linguistic side of things, it is possible that what he meant when he wrote about the "multitude of idioms cohering into language families..." is not what we interpret him to mean, namely lots of unrelated languages assimilating to each other to the point that they appear to be genetically related. He may simply have meant that lots of little languages died out to be replaced by a relatively small number of widespread languages that gave rise in turn to a fair number of daughters. Not that I wish to defend either silly linguistics or sloppy writing, but just as one should not attribute to malice what is explained by stupidity, perhaps one should not attribute to foolishness what is explained by sloppy expression.

I'm reminded of a Gamble Rogers story about Still Bill trying to trade his dog. The other guy tries to lead the dog out into the yard to take a look at her, and she bumps head-first into the door jamb, backs up and makes it through the door, stumbles over the sill and tumbles down the back steps, caroms off the shed and fetches up against the fence upside down. The prospect complains that Bill is trying to trade him a dog that's stone blind. Bill's response?

She ain't blind -- she just don't care!

That's my diagnosis in this case as well.]

Posted by Mark Liberman at 02:32 PM

More on animal communication

Diaries translated from Dog and Cat.

Posted by Mark Liberman at 02:10 PM

Couch biking

Neither of the two words is likely to evoke the other in a word-association experiment. At least in my opinion, it's not a good band name. It doesn't even seem appropriate as a spammer's pseudo-identity. However, it's a good story, with pictures. [via Dave Barry's blog].

Posted by Mark Liberman at 01:42 PM

Mr. Spam Man

I've noticed that the pseudonyms in the From: field of the spam I receive are often rather peculiar, so over the past few days I've made a little collection. A few of them look like they are generated randomly without regard to phonotactic constraints. Either that, or Rqhegqc Uohadoj hails from a place with whose language I am unfamiliar, possibly another planet. Others, such as Cocix Kubofyh, look like they are randomly generated by a system that pays attention to phonotactic constraints, though these could be the chance result of an unconstrained system producing some results that happen to conform.

The majority consist of real words or words very similar to real words. My guess is that the spammers generate the names by random selection from lists of real words, possibly with some random changes too. Here are some examples:

Corpuscles S. Alter
Decalogue J. Renowning
Quartet P. Depress
Wilburn Houser
Agglomerations B. Malls
Service Kwazi
Shauna December
Sarah WacArnolds
Whirling I. Screenings
Denmark R. Willful

I don't know whether these people work together with the ones who generate band names.

[Update 2004/03/01: Glen at Agoraphilia also noticed these curious names. And I just received a piece of spam with a particularly humorous pseudonym: Lassoing F. Tiresomely. Another particularly good one just turned up: Reevaluating I. Dogmatist. And another: Evacuees M. Colorblind.]

Posted by Bill Poser at 01:27 PM

Ten thousand statistically-average fake band names

Two great links: The Quest for Ground Truth in Musical Artist Similarity, by Dan Ellis et al., and Ten Thousand Statistically Grammar-Average Fake Band Names. Enjoy.

Posted by Mark Liberman at 09:52 AM

A shibboleth of gentility: [h] from William Shakespeare to Henry Higgins

I wrote earlier that "it seems that there was a period centered around 1800 when 'an hero' was common, as suggested by this histogram of the death dates of the 60-odd authors that lion.chadwyck.com finds for the search string 'an hero'. By 1900, "a hero" is all that is found; and the pre-1700 citations also seem to be mostly of that form, though there are not many of them."

Thanks to Bill Labov, I can now give the story behind this little graph, based on information from Henry Sweet, Otto Jespersen and others. The summary: h-loss was a feature of some London speakers since at least the mid-16th century, and probably affected Shakespeare; during the 18th century h-loss spread rapidly to all classes and regions of England except the far north, without being stigmatized at first; in the 19th century, it was beaten back, in part by explicit prescriptivist pressure, so that [h] became a strong "shibboleth of gentility"; this pressure leaked out to cause the introduction of [h] into words where it had always been spelled but never pronounced, like hospital and humour. Meanwhile, Americans missed the action almost entirely, though the English fad for h-loss did apparently affect a few colonials such as John Adams.

Henry Sweet wrote in his New English Grammar (published in two parts, 1892 and 1895):

894. Initial (h), which was preserved through First and Second MnE, began to be dropped at the end of the last century, but has now been restored in Standard E. by the combined influence of the spelling and of the speakers of Scotch and Irish E., where it has always been preserved. It is also preserved in American E., while it has been almost completely lost in the dialects of England -- including Cockney E. -- as also in vulgar Australian.

Otto Jespersen, in his Modern English Grammar (1949), gives many details (section 2.943) about the history of the "mute h" (originally in words taken from French), observing that "in many ... words where h is marked as mute by early orthoepists, the tendency to pronounce according to the spelling has become increasingly powerful." Among his examples are humble, inherit, heretic, homely, hypocrisy, hospital, heritage, humour, for which he documents considerable variation in the opinions of various authorities. He says that "humour and hotel are now pronounced with [h] by some educated speakers, without [h] by others." Finally, he says that "[i]n such words as are taken directly from Latin or Greek or as suggest a learned origin, though they may originally have come from French, h is pronounced: heredity, hero, heroism, hemisphere." [emphasis added -- it seems that hero was a good probe].

In section 13.6, Jespersen discusses the history of "loss of /h/", breaking the topic down into "several different phenomena, some of which are universal, while others belong to vulgar or dialectal speech." He distinguishes h-dropping "in rapid speech in the weak forms of pronouns and the auxiliary verb have"; h-dropping "in the second part of a great many compounds, especially those in which the separate elements are not felt as independent words" (such as Chatham, dunghill, coffeehouse, hedgehog and falsehood); h-dropping "between a strong and a weak vowel" (annihilate, vehement, rehabilitate); dropping initial h before a weakly stressed vowel (e.g. historical, Hungarian);

Jespersen agrees with Sweet that "in all English dialects, except the very northernmost ... [h] is completely lost as a significant part of the sound system, and the same is true of the vulgar speech of the towns." He mentions the insertion of [h] for emphasis or as a hypercorrection, and wisely observes this is normally done "without any regard to whether the word 'ought to' have [h] or not", but that "[t]he observer... to whom [h] or no [h] is significant, fails to notice the words that agree with his own rule, but is struck with the instances of disagreement, deducing from them the impression of a systematic perversion ("Am an' heggs"). "

Jespersen has some interesting things to say about geography and history:

13.684. Initial [h] is preserved in Scotland, Ireland, and America. "The Yankee never makes a mistake in his aspirates," says Lowell ... [note that this means that Adams was atypical or perhaps England-influenced in writing "an hero" -- myl]

13.685. It is not easy to find out how old this English disappearance of [h] is. From the great local extension of the phenomenon, one would be inclined to look upon it as very old, though why should recent sound-changes be unable to spread pretty fast over a large area? As a matter of fact, I have not come across any older mention of it than 1787. Elizabethan and even 18th century authors, who represent vulgarisms so frequently, do not seem to use omissions and misplacings of h's as a characteristic of low class speech. E 1787 (vol. 2.254 ff.) complains of exactly the same errors in this respect as are met with nowadays ... W 1791 speaks of the 'fault of the Londoners: not sounding h where is ought to be sounded, and inversely.' B 1809 p. 29 says: 'the aspirate h ... is often used improperly.' ...

H.C. Wyld, in A Short History of Modern Colloquial English (1936), asks "when did the tendency arise to pronounce 'ill for hill, or 'ome for home, &c., when these and other words occur as independent words in the sentence?" He observes that "Norman scribes are very erratic in their use of h- in copying English manuscripts, and we therefore cannot attach much importance to thirteenth- or even to early fourteenth-century omission of the letter which occur here and there." He says that "I have found comparatively few examples in the fifteenth century of spelling without h-", though he does cite a handful that "seem genuine". The first "fine crop of h-less forms" that he finds is in the middle of the 16th century, in the writings of "the Cockney Machyn".

Wyld also observes that h-dropping is not stigmatized in these earlier periods: "Cooper does not include the loss of initial h- among his traits of 'barbarous dialect'". (This is Christopher Cooper's English Teacher, 1687). Wyld also indicates that the restoration of mute h was continuing through the early 20th century, writing that "[t]he restoration of an aspirate in [humour, humoured] is a trick of yesterday, and I never observed it until a few years ago, and then only among speakers who thought of every word before they uttered it." He also observed that in his day the h was still dropped in the phrase at home "by excellent speakers".

He quotes Elphinston 1787 as writing that "many Ladies, Gentlemen and others have totally discarded" initial h-. He adds that "Walker, 1801, also draws attention to the habit, which he attributes chiefly to Londoners." His conclusion: "it would appear that the present-day vulgarism was not widespread much before the end of the eighteenth century... The practice, which aparently did exist in Machyn's day in London, must have been confined to a limited class."

H. Köberitz (in Shakespeare's Pronunciation, 1953) writes (p. 307 ff.) that

From the 15th century on we can witness a general tendency for initital h to be dropped in fully stressed words of Germanic origin, and conversely, for an inorganic h to be added to such words beginning with a vowel...

The implication that Shakespeare perhaps used to drop his h's has nothing startling or derogatory in it. As a matter of fact, the correct use of h had not yet become a shibboleth of gentility. Its omission was simply a colloquialism comparable to the loss of d and t ..., one that Shakespeare woud almost certainly have picked up anyhow on settling down in London, for that most conspicuous feature of Modern Cockney, the dropping of h's, was then merely the local offshoot of the general tendency just referred to. ... Here colloquialisms jostled for supremacy with conservative or artificial pronunciations inspired by the spelling and inculcated by zealous orthoepists...

One of the pieces of evidence brought forward is Shakespeare's punning, examples of which include "Arden-harden, art-heart, ear-here, eat-hate, heir-hair, heir apparent-here apparent, here-year, hour-whore, and perhaps Hiren-Irene-hiring". Another kind of evidence is provided by the distribution of an in an happy, an hayre, an hundred; there is also (orthographic) elision (examples like t'have, t'hold, th'harmony, th'hoorded); and "inverted spellings" howlet for owlet, histy for yeasty and shagge-ear'd for shag-haired.

[h]-loss should be a particularly interesting phenomenon to investigate further, as a case study in the dynamics of language variation and change. The change went essentially to completion, in England anyhow, before being beaten back in the standard language, but it never took hold in America or in Scotland. The crucial time period (roughly from 1600 to 1950, and especially from 1700 to 1900), is very well documented. Finally, it's unusual in being a sound change that can be reasonably well tracked in the orthography, quite apart from the representation of colloquialisms or the frequency of misspellings, by looking at the distribution of a/an.

Posted by Mark Liberman at 09:16 AM

(2)	"Napoleon is one of those figures one can admire without particularly liking. Sigmund Freud is another." Joseph Epstein. With My Trousers Rolled, p. 85.
(3)	"Something you can desire without ever being expected to strive for." Richard Russo. Empire Falls, p. 224.
(4)	"Please Inspect Before Using! [...] Please inspect your documents before using." (Fidelity Investments instructions for using new checks)
(5)	"Yet the peculiar thing (which Justine had seen too often before to wonder at) was that he seldom took her advice." Anne Tyler. Searching for Caleb, p. 45.
(6)	"Or else the scandal is alluded to without being named [...]" John Thorne. Simple Cooking, p. 198.
(7)	"And the letter had that awkward, semibureaucratic, semi-messianic style she had grown accustomed to without ever liking." John LeCarre. The Spy Who Came in from the Cold, p. 164.
(8)	"Brand-name foods contain things we've never heard of and should think about twice before allowing into our house." John Thorne. Serious Pig, p. 318.
(9)	"Homicide has such daredevil energy and intensity that it almost, but not quite, carries us past the many loose ends and red herrings that Mamet unleashes without knowing what to do with." Phillip Lopate. When writers direct. In Totally, Tenderly, Tragically p. 319.
(10)	"The roasted duck was first brought to the table in a copper sautoire for the diners to view before being carved." Michael Ruhlman. The Soul of a Chef (p. 250.