November 30, 2003


In the light of Mark's posting about the remark attributed to Guy Bailey about the origin of Southern r-lessness, I should withdraw my captious comments. No one knows better than I how the Times' editors (like others) occasionally make cuts that can leave a misleading impression, particularly when they're working under deadline -- and unfortunately, as I can also testify, it's always the writer who gets the mail.
Posted by Geoff Nunberg at 09:58 PM

Twang scholar on "the constraints of journalism"

I figured it was something like that.

It was great to see Ralph Blumenthal's piece on "Scholars of Twang" featured so prominently in the New York Times yesterday. It's not often that we see an engaging story about an interesting linguistic project on the front page of a national newspaper! But there were a couple of puzzling things in the interview with Guy Bailey that formed the core of the article. One was an account of the origins of U.S. r-lessness in terms of plantation owners sending their sons to England for schooling. Another was Guy's response when asked where fixin' to came from: "who knows?"

Pretty much any linguistically well-informed person would be puzzled about these aspects of the story, as Geoff Nunberg and I were, because there is a well-known story about the American distribution of r-lessness that is more complicated but also more interesting, and there is an obvious sort of answer about fixin' to, in terms of the specific history of fix and the general tendency of verbs of intention or preparation to get semantically bleached into mere tense or aspect. And I'd have bet money that Professor Bailey knows all of this much better than I do.

So I wrote to Guy to ask him what happened. His response with respect to r-lessness (posted with permission):

It was good to hear from you. The article was nice, but the stuff on the origins of r-lessness reflects the constraints of journalism. When asked about the origins of r-lessness in the U.S., I offered two or three different theories (including colonial education in England, an old theory by the way) and indicated that in the South, r-lessness was probably heavily influenced by the speech of slaves. Ralph (I assume, although editors may have shortened the article) chose to write about only one of them. Unfortunately, it's not the one I favor, at least for the South. On the whole though, Ralph did a good job.

And with respect to fixin' to:

The comment on fixin to was also part of a much longer explanation. I began by saying "who knows?" and then outlining "one possibility" -- a long, involved step by step process that Jan worked out a decade or so ago (but which she hasn't published) using OED and other dictionary citations. I have to admit that her derivation probably wouldn't make good news copy, although it is a process that parallels the similar grammaticalization of gonna.

In other words, as I thought, a combination of journalistic focus and editorial compression led to Guy being quoted in a way that doesn't accurately reflect what he knows and what he thinks.

This happens all the time, and not just to linguists. I've hardly ever read a piece of popular journalism, on a topic where I have independent knowledge, that didn't have at least one instance of this sort of thing. Journalists do misunderstand sometimes, and they want a good story, and they need a short one.

Does it matter? Well, it can be personally annoying -- and sometimes professionally embarrassing -- to be made to seem to say things that one didn't say and didn't mean. Also, the content of the mistake is sometimes significant. In this case, as Geoff Nunberg observed, a social change is attributed to social influence from above (rich kids schooled in England), instead of social influence from below (the effect of the speech of slaves). On balance, though, I feel that the result (an entertaining story about linguistics on the front page of the New York Times) is well worth the cost (a couple of misrepresentations of Guy Bailey's views on linguistic history). Of course, that's easy for me to say, I'm not the one being (mis)quoted. But more of us should be willing to take the risk.

I'll let Professor Bailey (who is also provost of the University of Texas at San Antonio) have the last word:

One thing we as linguists probably need to do is to figure out how to make technical linguistic descriptions easily available to a public which has a more general education. Interestingly enough, as an administrator, I always try to give reporters sound bites that reflect the message UTSA wants communicated; as a linguist, I never do.


Posted by Mark Liberman at 12:02 PM

November 29, 2003

How's your copperosity sagaciating?

Geoff Nunberg objects to the New York Times' quotation of Guy Bailey to the effect that r-lessness spread in Texas from the children of plantation owners who went to England for schooling and picked up the fashion there. I don't know whether that's what Guy really said -- it wouldn't be the first time that the NYT got a quotation or attribution garbled. And certainly both Nunberg and Bailey know a lot more about this than I do.

But in the course of putting together a lecture for an undergraduate course, I happen to have stumbled over a fascinating bit of trivia about r-lessness in 19th-century America, involving Uncle Remus, James Joyce, and the British recognition of the Republic of Texas. So here goes.

Loss of syllable-final /r/ was a change in progress in England in colonal times, variably distributed by geography and social class. As a result, the complex geographical and social patterns of r-lessness in the U.S. could logically have three sources: settlement patterns, patterns of continued contact with England, and local sociolinguistic dynamics.

The traditional account (as e.g. in (Richard) Bailey 1996 and Lass 1992) was that loss of postvocalic /r/ in England was a 17th and 18th century phenomenon. Thus r-lessness would have been widespread (but not universal) during the period when English speakers emigrated to North America, and thus settlement patterns are a likely source of influence.

However, recent research suggests that "... most of England was still rhotic ... at the level of urban and lower-middle-class speech in the middle of the nineteenth century, and that extensive spreading of the loss of rhoticity is something that has occurred subsequently..." (Peter Trudgill, "A Window on the Past: "Colonial Lag" and New Zealand Evidence for the Phonology of Nineteenth-Century English". American Speech 74(3) 1999).

If this is true, then U.S. settlement patterns are less relevant, and patterns of contact with England are more relevant. Prof. Bailey may have some evidence about this, I don't know.

However, I do want to cite one interesting piece of evidence in favor of an earlier adoption of r-lessness in the American south in general and Texas in particular.

On this web page, one Mike Schwitzgebel cites his Ohio grandfather's use of the word "copperosity". He tracks this via the OED to corporosity, "Bulkiness of body. Also used in a humorous title or greeting', with a citation to James Joyce Ulysses 418 "Your corporosity sagaciating O K? ". This in turn is apparently a reference to Joel Chandler Harris' The Tar Baby and other Tales of Uncle Remus, where "copperosity" and "segashuate" represent the African-American vernacular pronunciations of these words.

Schwitzgebel tracks the Harris/Joyce greeting further to Nicholas Doran P. Maillard's 1842 History of the Republic of Texas. Maillard was a British lawyer who lived in Richmond, Texas, for about nine months during the year 1840. His book was a virulent anti-Texas screed, published in the hope of influencing British public opinion against diplomatic recognition of the Republic of Texas. Maillard describes the infant republic as "stained with the crime of Negro slavery and Indian massacre", and "filled with habitual liars, drunkards, blasphemers, and slanderers; sanguinary gamesters and cold-blooded assassins; with idleness and sluggish indolence (two vices for which the Texans are already proverbial); with pride, engendered by ignorance and supported by fraud." Maillard also cites "How does your copperosity sagaciate this morning?" as a typical Texas greeting.

Make of it what you will. Myself, I've got a bunch of people coming this afternoon for a traditional Thanksgiving dinner on a non-traditional day, and I need to go get the neo-turkey into the post-thanksgiving oven.

[Update: now that the turkey is stuffed and in the oven, and other preparations are well underway, I need to add that I don't subscribe to Maillard's description as an accurate characterization of Texans, whether in 1840 or 2003, and especially not of my wife.]

Posted by Mark Liberman at 08:07 AM

November 28, 2003

Deep in the Hawt of Texath

A piece in today's NY Times on the etiology of the Texas twang was sounding pretty reasonable for the genre, until I came on the following:

The opposite syndrome, known as r-lessness, which renders "four" as "foah" in Texas and elsewhere, is easier to trace, Dr. Bailey said. In the early days of the republic, plantation owners sent their children to England for schooling. "They came back without the `r,' " he said. "The parents were saying, listen to this, this is something we have to have, so we'll all become r-less," he said. The craze went down the East Coast from Boston to Virginia (skipping Philadelphia, for some reason) and migrating selectively around the country.

That would be Guy Bailey, a linguistics professor at the University of Texas at San Antonio, who sounds here as if he's ignorant of the foundational research by Kurath and McDavid and others that established the connection between American dialect features and patterns of early settlement -- or maybe he's just unwilling to let go of a good tale.

The idea that American r-lessness arose because rich families sent their children to England for schooling is pure nonsense, of course -- there's no evidence that this was a widespread practice in either New England or the South, or for that matter for any r-lessness craze in American history. And in any case, that sort of story isn't required to explain r-lessness, given both the history of settlement and the role of autochthonous changes, as elaborated by Labov and many others.

But people seem to enjoy these anecdotes about the origins of linguistic features, like the classic story of the lisping King of Spain. That testifies to the popular tendency to think of language as a superficial social practice that changes in response to the sway of fashion, the same assumption that makes it easy to believe that systematic syntactic and phonological changes arise out of mere carelessness or affectation. The Times reporter couldn't have known any better, of course. But what was Bailey thinking of?

Posted by Geoff Nunberg at 09:46 PM

Talking seals and singing dogs

I don't normally read the Guardian, so I missed this Nov. 4 story about how Tecumseh Fitch is spending a sabbatical at St. Andrew's University trying to teach seals to talk. I'm 100% in favor of this effort -- more talking seals would be a step in the right direction, in my unsolicited opinion.

Despite my positive emotional response to talking animals, I've never taught an animal to speak, not even a parrot or a mynah. It's definitely one of the those things about my life I would regret, if the question came up. I once taught a dog to sing, though.

Well, I'm exaggerating. A disinterested observer might conclude that it was the dog who taught me to sing. Here's the true story.

In the summer of 2000, I was dog-sitting for Rich and Sally Thomason at their cabin in rural Montana. Once a week, I had to drive an hour to the nearest supermarket to do the shopping, and course Kwala would come with me.

When I played music in the car, I discovered that at certain points, Kwala would howl along. Her favorites were the soulful climaxes of country-western ballads and the tutti passages of Mozart orchestral works. She seemed to me to be entraining the timing of her howls to the rhythm of the music, and even sometimes matching pitches, but we humans tend to hear parallel sequences of complex sounds as being more correlated than they are, so I wasn't sure.

I found that Kwala would sing much more reliably if I sang too, even in works that did not interest her in themselves. Scientific motives aside, I enjoyed the experience. I have to confess that I am not much of a singer, and so I was happy to find that Kwala appreciated my efforts. She liked to sit behind me and rest her head on my shoulder, next to the open window, while we sang together along with the radio or a CD.

We got some very strange looks from a Suburban full of elderly fishermen, one hot July afternoon, when we pulled into the parking lot of the Bigfork IGA belting out "non più andrai".

I convinced myself that Kwala was definitely coordinating her vocalizations with mine, though I made no attempt to document this scientifically. Think of how many other mute inglorious canine Pavarottis may be out there!

For those who are more visually oriented, here are some pictures of Kwala that I put up on the web from Montana to reassure her distant owners that all was well. And courtesy of Prof. Hendler at Mindswap, here is today's application of the Universal Marketing Graphic (UMG), in this case illustrating the prospects for the development of talking seals:

While there has only been one talking seal so far, and he's dead, Tecumseh Fitch is on the case, and there are millions of seals out there to teach...

[Link to the Guardian story via].

Posted by Mark Liberman at 08:36 AM

November 27, 2003

Like, I care whether semantics are or is?

Mark Liberman quotes some remarks from a costume designer named William Ivey Long that include the clause "the semantics are confusing" and suggests in connection therewith that "Geoff Pullum will not be pleased to see that Mr. Long interpreted semantics as a plural count noun."

I think I can speak to this, what with me being Geoff Pullum and all, and I am here to tell you that you would be absolutely astonished to know just how little I care about whether some costume designer treated the morphologically pluralized lexeme semantics as a morphosyntactically plural count noun. I mean, it's not just a question of neither being pleased about Mr. Long's choice of verb form nor not pleased; we are talking about a deep and unbounded apathy here, a cosmically profound level of apathy down to which few people's refusal to give a monkey's fart ever descends. The depth of how much I deeply do not care about this would be impossible to overstate, though I will try. Why, just the other day I retired for a while to a fairly small room of my house and sat quietly reading there for, oh, a long time, without any thought of what suffixes dress choosers in New York were putting on verbs that had nouns like semantics as subject in a finite clause. I spend whole days sometimes not thinking about the verb agreement selection Mr. Long made -- indeed, not thinking about the inflectional decisions made by any parade costume designer anywhere. Let me try to explain further just how slender are the chances of my coming to care about this...

Oh, what's the point. People will always think I care about crap like subject-verb agreement. Let's face it, I'm a grammarian. No one is ever going to think I am anything but a boring old pedant. Not ever. No one realizes that I am actually a super fun wild and crazy guy, great in bed, sexy, witty, lively at parties, popular with children and animals. Even if people were to be shown a picture with parrots in the wild peacefully sitting on me they still wouldn't believe it. Sniff.

Posted by Geoffrey K. Pullum at 10:55 PM

LA emancipates electronic components

Unless the Onion has captured CNN's website, this must be serious. I looked around for a similar initiative about male and female plug types -- talk about stereotyped interactions! -- but couldn't find one.

Posted by Mark Liberman at 08:47 PM

Allegation of "forced fermatic practices"

Verity Stob at The Register has a scoop about "US software and litigation giant Softwron Inc".

Defying a "blanket gagging injunction," The Register cites a rumor in the Usenet newsgroup sci.math.research to the effect that a patented Softwron number "and two other 'large' integers together ganged up on an unwilling smaller (but technically oversize) integer and forced it to indulged in Fermatic practices with them."

The article quote Rock McDosh, founder and CEO of Softwron, as follows:

"We categorically state that no number protected by Softwron patent has been involved in any rumoured inappropriate behaviour; and in any case we do not accept that such behaviour is inappropriate, if it could be stated what it was. Nonetheless, if going forward it were generally known what it was, our number would still not be involved in whatever it is. Which it isn’t."

According to a quoted expert, "This kind of incident is highly embarrassing for Softwron right now, but I don’t think it will ever go to court. What you have to remember is that the US Government never ratified Fermat’s Law, which it views as being anti free trade."

At the end of the article, there are links to four other Stob stories on the patenting of numbers.

In this context I'd like to draw the reader's attention to Eben Moglen's article Anarchism Triumphant, which is a serious (though entertaining) meditation, from a lawyer's perspective, on the general problem of intellectual property rights in a world that "consists increasingly of nothing but large numbers (also known as bitstreams)".

Professor Moglen's article contains this memorable passage:

No one can tell, simply by looking at a number that is 100 million digits long, whether that number is subject to patent, copyright, or trade secret protection, or indeed whether it is "owned" by anyone at all. So the legal system we have ... is compelled to treat indistinguishable things in unlike ways.

Now, in my role as a legal historian concerned with the secular (that is, very long term) development of legal thought, I claim that legal regimes based on sharp but unpredictable distinctions among similar objects are radically unstable. They fall apart over time because every instance of the rules' application is an invitation to at least one side to claim that instead of fitting in ideal category A the particular object in dispute should be deemed to fit instead in category B, where the rules will be more favorable to the party making the claim. This game - about whether a typewriter should be deemed a musical instrument for purposes of railway rate regulation, or whether a steam shovel is a motor vehicle - is the frequent stuff of legal ingenuity. But when the conventionally-approved legal categories require judges to distinguish among the identical, the game is infinitely lengthy, infinitely costly, and almost infinitely offensive to the unbiased bystander.

I'm not sure that Prof. Moglen is right about this -- large numbers seem as at least as distinguishable to me as large collections of elementary particles are -- but you should read the whole thing.

Posted by Mark Liberman at 08:27 PM

Six nouns deep or more

It did cross my mind (I confess it) that Mark Liberman and Bill Poser might be making up their exotic-looking noun-noun-noun-noun-noun-noun compounds ( Volume Feeding Management Success Formula Award and East-ward Communist-Party Lifestyle Consultation Center and so on); but yesterday I found that I was being required to write a letter of formal response to the Narrative[1] Evaluation[2] Student[3] Grievance[4] Hearing[5] Committee[6] on my campus. This was because of a couple of students who objected to the F grades I gave them after a winter[1] quarter[2] undergraduate[3] computer[4] science[5] course[6] assignment[7] plagiarism[8] incident[9]. They really are all around us, these compounds that are six nouns deep or more.

Incidentally, if you want to know how to work out how many different bracketings there are for a string of N nouns, the answer is given by the function f such that f(1) = 1 and for each N > 0 you compute f(N) by taking the sum of all the products of all the f(i) values for all the non-singleton sequences of nonzero choices of i that add up to N.

For 2 this comes to 1, because the only list of positive integers that has more than one item and adds up to 2 is <1, 1>, and f(1) times f(1) = 1. For 3, the value of f comes out to 3, because we have 3 different lists of positive integers that add up to 3: <1, 1, 1>, <1, 2>, and <2, 1>; and when we take the products of all the f(i) for the integers i in each list we get f(1) times f(1) times f(1) = 1, and f(1) times f(2) = 1, and f(2) times f(1) = 1, and when we sum the products we get 1 + 1 + 1 = 3. This corresponds to the fact there are three bracketings for lifestyle consultation center: [lifestyle consultation center], [[lifestyle consultation] center], and [lifestyle [consultation center]].

To work out f(N) for N = 6, and thus the bracketings for volume feeding management success formula award or Narrative Evaluation Student Grievance Hearing Committee, just make a table of all the values of f for numbers from 1 up to 5; then make a list of all the lists of numbers that sum to 6; then take the value of f for each number in each list and write down those lists; then take the product of the numbers in each list and record those; and then sum all the products. This may take a while. In fact, for Americans it will entirely solve the problem of what to do with the long dull afternoons of the current four-day Thanksgiving holiday. Have a good one.

Posted by Geoffrey K. Pullum at 12:07 PM

Same-sex Mrs. Santa: "the semantics are confusing"

Yesterday, the actor Harvey Fierstein announced in a New York Times Op-Ed piece that he would be riding in the Macy's Thanksgiving Parade dressed as Mrs. Santa Claus. The theme of the piece was same-sex marriage, and he wrote that "[i]f I really was Santa's life partner, you can believe that he would ask and I would tell about who has been naughty or nice on this issue." He closed by inviting readers to "remember to wave to me on my float. I'll be the man in the big red dress."

This apparently caused some controversy. After all, as Fierstein stressed in his opening, "Macy's Santa is the real deal." So I'm sure he expected to create some buzz by announcing that that "tomorrow, to the delight of millions of little children (not to mention the Massachusetts Supreme Judicial Court), the Santa in New York's great parade will be half of a same-sex couple."

According to an article in this morning's paper, Macy's (the store that sponsors the parade) quickly intervened to announce that "Santa Claus would be on the final sleigh float, accompanied by Mrs. Claus, a woman. Mr. Fierstein would be on a separate float." Macy's statement also 'emphasized that Mr. Fierstein would be dressed not as Mrs. Claus but as "his beloved character Mrs. Edna Turnblad of the Broadway hit musical `Hairspray.' " '

But then, the NYT says, "the actor's costume designer said that Mrs. Edna Turnblad, as portrayed by Mr. Fierstein, would be dressed as Mrs. Claus."

The costume designer, William Ivey Long, did however specify that the interpretation should only go two layers deep, not three. In the words of the Times article "those viewing Mr. Fierstein's costume would be expected to suspend their disbelief and see only Mrs. Turnblad dressed as Mrs. Claus, not Mr. Fierstein dressed as Mrs. Turnblad dressed as Mrs. Claus."

Mr. Long achieved this remarkable precision of interpretation by means of "a Balenciaga swing coat worn over a floor-length pencil skirt with a stamped red velvet jacket with fake fur collar and cuffs topped with a white fake fur French beret," adding that "those are just words. The effect is, of course, insane."

Macy's then issued a second statement, agreeing that Fierstein would be appearing "in Edna's interpretation of Mrs. Claus ... As for Mrs. Claus herself, she will be appearing with Santa on Santa's sleigh ..."

As Mr. Long is quoted as saying, "the semantics are confusing."

Long is clearly using semantics in the ordinary language sense of "what things mean," and I've got no problem with that (not that it would matter if I did). I was taught that semantics is about meaning as something that sentences have, whereas pragmatics is about meaning as something that people do. However, the field seem to be increasingly divided about where to draw the line, and even whether there is a line worth drawing; and meanwhile the world at large has long since decided that the fancy word for "(analysis of) meaning" is "semantics". So be it.

But I did wonder about the metaphor underlying Mr. Long's comment. I guess that it's "clothes are words" or "outfits are sentences" or something like that. And in this case, everyone is pretty clearly focusing on "wearer meaning" rather than "outfit meaning" -- along with an interesting political mix-in, somehow cancelling the most basic level of interpretation.

Anyhow, the point that interests me is that such metaphors usually work in the direction of understanding something more abstract in terms of something more concrete, but this is the opposite. At least, it's the opposite if you think that signifiers are more abstract than clothes. I guess that means it's a theory, not a metaphor. Though maybe it's neither one, but just a piece of terminology that Mr. Long once learned in a class on the semiotics of culture ...

Another thing that seems upside down here is the partial explicit cancellation of an expected meaning. In the familiar cases, it's always the superimposed layers of interpretation that are explicitly cancelled: "I have some aces; in fact I have all of them." But here, what is explicitly cancelled is what seems most basic: we're told to see Edna as Mrs. Claus, not Harvey as Edna as Mrs. Claus. Clearly confusing, even if not clearly semantics.

There is probably a whole literature about the Gricean implicatures of clothing, cancelled or otherwise. No doubt I could find it via google, but I'll wait for some reader to tell me. I've read Anne Hollander's Sex and Suits, but its semiotic analysis is merely implicit, and Grice is not in the index.

[Note: Geoff Pullum will not be pleased to see that Mr. Long interpreted semantics as a plural count noun. At least I think Geoff won't be: maybe he'll charitably construe Mr. Long's comment as involving one of the usage patterns in which mass nouns can be pluralized: "the semantics of Harvey Fierstein's Mrs. Santa outfit" like "the wines of France". As Stephen Maturin would have put it, "let us not be pedantic, for all love."]

[Another note: contemplating this whole story, I have to ask "is this a great country or what?" And now, back to Thanksgiving preparations!]

Posted by Mark Liberman at 08:33 AM

November 26, 2003

The AI gnomes of Zurich

With respect to an earlier Language Log piece on the Great Ontology Debate, Yarden Katz has drawn my attention to an anti-Shirky posting by Drew McDermott on the www-rdf-rules mailing list. McDermott ends with a zinger:

It's annoying that Shirky indulges in the usual practice of blaming AI for every attempt by someone to tackle a very hard problem. The image, I suppose, is of AI gnomes huddled in Zurich plotting the next attempt to --- what? inflict hype on the world? AI tantalizes people all by itself; no gnomes are required. Researchers in the field try as hard as they can to work on narrow problems, with technical definitions. Reading papers by AI people can be a pretty boring experience. Nonetheless, journalists, military funding agencies, and recently the World-Wide Web Consortium, are routinely gripped by visions of what computers should be able to do with just a tiny advance beyond today's technology, and off we go again. Perhaps Mr. Shirky has a proposal for stopping such visions from sweeping through the population.

I like the image. It suggests an on-going feature, in which three low-level employees at GnomeNet GmbH gossip about their bosses' latest Fiendish Plot forthcoming product (this month: Mindswap!). A cartoon format would be best -- I can't draw, but then there's the Partially Clips approach...

As for the content, I don't think anyone (in this discussion) will admit to being opposed to vision. The thing is, some visions turn out to be the telephone or the automobile or the internet, while others turn out to be the Picturephone, the Personal Zeppelin™ or perhaps the Philosophical Language of John Wilkins. This is an argument for pluralism but against credulousness.

[Note: the Mindswap website led to me to these slides for the keynote address On Beyond Ontology at last month's ISWC conference, which I commend to the reader. I was especially happy to see that the Universal Marketing Graphic (UMG) is still in use (slide #3, "Approaching a Knee in the Curve"). I first saw a version of this graph used for speech technology market projections back around 1977. In those days, my colleagues used to label the horizontal axis in calendar years and the vertical axis in billions of dollars. Then someone pointed out that it was annoying to have to re-do the graphic every year, and suggested that the horizontal axis should be relabeled something like "...", "last year", "this year", next year", "...". Prof. Hendler (or perhaps the GnomeNet Marketing Department?) has generalized this further by removing the labels from both axes, so that anyone can now use the graph to illustrate an optimistic forecast about any aspect of the future of anything! ]

Posted by Mark Liberman at 08:09 AM

Lazy mouths vs. lazy minds

Captain John Dunn, of the Shreveport LA police department, is quoted by CNN as attributing the failure of the speech recognition technology in their new PBX to "Southern drawl and what I call lazy mouth".

I hope that I don't need to explain that on the face of it, this is nonsense. Message to Captain Dunn: the fault is in your system's technology, not in your citizens' mouths.

The general prejudice against southern varieties of English includes stereotypes about stupidity and backwardness that come out strongly when the context is technological. The most egregious example of this that I've seen was Michael Lewis's reporting for Slate from the Microsoft anti-trust trial.

Lewis seems not to have had much to say about the actual content of the trial. Instead, he devoted most of his dispatches to making extended fun of the participants' appearances and accents. Microsoft's lead attorney, John Warden, got the lead-off spot:

"Warden is a natural heavy, a great Hogarthian ball of pink flesh with jowls that ripple over his white, starched shirt. I don't think I could have placed his overripe drawl without the help of a potted biography (which says he grew up in Evansville, Ind.), except to say that it is Southern. It is also loud; Warden prefers to lean into the microphone and imitate the Voice of God. In any case, it didn't take him long to prove that technology doesn't sound nearly as impressive when it is discussed in a booming hick drawl. As he boomed on about "Web sahts" and "Netscayup" and "the Innernet" and "mode ums" he made the whole of the modern world sound a little bit ridiculous."

A bit later, David Colburn of AOL was given the treatment:

"He has stooped shoulders; short, dark hair; a runaway 5 o'clock shadow; and the economy of motion of a highly skilled hit man. His deadpan North Jersey dialect simply reinforces the general picture that if he is not dangerous himself, he knows people who are."

It turned out that Colburn is actually from Milwaukee, but factual accuracy about individuals is not precisely the point of this kind of stereotyping, is it? In fact, it's precisely not the point.

So this leads me to wonder what a police captain in Bayonne NJ would say about why the speech recognition technology in their new PBX doesn't work: "Hey, it's our deadpan North Jersey dialect -- the system just freezes up and connects everybody to some club in Lodi."

[CNN story via cannylinguist]

Posted by Mark Liberman at 12:13 AM

November 25, 2003

Ever heard a Chomsky sentence?

Which auxiliary in a declarative clause is the one that must precede the subject in the corresponding closed interrogative? In example (1) it's the first auxiliary, as can be seen from the grammatical interrogative in (2). (The auxiliaries are underlined and subscripted for reference, and "__" appears where the auxiliary would have been if it weren't before the subject.)

(1) This is1 the unit you will2 be delivering to me.

(2) Is1 this __ the unit you will2 be delivering to me?

If you choose the other auxiliary, the result is badly ungrammatical:

(3) *Will2 this is1 the unit you __ be delivering to me?

But "the first auxiliary" isn't the right answer. And a very important point about the learning of grammar hangs on this. Let me explain.

In the declarative clause (4), it is not the first auxiliary that is placed before the subject to make the interrogative.

(4) The unit you will1 be delivering to me is2 is simi lar to this one.

Using the first auxiliary would get you the disastrously ungrammatical (5).

(5) *Will1 the unit you __ be delivering to me is2 is sim ilar to this one?

Instead, it's the second auxiliary that you should choose. Putting that before the subject gets you the right result, namely (6).

(6) Is2 the unit you will1 be delivering to me __ similar to this one?

So the "first-auxiliary" rule is definitely wrong.

But how do we know this? How did we learn it? The correct rule is, as it happens, that it is whichever auxiliary is the one belonging to the main clause that must go before the subject. But how does a young learner ever find that out? What could convince a child who hit on the first-auxiliary rule that it is a mistake, only working accidentally in cases like (2) where the first auxiliary is the main clause auxiliary?

Well, we could learn that the first-auxiliary rule was wrong if we heard an example like (6). So for anyone who thinks we learn from the example of our parents and peers, it becomes an important question whether we ever do hear such examples.

Noam Chomsky does not believe that we learn most of our language from examples that we hear. He thinks much of the structure of human language is built into us at birth in some way ("innate"). And Chomsky has asserted firmly in numerous publications that we couldn't learn that the first-auxiliary rule was wrong. In one statement of the case he asserted that "A person could go through much or all of his life without ever having been exposed to relevant evidence" of the sort that (6) represents (paper and discussions recorded in Language and Learning: The Debate Between Jean Piaget and Noam Chomsky, ed. by Massimo Piattelli-Palmarini, Harvard University Press, Cambridge MA, 1980, p.40). However, he gave not one whit of empirical evidence supporting this confident assertion.

For convenience, let me refer to sentences with the property that (6) exhibits as Chomsky-sentences. Barbara Scholz and I have pointed out that it is not hard to find Chomsky-sentences in any text that one searches with any care, from newspaper prose to Oscar Wilde plays to Mork and Mindy scripts (see our paper `Empirical assessment of stimulus poverty arguments', The Linguistic Review 19 [2002], 9-50). But do they occur in spontaneous speech? Geoffrey Sampson, who believes people do learn purely from the evidence of hearing other people speak, suggests (in "Exploring the richness of the stimulus", The Linguistic Review 19 [2002], 73-104) that they don't -- and he thinks it is also the case that people never learn to produce Chomsky-sentences in speech. That is, he proposes that the rule about making interrogatives by placing the auxiliary before the subject is to some extent a rule of written English rather than spoken. (He even encountered, just once, a woman who attempted a Chomsky-sentence in spontaneous conversation, and she got it completely wrong: she was attempting to say Is what I'm doing worthwhile?, but what came out of her mouth was *Am what I doing is worthwhile?, completely ungrammatical.)

Whether Chomsky-sentences occur in spoken English is a real bone of contention, therefore. That is why I jumped as if stung by a bee when I was listening to the BBC World Service on December 7, 2001, at about 4:30 p.m. GMT and I heard a business reported doing an unscripted interview with a Swissair executive say:

(7) How radical are2 the changes you're1 having to make __?

That's a Chomsky-sentence. In declarative analogs like The changes you're1 having to make are2 so radical, the auxiliary of the main clause, are, is the one that has to be put up front before the subject (right after the fronted interrogative phrase how radical which begins the whole sentence). So much for the claim that you could live your whole life without hearing a Chomsky-sentence.

So are they common? Well, on February 2, 2002, I was listening to the BBC again, and I heard an interviewer doing an unscripted interview by satellite phone with a yacht race contestant, and the interviewer said:

(8) How sophisticated is2 the computer equipment you've1 got on board __?

That's another Chomsky-sentence in spontaneous speech. It raises the issue of whether perhaps both Chomsky and Sampson are wrong. Both my examples are from the BBC, and both are how questions; but how many more Chomsky-sentences are going past our ears all the time? And how many would it take to settle the question of whether it was possible for children to learn which auxiliary to front simply from examples of what they had heard? I have no idea. But the sentences in question don't have to be long and cumbersome like the ones above. The shortest Chomsky-sentence I've been able to construct is only four syllables:

(9) Is2 what's1 left __ mine?

Ever heard someone say that on seeing that there's just two slices of pizza left in the box? I have a feeling I may actually have said it myself on occasion. But I don't know.

It's actually scientifically important whether Chomsky sentences turn up in everyday speech, and if they do, how common they are. Keep your ears open, and make notes. I would love to see any accurately transcribed examples that you hear, written down with date and details of the speaker, and preferably witnessed independently by a third person who was there. You could send the examples to me by email. My login name is pullum and is the domain.

(Forgive me for not including a mail-to link, but it would immediately be seized upon by foraging spambots who would send me unwanted messages about Viagra and toner cartridges.) This post was edited on Wed 26 Nov 2003 at about 09:45 PST. Among other things, the word "not" was invisible for about twelve hours in the sentence "Noam Chomsky does not believe that we learn most of our language from examples that we hear" -- because of a formatting error, not a belief error. That's a rather serious alteration in sense. Apologies. --GKP

Posted by Geoffrey K. Pullum at 10:27 PM

At least he didn't answer

OK, I can't blame this on the paracingulate cortex.

No, maybe I can. This is a story about someone's cell phone going off inside his (closed) coffin at a remembrance service. It's said that "[s]ome of the relatives were so shocked they ran into the street."

Why? If the ringing phone had belonged to one of the live attendees, the others would have been annoyed, but not horrified to the point of running out of the building. But in this case, they found themselves starting to read the mind of a dead man, which is very creepy.

[Source: transblawg]

[Update 12/1/2003: this morning's BBC World Service news program, at the end of a discussion of the new British law against using mobile phones while driving, cited a similar story as a bit of colorful trivia -- the most inappropriate place ever heard of for a mobile phone call, or something like that. However, the reported introduced the event as happening "in Israel, actually." I couldn't find any indication (via google news) of such an event being reported from Israel. Was this (a) a simple mistake?, (b) an independent story that hasn't made it into google's index for some reason?, (c) the leading edge of an urban legend, placed in Israel because for the BBC, that is the default location for anything unpleasant?]

Posted by Mark Liberman at 10:12 PM

Understanding Complex Nominals

When I lived in Osaka I used to walk by a place whose sign identified it as the:

     Higashi-ku Kyoosantoo      Seikatsu   Soodan        Sentaa
     East-ward  Communist-Party Lifestyle  Consultation  Center

I puzzled for months as to what this might be. I was pretty sure that it was a center run by the Communist party in the East Ward for consultation about seikatsu, which means something like "way of life, lifestyle, livelihood". It didn't seem likely, for instance, that it was a center for consultation about the seikatsu of the East Ward Communist Party. But what sort of consultation about seikatsu might this be? Did the Communist party propose to help me decide whether to take up tennis? I eventually asked a friend who told me that it dealt with problems such as unemployment, marital discord, and alcoholism.

Incidentally, of the ten morphemes in this phrase, only one, "east", is native to Japanese. /sentaa/ is borrowed from English. All of the rest are loans from Chinese.

Posted by Bill Poser at 01:37 PM

Querkopf Von Klubstick, Grammarian

Here is a little something that I found in the Complete Poetical Works of Samuel Taylor Coleridge.

I find it interesting that even heavy doses of laudanum and neo-Platonism couldn't reconcile Coleridge to (the stylistic extremes of) continental philosophy. It would be amusing to channel his appreciation of Jacques Derrida, for example.

In reading Coleridge's biography, I learned that he was apparently the inventor of the word selfless, though the OED's first citation is not until 1825, ten years after the date of this poem.

The following burlesque on the Fichtean Egoismus may, perhaps, be amusing to the few who have studied the system, and to those who are unacquainted with it, may convey as tolerable a likeness of Fichte's idealism as can be expected from an avowed caricature. [S. T. C.]

The Categorical Imperative, or the annunciation of the New Teutonic God, EGOENKAIPAN: a dithyrambic Ode, by Querkopf Von Klubstick, Grammarian, and Subrector in Gymnasio. ...

Eu! Dei vices gerens, ipse Divus,
(Speak English, Friend!) the God Imperativus,
Here on this market-cross aloud I cry:
'I, I, I! I itself I!
The form and the substance, the what and the why,
The when and the where, and the low and the high,
The inside and outside, the earth and the sky,
I, you, and he, and he, you and I,
All souls and all bodies are I itself I!
All I itself I!
(Fools! a truce with this starting!)
All my I! all my I!
He's a heretic dog who but adds Betty Martin!'
Thus cried the God with high imperial tone:
In robe of stiffest state, that scoff'd at beauty,
A pronoun-verb imperative he shone---
Then substantive and plural-singular grown,
He thus spake on:---'Behold in I alone
(For Ethics boast a syntax of their own)
Or if in ye, yet as I doth depute ye,
In O! I, you, the vocative of duty!
I of the world's whole Lexicon the root!
Of the whole universe of touch, sound, sight,
The genitive and ablative to boot:
The accusative of wrong, the nom'native of right,
And in all cases the case absolute!
Self-construed, I all other moods decline:
Imperative, from nothing we derive us;
Yet as a super-postulate of mine,
Unconstrued antecedence I assign,
To X Y Z, the God Infinitivus!'


Posted by Mark Liberman at 10:28 AM

Parsers that count

A month ago, I cited the difficulty of parsing complex nominals like the one found on a plaque in a New Jersey steakhouse: "Volume Feeding Management Success Formula Award". We're talking about sequences of nouns (with adjectives mixed in as well), and the problem is that these strings mostly lack the structural constraints that parsers traditionally rely on.

When you (as a person or a parser) see a sequence like "A lapse in surveillance led to the looting" (from this morning's New York Times, more or less), you don't necessarily need to figure out what it means or even pay much attention to what the words are: "A NOUN in NOUN VERBED to the NOUN" has a, like, predictable structure in English, however you fill in the details. But "NOUN NOUN NOUN NOUN NOUN NOUN" is like a smooth, seamless block -- you (or the parser) can carve anything you please out of that.

One traditional solution is to look at the meaning. Why is it "[stone [traffic barrier]]" rather than "[[stone traffic] barrier]"? Well, it's because traffic barriers made of stone make easy sense in contemporary life, while barriers for stone traffic evoke some kind of science-fiction scenario. The practitioners of classical AI figured out how to do this kind of analysis for what some called "limited domains", and others called "toy problems". But this whole approach has stalled, because it's hard.

There's another way, though.

Here's a set of simple illustrative examples, taken from work in a local project on information extraction from biomedical text. (These examples come from Medline). Each of the four possible 3-element complex nominal sequences (with two nouns or adjectives preceding a noun) is exemplified in each of the two possible structures (one with the two leftward words grouped, the other with the two rightward words grouped).

sickle cell anemia
10561 2422
rat bile duct
203 22366
information theoretic criterion
  112       5
monkey temporal lobe
   16     10154
giant cell tumour
7272 1345
cellular drug transport
262  746
  small intestinal activity
8723       120
inadequate topical cooling
   4     195

And the numbers? The numbers are just counts of how often each adjacent pair of words occurs in (our local version of) the Medline corpus (which has about a billion words of text overall). Thus the sequence "sickle cell" occurs 10,561 times, while the sequence "cell anemia" occurs 2,422 times.

Most of the time, in a 3-element complex nominal "A B C", you can parse the phrase correctly just by answering the question "which is commoner in a billion words of text, "A B" or "B C"?

In a crude test of 64 such sequences from Medline (8 of each type in the table above), this method worked about 88% of the time.

Actually, this is an underestimate of the performance of such approaches. In the first place, the different sequence types are not at all equally frequent, nor are the parsing outcomes equally likely for a given sequence type. Thus in the Penn Treebank WSJ corpus (a thousand times smaller than Medline, and much less infested with complex nominals, but still...) there are 10,049 3-element complex nominals, which are about 70% right-branching ([A [B C]]) vs. 30% left-branching ([[A B] C]). More information about the part-of-speech sequence or the particular words involved gives additional leverage. And other counts (such as the frequency of the individual words, of the pattern "A * C", etc.) also may help. There are also more sophisticated statistics besides raw bigram frequency (though in this case the standard ones, such as ChiSq, mutual information, etc., work slightly worse than raw counts do).

Yogi Berra said that "sometimes you can observe a lot just by watching". The point here is that sometimes you can analyze a lot just by counting. And while understanding is hard, counting is easy.

Posted by Mark Liberman at 06:25 AM

November 24, 2003

Conversational game theory: the cartoon version

Deadlock: a funny exploration of why the logic of communication is hard, illustrating the thought processes of an interpersonally-sensitive Asperger's sufferer adult male human.

Like other actions, communicative choices have consequences. As shown in this strip, it's really hard to work out what choice leads to the best outcome a few moves down the road, especially when the other participants may not even be playing the same game.

This reminds me of a visit to SRI in the mid-70's, where I saw a demo of a principled conversational system. As I recall, my host typed in his conversational opening, and then we went to lunch while their KL10 churned away, trying to prove a theorem about what the optimal response to "hello" might be. I think we got back well before the machine had calculated its next move. This experience left me with a a completely illogical feeling that the machine, clueless and ungrounded as it was, still somehow really meant what it said, purely by virtue of the effort that it appeared to put into choosing its responses. But I also acquired another (more rational?) conviction: as a metaphor (or a system design) for on-line control of conversation, it would be better to pick a stochastic finite automaton instead of a theorem-prover. It might not do the right thing, but it would do something. A still more plausible conclusion, however, might be that no one had (has?) yet invented a formalism that does a good job of modeling human communication.

[via johnny logic].

Posted by Mark Liberman at 02:44 PM

blog wins 2003 ADS "word most likely to succeed" award

Margaret Marks at Transblawg points out a site where Swiss people can vote for "Wort des Jahres (word of the year) and Unwort des Jahres (antiword of the year)". The site archive has results back to 1977 (for Germany), when #2 (of six Wort des Jahres winners) was "Terrorismus, Terrorist". Starting in 2002, the program appears to have spread to Austria and Lichtenstein, and now to Switzerland. Dr. Marks explains Lichtenstein's intriguingly petty #3 winner for 2002, Senfverbot "mustard ban". Mark Twain would have appreciated the 1999 Deutschland tenth-place winner Rindfleischetikettierungsüberwachungs-aufgabenübertragungsgesetz.

The American Dialect Society has a "Words of the Year" contest. Unfortunately it seems only to go back to 1990, so we can't compare lexicographic terrorism awareness across the Atlantic in 1977.

But like the Academy Awards, the ADS contest has categories, of which the most interesting to me is "most likely to succeed". Winners in this category since 1990 have been notebook PC, rollerblade, snail mail, quotative "like", [not awarded?], world wide web, drive-by, DVD, e- , dot-com, muggle, 9-11, 9-11 [winner two years in a row?!], and [in 2003] blog. Take that, John Dvorak!

The ADS vote was tallied back in January, so it is not exactly a news flash, but I missed it at the time :-). A search for "American Dialect Society word of the year" at produces only an error page telling me that the search result "does not appear to have any style information associated with it." Indeed, alas...

[Update 12/1/2003: Grant Barrett, the webmaster for the American Dialect Society, has brought to my attention the fact that I misread their webpage: 9-11 won just once, in the January 2002 vote for "word most likely to succeed" from the year 2001. The list given above become correct, I think, if the second occurrence of 9-11 is deleted.]

Posted by Mark Liberman at 06:35 AM

November 23, 2003

Memo to self

Before activating time machine, memorize recipe for gunpowder and re-read A Connecticut Yankee in King Arthur's Court.

Posted by Mark Liberman at 09:15 PM

It has wrinkled feet

English speakers, or at least English-speaking linguists, are thoroughly used to the idea of loanwords. English has many thousands of words borrowed from French and Latin, and sizable numbers from other languages; and many or most other languages also do a lot of borrowing. So it's a surprise to find that some languages have few loanwords.

A `no-borrowing' strategy is shared by many Native American languages, at least as far as borrowings from the colonial languages English and French are concerned. Montana Salish is typical of languages of the US Northwest in this respect: it has virtually no English loanwords and only a handful from French (most of which it probably got from other Native languages, not directly from French).

So what do Montana Salish speakers do when they acquire something new from the dominant Anglo culture? What they do is invent words for new things, using materials that are already present in their own language. My favorite example is the word for `automobile', which is p'ip'uyshn -- literally, `it has wrinkled feet', a word that was obviously inspired by the appearance of tire tracks and/or of the tires themselves. And this word is not peculiar to Montana Salish. The same basic formation is found in two Salishan languages that are closely related to Montana Salish: in Coeur d'Alene the word literally means `thing with wrinkled paws' (according to Dale Sloat, via Julia Falk), and Moses-Columbia has k-p'ip'uyxn for `automobile', beside an English loanword, 7atmupil ( Dictionary of the Moses-Columbia language , compiled by M. Dale Kinkade, 1981).

There's a puzzle here: did speakers of these three languages independently come up with the same metaphor to designate `automobile', or was the word invented in one language and then borrowed (with appropriate phonetic changes according to a borrowing routine) into one or both of the others? I wish I had an answer to this question, but I don't.

Posted by Sally Thomason at 07:48 PM

Trees spring eternal

Trees can be trouble. Over the past month, this blog has seen issues with hypothesized tree structures in semantics (ontology), pragmatics (discourse structure) and syntax. We haven't discussed questions about tree-asserting hypotheses in morphology, phonology and phonetics, but believe me, they're out there.

It seems to be natural for human analytic efforts to produce tree-structured ideas, typically as a result of recursive subdivision of phenomena, whether subdivision of a string of tokens or of a set of entity types. For some naturally-occurring time series (linguistic and otherwise) and for natural kinds of plants and animals, this really works -- tree theories can be an efficient and effective way to organize rational investigation, whether or not they are scientifically valid. This record of success, I think, has reinforced the "things are trees" idea over many millennia of hominid inquiry into nature. A believer in evolutionary psychology might even suppose that our brains have learned to think that things are trees, genetically as well as memetically.

Of course, scientists often find that things are not trees, or at least not exactly. However, non-tree-structured hypotheses are not intrinsically any more likely to be correct. There's a fascinating case in the history of biology, which I learned about some years ago from one of the best books that I ever found in a remainders bin.

Linnaean taxonomy, which classifies all living things into a hierarchy, was developed in the 18th century, but its explanation in terms of Darwin's "descent with modification" did not emerge until more than a century later. In fact, as early 19th-century biologists delved further into the structures and lifecycles of invertebrates from around the world, several of them thought that they saw empirical evidence for non-tree-like patterns of relationship among such creatures. One of these was Thomas Henry Huxley, later famous as a promoter of Darwin's theories.

Darwin went off "botanizing" on the Beagle from 1831-1836 and came back with the evolutionary tree -- descent with modification -- as a new semantics for the Linnaean syntax. But he didn't publish his ideas until 1859. Meanwhile, Huxley went off botanizing on the Rattlesnake from 1846-1850 and returned with a theory of circles of affinity inter-related by parallel cross-links of analogy, as exemplified in this diagram

reproduced from Mary Winsor's fascinating book Starfish, Jellyfish and the Order of Life.

I get the impression that what appealed to Huxley about this "circular theory" (which was inspired by the earlier "Quinary theory" of William Sharp MacLeay) was precisely that it was so different from the common-sense hierarchy of natural kinds, and therefore looked like a real discovery. As a linguist, I'm familiar with the perspective that values "tension between common sense and science."

But Huxley also wrote that

The Circular System appears to me to stand in the same relationship to the true theory of animal form as Keplers Laws to the fundamental doctrine of astronomy--The generalization of the Circular system are for the most part, true, but they are empirical, not ultimate laws---

That animal forms may be naturally arranged in circles is true -- & that the planets move in ellipses is true -- but the laws of centripetal and centrifugal forces give that explanation of the latter law which is wanting for the former. The laws of the similarity and variation of development of Animal form are yet required to explain the circular theory -- they are the true centripetal and centrifugal forces in Zoology.

(Newton's account of Kepler's Laws depends on the single force of gravity, not paired centripetal and centrifugal "fictitious forces", doesn't it? .. but anyhow,) Huxley has the idea that a hypothetical pattern in nature should not simply be accepted (on aesthetic grounds, or as a glimpse of the mind of God), but rather should be given a causal explanation in terms of the dynamics of some simple process. And when he saw that "descent with modification" (with a bunch of other assumptions!) could provide exactly such an explanation for a tree-structured taxonomy of biological species, Huxley immediately abandoned circles for trees.

In linguistics, we can find some similarly fundamental causal arguments for tree structures in terms of the dynamics of basic processes: recursive concatenation in composition; stack discipline in processing; descent with modification in history. But just as in the case of natural kinds in biology, the argument from these basic processes to the structure of real-world phenomena requires lots of extra assumptions. And may be wrong.

Posted by Mark Liberman at 07:42 AM

November 22, 2003

Language fu, the cartoon version

Well, just one more. I like the subtle code-switching.

Posted by Mark Liberman at 01:48 PM

Dynamic and epistemic logic: the cartoon version

Quite funny.

With serious but fun stuff here, for the brave and/or curious.

OK, I'll stop now.

Posted by Mark Liberman at 01:22 PM

"Whole Language", the cartoon version

Fair enough. Also funny.

More on Whole Language, if you care (and you should!). Warning: not funny.

Posted by Mark Liberman at 12:56 PM

Brain & Language, the cartoon version

Unfair, but funny.

Posted by Mark Liberman at 12:51 PM

like is , like, not really like if you will

Geoff Pullum argues that val-speak "like" is like old-fogey "if you will." His case is cogent as well as entertaining.

But based on the examples and analysis in Muffy Siegel's lovely paper "Like: The Discourse Particle and Semantics" (J. of Semantics 19(1), Feb. 2002), I want to suggest that Geoff is, like, not completely right.

Muffy supports and extends the definition of (this use of) like due to Schourup (1985): "like is used to express a possible unspecified minor nonequivalence of what is said and what is meant". And I agree with Geoff that there are several widely-used formal-register expressions with more or less the same function: "if you will", "as it were", "in some sense", etc.

So far so good. However, Muffy's article also supports two differences between "like" and "if you will".

First, some of her examples (taken from taped interviews with Philadelphia-area high school students) suggest a quantitative difference:

She isn't, like, really crazy or anything, but her and her, like, five buddies did, like, paint their hair a really fake-looking, like, purple color.

They're, like, representatives of their whole, like, clan, but they don't take it, like, really seriously, especially, like, during planting season.

In these two examples, 8 discourse-particle likes get stuck in among a mere 38 non-like words -- roughly one like every 5 words. It's hard to translate this into fogey-speak:

?They're representatives, if you will, of their whole clan, if you will, but they don't take it really seriously, if you will, especially during planting season, if you will.

Whatever the whining old fogeys may say, I think it's this tic-tock frequency that bothers them. I once had a colleague who used the word literally similarly often: "Now, literally, look at the first equation, where, literally, the odd terms of the expansion will, literally, cancel out..." It (was one of several things about this guy that) drove me nuts. If all middle-aged telecommunications engineers started talking that way, I'd get in line behind William Safire to slam them for it. (By contrast, the overuse of like by young Americans seems quaint and charming to me, probably because I like the speakers better.)

There's a second difference between like and if you will to be found in Muffy's paper. She documents a number of semantic effects of like, such as weakening strong determiners so as to make them compatible with existential there:

(38) a. *There's every book under the bed.
     b. There's, like, every book under the bed. (Observed: Speaker
	     paraphrased this as 'There are a great many books under the
	     bed, or the ratio of books under the bed to books in the rest
             of the house is relatively high.')
(39) a. *There's the school bully on the bus.
     b. There's, like, the school bully on the bus. (Observed: Speaker 
	     paraphrased this as 'There is someone so rough and domineering
             that she very likely could, with some accuracy, be called the
             school bully; that person is on the bus.')

Try this in fogey-speak: "there's every book under the bed, if you will". Like, I don't think so.

No, like is definitely a more powerful (and useful) expression than if you will. Perhaps that's why some people use it, like, too much?

[Note: Muffy Siegel's paper doesn't discuss these specific alleged differences (between "like" and other hedges), which were inspired by her analysis but are not her fault.]

[Update 11/23/2003: Maggie Balistreri's Evasion-English Dictionary provides some amusing and relevant entries for like, though lexicographers might quibble about the sense divisions as well as the assignment of examples to senses. Well, anyhow, if I were a lexicographer, I would :-)... And here she is being interviewed on NPR, expressing the perspective that Geoff Pullum complained (like, validly) about.]

Posted by Mark Liberman at 09:39 AM

Bad Writing and Lord Lytton

Mark Liberman's post on bad writing made me think of the Bulwer-Lytton Fiction Contest. Edward Bulwer-Lytton seems always to come to attention these days as the epitome of bad writers, the author of the infamous passage:

It was a dark and stormy night; the rain fell in torrents - except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness. (Paul Clifford, 1830)

Personally, I don't think that this passage is so bad - I like it. It seems to me that people have lost the ability to appreciate complex style. But in any case, as a student of the native languages of British Columbia, I'd like to point out that Lord Lytton played a more important role in history. In 1858 and 1859 he served as Colonial Secretary in Lord Derby's government, in which capacity he gave instructions to Sir James Douglas, governor of the colonies of Vancouver Island and British Columbia regarding the conduct of relations with the indigenous peoples of the two colonies. (The correspondance between Lord Lytton and Sir James may be found in British Columbia Papers Connected with the Indian Land Question 1850-1875 (Victoria: The Government Printer, 1875).)

These instructions have figured in several ways in recent litigation over aboriginal rights in British Columbia. On the one hand, the Supreme of Canada in Calder v. Attorney-General of British Columbia (7 C.N.L.C. 91 (SCC)) cited his letters as evidence of delegation of power by the Crown to the colonial government (a position criticized by Bruce Clark in his important but controversial book Native Liberty, Crown Sovereignty at p. 64). On the other hand, his instructions give rather clear evidence that the Crown recognized aboriginal title and took the position that it could only be extinguished with the formal consent of the Indians. This evidence is of some importance since there has been long-standing controversy as to whether the Royal Proclamation of 1763 applied to British Columbia.

The colonization of British Columbia led rapidly to the loss of the indigenous languages. Three languages are already extinct; almost all of the remaining 33 are dying.

Posted by Bill Poser at 02:15 AM

It's like, so unfair

Why are the old fogeys and usage whiners of the world so upset about the epistemic-hedging use of like, as in She's, like, so cool? The old fogeys use equivalent devices themselves, all the time. An extremely common one is "if you will". Semantically it does exactly what like does. Let me explain.

Look at these synonymous pairs:

  1. The evidence I think will show that of the total amount of money raised from private sources, and from profits or increases in markup, if you will, on the sale of U.S. weapons to Iran, that a relatively small percentage of that money went to the Contras.

  2. The evidence I think will show that of the total amount of money raised from private sources, and from profits or increases in, like, markup on the sale of U.S. weapons to Iran, that a relatively small percentage of that money went to the Contras.

  3. The baboon that's best at coping with stress is the one that seeks emotional backing from other baboons (support groups, if you will), the researchers found.

  4. The baboon that's best at coping with stress is the one that seeks emotional backing from other baboons (like, support groups), the researchers found.

  5. And the bland assumption that all cartoons are childish or trivial is itself, if you will, a cartoon version of "cartoon."

  6. And the bland assumption that all cartoons are childish or trivial is itself, like, a cartoon version of "cartoon."

  7. "We were willing to overlook it, if you will, being a growth company."

  8. "We were willing to, like, overlook it, being a growth company."

  9. I think it's a reason we've done well; part of our mystique, if you will.

  10. I think it's a reason we've done well; part of, like, our mystique.

  11. They are, if you will, this country's governing body.

  12. They are, like, this country's governing body.

  13. There is also a potential source of shenanigans, if you will.

  14. There is also a potential source of, like, shenanigans.

In each case, the first sentence is a quote from The Wall Street Journal. They mostly appear to be quotes from educated and prosperous middle-aged persons — CEOs and so on. The second sentence in each pair is my translation into the style of younger speakers.

When people who think the English language is going to hell in a handcart cite phenomena like this use of like as their evidence, things are going a bit too far. Like functions in younger speakers' English as something perfectly ordinary: a way to signal hedging about vocabulary choice -- a momentary uncertainty about whether the adjacent expression is exactly the right form of words or not. If the English language didn't implode when if you will took on this kind of role among the baby boomers, it will survive having like take on an extremely similar role for their kids. The people who grouse about like are myopic old whiners who haven't looked at their own, like, linguistic foibles, if you will.

Posted by Geoffrey K. Pullum at 02:14 AM

Stalinist Linguistics

Mark Liberman's mention of Stalinist linguistics might give rise to the inference that Stalin had a distinctive approach to linguistics associated with his "left-fascist" politics. Actually, the distinctive, indeed bizarre tendancy in Soviet linguistics was due to N. Ja. Marr, who was to Soviet linguistics what Lysenko was to Soviet biology. Among Marr's stranger claims is that all the words of all human languages are descended from the four proto-syllables sal, ber, yon, and rosh.

Stalin's paper Marksizm i Voprosy Jazykoznanija [Marxism and problems of linguistics] is a refutation of Marr. Although Stalin cannot be said to have made any new and profound contribution to linguistics, he actually did have some knowledge of linguistics and his views were quite mainstream.

For a detailed account of Marr and his role in Soviet linguistics, see Jan Ivar Bjornfløten's book Marr og Språkvitenskapen i Sovjetunionen (Oslo: Novus Forlag. 1982.)

Posted by Bill Poser at 01:09 AM

November 21, 2003

Phineas Gage gets an iron bar right through the PP

On September 14, 1848, the Free Soil Union in Ludlow, Vermont, carried a news item that began:

As Phineas P. Gage, a foreman on the railroad in Cavendish, was yesterday engaged in tamping for a blast, the powder exploded, carrying an iron instrument through his head an inch and a fourth in circumference, and three feet and eight inches in length, which he was using at the time.    [from a scan on Malcom Macmillan's Phineas Gage information page]

I happened to read this item a couple of days ago while preparing a lecture on emotion for Cognitive Science 001. It reminded me of something that I left out of my earlier post on crossing dependencies in discourse structures: within-sentence syntactic relationships also often tangle.

To understand the phrase "carrying an iron instrument through his head an inch and a fourth in circumference" as the writer intended us to, we have to recognize that the inch-and-a-quarter measurement modifies the iron, and not Phineas' head -- which is in the way in this sentence, just as it was on that September day in 1848.

It's fair to consider this an unhappy stylistic choice. On the other hand, folks sometimes write this way, and they talk this way even more often (and often the results are not so likely to be mentally red-penciled by the audience). In some languages, and some registers of English, syntactic tangling like this is normal. In fact, the only thing that's really troublesome in the Gage example is that 'which' struggling to swim upstream to 'instrument' ...

Tangling of surface syntactic relations is certainly not a new discovery. Among recent treebanks, the German TIGER corpus project's "syntax graphs" permit crossing edges, and so does the analytical level of the Prague Dependency Treebank (where crossing relations are called "non-projectivity").

Of course, different frameworks of syntactic description, and different theories about how to explain them, offer different stories about what such apparently crossing relations really are, how they arise, how to think about them. This is the source of many of the non-terminological differences among approaches to syntax. Are the issues in tangling discourse-level relations the same, or partly the same, or entirely different?

Posted by Mark Liberman at 07:48 AM

November 20, 2003

The emperor and the dialect speaker

Two tidbits that I came across in an old file of miscellaneous linguistic stuff today, while I was looking for something else:

1. Sigismund (1361-1437), Emperor of the Holy Roman Empire, gave this answer to a prelate who, at the Council of Constance in 1414, had objected to His Majesty's grammar:

"Ego sum rex romanus, et supra grammaticam."

2. A Croatian dialect saying (rendered in very rough English-based spelling, without the Croatian diacritics, some of which are hard to render on line):

Kuliko jezikou chlovig zna,
Taliko chlovig valja.

Which translates to:

However many languages a person knows,
That's how much that person is worth.

Posted by Sally Thomason at 05:15 PM

Right-justified fixed-width raw text, no padding

It is possible to construct raw English text that has
a justified right margin without employing any of the
space padding that is used by old formatting programs
like nroff that were designed to fake right justified
text using daisy-wheel printers and fixed-width fonts
reminiscent of typewriters.  To show that it is not a
problem to do this, I offer this example (and if your
browser doesn't show this as right-justified, you are
using an insane font setting -- your fixed-width font
default must be a non-fixed-width font or something).

As Mark Liberman has pointed out in connection with a
message I recently sent him that had this property, a
problem in recreational computer science is suggested
by the possibility of right-justified raw text: write
a program that takes raw text paragraphs as input and
produces right-justified versions of them, respecting
certain tolerances and line-length preferences, using
transformations that preserve rough synonymy, varying
optional punctuation and substituting synonymous word
sequences as necessary but never adding extra spaces.

Posted by Geoffrey K. Pullum at 03:01 PM

A shitload more brevity

In Geoff Pullum's brief post about (one of) the Gricean maxims, he makes a good point. It's tough to blog briefly. Which makes me, as a new wet behind the ears apprentice underblogger, wonder just what the rules of Blog really are.

The beauty of Grice's maxims is that they seem a priori obvious. They tell us to be as informative as necessary but not more so, to convey true beliefs justified by adequate evidence, to be relevant, and to be (cutting a longer story short) brief. Just plain old common sense, right? But at least since Keenan (1974) linguists have wondered whether the maxims apply universally, and independently of culture, style and genre.

Think about bloggers, who as Geoff shows us by anti-example, are typically none too brief. In a blog, be relevant seems strangely not and standards of evidence are not in evidence. Besides, just how much of the blogosphere's great outpouring of cyber-information is truly necessary?

Grice justified his maxims as being special cases of one supermaxim - the Cooperative Principle. "Make your contribution such as required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged."  Herein lies the root of the problem. In the case of blogs, uncertainty over audience make-up and mores reaches a new high. You could be anybody. In fact, many of you are not bodies at all, but automated web-crawlers. And there simply is no commonly accepted purpose or direction. Bloggers are free to make up purposes and directions as they go, to inform as much as they like about whatever they like in pretty much any way they like. A young underblogger's apprentice does not (intentionally) stray far from standard purposes and directions, and hence conventions. But deeper in blog-space, anything might go. Who is to say whether bloggers follow these rules, or these...

The Maxims of Blog

Maxim of Enlightenment:
1. Bring enlightenment.
2. Wear shades.

Maxim of Controversy:
1. Be controversial. (Occasionally say what you are certain is true. It adds credibility.)
2. Hint at that for which you have no evidence.

Maxim of Digression:
Digress. (Especially (auto)biographically. Note that Gorky was born Aleksey Maksimovich Peshkov, and "Maxim" derives from his Father's name, the "-ovich" being a patronymic ending. Thus does one Maxim beget another.  Hopefully, more on Russian and other naming conventions in a later log. And perhaps someone more literary or political than I will have something to say about "the father of Soviet literature and the founder of the doctrine of socialist realism," and the reason Nizhny Novgorod was for many years hard to find on a map. The Nizhny Novgorodites are still proud of Gorky as far as I can tell, but not enough to have their city bear his name. (Beaver, Utah is not named for me (or vice versa (note the embedded parenthetical - these are good)), but it is apparently the birthplace of Butch Cassidy, ne Robert LeRoy Parker. So why "Butch"? Well, he once worked as a butcher. His most famous partner in crime (aka Harry Longabaugh) was nom de guerred in a reverse Gorky manoever: as a young horse rustler the Kid spent two years in jail in Sundance, Wyoming. Not much going on in my name, except that Beaver is supposedly a case of very poor translation by English officials helping my ancestors anglicize their Polish family name, "Kaczka", which means "duck". David Duck.))

Maxim of Entropy:
1. Hyperlink obscure expressions.
2. Keep 'em guessing.
3. Use acronyms.
2. Maximize entropy. Stream consciousness. Order pizza.

Keenan, Elinor O. 1974. "The Universality of Conversational Postulates." Studies in Linguistic Variation, ed. Ralph W. Fasold and Roger W. Shuy (Washington, D.C.: Georgetown Univ. Press), pp. 255-68. (Back)
Posted by David Beaver at 04:18 AM

Edward Sapir and the "formal completeness of language"

Camille Paglia recently slammed the blogosphere for "dreary meta-commentary," a "blizzard of fussy, detached sections nattering on obscurely about other bloggers," lacking the relevance to "major issues and personalities" of Paglia's own writing. So I warn you that we're about to go meta for a few lines. But when we re-emerge into normal space, the nattering blizzard safely behind us, we'll be within sensor range of some "major issues and personalities." Major issue: the formal completeness of language. Major personality: Edward Sapir.

Language Hat pointed to John McWhorter's piece on Mohawk Philosophy Lessons and my follow-up on Sapir/Whorf. In a comment on Language Hat's post, Jonathan Mayhew wrote:

Doesn't the fact that we can discuss certain differences between languages in a single language indicate that the Whorf-Sapir hypothesis is flawed? That is, I can use English to describe Hopi thought patterns. Thus language would be more malleable, not determining thought, but elastically adapting to changes in thought.

In response, being lazy pressed for time, I'll cut and paste question 2.2 from the final exam for my intro linguistics course in the fall term of 2000:

The American linguist Edward Sapir wrote in 1924:

The outstanding fact about any language is its formal completeness ... To put this ... in somewhat different words, we may say that a language is so constructed that no matter what any speaker of it may desire to communicate ... the language is prepared to do his work ... The world of linguistic forms, held within the framework of a given language, is a complete system of reference ...

What would it mean for this to be false? What does it mean if it is true? How can you square this quote with
the fact that Sapir is also associated with the Sapir-Whorf hypothesis, crudely expressed as the slogan "language
determines thought," or more precisely expressed by Sapir as:

We see and hear and otherwise experience very largely as we do because the language habits of our community predispose certain choices of interpretation ...

If you choose to answer this question, be sure that you can cite specific facts from at least two languages to
exemplify your analysis.

As a mathematical analogy to illustrate how Sapir's two beliefs are not at all incompatible, consider the Fourier transform and similar information-preserving coordinate transformations. The time- and frequency-domain representations of a function (or a sequence, in the discrete case) express identical information and have the same representational potential, but very different aspects of the entity are salient in the two different representations.

Sapir might have been wrong -- maybe all languages aren't always expressively equivalent, and maybe language habits don't usually predispose our interpretive choices -- but he wasn't stupid.

Though I sometimes think that it takes a really smart person to hold a really stupid position. This has nothing to do with any of the participants in the present discussion, of course :-). I could say more, but I'll restrain myself, for now.

Posted by Mark Liberman at 12:05 AM

November 19, 2003

Fascist linguistics

I'm narrow-minded and naive. At least when I don't think twice. Shortly after flinging in a quick post on changing fashions in 20th-century linguistics, I realized that I surely wasn't talking about "most linguists" in either half of the century. I was talking about anglophone and especially American linguists.

In particular, Nazi linguists and their forebears surely were not on a Boasian trajectory. I haven't read Christopher Hutton's Linguistics and the Third Reich -Mother-tongue Fascism, Race and the Science of Language, but it looks interesting.

Stalinist linguistics was another category to be considered, I guess. And in the second half of the 20th century, some stuff happened in France that sometimes seems to have something to do with ideas about language :-)...

But from the point of view of the American linguistic tradition, I think what I said was right.

Posted by Mark Liberman at 02:45 PM

Sapir/Whorf: sex (pro) and space (anti)

John McWhorter's perspicuous post on false exoticism in lexical semantics made me think.

In the first half of the 20th century, most linguists were friendly to the idea that different languages divide the world up in fundamentally different ways. In the second half of the 20th century, most linguists became deeply hostile to that same notion. The primary motivation in both cases was the same: respect for "the other."

For anthropologically-minded linguists after Boas, who saw language as a cultural artifact, this respect meant examining other languages and cultures carefully, on their own terms, without European preconceptions. Being open to finding out that things might be very different, in content as well as in form. Even things that look the same may be deeply different, as Whorf argued about Hopi.

For generative linguists after Chomsky, who saw language as an instinct with a universal biological substrate, this same respect led to the view that all people and all languages are basically the same. Even things that look deeply different must turn out to be the same, if you analyze them the right way. At least, anything important about language (and language use) must be that way.

Linguists are passionate about ideas, but they tend to get *really* worked up about this one. I myself can swing either way on it, but as a dispassionate (bi-passionate?) observer, I have to say that I find that most efforts of both kinds unsatisfying.

However, there are a few recent pieces of (pro- and anti-) Whorfian work that I can wholeheartedly recommend. On the pro side, Lera Boroditsky has been doing some neat stuff. Try her paper on Sex, Syntax and Semantics, for example. On the anti side, Peggy Li and Lila Gleitman have a great paper Turning the tables: language and spatial reasoning, debunking the theory (due to the MPI Language and Cognition group) that Mayan speakers have different customary spatial-coordinate systems from Dutch or English speakers. (If you don't have access to a 'Science Direct' subscription -- there's a rant for another time -- you can read a summary of the Li/Gleitman work in some commentary by Randy Gallistel here, as pointed out by Tim May in a comment over at Language Hat's place.)

Read the whole things :-)...

Posted by Mark Liberman at 10:58 AM

November 18, 2003

Mohawk philosophy lessons

I've just been reading Mark Abley's book on endangered languages, SPOKEN HERE. It is a heartfelt journalistic trip to various places where language revival efforts are going on, but the book is shot through with something that has made me itch in most of the language death books over the past several years, a rather reflexive acceptance of the Whorfian language-is-thought notion. What makes me especially uncomfortable about this tic is how easily it shades into fetishizing the very kinds of indigenous people who Whorf's progenitors Sapir and Boas worked so hard to dignify.

Abley listens to a Mohawk speaker talking about the word KA'NIKONRIIO, "righteousness." The speaker says "You have different words. Something that is nice. Something coming very close to -- sometimes used as a word for -- law. The fact of KA'NIKONRIIO is also -- beautiful. Or good. So goodness and the law are the same." Abley muses "I had the impression that a three-hour philosophy seminar had just been compressed into a couple of minutes."

Abley's intentions are good, but I can't help wanting to ask him "OK -- explain precisely how the semantic range of that word will illuminate your life, and/or please delineate for me just how you would construct a seminar on KA'NIKONRIIO that would stand alongside one on Kant?"

I know we are not supposed to "go there." But then, let's take STAND. You STAND on a corner, you STAND rather than sit, you STAND up. One can STAND pat -- even in an argument. That is, one can STAND up for a thesis, the point can STAND to reason, one can STAND firm on it, STAND down dissenters, and when unrefuted, the point STANDs, although the debate may also end in a STAND-off. We extend that meaning to say that something cannot STAND, and when we are not inclined to let something STAND we cannot STAND it. One person STANDs in for another; a symbol can STAND for a concept. Something noticeable STANDs out. And then when we probe deeper, STAND even lurks sheltered in longer words and seasons their meanings -- we underSTAND an idea, we withSTAND a threat. Then STAND even crosses the line between verb and noun, becoming what we might call a guiding spirit pervading the language. A persistent person may finally make their last STAND. People sell food at STANDs. To performers especially in the old days, a stay in one town was a STAND. You watch a ball game in the STANDs. And then one even hears of STANDs of trees. And STAND tucks itself into other nouns as well -- one may have a unique bodily STANCE, or a STANCE upon an issue of the day.

Yet I find it hard to imagine Abley or others so fascinated by polysemies in indigenous languages readily identifying all of these uses of STAND and their relationships as evidence that English speakers have a different conception of standing than other people. Nor can I readily imagine anyone calling the uses of STAND a "mini-lesson in philosophy" -- I suspect Kant would have found little of use here, for example.

On the other hand, I find it quite easy to imagine that if the word for STAND -- say, HALUC'KIP -- were used in the same ways in an indigenous language, then some writer somewhere might be telling classrooms that this word signals some complex, dynamic relationship between bodily position, conviction, toleration, nutrition, performativity, and trees.

The subtext of Abley's approach here is a school of thought that proposes that indigenous people are "realer" than we are, more in touch with spiritual realities that "civilization" has long wrested us from to our detriment. I understand that a good while ago, this notion was a useful way to counter the myth that indigenous people are "savages." But I wonder how many people who read a book like Abley's need to learn that in 2003, and in the meantime the tradition too often smacks of clapping wildly when a child manages not to spill any of her food.

As far as I'm concerned, if it's meat and potatoes in English, then it ought not dazzle us in any other language as "special." This is the very soul of believing that all humans are equal.

Elsewhere in the book Abley marvels that the Boro language has words that mean specific things like "to love for the last time" and "to feel unknown and uneasy in a new place." Okay -- but English has a word for when two acquaintances, through sharing an experience or reminiscence, experience a sense of deeper connection for the first time: BONDING. How spiritual we English speakers must be ... then -- get this -- we have a word for the first time a couple has sexual intercourse: CONSUMMATE. And so on -- and note that these are the meanings the relatively ingenuous English speaker would give most immediately to an investigator, despite their actually being specific uses of larger terms (as many of the Boro terms likely are, for that matter).

In the same way, what is mere polysemy in English is not a philosophy seminar in Mohawk. It's just polysemy.

Certainly we should try to save as many languages as we can. But we should do this because they are fascinating in their own right -- not out of a staged wonder that indigenous languages have lots of words for things that matter to them, or that some of their words happen to have wended into highly specific connotations, or that they have words that -- wonder of wonders! -- cover several related meanings.

Posted by John McWhorter at 06:04 PM

"Then down he goes his daily Log to write."

What with all the nautical word stuff floating around, somebody needs to trace the nautical origins of "log" as in "weblog". Here's what the OED says: (You'd think it would be obvious, but still I learned something: why "quadrant"? The only log I ever threw was cylindrical, as I recall.)

6. An apparatus for ascertaining the rate of a ship's motion, consisting of a thin quadrant of wood, loaded so as to float upright in the water, and fastened to a line wound on a reel. Hence in phrases to heave, throw the log, (to sail or calculate one's way) by the log. Said also of other appliances having the same object.

7. a. Short for LOG-BOOK. A journal into which the contents of the log-board or log-slate are daily transcribed, together with any other circumstance deserving notice.

The example sentences for 7.a include

1825 H. B. GASCOIGNE Nav. Fame 79 Then down he goes his daily Log to write. 1850 SCORESBY Cheever's Whaleman's Adv. vi. (1859) 86 To fix the localities of whales' resorts by the comparison of the logs of a vast number of whalers.

I like the idea of "fixing the localities of whales' resorts." And this is the earliest explicit example of textual datamining known to me, though no doubt there is a Sumerian citation from 2100 B.C. on predicting trends in temple revenues from a vast number of cuneiform tablets.

The OED says that the etymology of log is "obscure", but (like blog) seems to involve phonetic symbolism:

[Late ME. logge; of obscure origin; cf. the nearly synonymous CLOG n., which appears about the same time.
Not from ON. lág felled tree (f. OTeut. *laeg-, ablaut-variant of *leg- LIE v.1), which could only have given *low in mod.Eng. The conjecture that the word is an adoption from a later stage of Scandinavian (mod.Norw. laag, Sw. dial. låga), due to the Norwegian timber-trade, is not without plausibility, but is open to strong objection on phonological grounds. It is most likely that clog and logge arose as attempts to express the notion of something massive by a word of appropriate sound. Cf. Du. log clumsy, heavy, dull; see also LUG n. and v. In sense 6 the word has passed from Eng. into many other langs.: F. loch, Ger., Da. log, Sw. logg.]


Posted by Mark Liberman at 05:59 PM

More words and foods from O'Brian

Following my earlier posts on linking 'which' and words, foods and characters in Patrick O'Brian's novels, here's an example that combines both themes:

'... is there anything to eat or drink in the boat?'
'Which Killick put up some milk-punch and pickled seal, sir, in case you wasn't dead,' said Bonden.

(from 'The Far Side of the World', p. 287 in the 1992 W.W. Norton paperback)

Posted by Mark Liberman at 12:04 PM

November 17, 2003


H. P. Grice included "Be brief" as one of his maxims of conversation. I find blogging with brevity quite hard, but I thought I'd attempt at least one brief post. This is it.

Posted by Geoffrey K. Pullum at 08:07 PM

Critic: writer::zoologist:elephant

A few days ago, I quoted the aphorism "asking a linguist how many languages (s)he speaks is like asking a doctor how many diseases (s)he has."

This awkward little quip made me uneasy, and made some others, ... well, let's say angry. It reminded me of another mean-spirited and doubtful analogy, attributed to Roman Jakobson as a comment on a proposal to give Vladimir Nabokov a faculty position at Harvard:

“I do respect very much the elephant, but would you give him the chair of Zoology?”

I always thought this was a piggish thing for Jakobson to say, if the stories are true. Aside from being a great writer, Nabokov was an insightful literary analyst, as competent a literary scholar as many academics before and since, and (ironically) a good enough lepidopterist to publish scientific papers that are still cited and to hold a position at Harvard's Museum of Comparative Zoology.

Another qualification for academic life was Nabokov's sense of humor, as displayed for example in this interview:

What do you want to accomplish or leave behind-- or should this be of no concern to the writer?

Well, in this matter of accomplishment, of course, I don't have a 35-year plan or program, but I have a fair inkling of my literary afterlife. I have sensed certain hints, I have felt the breeze of certain promises. No doubt there will be ups and downs, long periods of slump. With the Devil's connivance, I open a newspaper of 2063 and in some article on the books page I find: "Nobody reads Nabokov or Fulmerford today." Awful question: Who is this unfortunate Fulmerford?

While we're on the subject of self-appraisal, what do you regard as your principal failing as a writer-- apart from forgetability?

Lack of spontaneity; the nuisance of parallel thoughts, second thoughts, third thoughts; inability to express myself properly in any language unless I compose every damned sentence in my bath, in my mind, at my desk.

You're doing rather well at the moment, if we may say so.

It's an illusion.

Just as I agree (however uneasily) that an excellent linguist can perfectly well be a monoglot, I agree with the sense of Jakobson's analogy, which is that being a great writer is not in itself a sufficent qualification for an academic position in an academic language or literature department. But let's also agree that being a bad writer is not a necessary qualification for such a position, whatever some may think :-).

Posted by Mark Liberman at 04:56 PM

Hic merus est Thyonianus

This is another in a series of posts on linguistic aspects of Patrick O'Brian's Aubrey/Maturin novels. (It also resonates with a different series of posts on scalar predicates.)

Everyone notices all the specialized, archaic and dialect words in these books -- catharpings, syllabub, marthambles and the like. I'm struck just as forcefully by the many words still in common use whose meaning has changed, more or less, over the time and space that separates us from the British Navy in the period of the Napoleonic Wars. Sometimes the change is simple and easy to characterize, as in the case of reptile, which used to mean "crawling thing" and thus was applied to weevils and other insects. In other cases, the change in word sense is less clear, but one still feels that something is different. A good example is the use of mere.

When Jack Aubrey's dinner is delayed, he says that he may "perish of mere want." An admiral complains that many of his captains are "very mere rakes." I have the impression that O'Brian's use of mere is not only divergent from contemporary patterns, but also unusually common. (Without on-line copies of the novels, I can't conveniently test this idea, and it may only be that the word seems common because its uses are salient.)

The story about mere seems to be a combination of two senses that have passed out of modern use, along with a modern associative accretion.

The first adjectival lemma in the OED for mere is:

(obs) Renowned, famous, illustrious; beautiful, splendid, noble, excellent. In Old English also in negative contexts: notorious, infamous. (Applied to persons and things.)

In this sense, mere could sensibly be intensified -- "very notorious rakes".

The second adjectival lemma is described as

I. In more or less simple descriptive use.

1. a. Pure, unmixed, unalloyed; undiluted, unadulterated.

In particular cases, this sense will overlap with senses 4 and 5, described as representing "intensive or reductive use":

4. That is what it is in the full sense of the term qualified; nothing short of (what is expressed by the following noun); absolute, sheer, perfect, downright, veritable. Obs.
Although collocations such as ‘mere lying’ and ‘mere folly’ are still possible, these are now taken to belong to sense 5, mere being taken to mean ‘nothing more than’ rather than ‘nothing less than’.

5. a. Having no greater extent, range, value, power, or importance than the designation implies; that is barely or only what it is said to be. [...]

The OED's mere quotes give me the same out-of-kilter feeling that O'Brian's mere uses do:
1625 BACON Ess. (new ed.) 150 That it is a meere, and miserable Solitude, to want true Friends.
1719 T. D'URFEY Wit & Mirth III. 306 It blows a meer Storm.
1746 LD. CHESTERFIELD Lett. (1792) I. cviii. 295 You are a mere Oedipus, and I do not believe a Sphynx could puzzle you.
1892 Law Rep.: Weekly Notes 24 Dec. 188/1 The defendant had been maliciously making noises for the mere purpose of..annoying the plaintiffs.

The modern accretion on mere, which typically seems to be missing in the earlier usage, is the implication that the referent of the modified noun is somehow trivial or paltry: a mere trickle, a mere drop in the ocean, a mere gesture. In the last OED quote cited above, mere has the modern sense of "nothing more than" (as opposed to "nothing less than"); but "annoying the plaintiffs" may be a nontrivial accomplishment, even if it is true that the defendent no more legally substantive purpose in mind.

I was surprised to learn that mere probably comes from Latin merus, though perhaps with some reinforcement from Germanic and Romance sources. The OED's etymology is

[Prob. partly (esp. in early use) < a post-classical Latin form (with characteristic vulgar Latin lengthening of vowels in open syllables) of classical Latin merus undiluted, unmixed, pure < the same Indo-European base as MERE v.1, and partly (in Middle English) a reborrowing of its reflex Anglo-Norman mer, meer, mier, Middle French mer (c1100 in Old French as mier).

Lewis & Short says about merus:

merus , a, um, adj. [root mar-, to gleam; cf.: marmaros, marmor, mare; hence, bright, pure] , pure, unmixed, unadulterated, esp. of wine not mixed with water:

For those who like their etymological pedantry straight up, hic merus est Thyonianus.

Posted by Mark Liberman at 09:49 AM

November 16, 2003

Corpus fetishism

A depressing tendency is apparent in a couple of the published reviews of The Cambridge Grammar of the English Language. (Don't ask me to name the reviewers. It would be unkind. A couple of the reviews published in Britain have been so stupid that the only thing a fair-minded man like me can wish upon the reviewers is that they should die in obscurity.) The tendency is to grumble that the grammar does not cite corpus sources for its examples, and to imply that that this means Huddleston and I are bad people.

The charge that we did not use exclusively corpus data to illustrate points of grammar in the book is certainly true. We sometimes used examples taken from texts, even well known ones, but never with a source citation (the source was not the point). We sometimes used edited versions of sentences from texts (omitting irrelevant clutter, shortening clumsy noun phrases where they didn't matter, replacing unusual names, etc.), or sentences we heard on the radio and jotted down. And sometimes we used natural-sounding made-up examples. It depended on what would do the job best. The subject matter of chapters 16 and 17 (information packaging and anaphora) makes style and context highly relevant, so there the frequency of attested examples is very high. But in Chapter 4, basic clause structure is under discussion, and the chief need is for very short and simple examples, not rich and ornate ones.

The reviewers whine on about our policies as if there were something improper and disappointing and unrigorous about a grammarian ever making up an exemplificatory sentence. I disagree. I think we have to draw a line between sensible use of corpora and a perversion that I call corpus fetishism.

You see, if you look at what someone like Mark Liberman does with corpora (often the gigantic corpus constituted by Google and the complete copy of the entire web that it keeps in a barn in Sunnyvale), you will note (e.g. here and here, and especially here) that he uses the corpus for investigation. He probes the text that is out there to see what sentences can be found, and he changes his mind about what the facts are according to what he finds in natural use of the language that appears to emanate from native speakers and seems not to have unintentional slips in it. This is because (and here I reveal a fact about Mark's private life, but only because it is highly relevant)... Mark is not a moron. Mark knows how linguistic investigation is done, not because he once read about it in a book he got out of the library, but because he actually does it. He is not attached to the corpus as if it were the object of study, like a twisted lover obsessed with the shoe of his beloved instead of the woman who wears it.

More than one of the reviewers of The Cambridge Grammar on the Old Europe side of the Atlantic -- reviewers who were clearly not grammarians themselves -- have hinted that no facts can be trusted if they are presented in terms of examples written by the grammarian. They claim that The Cambridge Grammar should have used corpus data throughout for illustration. But this is madness.

Take the beginning of Chapter 10, "Clause type and illocutionary force" (see page 853). There we list the five basic clause types, and give an example of each. We exemplify imperative clauses by giving the example Be generous. Rodney Huddleston chose it, and I have no doubt that he thought it up. Now, using "real" data (as the corpus fetishists always say) would have been trivially easy. We could have used "Call me Ishmael." (We wouldn't even have needed to take the book down from the shelf to cite the source, would we? Moby Dick, by Herman Melville, page 1.) But the question is, why would we or should we do this?

Would have it improved our exposition of clause type? No, it would have worsened it. It would have ruined the symmetry of the set of near-minimal contrasts we give between the five clause types: You are generous for the declarative, Are you generous? for the closed interrogative, etc. Using random attested examples from wherever we could find them attested would have lessened the clarity of the illustration.

Would it ensure a convincing answer to some contested question? No. Nothing is at issue here. There is no possibility that Be generous might be ungrammatical. No point is being missed if we use that rather than a different example that came from a corpus. We just need a clear and simple illustrative example so that you can see what we mean when we say "imperative clause".

In any case, there isn't really a line here between attested and non-attested data. Check out Be generous on Google and you find it gets roughly 120,000 hits, and thousands of them are imperatives. So it is attested, though choosing a source from the thousands available would have been arbitrary. If you want a literary citation, a few seconds of experimentation with the little corpus of uncopyrighted Victorian materials I keep on my Linux box plucks out this:

Don't mind Mrs. Dean's cruel cautions, but be generous, and contrive to see him.

We could have used that, though it has an extra twelve words of clutter, bloating the example up from 12 characters to 80, a factor of 6.67, well as messing up the symmetry with the other clause types. We could have given the citation too: "Wuthering Heights by Emily Brontë (1801)", plus a specific edition, and a page reference. The whole thing would take more than an order of magnitude more space on the page. Why didn't we do this? Because (you know what I'm going to say, don't you?)... Huddleston and I are not morons.

There are way over 10,000 numbered examples in The Cambridge Grammar, and thousands more given in passing in the text. To use only corpus examples, and to give full source citations of all examples used, would have added scores of pages (possibly a hundred pages or more) to a book that is already 1,842 pages long. You really would have to be a moron to do it. But because we didn't, we are getting accused of not being adequately responsive to the corpus revolution in modern syntax. Only two or three so far, but already I am getting tired of them. The charge is nonsense. Huddleston and I used corpora constantly. The British National Corpus was not available to us back in the 1990s, but we slaved over printouts from three well-matched and well-balanced small corpora (the Brown, LOB, and ACE corpora, representing American, British, and Australian English respectively); in addition I ran thousands of searches on the Linguistic Data Consortium's famous Wall Street Journal corpus of 1987-1989 journalism to check points of American English; we paid attention to both spoken and written English (notice, any spoken English caught by reporters turns up inside quotation marks); in every way we could think of we sought out evidence from attested linguistic material -- not just one fixed corpus serving as the only source for everything (that turns the language into a dead language -- corpus necrophilia), but a dynamically evolving collection embracing any kind of material that might be of use.

But what it was of use for was the investigation phase, when we were finding out what was true of English and what was not. To suggest that we then should have set out our illustrations only (or even largely) with unedited examples together with full text locations is just nuts.

I defend the rights of consenting adults to engage in corpus fetishism if they wish, in the privacy of their own homes. But it is a perversion, and I don't want its perverted adherents trying to tell me that The Cambridge Grammar would be a better book if its exemplifications were exclusively long and ungainly attested utterances taken unedited from corpora of text with location information attached, because it wouldn't.

Posted by Geoffrey K. Pullum at 02:14 AM

November 15, 2003

Words, foods, characters

Here's another post on linguistic aspects of Patrick O'Brian's novels. This time I'll retire quietly to the background and let O'Brian speak for himself.

The core of the books is the friendship between Jack Aubrey, an English naval officer, and Stephen Maturin, a half-Irish half-Catalonian physician. O'Brian is good at letting characters reveal themselves through their choice of words and foods, and his methods are nowhere more clearly displayed than in this scene early in the first book, set in Minorca, where Aubrey and Maturin, having narrowly avoided fighting a duel, share their first meal (From 'Master and Commander', pp. 34-35 in the 1990 W.W. Norton paperback):

They sat at a round table in a bow widow that protruded from the back of the inn high above the water, yet so close it that they had tossed the oyster-shells back into their native element with no more than a flick of the wrist: and from the unloading tartan a hundred and fifty feet below them there arose the mingled scents of Stockholm tar, cordage, sail-cloth and Chian turpentine.

'Allow me to press you to a trifle of this ragoo'd mutton, sir,' said Jack.

'Well, if you insist,' said Stephen Maturin. 'It is so very good.'

'It is one of the things the Crown does well,' said Jack. 'Though it is hardly decent in me to say so. Yet I had ordered duck pie, alamode beef and soused hog's face as well, apart from the kickshaws. No doubt the fellow misunderstood. Heaven knows what is in that dish by you, but it is certainly not hog's face. I said, visage de porco, many times over; and he nodded like a China mandarin. It is provoking, you know, when one desires them to prepare five dishes, cinco platos, explaining carefully in Spanish, only to find there are but three, and two of those the wrong ones. I am ashamed of having nothing better to offer you, but it was not from want of good will, I do assure you.'

'I have not eaten so well for many a day, nor' -- with a bow -- 'in such pleasant company, upon my word,' said Stephen Maturin. 'Might it not be that the difficulty arose from your own particular care -- from your explaining in Spanish, in Castilian Spanish?'

'Why,' said Jack, filling their glasses and smiling through his wine at the sun, 'it seemed to me that in speaking to Spaniards, it was reasonable to use what Spanish I could muster.'

'You were forgetting, of course, that Catalan is the language they speak in these islands.'

'What is Catalan?'

'Why, the language of Catalonia -- of the islands, of the whole of the Mediterranean coast down to Alicante and beyond. Of Barcelona. Of Lerida. All the richest part of the peninsula.'

'You astonish me. I had no notion of it. Another language, sir? But I dare say it is much the same thing -- a putain, as they say in France?'

'Oh no, nothing of the kind -- not like at all. A far finer language. More learned, more literary. Much nearer the Latin. And by the by, I believe the word is patois, sir, if you will allow me.'

'Patois -- just so. Yet I swear the other is a word: I learnt it somewhere,' said Jack. 'But I must not play the scholar with you, sir, I find. Pray, is it very different to the ear, the unlearned ear?'

'As different as Italian and Portuguese. Mutually incomprehensible -- they sound entirely unlike. The intonation of each is in an utterly different key. As unlike as Gluck and Mozart. This excellent dish by me, for instance (and I see that they did their best to follow your orders), is jabalí in Spanish, whereas in Catalan it is senglar.'

'Is it swine's flesh?'

'Wild boar. Allow me . . .'

'You are very good. May I trouble you for the salt? It is capital eating, to be sure; but I should never have guessed it was swine's flesh. What are these well-tasting soft dark things?'

'There you pose me. They are bolets in Catalan: but what they are called in English I cannot tell. They probably have no name -- no country name, I mean, though the naturalist will always recognize them in the boletus edulis of Linnaeus.'

[Note 11/16/2003: I've changed the title of this post to correspond better to its content...]

Posted by Mark Liberman at 09:47 AM

Any Gate

Mark's item on which in Patrick O'Brian's series highlights the most striking (to me) syntactic feature of the books. But I also like O'Brian's seamen's use of any gate instead of anyway . Too bad we can't hear the pronunciation: does any gate also have compound pronunciation, like anyway ? As opposed, that is, to a noun phrase any way in a sentence like I couldn't find any way to get the tree out of my car ?
Of course, any gate might occur in the movie `Master and Commander', but my confidence in Hollywood's linguistic sophistication is not immense.

Posted by Sally Thomason at 06:48 AM

Zipf and the general theory of wrinkling

This enjoyable and informative article has no obvious connections to language, but the link to it at A.L.D. is entitled "Don't botox the universe", which at least provides a good example of a clearly denominal verb, as well as a clue about the world that the A.L.D. headline writer lives in :-).

A few years ago, when I was researching Zipf's Law and similar power-law phenomena for a lecture, I stumbled over the wonderful Chicago crumpling group's web site, where you can find things like the Universal Power Law in the Noise from a Crumpled Elastic Sheet. Music of the (crumpled mylar) spheres...

Posted by Mark Liberman at 06:02 AM

November 14, 2003

Linking "which" in Patrick O'Brian

Because the movie "Master and Commander: The Far Side of the World" opens today, at least in this part of the world, I'm starting a small series of posts about linguistic aspects of Patrick O'Brian's Aubrey-Maturin novels. (The movie's name combines the titles of the first and tenth books in the series, on the two sides of the colon -- I'm not sure what this means about the plot).

Obscure words -- naval, historical, scientific, dialectal -- are the raisins in the spotted dog of O'Brian's prose. Dean King's A Sea of Words is a good present for an O'Brian fan. Certainly I was happy to get it from John Fought for my birthday a few years ago. However, there are some things that it doesn't help with. Here's a passage that illustrates the point (from "The Far Side of the World", p. 78 in th 1992 W.W. Norton paperback; the speakers are Jack Aubrey and his steward Preserved Killick):

   'What luck?' asked Jack.
   'Well, sir,' said Killick, 'Joe Plaice says he would venture upon a lobscouse, and Jemmy Ducks believes he could manage a goose-pie.'
   'What about pudding? Did you ask Mrs Lamb about pudding? About her frumenty?'
   'Which she is belching so and throwing up you can hardly hear yourself speak,' said Killick, laughing merrily. 'And has been ever since we left Gib. Shall I ask the gunner's wife?'
   'No, no,' said Jack. No one the shape of the gunner's wife could make frumenty, or spotted dog, or syllabub, and he did not wish to have anything to do with her.

King's lexicon informs us that lobscouse is "A common sailor's dish consisting of salted meat stewed with vegetables, spices and crumbled ship's biscuit"; that frumenty is "A porridgelike dish made of wheat boiled in milk and seasoned with cinnamon, sugar and sometimes dried fruits" ; that spotted dog or spotted dick is "A suet pudding containing currants or raisins (the spots)"; and that syllabub is "A drink, or dessert if gelatin is added, made of sweetened milk or cream mixed with wine or liquor."

So far so good. But what about which? Killick's use in this passage, typical of him and of other sailors of his class in the books, seem distinctly non-standard. It connects a descriptive clause ("she is belching ...") to the noun phrase that it describes ("Mrs Lamb"), across two prepositional phrases and a conversational break. The function is roughly like a linking phrase such as "with respect to her".

Alas, there is no entry for "which" in "A Sea of Words".

The OED comes through, more or less, in section 14.a. of its entry for which:

14. a. (as pron. or adj.) With pleonastic personal pronoun or equivalent in the latter part of the relative clause, referring to the antecedent, which thus serving merely to link the clauses together: (a) with the pers. pron. (or the antecedent noun repeated) as subj. or obj. to a verb (principal or subordinate) in the relative clause, which is usually complex; [...]

Among the quotes the OED gives for this usage: "1690 LOCKE Govt. II. v. §42 (1694) 196 Provisions..which how much they exceed the other in value,..he will then see. 1726 G. SHELVOCKE Voy. round World Pref. p. vii, Scandalous and unjust Aspersions..which, how far I deserve them, I shall leave to the candid opinion of every unprejudiced Reader. 1768 STERNE Sent. Journ. II. Fragment, The history of myself, which, I could not die in peace unless I left it as a legacy to the world."

It's nice to know that Locke and Sterne used a version of this construction. But why "linking which" works as a marker of lower-class speech in O'Brian's novels is a question that the OED doesn't answer.

Which it does contain the interesting truth about some strange uses of mere in O'Brian's books, however, as I'll explain tomorrow :-)...

Posted by Mark Liberman at 06:28 PM


In yesterday's New York Times , an article about anti-US protests (or anti-Iraq war protests) in France was accompanied by a picture of protesters in Marseille(s), in front of a McDonald's restaurant. One of them was holding a sign that said [sic]:


This suggests that the French government's campaign against English loanwords is not making much headway. `Boycott' comes from the ostracism of an unpopular English land agent in Ireland, a Mr. Charles C. Boycott (d. 1897). Oh, well, at least it's not a loanword from AMERICAN English.

Posted by Sally Thomason at 05:17 PM

linguist:language::doctor:disease ?

Mark Mandel's home page includes a version of this aphorism, which he attributes to Lynne Murphy:

"Asking a linguist how many languages (s)he speaks is like asking a doctor how many diseases (s)he has."

I think I don't agree, but I'm not sure why not. Maybe I don't like to think of a language as being analogous to a disease, pace William S. Burroughs and Laurie Anderson :-).

[Update: I've gotten some complaints about being too terse and even careless in this post, notably from the redoubtable Language Hat. So I'll amplify a bit -- and of course our standing offer to refund your subscription fees continues to apply!

As I wrote in LH's comments section, I have mixed feelings about this aphorism because I feel that everyone (and especially professional linguists) should use as well as study multiple languages, just as a matter of principle; but I also recognize that it's possible for a monoglot to be a first-rate linguistics professional, and that command of several languages is often in any case irrelevant to the contributions that polyglot linguists make.

Thus Bill Labov is not a monoglot, as it happens, but I don't believe that any of his major contributions depend on his speaking or reading any languages other than English.

So in some sense I do agree that asking a linguist "how many languages do you speak" is making an essential mistake about what linguistics is. Even though I also think that the answer should not be "one."

This screed by "Spengler" in the Asia Times claims that inadequate attention to multilingualism is "why America is losing the intelligence war", and asserts that "[t]he average Hungarian headwaiter had a greater command of languages than today's doctoral students in comparative literature at American universities. " I wonder if this is true, or just an uninformed assertion made for effect. I don't know know any Hungarian headwaiters at the moment, but the comp lit grad students that I've met recently are reasonably polyglot. (link from A.L.D.)

I do accept the argument that the U.S. would be better off if more of its young people knew more languages, though!"]

Posted by Mark Liberman at 11:42 AM

StoryCorps copyright?

StoryCorps looks really interesting. Their "storybooth" in Grand Central Station went live on Oct. 23, according to their website. And the five (streaming MP3) samples that they've put up are charming.

But from their website, I can't tell what the IPR (intellectual property rights) for the recorded stories will be. They say that the recordings will be deposited in the Library of Congress American Folklife Center, but that doesn't tell us anything about who owns them and what sort of distribution licenses will be offered. Given the general approach of the project, I should think that some sort of Creative Commons license would be appropriate. If so, why not say so?

StoryCorps' founder, Dave Isay, is best known for his work for NPR, but this is not necessarily a good sign from the point of view of IPR openness. At the LDC, we've been engaged for some years in trying to mediate access for the research community to archives of audio material, including broadcast stuff. And NPR has been one of the very tightest and least cooperative of the organizations that we've dealt with, to the point that we've basically given up trying to get permissions from them for anything. By comparison, we've done fine with broadcasters such as ABC and CNN (and among non-profits PRI).

My own current guess is that StoryCorps plans to hold the copyrights and to sell various products from their archives to generate revenue, as NPR does. That's their right, but if it's their intent, they should let people know. I feel that these questions are particularly relevant because StoryCorps says that:

We've modeled StoryCorps—in spirit and in scope—after the Works Progress Administration (WPA) of the 1930s, through which oral-history interviews with everyday Americans across the country were recorded. These recordings remain the single most important collection of American voices gathered to date. We hope that StoryCorps will build and expand on that work, becoming a WPA for the 21st Century.

FYI, here is the website for the WPA's Folklore Project archives. What the Library of Congress says about copyright on these materials is:

The Library of Congress is not aware of any copyright in the documents in this collection. As far as is known, the documents were written by U.S. Government employees. Generally speaking, works created by U.S. Government employees are not eligible for copyright protection in the United States, although they may be under copyright in some foreign countries. The persons interviewed or whose words were transcribed were generally not employees of the U.S. Government. Privacy and publicity rights may apply.

But StoryCorps is NOT a government project. Even less than NPR is.

[Note: the American Folklife Center's press release says that "The Library’s folklife specialists will be responsible for ensuring that the collection is preserved in digital form, appropriately indexed and cataloged, and then made accessible to the public at the American Folklife Center and on the Library’s Web site at" But this doesn't tell us things like whether the transcripts will be available and indexed, whether the audio will be downloadable or just available in streaming form, whether others will be able to get their own copy of (all or part of) the collection, for research or for novel interactive applications (such as a concordance), whether DRM of some future kind will or won't be involved, etc. etc. An appropriate Creative Commons license would implicitly settle these questions in a general way.]

Posted by Mark Liberman at 06:58 AM

November 13, 2003

Language AND syntax??

In the New York Times the other day ( Science Times, 11/11/03), James Gorman wrote this in an article entitled `Are animals smarter than we think?':

`Over the past few decades it has become clear that the great apes can learn some aspects of language and syntax. Parrots and dolphins can do the same.'

Curious linguists will want to know why Gorman's implicit defintion of `language' excludes syntax; they will also wonder how he determined that the great apes, parrots, and dolphins can clearly learn some syntax. The notion that syntax isn't part of language will certainly surprise all linguists, and the various claims that the great apes (much less parrots and dolphins) can learn syntax are at best highly controversial. A sizable vocabulary, yes, at least as far as a few superstar great apes are concerned; syntax, though -- that's another matter entirely.

Posted by Sally Thomason at 12:37 PM

Borges on metadata

A couple of days ago, I sketched the reasons why I don't think that the semantic web and similar efforts will get rid of the need for automatic information extraction from text. In thinking through these questions, we should ponder (or at least enjoy) what Jorge Luis Borges had to say on the subject, some time around 1929. Even if you could care less about metadata and information extraction, you should treat yourself to Borges' essay. (Note: there is an English translation down below the Spanish version).

The end of the essay:

Leaving hopes and utopias apart, probably the most lucid ever written about language are the following words by Chesterton: "He knows that there are in the soul tints more bewildering, more numberless, and more nameless than the colours of an autumn forest... Yet he seriously believes that these things can every one of them, in all their tones and semitones, in all their blends and unions, be accurately represented by an arbitrary system of grunts and squeals. He believes that an ordinary civilized stockbroker can really produce out of his own inside noises which denote all the mysteries of memory and all the agonies of desire"

Borges' piece is entitled El Idioma Analítico de John Wilkins. When Borges wrote it, Wilkins and his "Essay towards a real character and a philosophical language" had largely been forgotten. Indeed Borges starts by observing that that the 14th edition of the encyclopedia Britannica (published in 1929) had abandoned the entry on Wilkins, which was only 20 lines long in the previous edition.

Wilkins is back in the limelight today, as he is an important character in Neal Stephenson's massive new historical novel Quicksilver, which has sparked a lot of interest in the intellectual history of the 17th and 18th centuries among digerati who might otherwise not have realized that they cared about anything before 1995.

I put up the Idioma Analítico page in the winter of 2000, when a group of people from around the world (mainly the U.S. and Europe) were working through the ideas that turned into OLAC, the Open Language Archives Community. The OLAC Metadata set is a modest set of extensions to the Dublin Core, useful for cataloguing language-related archives of various types. Early OLAC suffered the usual stresses caused by enthusiasts inspired by the vision of a Philosophical Language. I thought that a small dose of Borges might help us avoid biting off more ontology than the project implementation could plausibly chew and digest.

Posted by Mark Liberman at 12:18 PM

November 12, 2003

Under the radar screen?

Geoff Pullum recently reported on the efforts of the Rockridge Institute, of which linguist George Lakoff is a founding member. The following item, sent to me by my father, Art Potts, is a fine example of what Rockridge analysts mean when they talk about how deftly the Republican party manipulates language to frame the issues.

In the November 11, 2003, Washington Post, and on the Internet here, there is an article about multibillionaire George Soros committing millions of his own money to oust the Bush administration. One of Soros's major pledges is to an organization which, according to the article, is described by a representative of the Republican National Committee as "an unregulated, under-the-radar-screen, shadowy soft-money group." And what al Qaeda-like paramilitary terrorist organization is the RNC describing? None other than

It would be a sign that Rockridge's efforts are working if the phrase "unregulated, under-the-radar-screen, shadowy, soft-money group" started rolling off the tongues of Democratic hopefuls whenever the Republican National Committee was mentioned in the news.

The actual statement is worth pondering for all its unstated content: "It's incredibly ironic that George Soros is trying to create a more open society by using an unregulated, under-the-radar-screen, shadowy, soft-money group to do it, " Republican National Committee spokeswoman Christine Iverson said. "George Soros has purchased the Democratic Party."
Posted by Christopher Potts at 01:40 PM

More on the 5 exabyte mistake

The canard that "Five exabytes... is equivalent to all words ever spoken by humans since the dawn of time" was repeated in this 11/12/2003 NYT article by Verlyn Klinkenborg. It's amazing how people pass this stuff around without checking it or thinking it through: Eskimo snow words all over again, though on a much smaller scale (so far).

The Dutch periodical OnzeTaal linked to the NYT article and also to my earlier post on the topic -- maybe the internet culture can start to keep these small thoughtless quantitative "idées reçues" in check.]

Klinkenborg is struck by the claim that telephone traffic in the year 2002 "added up to about 17 exabytes, more than three times all the words ever spoken by humans until that point". If the sum given for 2002 telephone traffic is correct (and since I haven't checked that, I'm not sure it's true), then a plausible estimate of "all the words ever spoken" would be more than 2,000 times greater. I'm not sure whether Klinkenborg would find that comforting or upsetting, in the abstract, but concretely it would cancel the premise of today's NYT piece.

[Note: it's curious how easily we linguists tend to fall into the role of ""insect dry discoursing gammer / [who] tells what's not rhyme and what's not grammar." Well, someone's got to do it, I guess. But I prefer inspiration to deflation, I really do!]

[Update: Klinkenborg is relying on this report from Berkeley, about which I might say more when I've had a chance to go through it. The Berkeley report supports the assertion that 17 exabytes of telephone traffic flowed in 2002, but on a quick read, I did not find anything connected to Klinkenborg's belief that all prior human talk would amount to 5 exabytes.

Though this (false) idea apparently was not mentioned in the Berkeley report (?), it is not something that Klinkenborg just made up. Hit google with the search string
"5 exabytes" spoken
and you'll get 215 repetitions of this idea, which is what might be called an "urban legend statistic". It gets started somewhere and then spreads by memesis, almost completely independent of whether or not has any factual basis. As it pretty clearly does not, in this case -- I looked at the first few pages of google hits above, and a sample of the later ones, and I couldn't find any explanation or justification of the figure, correct or incorrect. The bare numerical assertion is just cited as if it were common knowledge among the well-informed.]

Posted by Mark Liberman at 10:10 AM

A Veterans Day story

I planned to post this last night for Veterans Day, but by the time I got finished cleaning up after study break, I was too tired. There's no real linguistic relevance -- though I did manage to insert a linguistic link! -- but I'm going to indulge myself with a bit of personal blogging in this professional space. I promise not to do it very often.

In 1969, I was drafted and sent to Vietnam. I wasn't a big fan of the war. In fact, truth be told, I lost my student deferment because I was kicked out of college for antiwar activities on campus. And while I was in the army, I generally said what I thought about the war. Most of what I said was just assimilated into the general stream of army complaining, I think, so that some people agreed with me, and some disagreed, but I didn't get into as much trouble over this kind of discussion as you might expect. Except once.

I was stationed at a little camp near Pleiku, in the central highlands near where Vietnam, Laos and Cambodia come together. One afternoon, after I'd been there a couple of months, my friend Maddog asked me to take some paperwork over to somebody on the other side of the camp, in a living area where I'd never been before. It seems like every Army unit in those days had to have exactly one guy nicknamed "Maddog", usually because he was especially mild-mannered. You could call this reverse sarcasm, though of course in a military culture, "mad dog" is a kind of a compliment, so I guess David Beaver's theory works.

Anyhow, when I got to where Maddog sent me that afternoon, I saw that another guy I'd met a few weeks before also lived there. I'll call him "Ray". Ray was from rural Idaho, and his political views were far right. He read me passages from John Birch Society pamphlets; he saw fluoridation of water supplies as an obviously unacceptable instrusion of the government into individuals' lives; he thought it was plausible that WWII had been caused by Jewish bankers and that Martin Luther King Jr. was a communist agent. We had argued for a couple of hours one evening, and we didn't agree about anything.

Off in a corner of the hootch, a half a dozen NCOs were drinking. One of them, a sergeant from one of the other platoons, came over and started giving me a hard time. "Hey, college boy, I hear you're one of those hippie pinko protestors." He was pretty drunk, and he clearly wanted to start a fight. He kept pushing me in the chest, and taunting me. "You some kind of pacifist, you pussy? You just gonna take this from me? Well, faggot?" and so on. Meanwhile, his drinking buddies gathered around us in a circle. I didn't know any of them, and some of them were starting to echo his taunts. Even though this sergeant wasn't in my chain of command, and was probably too drunk to be much of an opponent, I was pretty sure that fighting with him would be a really bad choice. But the way out was blocked, and not fighting was starting to look like a recipe for getting the crap kicked out of me by the whole group.

Out of the corner of my eye, I saw Ray go over to his locker. He reached in and pulled out the biggest revolver I've ever seen in my life. I'm not any kind of gun expert, so I'll just say that it seemed like it was about a foot and a half long, with a bore the size of my thumb. Rather than a choice between fighting or just taking a beating, it looked like my options had narrowed to begging for my life or just dying with dignity. Ray -- who was a PFC like I was -- pushed his way through the circle of drunk NCOs and faced the sergeant and me. He raised the pistol to eye level, muzzle up, and cocked it. Then he looked at the sergeant and said:

"This man is an American. He has a right to believe what he wants, and say what he believes. Now back off!"

I thought, "Ray, wait a minute, what does my being an American have to do with it? Shouldn't everybody have those rights?"

But what I said was "thanks, Ray!"

The crowd of drunk NCOs just kind of melted away, like the wicked witch of the west. I don't think it was the gun -- though that helped emphasize the point -- I think it was what Ray said. It was strictly against regulations for him to have that pistol, and the NCOs could have taken the whole thing as some kind of mutiny and escalated it to another level. But they were ashamed of themselves for acting in such an un-American way, once somebody pointed it out from their side of the political fence.

Being in the army left me with a kind of emotional commitment to political pluralism, and this episode was a big part of it. So that's my story for Veterans Day.

[Update 11/15/2003: this book review gives a pretty good picture of what I mean by "pluralism".]

Posted by Mark Liberman at 08:29 AM

Adultery notes from all over

Margaret Marks at Transblawg explores the lexical semantics and legal status of adultery across cultures. Well, at least in England and Wales by comparison to New Hampshire. She quotes the Oxford Dictionary of Law:

adultery. An act of sexual intercourse between a male and female not married to each other, when at least one of them is married to someone else. Intercourse for this purpose means penetration of the vagina by the penis; any degree of penetration will suffice (full penetration is not necessary). … in addition to the adultery, the petitioner must show that she or he finds it intolerable to live with the respondent.

and observes that "adultery is not usually investigated to that degree - the respondent usually signs a confession, and evidence by a private detective certainly doesn’t substantiate the degree of penetration. However, in English law, there is the alternative ground of divorce of ‘unreasonable behaviour’ with no worse consequences."

Marks also explains that "unreasonable behavior" is shorthand for "behaviour by the respondent that the petitioner cannot reasonably be expected to put up with." She tells us that "[h]omosexual relationships are not adultery, but they can be unreasonable behaviour (not actually unreasonable, but such that the other spouse has grounds for divorce)."

This is a curious bit of adjectival semantics: a homosexual relationship is "not actually unreasonable [behavior]", but would be construed as "unreasonable behavior" for the purposes of a divorce case. Is this just because the legal definition is not the ordinary language definition? Or is it because reasonableness is always relative to an evaluator and a situation? Some behavior can seem reasonable to me and not to you, or reasonable in the shower and distinctly eccentric in the grocery store; and the two dimensions interact, so that you and I might have quite different ideas of what is reasonable in a grocery store. Or in a marriage.

Alternatively, is the point that behavior always carries its context with it, so to speak? It's certainly hard to tell where behavior stops and context takes over -- it's plausible that (say) belting out "Wild Thing" in the shower and in the cereal aisle are actually different behaviors, one reasonable and the other not. And then there's the "be expected" part: expected by whom? by the judge? by the petitioner's friends and relations? by the community at large?

Lexical semantics is confusing, and in a legal context it's both more complicated and more consequential. I suppose that lawyers and semanticists learn how to come to clear conclusions about these things -- I'm glad I'm just a simple phonetician.

[A quibble: the title of Margaret Marks' post ("Definition of adultery in U.S. divorce law") may have a failing presupposition. Since divorce, like marriage, is a matter of state rather than federal law, in some sense there is no such thing as "U.S. divorce law," unless the phrase is construed to mean "the divorce laws of the various states of the U.S." Marks is so careful in her investigation of the facts, and so precise in her use of language, that I'm sure that's what she meant :-).]

Posted by Mark Liberman at 05:14 AM

November 11, 2003

Ontologies and arguments

Back in the fall of 2001, some of us at Penn put together a proposal to the National Science Foundation for research on automatic information extraction from biomedical text. Most of the proposal was about what we planned to do and how we planned to do it. But in the atmosphere of two years ago, we felt that we also had to say a few words to validate the problem itself, the problem of creating software to "understand" ordinary scientific journal articles. This was not because the task is too hard (though that is a reasonable fear!), but because some NSF reviewers might have thought that it was about to become too easy. After all, the inventor of the World Wide Web was evangelizing for another transformative vision, the Semantic Web, which promised to make our problem a trivial one.

As we wrote in the proposal narrative:

Some believe that IE technology promises a solution to a problem that is only of temporary concern, caused by the unfortunate fact that traditional text is designed to convey information to humans rather than to machines. On this view, the text of the future will wear its meanings on its sleeve, so to speak, and will therefore be directly accessible to computer understanding. This is the perspective behind the proposed "Semantic Web" [BLHL01], an extension of the current hypertext web "in which information is given well-defined meaning," thereby "creat[ing] an environment where software agents . . . can readily carry out sophisticated tasks for users." If this can be done for job descriptions and calendars, why not for enzymes and phenotypes?

In the first place, one may doubt that the Semantic Web will soon solve the IE problem for things like job descriptions. The Semantic Web is the current name for an effort that began defining the W3C's Resource Description Framework (RDF) more than five years ago, and this effort has yet to have a significant general impact in mediating access to information on the web. Whatever happens with the Semantic Web, no trend in the direction of imposing a complete and explicit knowledge representation system in biomedical publishing is now discernable. In contrast, we will argue that high-accuracy text analysis for the biomedical literature is a plausible goal for the near future. Partial knowledge-represention efforts such PubGene's gene ontology (GO)[Con00] will help this process, not replace it. The technology needed for such text analysis does not require HAL-like artificial intelligence, but it will suffice to extract well-defined patterns of information accurately from thousands or even millions of documents in ordinary scientific English.

The past two years have confirmed this perspective. Even in bioinformatics, where some might think that everything should be clear and well defined, the attempt to provide a universal ontology (and a universal description language based on it) is not even close to providing a basis for expressing the content of a typical scientific article in the biomedical field. Don't get me wrong -- the kind of information extraction that we (and many others) are working on is certainly possible and valuable. But it's all interpretive and local, in the sense that it creates a simple structure, corresponding to a particular way of looking at some aspect of a problem (like the relationships among genomic variation events and human malignancies), and then interprets each relevant chunk of text to fill in pieces of that structure. It doesn't aim to provide a complete representation of the meaning of the text in a consistent and universal framework.

Recently, Clay Shirky has written an interesting general critique of the Semantic Web concept that is much more radical than what we dared to put into the staid columns of an NSF proposal. He starts with a bunch of stuff about syllogisms, which rather confused me, since syllogisms have been obsolete at least since Frege published his Begriffsschrift in 1879, and I haven't heard that the Semantic Webbers are trying to resurrect them. But Shirky ends with some ideas that I think are clear and true:

Any attempt at a global ontology is doomed to fail, because meta-data describes a worldview. The designers of the Soviet library's cataloging system were making an assertion about the world when they made the first category of books "Works of the classical authors of Marxism-Leninism." Charles Dewey was making an assertion about the world when he lumped all books about non-Christian religions into a single category, listed last among books about religion. It is not possible to neatly map these two systems onto one another, or onto other classification schemes -- they describe different kinds of worlds.

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can't get a standard til you have an agreement, and you can't force an agreement to exist where none actually does.

Shirky points out the connection between the semantic web and classical AI, which seemed to be dead but is to some extent reincarnated in the semantic web and the many things like it that are out there.

There's an interesting question to be asked about why people persist in assuming that the world is generally linnaean -- why mostly-hierarchical ontologies are so stubbornly popular -- in the face of several thousand years of small successes and large failures. I have a theory about this, which this post is too short to contain :-) ... It has to do with evolutionary psychology and the advantage of linnaean ontologies for natural kinds -- that's for another post.

[Thanks to Uncle Jazzbeau for the reference to Shirky's article.]

[Unnecessary pedantic aside: it seems that the inventor of the Dewey Decimal system was Melvil Dewey, not "Charles Dewey" as Clay Shirky has it. Google doesn't seem to know any Charles Deweys in the ontology trade. I have to confess that I alway thought it was John Dewey who designed the Dewey Decimal System, and I'm disappointed to find out that it was Melvil after all.]

[Update: Charles Stewart pointed me to a reasoned defense of the Semantic Web by Paul Ford. In effect, Ford argues that there is a less grandiose vision of the semantic web, according to which it just provides a convenient vehicle for encoding exactly the kind of local, shallow, partial semantics that IE ("information extraction") aims at.

Ford closes by saying that "on December 1, on this site, I'll describe a site I've built for a major national magazine of literature, politics, and culture. The site is built entirely on a primitive, but useful, Semantic Web framework, and I'll explain why using this framework was in the best interests of both the magazine and the readers, and how its code base allows it to re-use content in hundreds of interesting ways." I'll be interested in seeing that, because the it's exactly what I haven't seen from semantic webbers up to now: any real applications that make all the semantic web infrastructure look like it works and is worth the trouble.]

[Update 11/16/2003: Peter van Dijck has posted an illustrated guide to "Themes and metaphors in the semantic web discussion.]

Posted by Mark Liberman at 09:32 AM

Broken English from Ahnold?

The San Francisco Chronicle published a piece in early October 2003 about a group of Democrats who had been watching Arnold Schwarzenegger's early film "Pumping Iron" during the run-up to the gubernatorial election, just for laughs (before they realized they were laughing at the next Governor of the great State in which I live). They quoted this utterance as an example of what they called the "broken-English classics" of Schwarzenegger's speech:

I threw up many times while I'm working out. But it doesn't matter. It was worth it.

Not exactly `broken English', I would have said. But it does diverge in one way from a feature of normal Standard English. I thought of using it as an exercise in a grammar texbook I'm writing with Rodney Huddleston, but Rodney vetoed it as disrespectful, so we won't be using it. (It may be just as well; Schwarzenegger is, I realized after the election like one waking from a dream, is now going to be an ex officio member of the Board of Regents of my university.)

For the answer to the exercise you have to continue reading.

The answer to the exercise is that given the main clause I threw up many times, which is inflected in the preterite (here expressing past time reference), the subordinate clause, which has the same time reference, would normally also be in the preterite: it would be when I was working out.

English actually has backshifting of tenses in subordinate clauses governed by a verb in the preterite, even when there is no past time reference in the subordinate clause. Consider this example:

I told Stacy that Kim had blue eyes.

Stacy's blue eyes are not in the past; the eyes are still blue now. But what we do in English is shift the subordinate clause verb into preterite inflection (had blue eyes instead of has blue eyes) as if to respect the choice of tense in the main clause. It's optional here: it's also grammatical to say I told Stacy that Kim has blue eyes.. (The full story about backshifted preterites, which is mouthwateringly subtle and rich, is told in Rodney Huddleston's spectacular chapter on the verb in The Cambridge Grammar of the English Language, pp. 151-158.)

The thing about the Schwarzenegger quote is that the preterite in the subordinate clause is obligatory: the subordinate clause actually does refer to past time, so the present in while I'm working out really does sound a bit weird.

Not broken English, though. A small departure from idiomatic standard English, and a use of tense that would be grammatical in some languages. (There are languages in which a main clause tense is sufficient, and a subordinate clause can get away with no tense at all.) What Schwarzenegger said was not all that far from grammaticality. Definitely not a linguistic infraction that would get a man thrown off the Board of Regents of a university or anything. But then journalists always tend to exaggerate rather wildly when they say anything about grammar.

Posted by Geoffrey K. Pullum at 01:17 AM

November 10, 2003

An / Anne /Ian

I once lived in Somerville, MA, next to a woman who introduced herself to me as "Ian." I thought, how interesting, what was once a man's name has been generalized across gender boundaries. Then she introduced me to her husband Danny, rhyming with peony. Anyhow, I thought about "Ian" when I read this Monty Python skit, which I don't recall having seen on TV:

Chris: Good evening. Tonight: "dinosaurs". I have here, sitting in the studio next to me, an elk.
Oh, I'm sorry! Anne Elk - Mrs Anne Elk
Anne: Miss!
C: Miss Anne Elk, who is an expert on di...
A: N' n' n' n' no! Anne Elk!
C: What?
A: Anne Elk, not Anne Expert!
C: No! No, I was saying that you, Miss Anne Elk, were an , A-N not A-N-N-E, expert...
A: Oh!
C: ...on elks - I'm sorry, on dinosaurs. I'm ...
A: Yes, I certainly am, Chris. How very true. My word yes.

Just for fun, here's another linguistically clever Python fragment:

(Mr. Bertenshaw and his sick wife arrive at a hospital.)

Doctor: Mr. Bertenshaw?
Mr. B: Me, Doctor.
Doctor: No, me doctor, you Mr. Bertenshaw.
Mr. B: My wife, doctor...
Doctor: No, your wife patient.
Sister: Come with me, please.
Mr. B: Me, Sister?
Doctor: No, she Sister, me doctor, you Mr. Bertenshaw.

Posted by Mark Liberman at 10:44 PM

Research has been made

While reading an interesting 11/9/2003 NYT article (by Lawrence K. Altman) on progress towards a SARS vaccine, my inner prescriptivist was taken aback by this sentence:

Among the reasons for his optimism, Dr. Fauci said, is the successful research that Dr. Brian Murphy and other scientists have made at his institute, which is a unit of the National Institutes of Health.

It's hard to keep English light verbs straight: we normally have a discussion as opposed to making a discussion or doing a discussion; we normally make a comment as opposed to doing a comment, and having a comment is different. To make it harder, there are differences in usage: some have a bath when others take a bath; some have lunch while others do lunch; and so on, through thousands of bilexical minutiae.

But I thought it was agreed that in English, research is something we do, not something we make.

A bit of google corpus linguistics confirms this idea: there are some examples of make research, but all the ones I found were from non-native speakers, for instance:

A visiting scholar from Japan: "...[m]ake research on the constitution of all the computer systems at University of Illinois"

A query from a Malaysian student: "why Mandel only make a research about the gene just for pea?"

However, when I look at the passive voice -- cases of research being made -- the story is different. Some examples are non-native, like the Finn who writes that "[t]here has been a number of medical research made on electromechanical vibration and its effect on the human body." But there are quite a number of examples whose authors are clearly native speakers.

Brian Gaines and Mildred Shaw at the University of Calgary have a piece on Collaboration through Concept Maps that starts "This article focuses on research made on collaborative systems to support individuals and groups in creative visualization".

The Cooper County Historical Society (of Pilot Grove MO) announces that "Effective September 1, 2000, a $10.00 charge will be made for research made on the premises."

A flying-saucer researcher posts the transcript of a discussion in which Denise T asks him "Any idea when it will be aired, and will it cover additional research made on the film since this special was filmed?"

A bit of oral history from the Pittsburgh area mentions that "There was supposed to have been research made on the Stillion name because there were so few by that name."

A report on UK Marine Special Areas of Conservation says that "There has been little or no research made on the amounts of sewage discharged into port and harbour areas during operational shipping or recreational activities."

I don't think that I would ever write about "research made on light verbs" (assuming counterfactually that I ever did some), or "research made on the phonetics of lexical tone" (which in fact I've done). But I have to admit that these examples -- with passive forms of "make", mostly in reduced relatives immediately following "research" -- seem much less wrong than their active counterparts, to the limited extent that I have any intuitions about such things after reading a bunch of examples :-).

In any case, there does seem to be a minority tendency out there -- at least at the Cooper County Historical Society and in a few other places -- for research to be made. At least to this extent, my precriptive impulse (that such examples are wrong) can't be cashed out in terms of the way all native speakers write (or speak).

The NYT example talks about "the successful research that [some people] have made", with an active form of "make" in a full relative clause. If such examples are also in common use, then my prescriptive impulse is even less accurate as a reflection of actual norms. It's beyond my skill in google corpus linguistics to check this, and none of the available parsed corpora are big enough to answer the question. But the truth is out there...

Posted by Mark Liberman at 07:27 AM

November 09, 2003

The weblog of Samuel Pepys

I can't believe I've missed this. Or this!
[via AIMs weblog]

Posted by Mark Liberman at 07:49 PM

An abortion by any other name...

There's a sharp terminological contrast in recent news articles about an abortion bill signed by President Bush. Proponents of the bill call the procedure "partial-birth abortion"; opponents tend to call it by its medical name, "intact dilation and extraction", or else refer to it by some other non-gruesome-sounding label.

This terminological disagreement is yet another battle in the linguistic war surrounding the abortion issue. An early delineation of the battleground was Brenda Danet's 1980 article `"Baby" or "fetus"? Language and the construction of reality in a manslaughter trial' (Semiotica 32:187-219); on trial was a doctor who had performed a late abortion. And the war is also reflected in the now-standard terms for the two sides of the abortion debate: pro-life (implying that the other side is pro-death) vs. pro-choice (avoiding the emotionally-charged and, this side would argue, misleading "pro-abortion" label -- and also implying that the other side is anti-freedom of choice). This isn't a matter of whose terms are correct and whose are incorrect. Rather, it's yet another example that supports the late Dwight Bolinger's argument about the power of language, articulated eloquently in his 1980 book Language -- The Loaded Weapon. Bolinger's title says it all (and, in case a dim reader might miss the point, the book's dust-jacket features a picture of a revolver).

Posted by Sally Thomason at 06:40 PM

It depends on what the word is copulation means...

The Curmudgeonly Clerk dissects a case in which the result hinges on a set of interlocking definitions in Webster's Third New International Dictionary, cited as such in the court's opinion. It's obvious that legal decisions always depend on the meaning of words, but (knowing nothing about the law) I wonder how often the outcome hinges on definitions quoted from specific dictionaries. And I wonder if this case would have come out differently if some other dictionary had been used?

Just in case you're not inspired to read the whole thing by the abstract socio-semantic point at issue, let me add that the crux is the definition of the word adultery, and that the case was mentioned earlier by Instapundit under the title Eatin' ain't cheatin' :-).

Posted by Mark Liberman at 01:07 PM

Whitman: the first warblogger

Plato may have invented the weblog, but I think that the warblog should be traced back to Walt Whitman.

I've just read Whitman's Democratic Vistas. It's not much like other political essays. It reads more like a semi-random collection of the best pieces from a couple of years of passionate, long-winded blogging, strung together by associative links. And apparently that's just what it was:

"[T]hough the passages of it have been written at widely different times, (it is, in fact, a collection of memoranda, perhaps for future designers, comprehenders,) and though it may be open to the charge of one part contradicting another -- for there are opposite sides to the great question of democracy, as to every great question -- I feel the parts harmoniously blended in my own realization and convictions, and present them to be read only in such oneness, each page and each claim and assertion modified and temper'd by the others."

I reckon that the blogosphere is part of the "copious, sane, gigantic offspring" that Whitman hoped for. Like Whitman's essay, most warblog postings "are not the result of studying up in political economy, but of the ordinary sense, observing, wandering among men, these States, these stirring years of war and peace."

As the obligatory language hook, let me point out that Whitman has something to say about the connection between linguistic prescriptivism and class prejudice:

"The People! Like our huge earth itself, which, to ordinary scansion, is full of vulgar contradictions and offence, man, viewed in the lump, displeases, and is a constant puzzle and affront to the merely educated classes. The rare, cosmical, artist-mind, lit with the Infinite, alone confronts his manifold and oceanic qualities -- but taste, intelligence and culture, (so-called,) have been against the masses, and remain so. There is plenty of glamour about the most damnable crimes and hoggish meannesses, special and general, of the feudal and dynastic world over there, with its personnel of lords and queens and courts, so well-dress'd and so handsome. But the People are ungrammatical, untidy, and their sins gaunt and ill-bred."

These days, the rich are thin, and the People's sins are more likely to be plump and ill-bred, but Whitman's observation about the intrinsically snobbish tendency of "taste, intelligence and culture" is still valid. This is why linguists, though overflowing with normative impulses about grammar and usage, normally restrain themselves. Or rather, try to express themselves very carefully.

Posted by Mark Liberman at 09:52 AM

John righteously frums "Dead Right"

John (of John and Belle) has posted an enlightening and hilarious review of David Frum's "Dead Right". It's like one of those long, eloquent rants that a dinner guest sometimes lets fly, halfway through the third bottle of wine. One of the things that I like about the weblog form is that you get a lot of these vivid and heartfelt effusions -- which James Lileks calls bleats -- along with the news tips, the quick quips and the spoon bread recipes. And the beauty part is that if the rant gets tiresome, you can just slip out of the room without offending anybody.

John's bleat about "Dead Right" is a special type of bleat, an extended attack on a piece of someone else's writing. When the critique is presented as a copy of the original document with interlinear commentary, it's called a fisking. But John's review is not a fisking -- it couldn't be, since he's reviewing a whole book, and it shouldn't be, since his piece is organized around the flow of his thoughts rather than the text of his target.

I think we need a new word to describe a piece of this kind. I suggest the verb to frum [an author or text], with the associated noun frumming. Besides providing a useful bit of terminology, this improves the political symmetry of the anti-idiotarian lexicon.

A sample of John's piece, starting with a quote from Frum:

"Conservative rhetoric can sound a little overbroad, if not positively bats, to nonconservative ears. Conservatives, however, see the things they dislike in the contemporary world – abortion, the slippage of educational standards, foreign policy weakness, federal aid to handicapped schoolchildren – as all connected, as expressions of a single creed, a creed of which liberalism is just one manifestation."

This passage cracked me up. (Belle was moved to inquire solicitiously: “Are you OK, honey?”) It is, of course, precisely because people know some conservatives see all these things as connected that some people think some conservatives are bats. (If it thinks like a moonbat, and it talks like a moonbat, and if it comes right out and says it’s a moonbat, it’s a moonbat.)

Seriously, here’s a cautionary lesson taught by the 1960’s (you’d think conservatives could learn such things): just because you feel that everything is, like, so connected in a mysterious way, doesn’t make it so. And for damn sure you don’t have the right to bother other people with constant reports of your weird but strong intuitions of, like, total interconnectedness.

Another tasty passage:

What Frum has got, to repeat, is just a feeling that the kids these days are getting a bit soft. Everyone feels this way sometimes, of course – since it’s true. But some people have thoughts as well as feelings about this attendant effect of civilization. And so it turns out Lionel Trilling was maybe not such a poor prophet after all, when he wrote way back in 1953: “in the United States at this time liberalism is not only the dominant but even the sole intellectual tradition;” for the anti-liberals do not, by and large, “express themselves in ideas but only in action or in irritable mental gestures which seek to resemble ideas.”

Read the whole thing.

[Note: although the jargon file says that fisking is "very common", Google still asks helpfully "did you mean fishing?"
And yes, I do know about John Frum, cargo cults and all...]

Posted by Mark Liberman at 07:19 AM

Helpful Google

Cinderella Bloggerfeller shows how to have a Pythonesque conversation with Google [via Languagehat]. His elegant example

Search entry:hhhhhhhhhhhhhhhhhhhhhhhhhh

Helpful Google: Did you mean: hhhhhhhhhhhhhhhhhhhhhhhhh ?

is easy to generalize:

Search entry: aaaaaaaaaaaaaaaaaaaaaaaaaa

Helpful Google: Did you mean: aaaaaaaaaaaaaaaaaaaaaaaaaaa ?


Search entry: ooooooooooooooooooooooo

Helpful Google: Did you mean: oooooooooooooooooooooooo ?

Slight variations on the theme also work:

Search entry: gaaaaaaaaaa

Helpful Google: Did you mean: gaaaaaaaaa ?

Sometimes Google is oddly unsympathetic. For example, strings of repeated "aiaiai..." return actual pages right up to 24 ai's, which gets three hits including


Chemistry exam in two hours!!!!


but when asked about 25 ai's, Google just responds coldly

Your search - aiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiai - did not match any documents.
No pages were found containing "aiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiai".


- Make sure all words are spelled correctly.
- Try different keywords.
- Try more general keywords.

Also, you can try Google Answers for expert help with your search.

Posted by Mark Liberman at 05:47 AM

November 08, 2003

Reuters frenotomy story: Not ignorance but anti-Americanism?

A few days ago, I discussed a weird Reuters story about Korean parents allegedly pushing inappropriate surgery on their children in order to help them learn English. The strangest part was an interview with a psychiatrist who was quoted to the effect that teaching children foreign languages too early could cause speech impediments and autism.

At the time, I attributed this very odd story to lack of linguistic education and common sense on the part of the Reuters reporter and editor responsible. But that's naive. On reflection, it seems to me that there's a much more obvious explanation: anti-Americanism and anti-globalization. These biases may have created a story where there isn't (much of) one, really, but more certainly they have led Reuters to spin the story in a way that is, in a word, nuts.

Read the whole thing, especially the update at the bottom.

Posted by Mark Liberman at 09:53 AM

The Meatrix

As Kieran Healy puts it,

No one can be told what the Meatrix is

You have to see it for yourself.

I'm neither a supporter of PETA nor a vegetarian, but this is a heck of a way to frame political discussion.

Posted by Mark Liberman at 07:33 AM

Mind-reading fatigue

Why does Amtrak now need to have quiet cars? Why do some restaurants offer cell-phone-free seating options? Why does google index 51,500 cell phone rants?

It's not just because cell-phone ringers are obnoxious, though they are. The conversations themselves are annoying!

People often say that it's because cell-phone users talk too loudly. But I don't think this is true. I've been monitoring conversations around me in public places for the past couple of weeks -- regular live conversations as well as cell phone users -- and I don't hear much difference in amplitude. Some live conversations are softer, some are louder, and the same is true for cell phone users. The louder a conversation is, the more intrusive and annoying it is if you don't care to listen in. The thing is, though, a given cell phone conversation seems much more intrusive and annoying than an equally loud live conversation. We tend to interpret greater salience as greater amplitude, but it ain't necessarily so.

The greater salience of cell-phone conversations -- if it's true! -- could be because we're used to making allowances for others' live conversations, but cell phones are new and we aren't used to them. However, I don't think this is it. I think public cell phone users are annoying because mind-reading is hard work.

Let me explain.

Theory of mind is a term introduced by Premack and Woodruff (1978) to refer to a set of abilities that may be uniquely human: to attribute mental states such as beliefs, knowledge and emotions to self and others; to recognize that the mental states of others many differ from one's own; to use these attributed states to explain and predict behavior; and to predict how such mental states would be affected by hypothetical actions.

This is "mind reading", and it's hard to do, because there are no psionic wave transmissions involved -- it's all inference from what people say and do, how they say and do it, and prior information about them and others. It's also pretty much automatic -- if you're not autistic, you can't stop yourself from reading your companions' minds any more than you can stop yourself from noticing the color of their clothes.

But when you're only getting half the cues -- from one side of a cell phone conversation between two strangers -- you have to work a lot harder.

Recent theorizing in cognitive neuroscience suggests that humans have an evolved theory of mind module. An fMRI study by Gallagher et al. even suggests where it is in the brain:

Brain activation during the theory of mind condition of a story
task and a cartoon task showed considerable overlap, specifically in the
medial prefrontal cortex (paracingulate cortex).

So here's my hypothesis. When you're sitting in a restaurant or a railroad car, hearing one side of a cell phone conversation, you can't help yourself from trying to fill in the blanks. And after a few seconds of this, your paracingulate medial prefrontal cortex is throbbing like a stubbed toe. Or at least, it's interfering with your ability to think about other things.

[Update: a friend has observed that I myself rarely give any indication of noticing the color of anyone's clothes. Well, um, I do. Notice, that is.]

Posted by Mark Liberman at 06:27 AM

November 07, 2003

Gall in the family

It's depressing that Greg Ross, the managing editor of the generally excellent American Scientist Online, has written such a badly-informed and credulous review of Peter Forster and Alfred Toth, 'Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European'. PNAS (2003).

For a better appraisal, see Larry Trask's Linguist List review, Peter Forster's reply, and Trask's re-reply.

The American Scientist review starts out badly:

Ever since Darwin proposed an evolutionary tree to describe the descent of species, linguists have sought to apply the concept in their own field. ... Now historical linguists may stand to benefit by borrowing a second idea from evolutionary biology.

This gets the direction of intellectual influence exactly backwards. The well-known fact of the matter is that Darwin modeled his idea of "descent with modification" in biological evolution explicitly on what he took to be the obvious prior success of philologists in establishing "descent with modification" as the basis of the the history of languages.

"It may be worth while to illustrate this view of classification, by taking the case of languages. If we possessed a perfect pedigree of mankind, a genealogical arrangement of the races of man would afford the best classification of the various languages now spoken throughout the world; and if all extinct languages, and all intermediate and slowly changing dialects, had to be included, such an arrangement would, I think, be the only possible one. Yet it might be that some very ancient language had altered little, and had given rise to few new languages, whilst others (owing to the spreading and subsequent isolation and states of civilisation of the several races, descended from a common race) had altered much, and had given rise to many new languages and dialects. The various degrees of difference in the languages from the same stock, would have to be expressed by groups subordinate to groups; but the proper or even only possible arrangement would still be genealogical; and this would be strictly natural, as it would connect together all languages, extinct and modern, by the closest affinities, and would give the filiation and origin of each tongue."
("Origin of Species", 1st Edition, Chap. 13 Mutual Affinities of Organic Beings).

Indeed, this idea was already well understood by Thomas Jefferson almost a century earlier. See this chapter by Lyle Campbell for other, even earlier roots of the idea that languages form a "family tree" (scroll down to section 4, "The Rise of the Comparative Method").

I'm also fairly certain that the lexicostatisticians used algorithmic phylogenetic-tree-inducing techniques (Ross' "second idea") for language history before any such techniques were ever employed in biology. They certainly did so many decades before Forster and Toth came on the scene.

It's not fair to blame Ross for being ignorant of the past and present state of historical linguistics, but he could have asked someone with some linguistic credentials. If a couple of computational linguists wrote an article about applying language-modeling techiques to determining the structure of macromolecules, I'd expect Ross to consult with specialists in that area before deciding whether or not to take the authors at their word (in this case I believe that he'd discover that their word is good). When a couple of geneticists take a flying leap at Indoeuropean, I'd expect Ross to consult with a historical linguist or two rather than writing a puff piece based entirely on the article and an interview with its authors.

I'm not going to criticize Ross any further, or rehearse the problems with the Forster & Toth article in detail here -- but read Trask, read Ross, read Darwin, read Jefferson, and weep.

Our field needs to fire its public relations consultants and... What? We don't have any?

Posted by Mark Liberman at 06:37 PM

Cell phone poems

Rosanne over at the X-bar comments on an informative tautology in a bit of cell phone conversation heard on the street:

"I miss them because I miss them, but, you know, I'm happier."

Rosanne says that "public phone chats should be part of the linguistic public domain". This particular chatlet reminds her of a song lyric, and starts her thinking about tautologies in everyday life.

Since public cell phone talk is acoustic littering that degrades common spaces, I'm in favor of this idea of using it as a source of linguistic examples and as a form of found poetry. When life hands you one loud side of an unwanted conversation, make an example sentence -- or a poem.

It's interesting how often an everyday conversation feels like a poem, if you just arrange its typography according to the conventions of the form. A striking recent example was Hart Seely's discovery of the poetry of D.H. Rumsfeld. Without detracting from Rumsfeld's accomplishments, I think it's fair to say that most genuine conversations contain similarly effective and affecting passages.

For example, part of one side of the sample of conversational audio shipped with the Transcriber program goes like this:

It's like I mean
I just didn't know

You know everyone tells you
    you don't know
        you don't know
            you don't know
And the thing is you don't know
so you don't even know that you don't know
you know what I mean?

It's like-

I don't know.

So the next time a peaceful lunch is invaded by half of a cell phone conversation, I'll try to think of it as an impromptu poetry reading.

[Update: a couple of people have pointed out to me that it's not obvious that a sentence of the form A because A is a tautology, i.e. is necessarily true. In fact, some people might think that it's necessarily false, if taken literally, or perhaps has a necessarily failing presupposition. Whatever.]

Posted by Mark Liberman at 07:01 AM

November 06, 2003

Discourse: branch or tangle?

Coherent texts seem to have a clear, more-or-less hierarchical structure that crosses sentence boundaries, and may extend over arbitrarily long passages. However, several millennia of attempts to provide an analytic foundation for this kind of discourse structure have been disappointing. At least, discourse has never achieved the kind of widely-accepted informal analytic lingo that we take for granted as a foundation for talking about syntax: "in the sentence It is a vast and intricate bureaucracy, there is a noun phrase a vast and intricate bureaucracy, in which vast and intricate is a conjunction of adjectives modifying the head noun bureaucracy; etc."

Why? Is the apparent structure of coherent text just an incoherent illusion, a rationalization of ephemeral affinities that emerge as a by-product of the process of understanding? Is it too hard to figure things out when there is little or no morphological marking? Have linguists just not paid enough attention?

Recently, several of the many small groups developing various theories of discourse analysis have started creating and publishing corpora of texts annotated with structures consistent with their theories. The RST Discourse Treebank led the way, with the 2002 publication of Rhetorical Structure Theory annotations of 385 Wall Street Journal articles from the Penn Treebank. The corpus has enabled this approach to be widely used in engineering experiments and even some working systems.

Now Florian Wolf and Ted Gibson have put forward an alternative approach. In a paper entitled The descriptive inadequacy of trees for representing discourse coherence, they argue that "trees do not seem adequate to represent discourse structures." They've also provided an annotation guide for an approach that does not assume strictly hierarchical relationships in discourse, and annotations of 135 WSJ texts, which have been submitted for publication to the LDC.

As a non-expert in such things, I find their arguments convincing. Even leaving aside the structure of everyday speech, where we all too often surge enthusiastically "all through sentences six at a time", there are often cases where the commonsense relationships between bits of discourse seem to cross, tangle and join in a way that a strictly hierarchical structure does not allow.

Here's an example taken from the Wolf/Gibson paper (souce wsj_0306; LDC93T3A), divided into discourse segments:

0. Farm prices in October edged up 0.7% from September
1. as raw milk prices continued their rise,
2. the Agriculture Department said.
3. Milk sold to the nation's dairy plants and dealers averaged $14.50 for each hundred pounds,
4. up 50 cents from September and up $1.50 from October 1988,
5. the department said.

Here's their annotation of coherence relations for this segmentation:


(ce=Cause-Effect; attr=Attribution; elab=Elaboration; sim=Similarity.)

Note how the "Elaboration" relation between segments [3 4] and segment 1 crosses the "Attribution" relation between segment 2 and segments [0 1], and also applies only the second segment of the [0 1] group. This seems to me like a plausible picture of what's happening in this (simple) passage -- I wonder if someone who believes in tree-structured theories of discourse relations can offer an argument against cases like this.

Overall, Wolf and Gibson report that in their corpus of 135 texts, 12.5% of the (roughly 16,000) arcs would have to be deleted in order to eliminate crossing dependencies, and 41% of the nodes had in-degree greater than one (i.e. would have multiple "parents" in a tree-structured interpretation).

I think that these things -- both the RST Treebank and the Wolf/Gibson corpus -- are wonderful steps forward. Two alternative approaches to the same (hard) problem offer not just examples and arguments, but also alternative corpora (of overlapping material!), annotation manuals, annotation tools and so on.

The RST authors have applied their ideas to engineering problems of summarization, MT, essay grading and so on, as well as basic linguistic description. Wolf and Gibson are using their analysis as a foundation for psycholinguistic research as well as information extraction and other engineering applications.

What a great time to be in this field!

[Update: a response by Daniel Marcu is discussed here, and a response by Florian Wolf to this response is discussed here.]

Posted by Mark Liberman at 08:13 PM

Correct somebody. Hard and with passion.

Uncle Jazzbeau suggests that "Within every soi disant descriptivist is a prescriptivist dying to drop all the pretense and nonsense and correct somebody. Hard and with passion."

I'm sure that Uncle J. doesn't need to be warned that this can be a difficult subject, as we learned in our earlier discussion of the (non-)plurality of italics, in the course of which Geoff Pullum pointed out that

[a] libertarian who calls me a prescriptivist is a libertarian who is going to be asked to step outside in the back alley for a few minutes of profound unpleasantness, most of which he will spend lying on the ground by the dumpster.

Not that anyone has called Geoff a prescriptivist, exactly :-).

Posted by Mark Liberman at 09:58 AM

"The enemy is language"

Is the on-line talk abstract emerging as a new literary genre? Cindie McLemore has pointed out to me that this example, due to Chuck Fillmore, is a small masterpiece of the form.

It's easy to read, it gives you a clear idea of what the talk will be about, and it makes you want to go hear the rest of the story. It manages to give both the big-picture thematic context of the research and the individual experience of the researcher. The fact that it's talking about about excellent work on interesting problems is essential, but the presentation is artful.

Posted by Mark Liberman at 08:57 AM

November 05, 2003

For the metrically challenged

I was refereeing a phonology paper last week and being frustrated, yet again, by my non-grasp of the basic metrical terms. My linguist daughter came to my rescue with Samuel Taylor Coleridge's mnemonic verse (Metrical Feet, 1806), which I reproduce here in case someone else reading this might suffer from my metrical disability:

Trochee trips from long to short;
From long to long in solemn sort
Slow spondee stalks
-- Strong foot, yet ill able
Ever to keep up with dactyl's trisyllable.
Iambics march from short to long;
With a leap and a bound the swift anapests throng.

This verse doesn't move me quite as much as the mnemonic I learned eons ago from my college German teacher; that one was designed to help us remember which cases went with which German prepositions. But once I commit it to memory, Coleridge's poem should help me avoid embarrassment when I find myself stumbling among phonologists' feet.

Posted by Sally Thomason at 07:25 PM

Once King, always King

A bitter campaign to name a road after Dr Martin Luther King has been abandoned in San Jose, California. But curiously, the road to be renamed is called... King Road. A former police officer, Ken Stewart, fought for three years to get it named for Dr King (etymologically, at the moment it commemorates early San Jose settler Andew King), collecting 600 signatures of people on King Road who agreed with him, and turned the whole affair into a nasty blacks-versus-chicanos battle. Internalist semantics could really have saved a lot of trouble here. Change the conceptual structure without changing the overt linguistic form: just think about Martin Luther King when you drive down King Road.

Posted by Geoffrey K. Pullum at 05:42 PM

Zettascale Linguistics

In a presentation on cluster computing, I found the phrase:

"5 Exabytes: All words ever spoken by human beings"

The authors are Philip Papadopoulos, Greg Bruno and Mason Katz, of the San Diego Supercomputer Center, and the presentation seems to be one of a series that was given in Singapore in April of 2002.

The phrase means that digital storage amounting to 5 * 10^18 bytes would suffice to store everything that every human being has ever said. This is compared with the expected storage capacity of a modest ($300K-cost) computer cluster in 2007, which is listed at 1.2 exabytes, only about 4 times smaller. In fact this calculation seems to be wrong, by a factor of 8 million or so -- but never mind, the correction just puts things off for another couple of decades :-)... Despite the mistake, I have to exclaim "oh brave new world, that has such calculations in it!"

The context is an extrapolation of current trends forward to 2007. The authors discuss the likely future of commodity disk technology, and conclude (on slide 29) that in 2007, a "conservative" serial ATA disk will offer 1680 GB for a price of $46 (US), while an "agressive" disk will provide 5120 GB for $142 (US).

After discussing trends in other components as well, they give a picture of a "2007 cluster" (slide 37 translated from ppt into html, emphasis mine):

  • 4 TFLOPS
    • 128 dual processor compute nodes
    • 3rd on current TOP500 list
      • 2nd place is PSC Terascale cluster
  • 2.3 TB main memory
  • 1.2 EB storage
    • 2 disks per node
    • 5 Exabytes: All words ever spoken by human beings
  • 12.8 Tb/s aggregate network I/O
  • System cost: USD$300,000
    • PSC Terascale cluster = USD$35 million

The idea seems to be that each of 128 cluster nodes will have two "aggressive" 5.12 terabyte disks, which will collectively provide 1.2 exabytes. In order to impress us with how much this is, the authors tell us in an aside that 5 exabytes would suffice to store "all words ever spoken by human beings."

Truly an impressive (if horrifying) thought.

And I'm impressed enough, in advance, by being able to get 5-terabyte disks for $142 each.

However, I believe that this slide contains two numerical errors. First, the proposed configuration would amount to 1.2 petabytes, which is a thousand times smaller than 1.2 exabytes. Second, a 5 exabyte store would roughly be eight thousand times too small to store "all words ever spoken by human beings", at least in audio form. Therefore the 2007 cluster's storage would be too small by a factor of about 32 million rather than a factor of 4. I freely confess that maybe the authors were thinking about text -- but in the first place I'm a phonetician, and in the second place most human languages have not had a written form. So bear with me here for a while.

First, the cluster storage sum.
128 * 5120 * 10^9 * 2 = 1.31072 * 10^15
(128 cluster nodes, 5120 GB per disk, 2 disks per node). This is ~ 1.3 petabytes -- a petabyte is 10^15 bytes -- not 1.3 exabytes -- an exabyte is 10^18 bytes. (The change from 1.3 to 1.2 presumably has to do with disk format issues).

Second, the storage requirements for all human speech. There are said to have been 1 billion people in 1800, 1.6 billion people in 1900, and 6.1 billion people in 2000. So let's assume that 10 billion people have lived an average of 50 years, speaking for 2 hours a day on average throughout their lives. This is
10 * 10^9 * 50 * 365 * 2 * 60 * 60 = 1.314 * 10^18 seconds.
If we assume 16 KHz 16-bit linear single-channel audio, at 32KB per second, we've got
1.314 * 10^18 * 3.2 * 10^4 = 4.208 * 10^22 bytes.

This is 42 zettabytes (a zettabyte is 10^21 bytes), and is more than 8 thousand times more than 5 exabytes, and thus more than 32 million times larger than the projected storage of the 2007 computer cluster.

All these numbers -- number of people, amount of talking, audio encoding, etc. -- could be adjusted up or down by modest factors, but I believe that any way you slice it, "all words ever spoken by human beings" is a zettascale project. Unless I've screwed up the arithmetic, which is entirely possible, since Papadopoulos et al. did, and I'm sure they're less likely to drop a few orders of magnitude early in the morning than I am :-).

[Note: the 5-exabytes-for-all-human-speech meme seems to be proverbial -- scroll down the hyperlink to the defiition for exabyte, where you'll find that "It has been said that 5 Exabytes would be equal to all of the words ever spoken by mankind".]

[Also: given that disk price/performance continues to improve by a factor of two every year, it will take an additional 25 years to take care of the needed factor of 32 million (2^25 = 33,554,432). So we're talking about the typical cluster of the year 2032 -- except that some form of Stein's theorem is likely to intervene -- unless Davies' Corollaries apply...]

[Update 11/12/2003: the canard that "Five exabytes... is equivalent to all words ever spoken by humans since the dawn of time" was repeated in this 11/11/2003 NYT article. It's amazing how people pass this stuff around without checking it or thinking it through: Eskimo snow words all over again, though on a much smaller scale.

The Dutch periodical Onzetaal linked to the NYT article and also to this post -- maybe the internet culture can start to keep these small thoughtless "idées reçues" in check.]

[Update 1/3/2003: Adam Morris wrote to explain:

Gigabyte is a confusing unit, similar to billion (one thousand million or one million million? I'm used to both now and assume that unless explicitly mentioned Brits mean the larger while Americans mean the smaller...) A gigabyte should be 10^9 bytes, but as computer people frequently deal in binary, it is also used to mean 2^30. As 2^10 is 1024 this is frequently used as a multiplier in disk sizes and memory. This would make a terabyte, not 10^12 but 2^40 bytes. A 5120 GB disk would thus be five terabytes, and two of them would be ten terabytes. This gives us 1,280 terabytes, or 1.25 petabytes (2^50 not 10^15). thus the change from 1.3 to 1.2 is to do with the actual size of the units involved. Disk drive manufacturers usually use 10^X as it makes the disks seem bigger than the 2^Y maths used elsewhere.

I guess I sort of knew that, but neglected to bring it to bear on the calculations above. I'm grateful for the clarification.

I've heard from various other people with observations about better ways to estimate the total number of person-years in human history to the present; about alternative notions of how much talking people do; about audio encoding and audio compression methods; and so on. None of these seems to make more than an order of magnitude difference at most (mostly a factor of two or thereabouts), and the effects are sometimes to increase the estimate, and sometimes to decrease it. So I'll stand pat for now.

With respect to the number of people who have ever lived, Brian Carnell argues (with a reference) that it's closer to 100 billion than 10 billion. I haven't studied the source, but I'll accept the correction -- except that as Brian also observes, the figures deal with the number of humans who have ever been born, and during much of human history, most folks died pretty young, making my 50-year-life-span estimate far too high. The cited reference (a paper by Carl Haub) says that "[l]ife expectancy at birth probably averaged only about 10 years for most of human history". So rather than a ten-fold increase, there might be as little as a two-fold increase.

For those who care, here's a table of representative audio encoding rates. I chose 32 KB/sec -- roughly the quality of FM broadcasts -- as the data rate. One could use lossless encoding to lower this by a factor of two or so; one could use lossy coding (like MP3) to get higher perceptual quality in the 16-32KB/sec range; but it'd be a crime against humanity to go to cell phone or LPC-10 data rates.

Rate in bits/sec
Rate in bytes/sec.
Rates in bytes/hour
1. CD standard (stereo)
44.1KHz 16b/sample
1411.2K 176.4 635.04M
2. FM-quality wideband (mono)
16KHz 16b/sample
256K 32K 115.2M
3. Same as above
with lossless coding
~128K ~16K ~57.6M
4. Typical MP3, AAC etc. 128K 16K 57.6M
5. Basic digital telephony
(one channel)
64K 8K 28.8M
6. ADPCM (one channel) 32K 4K 14.4M
7. Typical Digital cellular
(one channel)
8K 1K 3.6M

8. LPC-10
(one channel)

2.4K 300 1.08M

So maybe it's two times more people and two times fewer bits per second. Any way you slice it, I think it's still a zettascale problem... ]

Posted by Mark Liberman at 07:40 AM

November 04, 2003

What Do You Call Your Boss? How About Your Jailer?

Address forms in many languages, including American English, reflect and reveal certain aspects of the culture's social norms. But usually they're implicit. Your greeting to your drinking buddy (``How ya doin', you old S.O.B.?'') probably shouldn't trip lightly off your tongue when you're greeting your corporate-honcho boss (unless he -- or she -- is also your drinking buddy, and even then...). If you're a 9th-grader, addressing your young classmate as ``Ms. Clark'' will sound odd; and if you've just met her stuffy father, calling him ``Johnny'' will sound even odder. You absorb the mostly unwritten and untaught rules for what to call people as you exit toddlerhood and proceed toward adulthood.

But if you happened to learn the wrong rules for American English, you can get help from the Inmate Handbook that is issued to all residents of the institutions administered by the Commonwealth of Pennsylvania Department of Corrections.

-- Or, at least, you can get this help once you're incarcerated in a state penitentiary in Pennsylvania. (I won't say where I got my copy of this Handbook because I don't want to jeopardize the happiness of a student of mine, an inmate who broke another prison rule by slipping me his copy one evening after class in the prison school.) Here's the Handbook's only language regulation (which I'm quoting accurately, including the grammar and the sexist assumptions about who's likely to hold which titles):

``Addressing Staff Personnel: Inmates should approach all staff personnel with respect and courtesy. Staff personnel should be addressed by their title (Superintendent, Captain, Doctor, etc.) or by `Mister' and if their last name is known (`Mister Smith, etc.) or by `Sir' if their correct title or last name is not known. For women, the appropriate Mrs., Ms., Miss, Ma'm, etc. should be used.''

Like the other directives in the Handbook, this one has teeth: failure to comply can land an inmate in the hole, a.k.a. solitary confinement. Enforced courtesy! Would Miss Manners approve?

Posted by Sally Thomason at 09:02 PM

Reuters: early bilingualism causes autism

Via Language Geek, a Reuters story about "a surprising number"of Korean parents who subject their kids to a frenotomy, cutting 1 to 1.5 cm from the strap of tissue linking the tongue to the floor of the mouth, allegedly in order to "help them perfect their English".

Ankyloglossia ("tongue tie") is a recognized medical condition, for which frenotomy is indicated, but it seems nothing short of preposterous to suppose that this condition would affect speaking English but not speaking Korean.

However, the article cites some negative reactions that are even more hair-raising:

Dr. Shin Min-sup, a professor at Seoul National University who specializes in issues of adolescent psychiatry, is worried about the trend for surgery and also for pushing young children too hard to learn languages.

"There's the potential for life-damaging after-effects," Shin said. "Learning a foreign language too early, in some cases, may not only cause a speech impediment but, in the worst case, make an child autistic."

"What's wrong with speaking English with an accent anyway? Many parents tend to discount the importance of a well-rounded education," Shin said.

So a psychiatrist from Seoul National University is quoted as saying that early bilingualism causes speech impediments and autism, and is also incompatible with a well-rounded education.

Words fail me.

OK. As a working hypothesis, I'd start with the idea that the anonymous Reuters journalist who wrote this article is guilty of criminal quote-mangling. But it's possible that Dr. Shin Min-sup actually said that stuff, in which case the journalist is merely in need of an emergency infusion of common sense. I mean, how can you get to be a Reuters reporter -- writing in English from Korea -- without noticing that kids grow up speaking several languages without developing speech impediments or autism or unbalanced education at an unusual rate?

I'd like to be able to say that this story shows why journalists should be required to take a good introductory linguistics course. However, the writer's failure to apply elementary reasoning to general world knowledge suggests that more eduation would probably just give him or her more stuff to get confused about.

Read the whole thing (sigh)... or in the UK or French versions, which list Kim Kyoung-wha as the author.

[Note: as usual in cases of apparent journalistic malfeasance, the guilty party may in fact be an editor who deleted essential material or "improved the prose" in ways that changed its meaning. [or substituted a completely different story - ed.] If that's true, I apologize to Kim and transfer all the above complaints to the Reuters editor. Who is guilty at least of failing to notice the article's idiocy, if nothing else.]

[Update: There is indeed someone named Min-sup Shin in the department of Neuropsychiatry at Seoul National University. That doesn't mean that the quotes are valid, of course.]

[Update 11/7/2003: It's occurred to me that the Reuters article doesn't offer any evidence that frenotomies are really rampant in South Korea. One doctor is quoted as saying that he performs the procedure "once or twice" a month, and that only "ten or twenty percent" of parental inquiries lead to surgery. Taking this at face value, it gives us a yearly total of 12-24 surgeries and 60-240 inquiries. Now, maybe there are dozens of other doctors and thousands of inquiring parents. Or maybe this is the one guy who's the frenotomy specialist, and he's boosting his stats, and we're talking about 10-15 surgeries and 50-60 inquiries a year, mostly medically valid or at least not connected to crazed parents frantically pushing English.

In that case, why would it be news? Well, plausibly, because it's a thump in the nose to globalization and (implicitly) to the U.S. The truly odd quote from the psychiatrist is consistent with this. Reuters has been accused of an anti-American bias more than once recently, with some apparent justification.]

Posted by Mark Liberman at 08:27 PM

ICANN steps back to move forward on globalization?

In a news release from Carthage:

"ICANN announced it will launch a broad strategic initiative to enable new generic top level domains (gTLDs). The strategic initiative will include a two-stage approach to move to the full globalization of the market for top-level domains."

This seems to be a step back from the earlier announcement from ICANN about "deployment of internationalized domain names" that Steven Bird criticized here a few months ago. However, I'm not enough of a globalization expert to figure out whether the now-promised "full assessment of technical standards to support multilingual TLD's" is really a step back from the earlier claim that "the commencement of global deployment of Internationalized Domain Names (IDNs) ... will allow use on the Internet of domain names in languages used in all parts of the world".

Posted by Mark Liberman at 06:56 PM

Secret cabals of the linguistic elite

Laurence Urdang is the founder of Verbatim, a quarterly publication devoted to mildly humorous writing on language. A collection of writing from Verbatim has just been published (Erin McKean [ed], Verbatim, Harcourt, 2001; ISBN 0-15-60129-X; paperback US$14.00.), and Urdang contributes a foreword. In that foreword, as Lynne Murphy points out in a review (Language 79.2, 2003, 660-61), Urdang makes this astonishing remark about the linguistics profession:

Not all those who are interested in -- even fascinated by -- language are willing to make the effort to study linguistics, which is probably just as well. Professional linguists guard their domain zealously, often forbidding any untrained "amateur" admittance to the secret annual cabals sponsored by such august institutions as the Linguistics Society of America, the Dialect Society, the American Name Society, and so forth.

Now, as it happens, there is no Linguistics Society of America, and there is no Dialect Society, so that might explain any difficulty with finding secret annual cabals under those names. But there is a Linguistic Society of America, and there is an American Dialect Society, and we can assume Urdang was intending to refer to them. So where the hell does he get the nonsense that he writes about them?

The LSA holds a large annual meeting which is widely publicized (go to < /2004annmeet /index.html> for full details of the upcoming one; the Sheraton Boston is not exactly a secret location). Members of the press attend when newsworthy stuff is on the program (as when there was a debate on the great Ebonics controversy). Despite the need for the meeting to finance itself through registration fees, I have never seen LSA staff making even the slightest effort to check on whether members of the public are walking into sessions right off the street. Try it for yourself: just walk in and listen to a lecture or two. If you can pull together $65 you can become a full LSA member for a year and vote in its business meetings. There are no qualifications or training prerequisites. Put down $1500 and you can be a member for life (an incredible bargain, because you get a valuable and expensive journal delivered four times a year, and as its price increases for hoi polloi your membership gets cheaper and cheaper until it is as if the LSA is actually paying you money). And as for the American Dialect Society, they meet jointly with the LSA, so you get to attend their sessions too, at the same hotel, if you go to an LSA meeting.

So what is Urdang talking about? Did "professional linguists" ever do something to him to embitter him thus? There is a legend that Alfred Nobel refused to endow a Nobel Prize in mathematics because his wife ran off with a mathematician. It is completely untrue. But conceivably Urdang had a comely wife who succumbed to the blandishments and intricate covert structures of some theoretical syntactician, or yielded to the persuasive prosody of some beguiling phonologist, and went off with him or her to some secret cabal at the Sheraton Boston? No. That wouldn't account for the fact that further down the page (p.xiii) he proudly lists ten professional linguists who contributed to Verbatim under his editorship. The trouble is, with the very first one he spells the name wrong (it's an extremely famous name: Dwight Bolinger).

So why would Urdang want to assert that linguists hold secret meetings that the "untrained" are barred from, when this has never been true? I have no idea. Why is it "just as well" that language enthusiasts should stay away from linguistics? (Should animal lovers stay away from zoology too, is that the idea?) Why does Urdang stress that "language is the property of us all, and thoughts and opinions about it must not be reserved for the few who regard themselves as the elite" (McKean, ed., p.xiii)? When did the LSA, with its active program of outreach to the public (see the website), ever say or do anything that suggests linguists "regard themselves as the elite"?

Of course people who have no education in linguistics should be able to express their opinions about language whenever they want. I am aware of no linguist who has ever denied that right. I'd really like to know why a man so ignorant of linguists and linguistics that he can't even name Bolinger or the LSA correctly is accusing the members of the strikingly egalitarian linguistics profession of declaring themselves an elite and keeping him out of widely publicized meetings.

Posted by Geoffrey K. Pullum at 02:43 PM

Davies' Corollaries

Daniel Davies at Crooked Timber has wisely conjectured two corollaries to Stein's law.

I'd like to register my own observation that the field of linguistics provides many supporting examples.

I'd also like to suggest that his typo "corollorary" was caused by the (always surprising to Americans) British pronunciation (the one with stress on the second syllable), which leave us with a nagging suspicion that there must be another "oll" or "or" in there somewhere.

[Update: a clever comment on Crooked Timber says that "Reading this almost gave me a coronorary".]

Posted by Mark Liberman at 01:18 PM

The rhetoric of cold reading

Start with a few Barnum statements, and then move on to the push.

According to an article by James Wood and others in the Skeptical Inquirer, that's how Rorshach Inkblot testers, astrologers and fortune tellers do it.

P.T. Barnum said that "a circus should have something for everyone" and "there's a sucker born every minute". Barnum statements (like "you work hard but your salary doesn't fully reflect your efforts", or "though you appear confident, you're really somewhat insecure inside") are designed to apply to (nearly) everyone; and to convey to every sucker what seems like a special and individual sympathy.

According to the article:

After being warmed up with Barnum statements, most clients relax and begin to respond with nonverbal feedback, such as nods and smiles. In most psychic readings, there arrives a moment when the client begins to "work" for the reader, actively supplying information and providing clarifications. It's at this critical juncture that a skillful cold reader puts new stratagems into action, such as the technique called the "push" (Rowland 2002). A psychic using the push begins by making a specific prediction (even though it may miss the mark), then allows feedback from the client to transform the prediction into something that appears astoundingly accurate:

   Psychic: I see a grandchild, a very sick grandchild, perhaps a premature baby. Has one of your grandchildren recently been very sick?

   Client: No. I. . . .

   Psychic: This may have happened in the past. Perhaps to someone very close to you.

   Client: My sister's daughter had a premature girl several years ago.

   Psychic: That's it. Many days in the hospital? Intensive Care? Oxygen?

   Client: Yes.

By using the push, a cold reader can make a guess that's wildly off target appear uncannily accurate. The push and other techniques are effective because, by the time the cold reader begins using them, the client has abandoned any lingering skepticism and is in a cooperative frame of mind, thereby helping the psychic to "make things fit."

In reading this article, I was struck by the kind of (informal) discourse analysis that the authors are doing. They discuss dimensions of interpersonal interaction that are often crucial to communication, but seem to be missing from the worldview of (most?) linguistic discourse analysts.

For example, what is the formal pragmatics of a Barnum statement? Why is its effect measured in nods and smiles?

The "push" is said to depend on the client being in a "cooperative frame of mind" -- is this the same kind of cooperation that's assumed by the Gricean cooperative maxims that arguably underly all communicative interaction?

How could a theory of discourse frame (and test?) the hypothesis that Barnum statements "set up" the push?

Such questions are not mysterious from a common-sense perspective, but (in my limited understanding, anyhow) they aren't easy to ask in the framework of linguistic pragmatics. They deal with rhetorical (?) structures that don't reflect the the management of reference, or the logic of an argument, or even the expression of attitudes, but instead seem to have something to do with the dynamics of interpersonal emotion, and the way it affects communication. Or subverts it...

Posted by Mark Liberman at 12:14 PM

November 03, 2003


In case you missed Talk Like a Pirate Day, Common-place has a review. (via A.L.D.)

However, Type Like a Pirate Day is not covered...

Posted by Mark Liberman at 11:29 PM

Another eggcorn

Public Service Announcement: If you've come here because you're interested in solemn promises of faithful attachment in marriage, and you've searched for "wedding vowels", you really should make this search for "wedding vows" instead. A vow is "a solemn engagement, undertaking, or resolve, to achieve something or to act in a certain way." A vowel is "a speech sound produced by the passage of air through the vocal tract with relatively little obstruction, or the corresponding letter of the alphabet", usually contrasted with consonant. Your vows will need to contain both vowels and consonants. I wish you all the best in your ceremony and in your life together!

The Miss Manners column in the San Francisco Chronicle today (Monday, November 3) reveals evidence of another very clear eggcorn. A reader wrote in to ask for information on the etiquette (about gifts, etc.) surrounding the practice of "renewing the wedding vowels". Miss Manners was gentle as always: "Forgive Miss Manners for skewering you with a simple typographical error...". But it wasn't a typographical error; it was clearly an eggcorn.

[Update 1/26/2004: A reader has pointed out that the word avowal is no doubt part of the pattern that results in this confusion. (myl)]

Posted by Geoffrey K. Pullum at 08:22 PM

An Early Language Experiment: Failure or Triumph?

Late in the 16th century, the Mogul emperor Akbar the Great tested his hypothesis that babies raised without hearing speech would be unable to speak. He had twelve infants raised by mute nurses in a house where no speech could be heard. Several years later, he went to the house and found that none of the children spoke. Instead, they conversed only in signs. Akbar's hypothesis seemed to be supported: no oral input, no oral language language learning.

But most accounts of Akbar's experiment miss the most interesting point. The silent house where the children were raised was called the Gong Mahal, the "Dumb House". But Gong (as Gernot Windfuhr tells me) meant not only `dull, stupid'; it also meant `one who converses by signs'. The mute nurses likely conversed with each other in signs; they must have communicated with their infant charges in signs -- and the children must have developed a kind of sign language. So although Akbar was right in predicting that the children would not learn an oral language, it seems likely that they did in fact learn, or create, a sign language -- either from normal signed input from the mute nurses (if the nurses had a fully developed sign language) or by further developing a rudimentary sign language used by the caregivers.

A rather different version of this story holds that Akbar's goal (or at least one of his goals) was to find out what language the children would speak when they grew up, thinking that that would be the world's original language. (This experiment is in the spirit of the similar, though smaller-scale, experiments supposedly conducted by the ancient Egyptian pharaoh Psammetichos and by King James I of England.) The fact that the children turned out to converse only in signs was deemd a complete failure of the experiment if that was its goal. It doesn't seem to have occurred to anyone that that result might instead suggest that the world's original language was a sign language, not an oral language. (Unlikely, perhaps, but it would be a reasonable conclusion given Akbar's premises.)

Posted by Sally Thomason at 07:29 PM

On second thought, make that "fuckingly brilliant"

The word "f---ing" may be crude and offensive but, in the context presented here, did not describe sexual or excretory organs or activities. Rather, the performer used the word "f---ing" as an adjective or expletive to emphasize an exclamation [and that] ... is not within the scope of the commission's prohibition of indecent program content.

The FCC, in its explanation last week of why it dismissed a complaint by the right-wing Parents Television Council after Bono received his Golden Globe award last year by saying "This is really, really fucking brilliant."
Posted by Geoff Nunberg at 12:51 PM

Weblogs were invented by... Plato!

Camille Paglia recently explained, with characteristic modesty, that

"Now and then one sees the claim that Kausfiles was the first blog. I beg to differ: I happen to feel that my Salon column was the first true blog. My columns had punch and on-rushing velocity. They weren't this dreary meta-commentary, where there's a blizzard of fussy, detached sections nattering on obscurely about other bloggers or media moguls and Washington bureaucrats. I took hits at media excesses, but I directly commented on major issues and personalities in politics and pop culture."

Mickey Kaus retorted that

"I still say Herb Caen's column was the first blog. ..."

Roger Simon observed that

"Though I was a Caen fan, my vote goes to the immortal Jimmy Cannon, New York sportswriter and progenitor of the three dot column. Second choice: Dr. Hunter S. Thompson--"Fear and Loathing in Las Vegas" as the first political blog."

and also reports on research pointing to the first website at CERN in 1992, a Swarthmore student's online diary in 1994, and Dave Winer's Scripting News.

Comments on Simon's site mention Jerry Pournelle's Chaos Manor Musings and Saucer Smear; Dave Winer says

"Roger, all three have legit claims. TBL's first site was a weblog (as were the What's New Pages at Urbana and then at Netscape), and Justin Hall's site predated mine by a couple of years. My claim is that all the things you see called blogs today can trace their roots back to Scripting News, as it inspired bloggers (and provided easy to use tools) to start blogs, and they inspired others and so on."

(Links from Glenn Reynolds)

Well, enough dreary meta-commentary, let's start directly comment[ing] on major issues and personalities!

Following the chain of family resemblances from Camille Paglia through Herb Caen, Jimmy Cannon and Hunter Thompson, I want to skip back a couple of millennia to an even more original model: Plato, whose Republic begins

"I went down yesterday to the Peiraeus with Glaucon, the son of Ariston, to pay my devotions to the Goddess, and also because I wished to see how they would conduct the festival since this was its inauguration. I thought the procession of the citizens very fine, but it was no better than the show, made by the marching of the Thracian contingent."

Is that on-rushing velocity, or what?

I want to point out in passing that the paragraph cited above, in the version I've linked from the Perseus web site, offers five footnotes (from the underlying edition "Plato in Twelve Volumes", translated by Paul Shorey, Harvard University Press, 1969) along with four additional hyperlinks added by the good folks at Perseus. Talk about meta-commentary...

Obligatory linguistic relevance: it's in The Republic that the term prosody originates. Read the whole thing :-).

[Update: Way back in June 2003, Languagehat documented the antiqutity of weblogs by quoting Aristotle's attack on the Pythagorean metaphysics of blogging. And you can't get more meta than that!]

Posted by Mark Liberman at 02:49 AM

Linguistic punditocracy: the Rockridge Institute

At last there is (as Paul Postal recently pointed out to me) a liberal think-tank, the Rockridge Institute, to go up against all those conservative institutes and centers in Washington DC that can always provide a Senior Research Associate to talk authoritatively on National Public Radio about the right-wing view on absolutely anything, or to write an op-ed piece for the morning newspapers explaining why the conservatives are right. I always thought we linguists were forever to be denied the delights of Senior Research Associatehood, and it would always be the law faculty and politics profs who would get the talking head assignments. But surprisingly, the Rockridge taxis out onto the runway with a rock-ribbed linguist on the flight deck: George Lakoff is at the heart of it, with his ideas about how conservatives are winning all the battles for linguistic reasons -- they design all the metaphors for framing political discussion. Check out the interview with Lakoff that the UC Berkeley news service has posted.

The Rockridge has the right address for liberal credentials: it's in Berkeley, California. But that has the disadvantage of putting it three thousand miles away from NPR, and three time zones behind Washington, which may mean the conservatives will continue to be ahead. He laughs last who gets to talk on the phone to Bob Edwards first.

Posted by Geoffrey K. Pullum at 12:07 AM

November 02, 2003

Lady Mondegreen says her peace about egg corns

What label to put on errors like egg corn (for acorn)? In a recent posting, Geoff Pullum accepts Mark Liberman's earlier argument that "folk etymology" isn't quite right, because the reinterpretation hasn't spread yet, and (possibly with the example of "mondegreen" in mind) suggests the label "egg corn". Certainly, if any existing label fits for egg corn, "folk etymology" is it. Fact is, we're pretty short of labels for kinds of reshapings of expressions.

A folk etymology is a reinterpretation of an expression as having parts that aren't etymologically justified. Usually this involves messing with the phonology (as in cockroach from cacarootch from Spanish cucaracha), but it wouldn't have to; I've collected Shiffer robe for anglicized chifferobe, with no change in pronunciation (but also spellings with a, ae, and e, indicating a mid vowel rather than a high vowel). In any case, every folk etymology started with reinterpretations by individual folk. Some, like the hardy cockroach, win the day, some, like sparrow grass for asparagus, spread only within a region or social group, and some never get off the ground socially, which is what I take to be the case for egg corn at the moment.

Maybe we should talk about nonce folk etymologies vs. successful folk etymologies, with lots of stuff in between, but the original impulse is the same in all of these cases: to find meaningful parts in otherwise unparsable expressions.

As for reshapings in general, they can affect pretty much any aspect of an expression: (1) how it's divided into parts; (2) how the parts are related structurally and semantically; (3) what lexical items or morphemes are involved; (4) how any of this stuff is pronounced (or spelled); or (5) what the whole thing means.

For (1), there are the classic recuttings, for example, with a(n) either attracting an n from the next word or losing it to the next word.

Pure cases of (2) are harder to find. Here's a possible example: inside and out for inside out (as in His pockets was all inside and out -- from a witness in a trial, so I never had a chance to interview the speaker further). My guess was that this fellow understood inside out as 'both inside and out' (asyndetic conjunction) instead of as 'having the inside out', and simply restored the conjunction. The existence of a fixed expression inside and out 'both inside and outside' (hundreds of thousands of hits on Google) would have encouraged his reshaping.

For (3), there are classical malapropisms: There's a connection, no matter how obtuse [obscure] it is. For (4), phonological reshapings: nucular for nuclear. For (5), private meanings -- ritzy taken to mean 'tacky', from its occurrence in derisive contexts -- metaphoric extensions, and metonymic extensions and contractions.

Some reshapings involve several aspects at once. Mondegreens are global mishearings affecting all aspects except possibly phonology -- and usually phonology is swept along as well.

Some reshapings are subtle, and have no standard names. What's the label for misidentification of lexical items, all other things remaining constant? As in the following tale.

I have a friend who creatively (and cleverly, but unconsciously) reinterprets the parts of all sorts of expressions. I write I've said my piece, and my friend thinks it should have been I've said my peace. Several other -- highly educated -- folks chime in on his side, and they provide rationales for their version of the idiom. (This kind of reshaping wouldn't have to result in a respelling, but things are very clear when it does, and when the writer defends the new spelling.)

So we start with the five-way division above (the parts of which aren't mutually exclusively, but let's keep things simple). We cross that with the distinction between advertent (I meant to say that) and inadvertent (oops!) reshapings -- between classical malapropisms and Fay/Cutler malapropisms, for example. And then cross these with at least a two-way distinction for idiosyncrasy -- between, say, egg corns and standard examples of folk etymologies, like cockroach. And then cross these with a distinction between production-based reshapings (the typo Zqicky) and perception-based reshapings (the misreading that leads to the spelling Fwicky). That gives us at least forty types of reshapings. We're way short in the terminology department. And we haven't even considered reshapings that are done deliberately, usually for humorous effect, as in puns.

Well, the labels are useful for pointing to similar phenomena in different languages and in different circumstances (speaking vs. writing, for example). But they have no special status in linguistic theory or in psycholinguistics, and they shouldn't be multiplied beyond necessity.

Posted by Arnold Zwicky at 10:52 PM

Improve your love life through the power of pragmatics

In an earlier post, I mentioned a college roommate who regularly committed "reverse sarcasm" as a sort of a joke, for instance tasting a new dish and saying "mm, disgusting!" with a blissful smile.

David Beaver offered an insightful analysis of the general issue, concluding that "you can sarcastically express a departure from a salient hope, not from a salient fear."

In a subsequent email exchange, David observed that "[i]f this roommate acts that way on a first date, I'm predicting he is still single." Actually, my old roommate has had a normal history of long-term committed relationships. However, David's comment does suggest that the field of linguistic pragmatics has missed a significant market opportunity: self-help books on how to have better relationships through better communication. (I'm sure that there are many such books, it's just that they're not written by academic practicioners of the discipline of linguistic pragmatics, as far as I know :-)).

Here's the whole exchange:


I wonder... whether it's exactly hope and fear that are involved. Why should it be this emotional scale (as opposed to joy vs. pain, or pride vs. envy) whose expressions are subject to positive-to-negative inversion? Is it something special about hope/fear (such as the property of looking forward in time)? or are you using hope/fear to refer to some more abstract property of attitudes towards states of affairs?

David Beaver:

I agree that its not exactly hope and fear that are involved, although it seemed like a good way to put it. So it's fortunate that I said "You can sarcastically express a departure from a salient hope, not from a salient fear", rather than "You can *only* sarcastically express a departure from a salient hope...." Looking forward in time seems relevant, but I think the generalization is that you can be sarcastic about situation X just in case something other than X would have been preferable.

"This is so delicious" (of a packet soup that tastes of cardboard, sugar and salt)
= "If the soup had tasted delicious/significantly more delicious, it would have been preferable"

"That's so tiny" (of a cellphone that is too large to fit in a pocket.)
= "If the cellphone had been tiny/significantly tinier, it would have been preferable"

"That's so huge" (of a plot of hotelroom that's barely big enough for the bed)
= "If the room had been huge/significantly huger, it would have been preferable"

I have to admit that this scheme breaks down on your roommate. Perhaps his is some sort of second order sarcasm - "I'm playing a game whereby I like to be served disgusting stuff and by saying sarcastically that this is disgusting I'm implying that in fact it fails to live up to the levels of disgusting-ness which I would like. Hence it tastes good." But this feels like over-analysis. I think it may be crucial that there was a pattern involved in his behavior. If this roommate acts that way on a first date, I'm predicting he is still single. My scheme also fails for cases of understatement.

As for Kerberos, I also find the devil's words hugely but pleasantly confusing. I started off with Addams Family scenarios in my mind, but found it impossible to figure out whether I'd expect the writers to use "good" and "bad" to mean "bad" and "good" or vice versa when sarcasm is included. (There is an implied sarcasm on the part of the script writers when Morticia calls something hideous "beautiful", but Morticia is portrayed as meaning it, not as being sarcastic.)

Here's what I always thought my roommate meant: "Sarcasm is supposed to be meaning the opposite of what you say, but I've noticed that it doesn't always work, and I'm pointing this out to you, cleverly, by attempting an inverted communication that fails (and thus, paradoxically, succeeds)."

I could ask him about it, but I bet he's forgotten the whole thing; it would certainly be the first time that I ever asked anyone to explain a joke that they performed more than three decades earlier. In the category of being slow on the uptake, this would be a new personal best :-).

Posted by Mark Liberman at 10:56 AM

Can the world's oldest person die?

Superlatives and comparatives in a changing world: semantic problems at Crooked Timber.

Posted by Mark Liberman at 10:20 AM

November 01, 2003

Earwitnesses, voiceprints, automatic speaker recognition

There's an interesting article in Legal Affairs about the history and current legal status of earwitness testimony, "expert" testimony on "voiceprints", and speaker recognition technology. [Via Arts and Letters Daily].

Here is the home page for Speaker Recognition Evaluations at the National Institute of Standards and Technologies (NIST), where you can find some information about NIST's speaker recognition tests. A little poking around will find you this 2002 paper by Mark Przybocki and Alvin Martin, summarizing the history. Their figure 10 is reproduced below, showing the results of the 2001 evaluation with an operating point of about 1% false alarms (i.e. the system says "that's the one" when it isn't) and around 20% misses (i.e. the systems says "that's not the one" when it is). I think this is better than humans could do on this type of task (where there are hundreds of unknown speakers and the speech is recorded over the phone), though I don't know that there are exactly comparable human benchmarks.

This kind of plot, showing the trade-off between misses and false alarms as a detection system's threshold is varied, is called a DET curve. This is a paper by some NIST researchers explaining the nature and value of DET curves.

[Update 11/2/2003: Like many people, I've personally experienced some extraordinary feats of speaker recognition. A few years ago, for example, when I answered the phone and someone said "Hello, is this Mark?", I instantly recognized the voice as someone I had known well in college, but had been out of touch with ever since. I had no reason to expect them to call me, and in fact I hadn't thought of them in years.

Cases like this always seem to involve friends or at least people I once spent a lot of time with, just like the similar cases of face recognition, which are commoner and therefore less surprising. And I've also had some embarassing misses, where I meet someone on the street whom I once knew well, and they say "don't you remember me?", and I don't. Or phone calls where I fail to recognize the voice of a current acquaintance.

There have also been a few false alarms, though not many and never very lasting -- cases where I hear someone talking in a public space and for a minute I think I recognize their voice, but it turns out that they're no one that I ever knew. And I'm sure I'd be really bad at the kind of test the NIST folks are running, where a few seconds of speech from an unknown test speaker has to be implicitly compared with hundreds of equally unfamiliar reference speakers.

Still, the occasional extraordinary anecdote does give me the intuitive though perhaps irrational belief that somehow the identity is in there. This kind of feeling may be responsible for the credence the law seems to give to earwitness testimony. Along with uncritical respect for technology -- especially technology with complicated pictures -- this belief also may have helped make "voiceprints" seem plausible.]

Posted by Mark Liberman at 12:19 PM

How does the devil admonish Kerberos?

I've gotten a lot of mail on the reverse sarcasm thing. "Reverse sarcasm" is clearly a bad name, and so I started using "scalar inversion" instead, but David Beaver argues convincingly that no explicit scalar predicates need to be involved.

I described someone who comes home from a long hard day to find that the puppy has pooped on the rug, and says "oh, terrific!" (or "wonderful" or "great" or similar positively-evaluated adjective), meaning the opposite; and contrasted this with the same person finding a bouquet of roses, and saying "oh, disgusting!" (or "ugly" or "annoying" or similar negatively-evaluated adjective) to mean the opposite. I argued that the first is normal and the second is weird. David responded:

[L]et me first point out that this has nothing to do with adjectives. As you observe, there's a very general process going on, so that the particular positive expression does not seem to matter very much. The pooped ironoclast with the pooped-on rug could equally well utter any of the following less adjectival variants on your original examples:

"Just what I needed!"
"That's exactly what I wanted."
"I wouldn't have asked for it any other way."
"At least someone still loves me."
"I knew there was something missing from my apartment."
"That's what it takes to make a house into a home."
"Now I have everything I hoped for."
"Praise the lord!"
"Thank you, Rover!"
"The magic fairy has granted my wish once more."

So far then: no to adjectives being crucially involved, but yes to positives being used sarcastically to express negatives rather than the other way around. However, positives and negatives, while largely common across humanity, vary according to the details of context. It's all a matter of goals.

If you think small is good, e.g. for a cox, a jockey or a cellphone, then when you say "This one's really tiny!" you can sarcastically mean that this one is very large. I would not commend the smallness of your cellphone by saying "It's so large!" However, if I was interested in finding the tallest building in the city, and you showed me an eight story block, I could say "That is HUMUNGOUS" and mean I thought you were rather provincial. But if you showed me a 200 story block, I couldn't commend your choice by saying "It's tiny!" So "small" is negative when er um, well, when it's negative. Otherwise it's positive. And contrariwise for "big". Similarly for most other classic cases of positive adjectives: "hot", "tall", "long", "wide": in all these cases one can easily use reverse sarcasm provided ones goals are themselves inverted. Nothing particularly specific to the nature of adjectives in this, although perhaps it does shed light on the nature of sarcasm.

There may be some adjectives that are inherently to do with positive goals. "Good", an old philosophical chestnut, would seem to be the obvious example. For cases like"good", "nice", "wonderful", it's hard to find cases where our goals can be inverted. Even the devil, when his plans come to fruition, may rub his hands with glee and say "Good!" And when he admonishes Kerberos, he surely says "Bad boy(s)", not "Good boy(s)."

However, even "good" can be inverted at times. Sometimes the devil could use what seem to be positive adjectives negatively and vice versa, though in doing this he surely risks misinterpretation. He might just about sarcastically say of someone who he had sent out to wreak havoc in the world but who had only managed to squash a fly "That was REALLY evil!" But suppose this malefactor managed to commit multiple atrocities, steal from the mouths of millions of poor people and become president of the greatest coutry in the world (not necessarily in that order). If the devil, suitably impressed, said "That was really, really good", then he would not be using this as a means to praise with sarcasm, he would just be praising simpliciter (on a conventional use of "good" coinciding with his goals, albeit that they are perverted goals).

So here's my latest thesis: you can sarcastically express a departure from a salient hope, not from a salient fear. There's a correlation with adjective choice just to the extent that some adjectives correspond to what we normally hope (prefer, desire, set as a goal, believe ought to be the case), and other adjectives correspond to what we normally fear (disprefer, do not want, avoid, believe ought not to be).

One of the nice things about teaching undergraduates is that they ask questions that have such interesting answers!

I keep thinking about Satan (or should it be Pluto?) rebuking Kerberos by shaking his finger sternly and saying "No! Good dogs!" But wait, shouldn't he smile and pat the critter's heads and ... No, I'm confused.

Posted by Mark Liberman at 07:29 AM

Discourse as turbulent flow

We've all known people like Sir Hector. If truth be told, most of us have probably been like Sir Hector sometimes.

Arthur Hugh Clough, 1819-1861
[lines 83-97]

Spare me, O mistress of Song! nor bid me remember minutely
All that was said and done o'er the well-mixed tempting toddy;
How were healths proposed and drunk 'with all the honours,'
Glasses and bonnets waving, and three-times-three thrice over,
Queen, and Prince, and Army, and Landlords all, and Keepers;
Bid me not, grammar defying, repeat from grammar-defiers
Long constructions strange and plusquam-Thucydidean;
Tell how, as sudden torrent in time of speat in the mountain
Hurries six ways at once, and takes at last to the roughest,
Or as the practised rider at Astley's or Franconi's
Skilfully, boldly bestrides many steeds at once in the gallop,
Crossing from this to that, with one leg here, one yonder,
So, less skilful, but equally bold, and wild as the torrent,
All through sentences six at a time, unsuspecting of syntax,
Hurried the lively good-will and garrulous tale of Sir Hector.

This offers a very different set of metaphors for discourse structure than the sedate stack discipline of Grosz and Sidner (1986), though Clough does suggest that more education and less alcohol should yield a more laminar flow of language, and therefore simpler structures.

Recently, Florian Wolf and Ted Gibson have questioned whether trees are an appropriate model for discourse coherence, and they support their case with a systematic study of AP Newswire and Wall Street Journal text, whose scribes are sober (or at least can hold their liquor better than Sir Hector), but still show about 12.5% crossed dependencies. More on this later...

[Update 11/03/2003: Florian has put .pdf form of their papers up here.]

Chafe's The flow of thought and the flow of language (1979) echoes the "flow" metaphor, but I don't know anyone who has picked up the "trick riding" idea :-). On the other hand, Lewis and Short tell us that Latin discursus (from discurro) meant "a running to and fro, running about, straggling", perhaps suggesting that a stack discipline is not necessarily always obeyed...

Posted by Mark Liberman at 06:24 AM