October 30, 2003

Sexing text with the Gender Genie

Via Instapundit, the Gender Genie. Check it out.

Posted by Mark Liberman at 10:18 PM

English, Portuguese, Polish, Farsi, French,...

Could it be true?

A note at Language Hat references the NITLE blog census histogram of the languages used (in the roughly 1.5M weblogs surveyed), and expresses surprise that Russian is so far down the list (in 18th place, between Danish and (?) Latin).

I'm more struck by the fact that Portuguese is in second place, and that Polish and Farsi are next, ahead of French, Spanish and German -- and with twice as many Farsi blogs as French ones! Tonnerre de Brest!

[Update: comments on the Language Hat site give details of the LID algorithm used, and explain some of the oddities (e.g. "Latin" is null blogs with "Lorem Ipsum" text, "Breton" is usually misidentified French or Spanish, etc.). It still seems likely that Farsi is well ahead of French.]

[Update 11/03/2003: Boingboing cites Hossein Derakhshan on Cafe Blog in Teheran.]

Posted by Mark Liberman at 09:32 PM

Teraling goes public

Peter Sells has posted an FYI notice on Linguist List announcing an "information and discussion session" on 1/8/2004, before the Boston LSA to discuss "ways to increase funding for research in linguistics, through opportunities created by new funding channels within NSF". A new website was also announced.

This is the same thing as the Terascale Linguistics Initiative discussed in this blog earlier. I think it's good that Peter downplayed the "terascale" name and emphasized the "ways to increase funding" concept in this announcement, even though that makes the initiative's vision even less crisp and sharp than it was. This gives the profession a chance to participate in defining the initiative, with minimal prejudging of the outcome. Now we just need some google hits for "teraling" that are not a kind of tropical timber, or a word in Bahasa Indonesia, or ... :-)

Posted by Mark Liberman at 10:55 AM

Metonymy notes from all over

Could philosophers of language beat their analytical scalpels into data-mining shovels? Are language engineers discovering analytical distinctions that philosophers have missed?

Well, in any case, it's fall and the smell of metonymy is in the air.

Geoff Nunberg, who's been working in the metonymy trade since 1978, has a new piece entitled "Indexical Descriptions and Descriptive Indexicals". It's due to appear in a forthcoming OUP collection which is blurbed as "brand-new essays on important topics at the intersection of philosophy and linguistics."

Meanwhile, committees of engineers involved in the DARPA ACE (Automatic Content Extraction) project have extended their annotation task guidelines to cover Arabic and Chinese. The general approach is summarized by the slogan "Facts = entities + relations". A key issue in the ACE "entity" arena is the problem of metonymy, which the ACE guidelines divide into two types: "composite metonymy" or "role assignment", which is viewed as focusing attention on one of the intrinsic aspects or attributes of a "geo-political entity" (such as its territory or its government or its population), and "classic metonymy", which is viewed as using a GPE's name to refer to an entirely different thing (such as a sports team).

I have no doubt that Geoff and other philosophico-linguists could help the engineers sort this stuff out -- these seem like deep ontological waters -- but at the same time, it looks like the information-extraction community (it's not quite yet an industry) has some issues and ideas to offer that don't seem to be treated in the philosophical literature. For instance, is there really a qualitative distinction between "composite metonymy" and "classic metonymy"? What's the role of convention in this area -- do metonymic norms differ across languages or across genres, and does this matter?

Geoff has worked on automatic genre classification but not (I think) on automatic information extraction, and (although I've know him since he was thinking about metonymy while driving a cab as a graduate student at CCNY), I don't know if he thinks that his work on metonymic aspects of reference has practical applications in data mining, or what he things of the distinctions that the ACE crowd is making these days. Geoff?

Posted by Mark Liberman at 09:35 AM

Another scientific revolution?

Steven Bird has drawn my attention to a method for "publicly registering data sets with a persistent identifier and structured basic description."

This might be more exciting than it sounds.

According to the press release

"This use of DOI will provide for the effective publication of primary data using a persistent identifier for long-term data referencing, allowing scientists to cite and re-use valuable primary data. The DOI's persistent and globally resolvable identifier, associated to both a stable link to the data and also a standardised description of the identified data, offers the necessary functionality and also ready interoperability with other material such as scientific articles."

Steven points out that LDC might use this as a way to go beyond LDC catalog numbers and ISBN numbers as a way to provide durable references for published linguistic data.

It looks interesting, though I haven't had a chance to figure out how it really works in detail, and what people can really do with it.

One thing I'd like to understand better is the relationship to the Open Archives Initiative and the Open Linguistic Archives Community. Steven?

It's been a long-time dream of mine to be able to read a scientific article, and to access the underlying data and analyses through a process as simple as clicking on a hyperlink. From the other side, I'd like to be able to give readers the same sort of access to my data and analyses. It looks like we're gradually moving in that direction (though persistent identifiers for data are only part of what is needed). When we get there, I believe that it will have a much more profound impact than most people realize, affecting science in something like the way that URLs and hypertext and browsers have affected mass communications.

Posted by Mark Liberman at 08:16 AM

October 29, 2003

What language are we in, mon ami?

A friend of mine recently learned that the perl programming language permits system calls to run other programs, which means a perl program can call up a simple Unix shell script, which means you could write a perl program that does nothing but call up a shell script and run it. So he jokingly suggested to me that he might put all his shell scripts in perl wrappers in this way, and thus become a bona fide perl programmer instantly, with no learning curve (learn perl in one minute flat! no studying! amaze your friends! enhance your job prospects!)

This led me to thinking that the notion of being a program in one particular language isn't really very well defined. Let me explain...

A shell script can read in arbitrary material in another language as a kind of quotation; consider this sequence of commands:

    #!/bin/csh -f
    cat >! foo.c <<QQ
    main(){printf("Hello world\n");}
    cc foo.c
    /bin/rm foo.c a.out

This mumbo-jumbo has the top-level form of a C-shell script. It causes "Hello world" to appear on your screen. But the way it does is by quotation: it quotes the code of a program in another language (the C language), gets the cat program to put the quoted material into a file, compiles the file with the C compiler, and then executes the compiler output and covers its tracks by removing the files. Is this a C-shell script? Is it a C program? If wrapped in a perl shell would it be perl? One could stipulate answers to these questions (it's a C shell script because the first word is #!/bin/csh), but it seems to me like the point is being missed.

If this holds for computer programming languages, it surely holds much more for natural languages. What language I am using if I say the following?

What I say is sauve qui peut, mon vieux!

(There are characters who talk this way, annoyingly, in some novels, as I recall. Hercule Poirot does, I think.)

There are two ways to go. One is to include the French bits and their structure and meaning in the structure of the English sentence, and thus for consistency include all French sentences in English, with all their structure, and thus completely blur the line between English and French, and between English and any other language (they would all have to be included). The other is to say that the example is English but it has exactly the same grammar and meaning as What I say is aaaaahhhh!. This would amount to saying that in certain contexts random noises you make with your mouth are permitted to appear in English sentences as if they were words, with no length limits or structural constraints.

Either way, the set of strings of noises that get classified as English seems intuitively way too big. There is too much included that doesn't relate at all to what the rules say about the grammatical structure or pronunciation of sentences of English. And since any string of nonsense could pop up, most conclusions about what can appear in an English string are blown away (a point that has been made by Alexis Manaster-Ramer in the context of pointing out that one cannot really support Chomsky's claim from 1956 that English is not accepted by a finite state machine, because most or all of the drivel one needs to define as ungrammatical to argue for this result turns up in strings in the form of quotations and names).

These ruminations are not quite the pointless speculation that they might appear to be. I think they carry the message that whatever we think a natural language is, we should not think of it as simply a collection of sentences. The computer science idea of defining a `language' as a set of symbol strings does have mathematical advantages, and it meshes beautifully with work on generative grammars (which is where it came from). But as my initial examples suggest, it has counterintuitive aspects even in computer science. And it's certainly not the way to conceive of natural languages. But for the same reasons, natural languages also shouldn't be thought of as mentally inscribed generative grammars (Chomsky's "I-languages").

Naturally, I wouldn't be saying all this if I didn't have a better story about how to think about natural languages. I have a most wonderful story about that topic... but unfortunately this blog entry is too small to contain it.

Posted by Geoffrey K. Pullum at 07:57 PM

Scalar inversion and the unique cephalopod of negation

I sent a query about "reverse sarcasm" to Larry Horn, figuring that he would have worked on this or related things at some point.

I was right:

Thanks for the blog. I've written about similar data, in Chapter 5 of my Natural History of Negation, basically along the same lines Ellen invokes here. The imbalance in the use of irony/sarcasm here is very similar to the asymmetries that show up with negation: we often deny a positive to convey a negative (the rhetorical figure of litotes), as when as I say ""I'm not optimistic they'll resolve the problem" to mean that I'm pretty pessimistic about it but not vice versa, or "I don't like X" to mean I actively dislike it (while "I don't dislike it" is more of a straight contradictory negation), since there's no cultural taboo against providing positive evaluations the way there is with negative ones. (In certain cases involving something like the knock-on-wood taboo, there is, whence the positive connotations of "not bad!") In that chapter, I trace the recognition of this asymmetry back to the late 19th century Romance philologist Adolf Tobler, who recognized that the irony in examples parallel to "You spent two days in Monterey? How awful for you!" (e.g. "I don't hate you" as a passionate declaration of love) is more indirect and self-conscious than that in the characteristic negative strengthening that yields the contrary (as opposed to contradictory) readings of e.g. "I don't like it" or "I don't believe that p". I also treat the asymmetry in negative prefixation (unhappy vs. *unsad, unkind vs. *uncruel) as a similar phenomenon, albeit with a higher degree of conventionalization than in the simple litotic or neg-raising contexts. In a real sense this all involves an application of the "politeness" dimension that Ellen associated with neg-raising back in her '76 Language paper, when we were all so much younger.

So go buy A Natural History of Negation, from which you can learn all about such things! I just did. My (inadequate) excuse for not owning it already is that I'm just a poor simple phonetician, who has strayed into these deep interpretive waters only because an undergraduate poked holes in my classroom example of "speaker meaning".

["P.S. Barnes & Noble seems to have gotten its reviews jumbled up a bit, so that the discussion of Larry's book begins "An exquisite natural history of this unique cephalopod ..." What a heart-warming, if accidental, affirmation of the fundamental unity of rational inquiry.]

Posted by Mark Liberman at 11:12 AM

October 27, 2003

Phrases for lazy writers in kit form

Mark Liberman asked in an early posting to LanguageLog what we should call a linguistic error like egg corns (for "acorns"), arguing convincingly that it is not exactly a malapropism or a mondegreen or a folk etymology. I answered that it should be called (of course) an eggcorn.

It now occurs to me that we also need a name for another linguistic figure, also noted by Mark but not yet named. Roughly speaking, the thing we need a name for is a multi-use, customizable, instantly recognizable, time-worn, quoted or misquoted phrase or sentence that can be used in an entirely open array of different jokey variants by lazy journalists and writers.

Let me explain.

Mark pointed out to me in connection with a post of mine on LanguageLog that hundreds or even thousands of unimaginative writers are using If Eskimos have N words for snow... (pick any number you like for the N), especially as the first sentence in a piece. It has become a journalistic cliché phrase with an attention-grabbing hook and totally free parameters for you to set as you wish -- that is, the value for N and the main clause that you continue the sentence with (like ...Santa Cruzans must have even more for surf or whatever).

Well, I just discovered another one, quite by accident. Checking on the source of the original poster slogan for Alien, I was distracted by finding that the web has ten thousand or more instances of jokey variants on it. The original was In space, no one can hear you scream. I found people saying (often in headlines for film reviews) that in space no one can hear you belch, bitch, blog, cream, DJ, dream, drink, explode, gag, groan, laugh, moo, opine, pop, sell, sing, smeg, snore, speak, squeak, suck, sweat, tap, whimper, yawn... And there are plenty more, like "say thank you", "ask for bail", "ask `What the heck was that supposed to be?'"... Not to mention In space, no one can see your breasts and many other farther-out twistings of the slogan.

What's need is a convenient one-word named for this kind of reusable customizable easily-recognized twisted variant of a familiar but non-literary quoted or misquoted saying. (I say "or misquoted" because there is actually no original source for The Eskimos have N words for snow, people only think it once appeared in some reputable source.) "Cliché" isn't narrow enough -- these things are certainly clichés, but a very special type of cliché. And "literary allusion" won't do: these things don't by any means have to be literary.

Usually I'm moderately good at thinking up terminology. I am the proud creator of the term vortensity, which now has dozens of citations in astrophysics (it denotes the ratio of swirling rate to unit surface density in accretion disks like the rings of Saturn or other rotating clouds of debris). But on this particular meme type I have failed so far.

I'll probably think of something. Or one of the other LanguageLog bloggers will. Or someone will email me a good idea in plain ASCII text form and I'll credit them here.

Posted by Geoffrey K. Pullum at 06:22 PM

Reverse sarcasm?

A student in my Linguistics 001 class asked me a hard question: why doesn't "reverse sarcasm" work?

We can use any positively evaluated word to mean its opposite, given a halfway appropriate context and performance:

   "how wonderful!" (said of something horrible)
   "how delicious!" (said of something disgusting)
   "how thoughtful of him!" (said of thoughtless behavior)

But the other direction rarely works:

   #"how horrible!" (said of something wonderful)
   #"how disgusting!" (said of something delicious)
   #"how thoughtless of him!" (said of thoughtful behavior)

There are specific reversals like "bad" for "good", but they're much more culturally, lexically or situationally restricted.

All the obvious Gricean accounts that I can think of seem to be invertible, which is not consistent with the facts. So I did what I usually do in such cases: I asked Ellen Prince. She came back with a connection to a classic observation by Edward Sapir:

In a paper from way back when ('Grading' I think), Sapir noted that the noun for a scalar property corresponds to the adjective at the positive end -- so one's beauty can be zero, meaning one is ugly, but one's ugliness being great doesn't make one beautiful. Likewise, height (< high) is unmarked for how high/tall one is but shortness must be short; one's intelligence can be so low that one is stupid but one's stupidity can never get high enough to make one intelligent, etc etc etc.

As Ellen suggested, the problem with "reverse sarcasm" is probably connected to this, somehow: sarcasm can reduce the implicit value of a positive scalar property to the point that it turns into a negative one, but doing the same thing to a negative scalar property doesn't turn it into a positive one.

[Update 10/28/2003: several people, including Ellen Prince and Prentice Riddle, have supplied examples where negative-to-positive reversal seems to work. Prentice's contribution was personal and convincing: "You spent two days in Monterey? How awful!"

But I still think that there's a difference here. His example reminds me of one of my college roommates. I can remember John tasting the taramasalata at the Greek restaurant across the street from our dorm, and saying "Mmm, disgusting!" with a beatific smile on his face. He was known for this sort of thing, and it was generally regarded as weird. I'm pretty sure that *he* thought it was weird, and did it precisely because it doesn't really work by the normal rules of conversational interpretation, though it seems like it should.
(I don't mean that Prentice is weird. His example seems normal, it just reminds me of an old friend's long-ago odd jokes).

In general, I somewhat mistrust my intuitions on this, and recognize that it might be a "mind set" problem like the old quantifier dialect investigations.

It's also been pointed out to me that "reverse sarcasm" is a pretty bad term for this phenomenon, since sarcasm usually doesn't involve inversion of scalar predicates, and the scalar-predicate-reversal cases need not have the contemptuous or mocking tone required for sarcasm.

So to sum up, the cited facts are somewhat wrong, and the proposed name for the phenomenon contains at least two mistaken presuppositions. Oh well, feel free to apply for a pro-rata refund of your subscription fees :-).]

Posted by Mark Liberman at 07:21 AM

Terascale Two

Peter Sells chairing the breakfast discussionThis is a quick summary of the second (and last) day of the "Terascale Linguistics Initiative" workshop, composed in the airport waiting to fly home. We met for an hour and a half over breakfast, took a tour of the Monterey Aquarium (the jellyfish were my favorites), checked out of our hotel rooms and talked over lunch, then had some smaller break-out discussions and a final plenary summation, ending about 4:30.

An easy day, compared the 12-hour marathons that I'm used to at DARPA workshops, but alas this was only the first phase for me. I'm writing from the Monterey airport, where my flight to San Francisco is due to leave more than two hours late, delayed by cascading effects from the fires throughout southern California. If the flight isn't delayed further, I'll have a few minutes in SFO to make it to the other end of the airport for the red-eye that will get me to Philly at 6:00 a.m., in time to get take a shower and give a Linguistics 001 midterm exam. And that's the optimistic scenario :-).

We took some group pictures at the aquarium, but the guide seems not to have actually pressed the button on my camera, because the only new picture on it is the one above, which I took earlier, showing Peter Sells leading this morning's discussion over breakfast.

Ok, so what about TSL?

Well, over the course of the day, the group generated a lot of interesting raw materials for the process of defining and proposing the initiative, and at the end of the day, it was decided who would take various next steps. According to the plan, there will be a "town meeting" to discuss the idea at the Boston LSA on January 8, and then not too long after that, some people will draft a "prospectus" for the initiative and present it to NSF. After some arm-twisting, I agree to be one of these people. When I figure out what this means, I'll explain it further :-).

The more immediate task is to prepare for the town meeting. The format, as I understand it, will be five (?) short (10-minute) presentations, and then an open discussion. So one key issue is figuring out exactly what will be in those prepared presentations. There were some decisions made about who will talk about what, but I wasn't taking notes at that point. I'll post the details when I recover them.

Another, even more pressing need is to publicize the January 8 session. Since people are already making their travel plans, and 1/8 is the day before the main LSA meeting, it's especially important to get the word out quickly. In fact, I feel that it would have been a good idea to have made the announcement back in July or so, when the the NSF workshop proposal was funded.
[Update 10/31/2003: Peter Sells has pointed out to me that it wasn't possible to publicize the LSA session until the LSA Program Committee accepted the proposal, which didn't happen until 9/30/2003.]
Today, it was agreed that Peter would ask the LSA secretariat to put something on the meeting web site right away, and he will also announce the session on the LINGUIST list within a few days. The TSL web site will also come to into public existence at Stanford ASAP, though I don't know the timetable.

The discussions today did not arrive at anything like a concrete and specific idea of what TSL is, though many ideas were raised and discussed. These included both very specific suggestions for new kinds of data -- e.g. "eye-tracking data from both parties in a dialogue" among many other ideas -- and also general research questions that might be answered by new kinds or quantities or data, new methods of analysis, etc. In principle it's appropriate to leave things in this state, since various kinds of community discussion are still in the future.

However, it seems to me that it wouldn't hurt to have a straw man proposal -- or maybe a whole straw family of alternative short, clear descriptions of what TSL might be. I'm sure that LINGUIST list will not just be a passive channel for announcing the session at the LSA, but also an interactive medium for a lively discussion of the issues. And I imagine that there will be discussions in other places as well, including weblogs and other email lists as well as linguists' offices, labs, backyards and favorite bars. With luck, by January there'll be some good clear targets to fire at.

[Update: I made the connection in SFO, with whole minutes to spare, and I'm posting this from home the next morning. After the shower, but before the exam.]

Posted by Mark Liberman at 06:59 AM

October 26, 2003

Up they turned

Ah, how devilishly subtle is the epistemology of grammatical investigation, I muttered to myself the other day, as I read a story about Northern Ireland in my favorite serious news magazine. Prime ministers Tony Blair (UK) and Bertie Ahern (Eire) were due to turn up arm in arm to be present at an encouraging announcement of agreement (it was to prove illusory) between Unionists and Sinn Fein, said The Economist (October 25th, 2003, p.52, column 1): "Downing Street duly announced it, and up the prime ministers turned."

But of course, that's not grammatical. Sometimes you have to refuse to trust the evidence of your own eyes, and this is one such time.

Turn up is one of the fossilized prepositional verbs of English, like come across meaning "encounter". You can say Let me know about anything that you come across but not *Let me know about anything across which you come. The come across sequence has to be in that order and untampered with. So does turn up in the sense of "arrive". You can normally switch prepositions like up to the front and kick subjects to the end, as in Off the prime ministers went, or Up the monkey climbed; but *Up the prime ministers turned is not grammatical English.

Yet it's not an error, either.

I'm a firm advocate of the use of corpora of real live text for evaluating the correctness of proposed grammars, but this case shows how careful you have to be. The Economist's writer is being jocular. Fooling about with the language.

How do I know? All I can tell you is that after fifty uninterrupted years of paying close attention to the use of a language you know a thing or two. Trust me. Sometimes what you see isn't what you ought to get.

So what we have here is a case of a sentence that is not grammatical and not a careless slip, yet it occurs in print in a carefully edited (and indeed, error-free) article in a serious publication. The naive empiricist here is going to say that I am cutting the empirical bottom out of the discipline of linguistics if I posit such a thing. But they're wrong. Relying on a corpus as if it were handed down by God is corpus fetishism, not linguistic science. When you're a descriptive grammarian like me, sometimes you have to trust the corpus and modify your intuitive idea of what is grammatical, and sometimes you have to use your intuitive knowledge of the language to ward off false impressions the corpus might give you. It's not a straightforward matter. Science never is.

Posted by Geoffrey K. Pullum at 07:36 PM

The "Terascale Linguistics Initiative"

I'm in Monterey, along with about twenty other linguists and people from allied fields such as anthropology and psychology, to participate in a two-day workshop on something called the "Terascale Linguistics Initiative". This is a brief report on the first day's session, posted from my hotel room the morning after.

I knew very little about this initiative before coming here. I was invited by Peter Sells, and I learned a bit more about the history and context from his presentation at the start of the first day's session. It seems that back in 2001, Cecile McKee (then the head of the Linguistics Program at the National Science Foundation) asked the NSF Linguistics Panel to suggest possible initiatives. I think that at NSF "initiative" then meant something like "effort to foster research by setting a theme," and that the new term for this is "priority area", though maybe I have the lexicography wrong here.

Peter, who was then a member of the linguistics panel, came up with the idea of an initiative to promote linguistic research based on "the opportunity created by new technology", as he put it in his presentation yesterday. To carry this idea forward, there was a small brainstorming meeting at Stanford at some point in 2002, as a result of which Peter and a couple of others applied to NSF for funding to hold this larger workshop, and to arrange a "town meeting" at the Linguistic Society of America 2004 meeting in Boston (though I can't find anything about it on the meeting site as yet).

The organizers are familiar enough with the "Terascale Linguistics Initiative" concept to use the acronym TSL in a familiar way -- as I will from now on -- but there is not very much of a web presence as yet for this concept. In fact, as I learned before coming here, "terascale linguistics initiative" does not occur in google's index. The URL that I used for the hyperlink in the first sentence of this post was cited in the paper packet of materials that I got at the workshop, but it's on a site at Stanford that is password-protected, and therefore isn't indexed by web search engines. I don't have the password, so I haven't seen the site either. I believe that Peter intends to open it to the public at some point, perhaps by the time you try it, gentle reader.
[Update 11/1/2003: the public site's URL is http://www.teraling.org.]

As for what TSL means, I learned a lot at yesterday's session, but the proposed initiative is still pretty diffuse for me. From the point of view of funding within NSF, the goals seem modest at least initially: what was discussed yesterday was the idea of having a kind of subordinate focus ("area of emphasis" might be the term) starting in 2004 under the NSF "Priority Area in Human and Social Dynamics", for which the 2003 call is here. From the point of view of intellectual content, the idea is still not clearer to me than what I can infer from Peter's phrase "the opportunity created by new technology," which is that the advance of networked computing technology makes big quantitative improvements in the accessibility and cost-performance of corpora and linguistic databases of all kinds. This enables old ideas to be carried out on a larger scale or by more researchers, and it also enables some entirely new kinds of research, in linguistics as in almost every other scientific and scholarly discipline.

At yesterday's session, in addition to Peter's brief history of TSL, we got a systematic survey of language-related stuff at NSF, presented by the current linguistics program director Joan Maling and the director of the Human Language and Communication Program, Karen Kukich. This presentation was enlightening but did not relate specifically to TSL. Then we heard presentations about "what TSL means to me and my subfield" (that's my characterization, not anyone's title) from five participants: Beth Levin, Katherine Demuth, Mary Beckman, Jack Dubois, Norma Mendoza Denton. These presentations were uniformly interesting, and in some cases I was happy to learn about pieces of work that were new to me. However, they were also very detailed and specific, so that as a whole they presented a sort of pointillistic sketch of what TSL might be. We got five wide-separated clusters of points, each representing some specific examples of "the opportunity created by new technology" in language-related research, but no synthesis or systematic vision as yet.

The eight background presentations took about twice as long as Peter had allocated for them, so that except for a few minutes at the end of the afternoon planning today's sessions, and a few brief exchanges during the presentations, that's pretty much all that happened yesterday.

I'll report about today's meeting, probably at some point after I get back home, as I'm taking the red-eye back tonight and won't have net access during the session today.

I should say that TSL seems to me to be basically a Good Thing. The trends that it reflects are important and deserve to be encouraged, and I'm in favor of anything that promotes an improvement in the very modest funding level of NSF's linguistics program.

[Update 10/29/2003: Peter Sells points out to me that the official acronym is TSL not (as I thought) TLI. My memory of having heard people talking about "TLI" is apparently a symptom of a low-grade acronymic aphasia, perhaps caused by the fact that all possible three-letter combinations are already taken many times over (google tells me that "TSL" is "Texas School Libraries", "Tokyo Specialty Love", "Triple Super Lead", and some 236,000 others, while "TLI" is only "Taipei Language Institute", "TOUGHLOVE International" and some 114,000 alternative readings). Anyhow, I'll make the change throughout my two posts on TSL, to avoid confusing others, and I'll add this one to my notes for the poignant memoir "The Man who Mistook his PDA for a PFD".]

Posted by Mark Liberman at 05:36 AM

October 24, 2003

Calling all parsers

From Alexander Williams, via Seth Kulick:

On the front door of the stadium-sized old-school steakhouse "The Pub" in Pennsauken, NJ, at the armpit of Routes 30, 130 and 38, we find a shiny red plaque from 1964 with the inscription:

Volume Feeding Management Success Formula Award

I'd like reports on parsing results from all the major labs in carbon-copy triplicate by Monday.

To which I reply: "Carbon copy?"

Seriously, parsing complex nominals is a serious and interesting problem, about which more in another post; right now I'm on my way to the airport to travel to a workshop on "terascale linguistics", about which more in another post; right now I'm recursively generating too many distractions....

[Update 10/25/2003 from Monterey: google tells me that the very same six-noun sequence was cited (though not sourced) in a footnote on p. 22 of Regina Barzilay's 2003 Columbia University PhD thesis, in a discussion that gives a useful review of the (computational, psycho- and plain) linguistic literature on the interpretation of complex nominals. This is more like terascale library science than terascale linguistics, but maybe there's a connection. I hope to find out later today, when the "terascale linguistics" workshop I'm attending gets down to defining its terms.]

Posted by Mark Liberman at 01:07 PM

October 23, 2003

Economist blender blunder

Of all the magazines and newspapers that have declined to publish letters of mine, I am bitter about only one. In March 1997 The Economist declined to publish a letter that would have been a true first in natural language text: a normal piece of prose containing a meaningful contiguous minimal word quintuple. Yes, we're talking about a grammatical and meaningful sequence of five consecutive words in a natural context that are differentiated from each other by just a single character. And in the case at hand these were 7-character words, no less, and the differentiating characters were vowels, all in the same position.

This could have placed the Economist permanently in the linguistic book of records. (Well, there isn't one, actually, but there could have been, and they could have secured a place in it.) What a myopic, blinkered clod their letters page editor must be. The letter was, at the time, fully topical. It was a response to an article about Russian oil pipeline problems that appeared in the magazine the week before. It deserved to appear in print. Read it. You be the judge. Here it is.

Stevenson College
University of California, Santa Cruz
Santa Cruz, CA 95064

March 20, 1997


"Connections needed" (March 15) reports that Russia's Transneft pipeline operator is not able to separate crude flows from different oil fields: "they all come out swirled into a single bland blend." This is quite true. And worse yet, the characterless, light-colored mix thus produced is concocted blindly, without quality oversight, surely a grave mistake. In fact, I do not recall ever encountering a blinder blander blonder blender blunder.

------Geoffrey K. Pullum

Posted by Geoffrey K. Pullum at 11:42 PM

In Search of the Fimpossant

Yesterday my weekly expedition into the pages of The New Yorker (October 27, 2003), in search of specimens of that threatened species the first-mention possessive antecedent ("fimpossant" for short), was a spectacular success, with three healthy specimens sighted during a mere fifteen minutes of reading. A catch of this size might not come again; I fear that the editors of the magazine, faced with the results of my previous expeditions, will step up their wrong-headed campaign to expunge this graceful and useful creature from their domain. Any day now they will install an army of fimpossant checkers, to augment their celebrated army of fact checkers, and the fimpossant will be doomed.

I urge them to protect and treasure the fimpossant, but they inexplicably persist in viewing it as an invasive species from the wrong side of the literary tracks. Why, I ask you, would anyone object to the presence of the three beauties below in their neighborhood?

First, a rare instance of the Showy Fimpossant, at the very beginning of a piece of writing. It can be viewed on page 58, where Judith Thurman starts her "In Fashion" piece with

Elsa Schiaparelli's signature color was a violent magenta that she admired, she said, because it was...

Now, a hypercritical person might object that the two occurrences of she here have as their antecedent, not the Elsa Schiaparelli that appears in its possessive form at the beginning of the piece, but the Elsa Schiaparelli that appears in the subtitle of Thurman's article. But surely Thurman wrote the article before someone attached a title and subtitle to it, and anyway, if an occurrence of Elsa Schiaparelli in the subtitle were enough to establish Schiaparelli as the topic of the piece, then Thurman could have used a possessive pronoun at its beginning -- Her signature color was a violent magenta that she admired, she said, because it was... -- but that wouldn't have done at all.

Actually, all this fussy argumentation is beside the point, because the occurrence of Elsa Schiaparelli in the subtitle is itself in its possessive form:

Elsa Schiaparelli's seminal clothes are on display in Philadelphia

So much for the Showy Fimpossant. The other two examples are of the more inconspicuous subspecies the Presumed Fimpossant. You can find the first on page 41, towards the end of Ben McGrath's Talk of the Town report from Staten Island, on the ferry wreck:

By Thursday afternoon, as the investigation into the pilot's bizarre behavior continued--after the crash, he'd rushed home, slit his wrists, and shot himself with a pellet gun--a steely cynicism had returned with the daily grind.

This is the first, and only, mention of the pilot in the piece. But of course given a ferry we can infer a pilot, and in any case readers can be expected to know already that the ferry's pilot plays a significant role in the story, so there's no problem in referring to this pilot via a definite description, the pilot, in this case in its possessive form. And then once the pilot has been introduced into the story, McGrath can refer to him with pronouns, in particular the pronoun he.

Finally, on page 98, in John Lanchester's review of Jonathan Bate's life of poor (in several senses) John Clare, you can observe the following specimen:

... but there is no denying that Clare's struggles with poverty were all-encompassing and lifelong. He was born in 1793, the son of Parker Clare, who was a casual farm laborer in Northamptonshire, a pretty but unspectacular patch of countryside in the Midlands. The family's key asset was an apple tree outside their cottage which produced enough fruit to support them when Parker's rheumatism prohibited steady work.

This is the first, and only, mention of Clare's family, as a unit, in the piece. We presume that everyone has a family, and this presumption is reinforced by the mention of Clare's father, so there's no problem in referring to this family via a definite description, the family, once again, as it happens, in its possessive form. Once the family has been explicitly mentioned, Lanchester can refer to them with pronouns, in particular the pronoun them.

That's it. One gorgeous Showy Fimpossant and two specimens of the gentle Presumed Fimpossant. Why should anyone want to eliminate these useful and ornamental creatures from our environment?

Posted by Arnold Zwicky at 11:20 AM

Inveralent jumumble at the Guardian

A month after everyone else with an internet connection, Michael Johnson at the Guardian has noticed that an "ietm is going around ... citting new reserach from Cambirge Uvinersity" about the readability of scrambled words. Johnson hasn't spent the past month studying the subject in depth, since from the discussions by Language Hat and Uncle Jazzbeau (among others) he would have learned that this is a sort of Urban Legend, or perhaps a new kind of internet-mediated distributed amateur science. He would also have learned, as Matt Davis authoritatively explains, that "there's no-one in Cambridge UK who is currently doing research on this topic." Or at least there wasn't until the folks there started participating in the world-wide digital discussion.

In fact there is a story by John Crace in the education section of the Guardian on 10/21/2003 (the day before Johnson's commentary) that has the correct attribution -- the original research on this effect was apparently done by Graham Rawlinson in a 1976 PhD dissertation at Nottingham University. But on 10/22/2003, Johnson not only misses the attribution, he also gets the content wrong. If he'd poked around in the blogosphere a bit, or seen the previous day's piece before writing his own, Johnson would have learned enough not to write

"the eye deosn't need or evn want the whoole wrord. It noets the frist and last lettres, and fills in the rest by inrefence. You can even add or dorp lettres. The jumumble in btweeen is irrveralent."

He would have seen Matt Davis' example the magltheuansr of a tageene ceacnr pintaet, which the eye doesn't read "vrey fast and quite misteriollusly" as the manslaughter of a teenage cancer patient", or the extensive (and psycholinguistically interesting) commentary at Davis' site and elsewhere about why muddling middle letters sometimes works and sometimes doesn't.

Instead, Johnson uses a half-understood version of the muddled-spelling business as a hook on which to hang a collection of linguistic observations: some harmless national stereotypes about "Eglinsh scools" vs. those in his native "Idniana"; the fact that "three is an accnet for evry neihbourhod in Lodnon, plus a few hunderd form upcuotnry"; some conventionally astonishing displays of ignorance from the "biusness execuvites" to whom he gives classes in "the use of the Eglinsh lagnuage"; and so on. All in all, the piece works well enough that it was sent to me by a friend from the English department, who probably wouldn't have noticed it if not for the muddled-spelling trick.

I know that it's churlish and pedantic to complain about an inadequately-researched joke, and in this case the truth would have made the joke harder to frame. Still, I feel that Johnson should have gotten his facts in order, especially because he complains about the decline in educational standards. His complaints are ironic, and thus ambiguous, but are irony and ambiguity an excuse for ignorance?

A lot of journalistic commentary is like this: a few scraps of false rumor, social stereotype and personal anecdote, eked out with enough conventional wisdom to fill the measure. In this "slleping" case at least, the informal network of weblogs and personal pages has had higher standards than the organs of conventional journalism, as well as faster reactions.

[Note: it occurs to me that Michael Johnson's piece may have been written in early September, and then languished for six weeks in some Guardian editor's inbox . If so, I owe Johnson an apology for suggesting that he missed the flurry of mid-September internet information, not to speak of the previous day's piece in the Guardian's education section; and Johnson owes a raspberry to the hypothetical Guardian editor in question.]

Posted by Mark Liberman at 08:13 AM

October 22, 2003

"What are you, French?"

Scene: earlier today on a Philadelphia playground. Dramatis Personae: three seven- or eight-year-old boys.

Boy A:  [kicks boy B]
Boy B:  Hey!
Boy C:  Kick him back! What are you, French?

I feel that this is quite unfair to the French government, which can't be accused of being inadequately vengeful, and even more unfair to the French people, who presumably hold the usual range of opinions on such things.

But at this point the meme is apparently unstoppable, and all we can do is watch as its linguistic consequences unfold.

Posted by Mark Liberman at 05:27 PM

IAEA yippy yippy yie

Seldom can there have been a phonetically more unfortunate set of initials than the one that the International Atomic Energy Agency (IAEA) is stuck with. The agency has been much in the news recently because of Iran's suspicious experiments in transuranic chemistry, forcing frequent efforts by newsreaders and radio hosts, reporters, and interviewees to to read aloud the abbreviation. But at normal conversational speed its ghastly sequence of four diphthongized long vowels (in the nicely symmetrical but un-IPA Trager-Smith transcription, /ay ey iy ey/) sounds something like ah-ee-yay-ee-yee-yay-ee. Aaieeeee!!

If only they had thought ahead. They could easily have called the agency the Transnational Atomic Power Authorizing Service, and then we would have had an acronym for them, TAPAS. People should think about this sort of thing before they create organization names. This is the sort of issue that we are giving a lot of thought to here at the recently-formed Society for Terminological Uniformity, Practicality, and Informational Durability.

Posted by Geoffrey K. Pullum at 03:21 PM

Is bad writing really good?

The Oct. 24 issue of the Chronicle of Higher Education has Carlin Romano's entertaining review of a recently-published collection Just being difficult?, which "poses the question ... : When is bad writing not so bad, even if it's terrible?"

Just being difficult? is a response to Dennis Dutton's Bad Writing Contest. It was written by some of the contest winners and their partisans.

Posted by Mark Liberman at 08:49 AM

October 21, 2003

Bleached conditionals

There is a special kind of conditional that does not appear to have conditional force at all; it is more like a coordination. Here is a nice one from The Economist last week:

If Eskimos have dozens of words for snow, Germans have as many for bureaucracy.

[The Economist, October 11th, 2003, p. 56, col. 2]

It is surely being presupposed here that EVERYONE knows Eskimos have dozens of words for snow. The alleged bureaucratic ubiquity of the Teutonic world is not really being treated as conditional on the truth of a contingent claim about those hardy Boreal nomads of the Arctic. The if scarcely means "if"; the sentence is essentially equivalent to this:

Eskimos have dozens of words for snow, as everyone knows. Well, Germans have just as many for bureaucracy.

I know, I know, those of you acquainted with my oeuvre are probably expecting me to rave on for a while about the continuing spread of the ridiculous journalistic conceit of drawing rhetorical conclusions from the frozen factoid that for some fascinatingly large number N, the Eskimos have N words for snow. It has been going on for fifty years, at increasing pace. Laura Martin has been inveighing against it since 1982, when she spoke to the American Anthropological Association about it as a kind of academic urban legend that anthropological linguists had been spreading. In 1986 (after four years of arm-wrestling with embarrassed anthropologist referees who would really have been happier if this did not come out) she finally published a short note on the topic in American Anthropologist.

Later I wrote a deliberately humorous article myself called "The great Eskimo vocabulary hoax", attempting to publicize Laura's work among linguists. My article has been published in five or six different places, including as the title essay of my 1991 book, which was written up in Newsweek and various other places, and has been drawn to the attention of journalists and editors. But it's clear to me that Laura and I are just wasting our time. People have written letters to The New York Times over and over again about this, quoting from their previous letters, and it makes no difference: the Times has repeated the Eskimo claim several times. Jane Brody alone has used it at least twice, citing a different number of snow words each time, and has ignored letters about the topic.

The truth about snow words in the Eskimo languages simply doesn't matter. If it did, I would carefully explain that there seem to be only a handful of roots that really are snow roots in the languages of the Yup'iks and Inuits, maybe four or five, not very different from the number found in English (snow, sleet, slush, blizzard). But it doesn't matter. All that matters to journalists is that they continue to have the snowbound simile in question at their disposal for constant use whenever a line or two needs to be filled up with linguistic babble.

But this is what makes the point I made about the conditional example above so clear. You are supposed to know that there are dozens of words for snow in a language called "Eskimo". (Sure, there is no such language, and you have never seen any data, but never mind, you are just supposed to know that it's true.) It's meant to be publicly known, in the common ground. That is what makes it so clear that the conditional sentence I cited is in fact bleached of its conditionality. The if P, instead of meaning "give [me the assumption that] P" (the etymological origin), it means "given that P".

If you look around you will find many other such examples. In fact [added October 23, 2003, 9pm] Mark Liberman has pointed out to me that there is a semi-fixed journalistic form of words developing here: bleached conditionals beginning "If Eskimos have [YOUR CHOICE OF POSITIVE INTEGER HERE] words for snow..." are becoming ridiculously common. I have made available Mark's small corpus of these examples for your reading pleasure, ordered in ascending number of alleged words for snow. It is really rather frightening. Perhaps they are teaching this as a useful trope in journalism schools now.

Bleached conditionals probably tell us something about the semantics or the pragmatics of conditionals, though I have never been able to put my finger on exactly what.

Posted by Geoffrey K. Pullum at 07:41 PM

Grammaticality, anaphora, and all that

Can a sentence be ungrammatical in isolation, but grammatical in context? In an exchange of e-mail with me, Louis Menand suggests that this is the case for the now-famous example Toni Morrison's genius enables her to... (see earlier posts by Geoff Pullum and by me on this topic), and for examples I extracted from his book The Metaphysical Club. Trying to make sense of this proposal leads to some interesting observations about grammaticality and anaphora.

As background, consider a blast from the past, Jerry Morgan's example Spiro conjectures Ex-Lax. In isolation, this is certainly a puzzler. But in a context in which Ex-Lax can be understood as Spiro's conjectured answer to some question, there's no problem. Isn't this a case of a sentence that's ungrammatical in isolation, but grammatical in context?

Not really. It's certainly true that Spiro conjectures Ex-Lax is ungrammatical on an interpretation in which Ex-Lax is the direct object of the verb word conjectures; the verb CONJECTURE requires clausal, not simple NP, direct objects. But Spiro conjectures Ex-Lax is fine on the interpretation 'Spiro conjectures that the answer to some question is: Ex-Lax'. The problem is that there are (at least) two expressions Spiro conjectures Ex-Lax.

Linguistic expressions are not just form -- phonological content or marks on paper -- but form paired with meaning; they are signs. Unfortunately, the folk notions of sentence, word, phrase, etc. are purely formal, and even professional linguists often speak in ways that presuppose the folk notions. We say that the word pen, the phrases saw her duck and visiting relatives, and so on are "ambiguous", when, to be precise, we should be saying that there are several distinct words pen, several distinct phrases saw her duck, and so on. Normally, this doesn't get us into trouble. But on occasion it can be seriously misleading.

Now to return to Menand, who takes the position that examples like Einstein's discoveries made him famous are ungrammatical -- on the interpretation in which the pronoun has the possessive as its antecedent. Such examples are acceptable, Menand maintains, if the pronoun has an antecedent somewhere in the preceding context; Menand's own example Emerson's reaction, when Holmes showed him the essay, is choice... is "a solecism" if him is anaphoric to Emerson's, but acceptable if an antecedent for him is provided in the preceding context. And there are in fact several occurrences of Emerson not very far before this sentence.

This is an ingenious proposal -- I'll put aside, for the moment, some problems in rephrasing it in conjecturing-Ex-Lax terms -- and it even makes three empirical predictions: first, that if a possessive is the first mention of some entity, a subsequent pronoun cannot refer to that entity; second, that even if the possessive is not the first mention of the entity, if the most recent previous mention is far back in the text, a subsequent pronoun cannot refer to that entity (we can't expect readers to hold discourse referents in memory indefinitely); and third, that a possessive can't serve as the antecedent for a reflexive pronoun (even if there is a prior mention of the relevant entity, this can't serve as the antecedent for the reflexive, since reflexives require antecedents within their clauses). All three predictions are falsified -- the last one in the first fifty pages of Menand's book, so I'll start with that one.

On page 7 of The Metaphysical Club, we have a reference to the city of Boston, via the noun Boston, and, shortly thereafter:

...in a phrase that became the city's name for itself...

Here, itself refers to Boston, but the antecedent of the reflexive pronoun can't be the earlier occurrence of Boston, which isn't in the same clause as itself, and must be the city , which is; even a very recent occurrence of Boston isn't enough to license a reflexive if they're not clause-mates:

Given the history of Boston, our modern name for it/*itself seems odd.

The other two predictions are harder to test in The Metaphysical Club, since the book has a small cast of central characters who are referred to again and again throughout its pages; there are almost always prior mentions. You wouldn't expect historical writing, or novels for that matter, to be a fertile field for first-mention possessive antecedents. So let's turn to shorter forms -- for instance, Calvin Trillin's columns. A quick search of his essays collected in Too Soon to Tell (1995) nets three examples of first-mention possessive antecedents:

p. 87: ...the most recently transcribed tapes from Richard Nixon's Oval Office reveal him once more as a vindictive, unscrupulous paranoid.

p. 138: When I read that the young Ross Perot's stated reason for wanting a hardship discharge before he had fulfilled his naval obligation...

p. 231: A day or two after the Webers' son--Jeffrey, aged twenty-six--finally moved out of the house, they realized that they had lost the ability to tape.

All three examples are from the very first sentences of their essays; possessives are being used to introduce discourse referents.

And for an example that Menand would have to interpret as action at a considerable distance, there's this sentence on page 46:

Ivana Trump's books represent a departure only in that, because of the publicity machinery that has made her one of the last decade's premier examples of wretched excess, everyone knows she's not going to write them.

The last mention of Ivana Trump was a full page earlier, with significant digressions on other people and institutions in between. I hope that Menand will not want to claim that it is this distant occurrence of Ivana Trump, and not the possessive, that is the antecedent for the pronouns her and she.

In his e-mail to me, Menand describes possessive antecedents as "technically" solecisms, which I interpret as suggesting that only the most fastidious, meticulous, and scrupulous writers and editors would observe the proscription against them. Ordinary writers, ignorant of the rule and concerned only about communicating clearly, might get away with them, common usage might condone them, but those who truly care about grammatical correctness will avoid them; Menand tells me, in fact, that The New Yorker would not allow them. This is false.

In the October 20, 2003 issue of The New Yorker, there are at least two examples of first-mention possessive antecedents (there are also at least five examples of non-first-mention possessive antecedents, but let them pass):

p. 177 (David Denby on Pauline Kael): As late as 1980, I was capable of writing (in New York), about De Palma's "Dressed to Kill," "You can see that he's using film techniques and tricks to..."

p. 198 (Peter Schjeldahl on El Greco at the Met, first sentence): "We can define El Greco's work by saying that what he did well none did better, and that what he did badly none did worse," the Spanish painter and scholar Antonio Palomino wrote in 1724.

I fear that, faced with this evidence, Menand would find reasons to exclude every bit of it: comic writing, even by Calvin Trillin, is not held to the same high standards as serious writing; shifting between main text and quoted text is a special circumstance; The New Yorker can't be held responsible for the form of material it quotes (especially in translation); and, in any case, even the most cautious, including Menand himself, sometimes slip. I am so pessimistic because I've had the experience of confronting exponents of the Possessive Antecedent Proscription with the very remarkable facts that only people who've been taught the "rule" find anything wrong with examples like Einstein's discoveries made him famous and Harry's sister gave him a nice birthday present and that people who inveigh against possessive antecedents routinely use them in their own writing. I would think that these observations would give exponents of the PAP pause, would make them question the "rule" and the presumed authority that stands behind it. But, for the most part, no; they are Blinded by the Rules, to the point where they can no longer judge the grammaticality and effectiveness of expressions without reference to those rules.

But back to linguistics. It turns out to be tricky to translate Menand's previous-mention proposal -- pronouns can't have possessive antecedents but instead must have antecedents earlier in the context -- into a proposal about different interpretations of the same form. Take Einstein's discoveries made him famous. The previous-mention proposal says that him refers to Albert Einstein, but not because the pronoun has the possessive as its antecedent, but because it has some previous expression referring to Einstein as its antecedent. So, let's suppose that there are two interpretations that might be associated with Einstein's discoveries made him famous. One of these associations, in which him refers specifically to Einstein, is disallowed by the PAP; this is an ungrammatical sentence. What's the other, permissible, association?

It has to involve an open interpretation, in which him refers to some unspecified male referent, which the reader/hearer picks out from the context, by deixis, general knowledge, previous mention, or subsequent mention. On this interpretation - for Menand, the only interpretation of the expression -- Albert Einstein is just one of an endless number of possible referents for the pronoun.

This is subtle, but coherent. Unfortunately, not very plausible. It's certainly true that Einstein's discoveries made him famous doesn't require that him refer to Einstein; consider Schaumberg was sent many of Einstein's results and published about them extensively. Einstein's discoveries made him famous. On the other hand, Einstein is a highly salient possible referent of the pronoun, entirely without any context; Einstein is the first person we think of, in fact. This effect is impossible to explain if, as PAP exponents would have it, the pronoun can't refer to Einstein (because, they believe, the expression Einstein's isn't referential and can't be). Ok, now, we're back to the raw power of authority, and I'm getting all pessimistic again.

Posted by Arnold Zwicky at 06:39 PM

The plastic fetters of grammar

Several times a day, when I walk over the patch of sidewalk inscribed with the picture below, I'm reminded of how far linguistic analysis has faded out of public consciousness.

In the world of Shakespeare and Descartes, or Jefferson and Franklin, the foundation of a liberal education was the "trivium" of grammar, rhetoric and logic. Today, only a small fraction of American college students have ever been taught anything about any of these subjects: the trivium has become non-trivial.

Disciplinary special pleading aside, the result is to blunt and coarsen public discourse on language in all its aspects, from style and usage to reading instruction and bilingualism. Americans haven't stopped talking about language, but few of us, on any side of any issue, know what we're talking about.

It's consoling to reflect that as analytic understanding of language has decreased, so have the negative emotions associated with educational force feeding.

If you're lucky enough to belong to an institution with a subscription to the literature online service, try searching English poetry for the word "grammar". Over many centuries, you'll find phases like "grammar's servile fetters," and be told how an "insect dry discoursing gammer / tells what's not rhyme and what's not grammar." This passage from Beaumont's Psyche is typical in tone:

This forc'd through many tedious sweating Years
The patience of the earnest Student; who
Consumed with a thousand pallid Cares,
Amidst his painful Work could nothing do.
For to inrich his Tongue, his Brains he brake,
And aged grew e'r he had learn'd to speak.

Strange scrambling Alphabets this multiply'd,
And to an Art improv'd Necessity;
Each parted Tongue this did again divide
Into Eight several Stations, and by
Unworthy Grammar's busy Niceties
All generous Apprehensions exercise.

Yea Grammar too found all her Laws too weak
To govern Language's extravagance;
Such odd and unruly Idioms did kick
Against her setled Discipline, and prance
So wildly through Expression's fields, that Art
Was fain to play the child, and conne by heart.

In contrast, recent English-language poetry generally discusses linguistic analysis in neutral or even positive terms:

The plastic character of grammar  
  seems to deride
the lexical excesses  
  of botany.

    (Miles Champion, Transcendental Express, 1996)

I don't know what "the plastic character of grammar" is, but I think it's a step up from "servile fetters."

If linguistic analysis is now generally ignored, at least it's no longer generally hated. In this respect, its role has been taken over by mathematics :-).

Posted by Mark Liberman at 05:48 PM

October 20, 2003

Better Good Turing (?)

Check out this article in the Oct. 17 issue of Science, "Always Good Turing: Asymptotically Optimal Probability Estimation", by Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang.

As Saul Gorn and Abraham Lincoln used to say, "people who like this sort of thing will find it just the sort of thing they like."

[Update: For those who don't like this sort of thing, or who don't know whether they do or don't, let me add that the paper deals with methods for estimating the probability of unseen events. An impressive but ill-informed remark by Chomsky on a related topic played an large role in the history of the field. There was earlier work on the subject by Laplace. Alan Turing and I.J. (Jack) Good made a classic contribution as part of WWII work on the enigma decryption, which was published by Good in 1953 using an ecological "cover story": estimating the population frequency of species that have never been seen in a given amount of trapping.]

Posted by Mark Liberman at 11:44 AM

October 19, 2003

Another informative tautology

Via versiontwo.org, a story about emergent bracelet semiotics among Florida middle schoolers.

The linguistic hook is a quote from a concerned parent: "If they need to ban these jelly bracelets, they need to ban them."

Lyn Walker's dissertation provides an enlightening discussion of why "informationally redundant utterances" are often communicatively useful, but doesn't have much to say about informative tautologies. See this earlier post for another example, also involving the logic of obligation and permission.

These things are all over -- google finds 811 hits for "if you must you must", and strings like "who like him like him" turn up examples of relative-clause tautologies.

I believe that Ward & Hirschberg's 1988 paper "The pragmatics of tautology" is relevant, but I don't have a copy at hand.

[Update (10/20/2003): It seems to me that research on such pragmatic curiosities has almost died out, as indicated by the fact that I couldn't find any web-accessible discussion of Ward & Hirschberg (1988) via google. No doubt if I looked harder, I could find something, but it must be relatively thin on the hyper-ground. What this means is that in post-www time -- since 1995 or so -- people haven't been thinking or writing much about topics that would lead them to discuss or cite this paper.

I hypothesize that this is connected to the death of AI, because (classical) AI researchers were the key market for this kind of research, even though much of the work was actually done by linguists.

On the whole I don't mourn AI's passing, but I do feel that the field of linguistics is poorer for devoting less attention to things like these informative tautologies. Like optical illusions for vision scientists, these puzzling bits of language are tell us about how things work, if we can decode the message.]

Posted by Mark Liberman at 11:46 AM

October 18, 2003

To see Italics in a Grain of Sand...

I've gotten a few emails about the italics thread, and as often happens when you focus on a tiny point of usage, the responses open up some big issues.

Dan Swingley pointed out that the originally offending phrase "fewer annoying italics" would have been ambiguous if it had been rendered correctly. "Less annoying italics" could mean "italics that are less annoying" as well as "a smaller quantity of annoying italics". Dan suggested that this might have played a role in the writer's choice. Anti-prescriptivists often play the ambiguity-avoidance card as a defense against the charge of solecism in cases where usage is descriptively mixed, so this move is a classical one.

However, Dan's note raises the whole fraught issue of ambiguity-avoidance in language use and language change. Like the idea that Eskimos must have many words for snow, the idea that speakers and speech-communities should avoid ambiguity is an amiable one. The trouble is, the facts in many (most?) cases suggest that people act, singly and in groups, as if they could care less. At least, that's my reading of the literature. If that's true, it raises some interesting questions about "theory of mind" reasoning in human communication.

Shifting focus to another small aspect of the original not-large-to-start-with point, one Jonathan Wright (whom I don't otherwise know) writes that 'As a Briton, I'm inclined to disagree with you on how unacceptable it is to say or write "fewer italics". . . Maybe you should check the national origin of those various usages. You might find that "fewer italics" is more frequent in England.' Another classic anti-prescriptivist move, appealing to dialect variation as a defense.

Jonathan didn't provide any evidence for his conjecture, and the intuitions of a handful of local informants of various national origins fail to support it (not that this means much). Google doesn't allow me to restrict searches to sites in the United Kingdom (much less England), and I'll leave it to others to check all the google hits by hand. The British National Corpus has 187 instances of "italics" but none of either "fewer italics" or "less italics", so it's useless in this case.

Though I'm skeptical of Jonathan's particular suggestion, it raises (or at least resonates with) a fundamental issue. "Less/fewer italics/politics/physics" is a case of apparently gradient grammaticality, supported by converging evidence from intuitions, statistical usage patterns and (perhaps some day) psycholinguistic experiments. Suggestions for how to model this kind of situation can be divided on several dimensions. One recently-debated question is whether grammars "play dice." That is, should models of linguistic patterns should be treated as intrinsically stochastic? or should variable data should be modeled as a mixture of non-stochastic grammars (whether mixed in a population of speakers or mixed in the head of a single speaker)?

This may remind some outsiders of the famous theological debate over an iota -- whether the Father and Son have homoousion "same substance" or homoiousion "similar substance" -- but (without prejudice to the theology) the linguistic question really matters! The issue is foundational: what should a model of language structure look like?

The "less/fewer italics" business enages this general "stochastic grammars vs. mixtures of grammars" question, and (I believe) supports my own view, which is that models of language structure should be intrinsically stochastic.

I'll make this argument in a future post, if the creeks don't rise. For now, I just want to point out how each of these two quibbles about a quibble helps us

To see a World in a Grain of Sand
And a Heaven in a Wild Flower

or at least to see the nature of language in a quantifier choice...

NOTE: I certainly don't want to accuse my friend and colleague Geoff Pullum of quibbling -- his original post aimed to make an important general point about the need for careful grammatical analysis in evaluating questions of usage.

Also, when I cast him in the role of prescriptivist, I have to insist that I mean it in the nicest possible way, as Geoff has recently mentioned to me that

[a] libertarian who calls me a prescriptivist is a libertarian who is going to be asked to step outside in the back alley for a few minutes of profound unpleasantness, most of which he will spend lying on the ground by the dumpster.

Posted by Mark Liberman at 02:25 PM

October 16, 2003

Google psycholinguistics

Never mind reaction-time measurements, we can do psycholinguistics with google :-).

I'm following up on the conjecture that "fewer politics" is (psychologically) more wrong than "fewer italics." Both are wrong because in standard usage, it should be "less politics" and "less italics". However, "italics" is somehow closer to being the plural of a count noun than "politics" is, perhaps because one can think of "italics" as referring to the individual italicized letters.

Google gives these counts:

raw count
(corrected count)
fewer italics
less italics
fewer politics
less politics

The raw string counts are not necessarily right: "fewer politics" could be from a phrase like "fewer politics courses" instead of "fewer politics and better teamwork." So I checked the 12 hits for "fewer italics" and the 59 hits for "fewer politics" -- corrected totals are 11 for "fewer italics" and 53 for "fewer politics". This correction only strengthens my point, so I'm going to ignore it for now.


According to google, "italics" is almost 5 times more likely to be modified by "fewer" than "politics" is -- (12/1840000)/(59/43600000) = 4.82.

And "fewer italics" is used about 1/3 as often as "less italics" (38/12 = 3.17), while "fewer politics" is used only about 1/50th as often as "less politics" (3,140/59 = 52.2).

Q.E.D. "Fewer italics" is less ungrammatical -- as a matter of common usage -- than "fewer politics" is.

I'm sure that a similar exercise would show that "fewer physics" is wronger than either of these. This would be a little more work, because phrases like "fewer physics courses" are very common, so one would have to create and use corrected totals.

[Note: I'm not taking a position on the question of whether the grammatical feature is a count noun should be replaced by some gradient property of "countiness." For what it's worth, I tend to think that this would be a mistake. My point is that just that "italics" seems more like a count noun than "politics" does, to me and also (says google) to the average English-language web document writer.]

[Update: Bill Labov points out that google finds 52 instances of "less polemics" to only 3 instances of "fewer polemics". Go figure...]

Posted by Mark Liberman at 10:10 AM

Language Rights Lose in Nebraska

The AP reported this week (e.g. in the New York Times, 10/15/03) that Sarpy County, Neb., Judge Ronald E. Reagan ordered a father to speak to his 5-year-old daughter in English or else lose his visitation rights. What's next? Will divorced parents be required to speak in a particular dialect (say, Standard English) as a condition of visitation with their children, for fear that the children might otherwise say (gasp) "ain't" occasionally?

In the opinion of this judge, and of a great many other Americans too, First Amendment rights don't extend to use of a language other than English -- in spite of the fact that the United States, perhaps uniquely among the world's nations, in fact has no official language. Federal courts have struck down some of the more extreme "English Only" laws passed by state legislatures as being unconstitutional; possibly this Nebraska judge's ruling will meet the same fate.
But aside from constitutional issues, the judge's notion of language learning is notable (if not unusual) in its ignorance of children's capacities: a five-year-old child exposed in daily life to two languages -- ANY two languages -- will learn both languages with equal facility and without significant delays in acquisition of either language. So the judge's ruling is not only cruel; it is pointless. The child's welfare will be unaffected, except of course that she will miss a valuable opportunity to exercise her mind and enhance her humanity by learning a second language.

Posted by Sally Thomason at 08:02 AM

Fewer physics, fewer politics, fewer italics

Geoff Pullum has pointed out a grammatical mistake in the Economist: "fewer annoying italics". I agree with his judgment and his reasoning, but I can also easily imagine making this mistake myself. There is a sort of continuum of perceptual plurality here, I think.

Consider these:

  1. We need fewer italics in this article.
  2. We need fewer politics in this field.
  3. We need fewer physics in this curriculum.

When I imagine hearing or reading these, my reactions are:

  1. seems uncertain, is wrong on rational reflection
  2. seems wrong, is wrong on rational reflection
  3. clearly wrong, is also wrong on rational reflection

I bet that a reaction-time study would show that other English speakers share this ordering.

This is an example of the gradient judgments for which Haj Ross coined the term "squish" in a 1971 article. I just checked google and found only two hits for "Haj Ross squish". Two! Finding any combination of three random terms with such a small number of hits on google is not trivial: for example, "sam gundy mullah" (which I made up as random first name, random last name, random low frequency word) yields THREE. How the mighty have fallen!

Phenomena like the physics << politics << italics continuum are also relevant to the Great Morphophonemics Debate between Steve Pinker and Mark Seidenberg, about which more later.

Posted by Mark Liberman at 07:43 AM

October 15, 2003

Italics and stuff

At the end of a recent book review in The Economist [October 4th, 2003, p.81] I read that the book under review “could have done with less sociological jargon and fewer annoying italics.”

Fewer what? That doesn't sound quite right. What has gone wrong? I think I see what might have happened in the editing process here, and it makes an interesting illustration of the way you need quite a sophisticated understanding of grammar just to apply the standard prescriptive rules.

It is a long-established prescriptive rule of English that you use fewer with count nouns and less with non-count: less tea, but fewer tea bags. It is regarded as a solecism to say We have less tea bags than I thought. It is a reasonable enough distinction for people to want to maintain, it really does prevent ambiguities in some cases, and editors tend to enforce is fairly tightly. I think an editor tried to enforce it at The Economist the night before October 4th.

But of course, to enforce it you have to be able to distinguish count from non-count nouns. Non-count nouns generally don't appear in the plural (it's true that teas can occur, but it always means `cups of tea' or `kinds of tea' or something like that -- it has to refer to some things, not just to some stuff, which means it has to take on a use as a count noun). So it might seem that you could rely on the principle that less should be corrected to fewer when followed by a plural noun. And I suspect that an editor at The Economist made this understandable error.

But italics is not like antics or critics, which are count plurals; it is one of those morphological plurals in -ics, like politics and linguistics, that function as non-count singulars (for a thorough discussion of these words, see The Cambridge Grammar of the English Language, p. 347). You don't talk about wishing there were “fewer politics” around the office; you say less politics, just as you would say less hostility or less backstabbing, because politics is conceived of as stuff, not as things. And material in italic typeface is too.

The test for a count noun is simply to try the word with numbers (see The Cambridge Grammar, p. 334): We talk of material being in italics, but we don't say *one italic, or *two italics. The word italics is a plural ending in -s morphologically, but it doesn't have a singular and it isn't syntactically the plural of a count noun.

So the less/fewer distinction is not relevant here at all. To say less sociological jargon and less annoying italics would have been fully grammatical, and in fact that could be reduced to less sociological jargon and annoying italics, saving a repeated word.

It all goes to show that if you're going to apply prescriptive rules, you really can't do it blindly or automatically. Automatism is for the lower animals. The lesson here is that you actually need to have a pretty good control of descriptive grammar before you can intelligently engage in prescriptive grammar.

Posted by Geoffrey K. Pullum at 11:38 PM

Founding fathers of the Amerind debate

Joseph Greenberg's 1987 book lumping nearly all the indigenous languages of the New World into one super-family called Amerind has engendered great popular interest as well as seemingly endless controversy.

Far be it from me to fan the embers into flame. All I want to do is to point out that the controversy has deep roots, with Penn's Dr. Benjamin Barton as the original Greenbergian lumper, while the skeptical splitter viewpoint was championed by none other than Thomas Jefferson.

In Notes on the State of Virginia (written 1781-82), Jefferson wrote

How many ages have elapsed since the English, Dutch, the Germans, the Swiss, the Norwegians, Danes and Swedes have separated from their common stock? Yet how many more must elapse before the proofs of their common origin, which exist in their several languages, will disappear? It is to be lamented then . . . that we have suffered so many of the Indian tribes already to extinguish, without our having previously collected and deposited in the records of literature, the general rudiments at least of the languages they spoke. Were vocabularies formed of all the languages spoken in North and South America, preserving their appellations of the most common objects in nature, of those which must be present to every nation barbarous or civilised, with the inflections of their nouns and verbs, their principles of regimen and concord, and these deposited in all the public libraries, it would furnish opportunities to those skilled in the languages of the old world to compare them with these, now or at a future time, and hence to construct the best evidence of the derivation of this part of the human race.

He based his conclusions on careful examination of all the vocabularies he could collect, not only for New World languages but also (for example) from Peter the Great's Siberian expeditions.

One of his many correspondents on this topic was Benjamin Smith Barton M.D., Professor of Materia Medica, Natural History and Botany in the University of Pennsylvania. In his 1798 book New Views of the Origin of the Tribes and Nations of America, Barton wrote that

By a careful inspection of the vocabularies, the reader will find no difficulty in discovering that in Asia the languages of the . . . tribes of the Delaware-stock may be all traced to ONE COMMON SOURCE. Nor do I limit this observation to the languages of the American tribes just mentioned . . . HITHERTO, WE HAVE NOT DISCOVERED IN AMERICA. . . ANY TWO, OR MORE LANGUAGES BETWEEN WHICH WE ARE INCAPABLE OF DETECTING AFFINITIES (AND THOSE VERY OFTEN STRIKING) EITHER IN AMERICAN, OR IN THE OLD WORLD. [emphasis original]

Barton went on to assert that "[m]y inquiries seem to render it probable, that all the languages of the countries of America may . . . be traced to one or two great stocks. . ."

Jefferson disagreed (from Notes on the State of Virginia):

. . . imperfect as is our knowledge of the tongues spoken in America, it suffices to discover the following remarkable fact. Arranging them under the radical ones to which they may be palpably traced, and doing the same by those of the red men of Asia, there will be found probably twenty in America, for one in Asia, of those radical languages, so called because, if they were ever the same, they have lost all resemblance to one another. A separation into dialects may be the work of a few ages only, but for two dialects to recede from one another till they have lost all vestiges of their common origin, must require an immense course of time; perhaps not less than many people give to the age of the earth. A greater number of those radical changes of language having taken place among the red men of America, proves them of greater antiquity than those of Asia.

Later on, he considered a sociolinguistic explanation. Having heard that some Indians considered it dishonorable to use any language but their own, he suggested that when a part of a tribe separated itself, the seceded group might refuse to use the original language and invent their own. (ms. notes circa 1800):

Perhaps this hypothesis presents less difficulty than that of so many radically distinct languages preserved by such handfuls of men from an antiquity so remote that no data we possess will enable us to calculate it.

Plus ça change . . .

[Update 10/16/2003: In Good Bye! magazine's obituary for Joseph Greenberg, the anonymous author writes that "[t]he splitters of linguistics have this problem: they're just not as interesting as the lumpers."

This is clearly true in today's popular press, and it tends to be true in interdisciplinary research, for similar reasons. It was probably also true in the marketplace of ideas in late 18th century America. However, I for one find that Jefferson's explorations of this problem are much more interesting to read than Barton's. This is partly because Jefferson was smarter, and writes better -- Mozart to Barton's Salieri -- but perhaps it's also because his few paragraphs on the subject show a keen mind inquiring after the truth, rather than a mere enthusiast piling up evidence.]

Posted by Mark Liberman at 04:21 PM

As If We Needed Further Proof of The Importance of Context in Ambiguity Resolution

"Arafat's Premier Says He Is Close to Resigning Post," ran the headline on the upper left of the NY Times front page on Oct. 13. Possessive Antecedent Principle or no, my initial reading of that had the pronoun he referring to Arafat. But no, I thought, even before I turned to the body of the story -- in that case they'd surely have run it on the other side of the page.
Posted by Geoff Nunberg at 01:42 PM

October 12, 2003

Flash: Gopnik inverts a tag!

In the 2003-10-13 issue of the New Yorker, Adam Gopnik writes

"LAURA BRAVES WEASEL KISS!" ran the headline.

which is a counter-example to this weblog's speculation about the magazine's anti-inversion policy.

Or was the copy editor for Gopnik's comment asleep at the switch?

[Update 10/14/2003: Hersh does it too! Scanning some older New Yorker articles for sequences like /", said/ yields several examples of quotative inversion, such as this one written by Seymour Hersh:

"With the exception of exchange of fire over the Shebaa Farms"--a disputed area on the Lebanese border--"it's been quiet since the Israeli evacuation in 2000," said Richard W. Murphy, a senior fellow at the Council on Foreign Relations, who served as Ambassador to Syria in the nineteen-seventies.

So I think we need to apologize to the New Yorker's copy editors for suggesting that they were responsible for creating awkwardness. At worst, it seems, they didn't fix it.]

Posted by Mark Liberman at 10:38 AM

October 11, 2003

Emergence of birdsong phonology

It's like watching life emerge from the primordial ooze. Ofer Tchernichovski at CCNY has produced some amazing animations of the dynamics of birdsong development.

Ofer studies song learning in zebra finches -- here is some background information, and here is a 2001 Science paper by Ofer and others (requires a subscription).

Ofer raises the birds in a controlled environment -- made from beer coolers! -- so that he can specify everything that they hear, and record every sound they make, over a period of months.

The recorded songs are automatically detected and segmented into "syllables" with fairly high accuracy, and the individual syllables are automatically characterized in in terms of 12 acoustic properties like duration, average pitch, amount of frequency modulation, and so on. The result is a sequence of many millions of feature vectors for each bird, representing its entire vocal output over the time when it is learning and mastering its song.

At the end, the bird's song is a sequence of complex "syllables," which are individually very different from one another in repeatable ways, and are produced in a stereotyped (though not invariant) order. In the beginning, the bird's "proto-song" is a series of proto-syllables that seem to have less individual structure, and are more variable in their properties and less well differentiated from one another.

Each frame of one of Ofer's movies shows a few thousand syllables from a given bird, plotted on a couple of dimensions such as duration and amount of frequency modulation. These dimensions don't give a complete picture of a syllable's sound by any means, but they express some of its properties that seem to be important. The sequence of frames in the movie follows the progression of time over the 100 days or so that it takes the bird to master its song completely.

The picture below shows four frames from the beginning, middle and end of such a movie, arranged left-to-right and top-to-bottom. The blue bar on the left of each frame shows the progression through 100 days of the bird's life, starting before first exposure to an adult song model, and ending after full mastery of the song. The colors in the scatter plots are automatically assigned by a clustering algorithm.

In some sense, we're seeing symbols emerge from signals. The syllable clusters are not "symbols" in the strong sense of "signs with meanings"; but they are symbols in the weaker sense, a finite set of well-differentiated types to which behavioral tokens belong. The fully-developed song then functions for some purposes as a sequence of syllable types, independent of the variable details of their performance. The zebra finch audience probably responds to virtuosity and perhaps to other aspects of performance variation, but birdsong (like human speech) develops "phonological" categories that clearly play a central role in organizing the behavior.

Why? In human spoken language, Hockett's "duality of patterning" is a plausible evolutionary motivation: if you want to have a vocabulary of a hundred thousand items with good transmission fidelity, the coding had better be digital. But these birds don't connect particular "syllable" sequences with particular meanings, and in fact a zebra finch only has one song, whose meaning Ofer glosses as "I love you" when it is directed to a female, and "get lost" when it is directed to another male. Does their system "go digital" just as a side-effect of the need to create signals that are impressively complex from the perspective of other birds?

The movie is much more compelling than the excised frames; I'll ask Ofer if he can post it somewhere.

Posted by Mark Liberman at 08:24 PM

The conventions for expressive content words

Geoff Nunberg recently commented on the DC District Court's surprising decision to permit the Washington Redskins to retain their current name. The decision is surprising because it is so clearly opposed to the established conventions for using and understanding epithets and other expressive content words. In general, such words have the property that their interpretation on a given occasion of use is out of the speaker's control. It rests instead with the audience.

It is quite common to find instances in which some hapless public figure has forgotten this convention and tried to use an expressive content item in a new way. Consider, for instance, the report in the Las Vegas Review-Journal (July 27, 2000) titled Garcia's epithet creates outrage, which opens with 'The new superintendent of Clark County says his use of a racial slur was not meant to be offensive'. According to the report, Garcia said the following during a speech intended to "make his stand against racism clear":
"Niggers come in all colors. To me, a nigger is someone who doesn't respect themselves or others."
Garcia's attempt to redefine nigger on the fly failed miserably. His audience refused to budge on the word's usual interpretation. The passage is worth considering alongside something like, "Artists come in many forms. To me, an artist is anyone who can eat fifty eggs in one sitting". This redefinition of artist is decidedly nonstandard, but an audience is likely to accept the special usage.

In one famous incident, the audience's interpretation held sway even when that interpretation was agreed to be basically incorrect by all involved. In 1999, a Washington D.C. mayoral aide resigned after using the word niggardly. The aide himself told the Washington Post, "Although the word, which is defined as miserly, does not have any racial connotations, I realize that staff members present were offended by the word." Niggardly has neither historical nor semantic links with any racial epithet. Yet the fact that some speakers were offended sufficed to generate controversy.

Why are the meanings of expressive content items basically out of their users' control? The answer probably lies in the fact that they are a kind of performative word. Peformatives permit speakers to accomplish certain acts merely by uttering them. The verb promise is a typical example: uttering I promise to take out the trash just is the act of promising to take out the trash. Similarly, in uttering the word nigger, Garcia expressed an extreme form of disapprobation. The damage was done even before he had reached the verb phrase offering his redefinition. The flap over niggardly shows that even origin and meaning can be beside the point if the word's sound pattern has certain properties.

Thus, it is surprising that lawyers arguing against the name Redskins did not win their case merely by presenting evidence that redskin is likely to be interpreted by a large segment of the audience as offensive. The court's assumption seems to have been that every possible use of a word must be offensive in order to make it an inappropriate brand name. But this just isn't how the conventions of language work.
Posted by Christopher Potts at 07:51 PM

Political corpus linguistics

Josh Marshall adds to the small list of interesting examples of political corpus linguistics.

"Modifiergate" was an earlier example, in which Geoff Nunberg challenged claims about media bias by counting the labels used in newspaper mentions of prominent liberal and conservative public figures and organizations.

This sort of thing is likely to be a growth industry, given nexis, google and weblogs, but I haven't seen much evidence that academic social scientists have caught on to the opportunity yet. Will they pick this up only after it becomes routine in non-academic circles, or am I just ignorant of an underground trend?

Posted by Mark Liberman at 11:46 AM

October 10, 2003

"Too much of a coincidence to be a coincidence"

John Street is the mayor of Philadelphia, in the middle of a hotly contested election campaign, and the past few days have been difficult for him. First a bug was found in his office, and it turned out to have been planted by the FBI. Then the FBI confiscated his Blackberry. Then the feds raided the homes and offices of several of his supporters and associates.

Today's Philadelphia Inquirer quotes him as saying

"In the true spirit of candor, there are some people, particularly in the African American community, who believe that this is too much of a coincidence to be a coincidence."

This sentence makes perfect sense (though I suspect that it would have made Slate's Bushism of the day" if George Bush had said it).

At first I thought that the mayor's phrase trades on two different senses of "coincidence". But (our local on-line version of) the American Heritage dictionary defines "coincidence" as a sequence of events that although accidental seems to have been planned or arranged. On this meaning, as something become more and more of a coincidence (because it seems more and more planned and arranged), it paradoxically become less and less of a coincidence (because it is less and less likely to be accidental). More simply, the mayor is saying that the timing of his troubles seems too planned to be an accident.

Thus the two uses of "coincidence" in Street's sentence seem not to have different senses, but rather to emphasize different aspects of the same sense.

Posted by Mark Liberman at 10:35 PM

Quoi ce-qu'elle a parlé about?

According to Ruth King's plenary address at NWAVE32, preposition-stranding infiltrated Prince Edward Island French in a shipment of infected lexical borrowings.

King started from the fact that sentences like

Le gars que je te parle de ...
the guy that I you talk of
"the guy I'm talking to you about ..."

Quelle heure qu'il a arrivé à?
what time that he has arrived at
"what time did he arrive?"

which are unthinkable in standard French, are normal and common in some (but not all) varieties of Canadian French.

This looks like a case of borrowing a syntactic pattern, but King argues that the effect is indirect. According to her analysis, Prince Edward Island French borrowed a bunch of English prepositions, which carried with them the ability to be "stranded" in questions and relative clauses -- as in the title of this entry, which is modeled on one of King's examples. This "strandability" then spread to native prepositions in a second step.

As I understood her talk, King's main argument for this view is a correlation between preposition-borrowing and preposition-stranding among different geographical variants of Canadian French. I believe that the details of this argument are presented in her recent book The Lexical Basis of Grammatical Borrowing, which I haven't read. From the evidence she presented in her talk, I gather that the number of varieties for which this correlation has been checked is fairly small -- perhaps half a dozen, of which two show both preposition-stranding and preposition-borrowing, while the others show neither trait.

King proposes that all cases of grammatical borrowing in language contact situations are similarly mediated by borrowed words. The general claim is very interesting, though I find it hard to believe. The particular claim about Canadian French preposition stranding is also a fascinating one, but it raises some questions for me. Is strandability really a property of prepositions? Could individual prepositions in some language be strandable or not? How can one arrange this without also allowing individual verbs to choose whether or not their objects can be moved (e.g. questioned or relativized)?

On a more descriptive (and entertaining) note, King pointed out that PEI French has borrowed not only English prepositions, but also many verb-particle combinations. Her handout gave a long list of borrowed verb-preposition combinations, including these:

bailer out ganger up puller through
bosser around grower up setter up
chickener out hanger around shipper out
se dresser up kicker out singler out
fooler around layer off slower down

These examples (along with dozens of others) were found in the transcripts of hundreds of hours of interviews, in which all participants, including the interviewers, were native speakers of the local version of French.

[Update 4/23/2006: Benoit Essiambre writes

Let me start out by apologising for this late comment just after the publication of the first Language Log book. Going through the table of contents and stumbling upon the title "Quoi ce-qu'elle a parl&eactue; about?", I was pleasantly surprised by this perfect example sentence of my hometown language, le Chiac. I went on to read the online post (I don't have a copy of the book yet) which was a great illustration of the idiosyncrasies of Chiac, however I was shocked that the language I grew up with was attributed to Prince Edward Island when everyone in the maritime provinces know that it is mainly spoken around the New-Brunswick city of Moncton. Of course, it is quite probable that small P.E.I communities use it as they are situated just a little more than an hour away from Moncton, however I can assure you that it's not its focal point.


Posted by Mark Liberman at 07:32 AM

October 09, 2003

A parrot after my own heart

Last Sunday, I spent a beautiful fall afternoon walking around Valley Forge National Historical Park with my seven-year-old son Mac and Dick Oehrle, whom I've known since we were undergraduates together.

Dick's daughter once worked as a research assistent for Irene Pepperberg at the University of Arizona. Dick relayed this story about the language skills of Alex the African Grey Parrot.

It seems that Cheerios cereal was a favorite treat among the parrots in the lab. At a certain point, someone went to a new local health food store, and brought back some healthy organic O-shaped whole-grain cereal. Alex tried a mouthful, spit it out, looked at the provider, and said, very distinctly:


Of course, what you're reading is my re-telling of Dick's re-telling of his daughter's story, which itself might have been second hand. But still.

[10/16/2003: More about Irene Pepperberg and Alex is here.]

Posted by Mark Liberman at 06:05 PM

October 08, 2003

Louis Menand's pronouns

Geoff Pullum has posted about Louis Menand's swipe at possessive NPs serving as antecedents for personal pronouns -- what I've called the Possessive Antecedent Proscription (PAP) in a series of postings to the American Dialect Society since may -- and now it's time to check on Menand's own practice. I've done the obvious thing and pulled out my copy of his book The Metaphysical Club and started looking for violations of the PAP.

[What follows is from a posting of 10/7/03 to the ADS list]

Menand seems to be much given to this useful construction, despite labeling it a "solecism" in his New Yorker review. here are the first six examples i found; they take us through page 38 of this book of 445 pages of text (many of which have extended quotations from the people he's writing about; i didn't look at these).

All these examples have subject or object pronouns (set off by underlining). Examples with possessive pronouns are everywhere, but many handbooks exempt them from the PAP, so I ignored them.

The first one is an example of a type I hadn't considered before, with a reflexive pronoun rather than a plain definite pronoun -- but I can't see why the PAP shouldn't cover these in the same way as the others.

  1. p. 7: ...in a phrase that became the city's name for itself...
  2. 2. p. 7: Dr. Holmes's views on political issues therefore tended to be reflexive: he took his cues from his own instincts...
  3. 3. p. 25: Emerson's reaction, when Holmes showed him the essay, is choice...
  4. 4. p. 28: Brown's apotheosis marked the final stage in the radicalization of Northern opinion. He became, for many Americans,...
  5. 5. p. 31: Wendell Holmes's riot control skills were not tested. Still he had, at the highest point of prewar contention...
  6. 6. p. 38: Holmes's account of his first wound was written, probably two years after the battle in which it occurred, in a diary he kept during the war.

There's really no point in pursuing this further. there are probably close to a hundred examples in the book.

Further comment: Menand writes well, and these sentences do nothing to tarnish his reputation. there's nothing wrong with them.

I've sent copies of my ADS postings about him to Menand. I hope he won't conclude that he should now be trying to avoid possessive antecedents! This is a possible response: when a colleague posted to the newsgroup sci.lang that possessive antecedents were just ungrammatical, and I mailed him an example from his own writing, he was inclined to think that he should just be more vigilant.

Posted by Arnold Zwicky at 11:30 AM

October 07, 2003

Fenimore Cooper, Call Your Office

The decision of the DC District Court to reinstate the trademark of the Washington Redskins was annoying, to put it mildly. I served (pro bono, for what it's worth) as the expert witness for the Indians who brought the petition to cancel the mark before the Trial Trademark and Appeal Board, which ruled in 1999 that the mark was improperly registered back in 1965, since the Lanham Act forbids the registration of marks that are "disparaging."

But I can't say that the District Court's decision surprised me. I know judges are always spinning the factual background of a case to support the argument they want to make. But when the facts concern an anti-trust matter, say, they're obliged to make at least a pretense of knowing what they're talking about. Whereas when it comes to language, all bets are off.

In making our case, we put together what I think was a pretty strong portfolio of evidence to support the claim that redskin was a disparaging term when the mark was originally registered and remained so afterward. We had print citations for the word going back to the nineteenth century, like a passage from the 1910 edition of the Encyclopedia Britannica that described the word as not being "in good repute."

We showed that the modern press uses redskin in reference to Indians only as an example of a racial epithet or in campy references to old movies -- you don't find newspaper articles that say "Redskin Jay Silverheels was honored last night." We made a compilation video of clips that documented the disparaging use of redskin in movie Westerns, like the scene from the 1956 film Mohawk that had a character identified as an "Indian hater" referring to "dirty, mean, ignorant, slinkin' redskin skunks." And a survey showed that a substantial proportion of Native Americans find the word objectionable today.

But District Court Judge Colleen Kollar-Kotelly ruled that the evidence on the disparaging status of the word was inconclusive. Her arguments betrayed the mix of ignorance and illogicality that are depressingly common when courts stray into linguistic territory. Some examples:

The fact that a "not insignificant number of Americans have understood 'redskin(s)' to be an offensive reference to Native Americans," has nothing to do with whether Native Americans, themselves, consider the term "offensive," which would obviously be more probative or relevant.

Right. Even if non-Indians use redskin in a disparaging way, that doesn't mean that Indians don't take the label as a compliment. A rare people indeed, who don't care what others think of them -- you wonder why they filed the petition at all.

[T]he dictionary evidence only states that the term 'redskin(s)' is 'often offensive,' which, as Pro-Football observes, means that in certain contexts the term 'redskin(s)' was not considered offensive. In fact, the TTAB concluded that the term 'redskin(s)' means both a Native American and the Washington-area professional football team. The fact that it is usually offensive may mean the term is only offensive in one of these contexts."

Give me a break. Hedges like "usually" or "often" in a usage note are intended exempt certain specialized uses of a term -- as a reclaimed epithet by members of the group, say, or when the word is mentioned in a linguistic discussion of epithets. They don't mean that the term is inoffensive when it's used as a trademark to capitalize on some features of the group's stereotype, in this case the inhuman savagery that people associate with the Indians of movies and popular fiction. (By that logic, you could get away with marketing an SAT prep program as YidSmarts.) In fact, inasmuch as the Lanham Act takes a non-disparaging status as a precondition for registration of a term, you'd never be able to invoke the condition, since no mark could be judged disparaging until consumers decided whether it was offensive as the name of the product in question.

Under the "butter-wouldn't-melt" heading, in fact, you could put the Redskins claim that the success of the team brought honor to Indians -- in the same way, I assume, that the achievements of the New Jersey Devils bring honor to the Prince of Darkness.

The decision went on along these lines -- the judge opined, for example, that even though the Redskins' fans and newspapers used the team's name in ways that "often portrays Native Americans as aggressive savages and bufoons," with references to scalping opponents and by using pidgin English, the team itself has in recent years used Native American imagery respectfully. By that logic, a team could call itself the Washington Niggers without running afoul of the Lanham Act, so long as it was careful to post photos of Martin Luther King and Marian Anderson over the stadium doors. If the fans came to games in blackface and kinky wigs -- hey, that's not the team's look-out.

At this writing, the petitioners haven't decided whether they're going to appeal.

Posted by Geoff Nunberg at 06:14 PM

October 06, 2003

The awful German New Yorker Language?

In his classic discussion of The Awful German Language, Mark Twain complained about the distance between subjects and verbs in written German, citing the example

Wenn er aber auf der Strasse der in Sammt und Seide gehüllten jetzt sehr ungenirt nach der neusten Mode gekleideten Regierungsräthin begegnet.

which he glosses as

But when he, upon the street, the (in-satin-and-silk-covered-now-very-unconstrained- after-the-newest-fashioned-dressed) government counselor's wife met.

Twain's comment: "You observe how far that verb is from the reader's base of operations; well, in a German newspaper they put their verb away over on the next page; and I have heard that sometimes after stringing along the exciting preliminaries and parentheses for a column or two, they get in a hurry and have to go to press without getting to the verb at all. Of course, then, the reader is left in a very exhausted and ignorant state."

In Twain's example, the verb is separated from its subject by 19 words. In the example that Chris Potts put forward as evidence for the New Yorker's prejudice against quotative inversion, the subject-verb distance in the uninverted quotative tag is 20 words (the span from 'Sekulow' to 'says'):

"I would hope that, based on the President's judicial nominations so far, you will see him appoint Justices more in line with a conservative judicial philosophy," Jay Sekulow, the chief counsel to the American Center for Law and Justice, an advocacy group funded by the Reverend Pat Robertson, says.
(Jeffrey Toobin. Advice and dissent. The New Yorker, May 26, 2003 (p. 48, column 1))

Twain's essay was unfair to the German language and its distinguished tradition of excellent prose, just as this post is unfair to the New Yorker, a distinguished publication with genuinely high standards. So what's the point?

I assume that Germans, like speakers of other languages with S(ubject) O(bject) V(erb) order, normally have no real trouble "flounder[ing] through to the remote verb," as Mark Twain put it, though some German authors may make life harder for their readers than it needs to be. Similarly, English-language journalists are prone to pile up appositives and similar stuff between subject and verb, in a way that doesn't happen in speech and rarely happens in other kinds of prose. However, this ordinarily doesn't cause any real trouble for the reader, perhaps because these journalistic appositives are pragmatically very close to more colloquial constructions. Thus a reader can see:
David Devonshire, the company's chief financial officer, said the separation will improve the way customers view the company.
and think:
David Devonshire [is] the company's chief financial officer, [and he] said the separation will improve the way customers view the company.

No similar re-construal is available inside a quotative tag, where piles of post-subject appositives force the reader to "flounder through to the remote verb" without assistance. That is, a reader seeing:
The separation will improve the way customers view the company, David Devonshire, the company's chief financial officer, said.
can't (helpfully) think:
*The separation will improve the way customers view the company, David Devonshire [is] the company's chief financial officer [and he] said. It's interesting that a linguistically-arbitrary stylistic rule -- the New Yorker's (conjectural) ban on quotative inversion -- may be forcing fine writers into these awkward constructions. This is an object lesson in the perils of trying to improve prose style by legislative fiat.

Posted by Mark Liberman at 11:36 AM

October 05, 2003

Menand's acumen deserts him

For a man who will write in a national magazine that “Microsoft Word is a terrible program” I will cut a lot of slack. Louis Menand can't be all bad. And his generally entertaining review article on the new 15th edition of The Chicago Manual of Style, which appeared in the October 6 issue of The New Yorker, is fun, cleverly interleaved with a slashing attack on Word's brainlessly irritating efforts to take charge of your writing. But he works himself up into such a lather of pedantry that he cannot resist making a side remark that raises a tired and false grammar story once again. I refer to the baseless claim that the College Board made a grammar mistake in a PSAT test. According to Menand, the Board “replaced the phrase ‘Toni Morrison's genius’ with ‘her’,” and it is a failing of the Chicago Manual that it does not warn the reader against this sort of thing. Well, the College Board did nothing of the sort, and Menand should be ashamed of his sloppiness. If we're going to play the grammatical pedant, then let's be careful to get it right.

What the College Board actually did was to use this sentence as the basis for a grammar question:

Toni Morrison's genius enables her to create novels that arise from and express the injustices African Americans have endured.

The questions following it asked about the location of whatever grammatical errors it might contain. The Board correctly took the correct answer to be that there are no errors in it; her refers back to Toni Morrison, which is fine, and nothing else is wrong. But Maryland high school teacher Kevin Keegan persuaded them to change the scores of everyone who took the test because, he claimed, it did have an error in it.

Keegan's case was based on the fact that several usage books insist that a noun phrase that is the genitive determiner of another noun phrase must never be the antecedent of a pronoun. (That is, in the College Board's sentence, the pronoun her simply cannot refer back to Toni Morrison as it is obviously intended to do.) Sometimes the books just assert this as a brute fact; sometimes they seem to imply that the rationale has to do with avoidance of ambiguity; and sometimes they seem to say that it is simply a logical truth: any modifier of a noun is ipso facto an adjective, they claim, and a pronoun replaces a noun, so a pronoun can never replace a genitive noun phrase in determiner function.

All of this is nuts. It is patently ridiculous to suggest that that sentences like Roy Horn's white tiger attacked him on stage are ungrammatical. Such sentences are commonplace in the work of the finest writers. The prohibition against them is a mistaken over-generalization of a style recommendation about not getting antecedents too deeply embedded (“In several of Hitler's diary entries he says...” is not very good, for reasons that are rather hard to put a finger on), or about not writing ambiguously (in “Mary's mother thinks she's too fat” we can't tell who's supposed to be fat). And prenominal genitive determiner noun phrases are not adjectives, so to think that they can't be antecedents of pronouns for that reason is even madder than merely imagining that some obscure rule is being violated.

Geoff Nunberg published a very nice article in The New York Times that dealt with both the grammar and the politics of this case. It is a pity Louis Menand didn't read it. Menand's reverence for prescriptive usage books has blinded him to the fact that some of them (not all that many, actually) repeat silly rules from long ago that were never genuine principles of sentence formation in the language. (And was that last sentence of mine, which you read without a qualm, ungrammatical? Of course not.)

Posted by Geoffrey K. Pullum at 02:22 AM

The European Council legislates English morphology

In this post on the EDline list, Victor Dewsbery references an official page informing us that "[a]ccording to the European Council conclusions reached in Madrid in December 1995, . . . [i]n English, the terms euro and cent are invariable (no plural 's')."

Thus the Council decrees that its subjects should write (and say) "This apple costs 30 cent."

Curiously, the morphology of other languages is not similarly redefined: the French and Spanish get to keep their traditional 's', the Finns get a singular partitive 'a', etc.

Dewsbery cites some other official pages that backpedal on the 's' prohibition, presumably because it's hopeless to try to enforce the original policy. But why was it imposed in the first place? It's hard to believe that the drafters of the policy thought that this is how English works. Perhaps they saw the euro as a game animal, like grouse, deer, trout and salmon?

Another odd stricture, documented further down the same page, concerns the syntax of the ISO code EUR:

In English, the ISO code or euro sign is placed before the figure, separated by a non-breaking space, e.g. EUR 30.

In all other languages the order is reversed, e.g. 30 EUR.

[Update: this page explains the political history of plural euro and cent:

How did this unusual usage come into being? At a meeting of the Monetary Committee in 1998, the ECB -- fearing that the use of different spellings for the single currency might lead to legal problems -- claimed that "euro" and "cent" should be invariable in all languages, as decided in Madrid and Verona. The principle of invariable spelling was therefore accepted, but -- as so often happens -- some countries (France, Spain, Portugal) immediately obtained derogations allowing them the plural inflections natural to their languages (though not of course on the notes and coins themselves). In practice, therefore, 'invariable' meant 'invariable for some languages but not for others' right from the start.

So the European Council was not motivated by a subtle analogy to invariant "yen" and similar models in English-language currency usage. Instead, the Council aimed to establish a rationalist, morphology-free norm for all languages, and then issued exceptions to a few countries who were paying attention or who take language legislation seriously.

The English-only order EUR 30 is still a mystery to me.]

Posted by Mark Liberman at 01:15 AM

October 04, 2003

Colorless green probability estimates

43 years later, someone finally checked. And it turns out that Chomsky was wrong.

In Syntactic Structures (1957) Chomsky famously wrote:

  (1) Colorless green ideas sleep furiously.
  (2) Furiously sleep ideas green colorless.

. . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

This was one of the most compelling passages in an enormously influential book, which killed the early-50s information-theoretic explorations of language.

Chomsky's typically confident conclusion is both extraordinarily broad -- "in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds" -- and also unsupported by any argument other than assertion. Yet anyone who knows that a statistical model can assign different probabilities to different unseen events will suspect that his assertion is wrong.

In an article "Formal grammar and information theory: together again?", Fernando Pereira describes an experiment that disproves Chomsky on this point, by fitting a simple statistical model (an "aggregate bigram model") to a corpus of newspaper text.

The result? The sentence "Furiously sleep green ideas colorless" is estimated by this model to be about 200,000 times less probable than "Colorless green ideas sleep furiously" (p. 7).

Read the whole thing, which gives a picture of the history of these issues since 1950, including a sympathetic account of Zellig Harris' research program, and makes some interesting suggestions for the future.

[Note: Pereira's article was prepared for this volume on "The Legacy of Zellig Harris", which contains other interesting articles as well.

Posted by Mark Liberman at 06:58 AM