December 10, 2003

Irresponsible Punditry

The paper "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin" discussed in a previous posting was the subject of an article by Boston Globe staff writer Gareth Cook in the Thanksgiving Day issue (p. A16). The title, "A new word on birth of Western languages", is a little odd since the Indo-European languages include not only most of the languages of Europe but most of the languages of such non-Western countries as Iran (Persian), Armenia (Armenian), Afghanistan (Dari, Pashto), Pakistan (Panjabi, Urdu), India (Sanskrit, Hindi, Gujarati, Bengali, Assamese, Marathi, Oriya), Nepal (Nepali), Bangladesh (Bengali), and Sri Lanka (Sinhalese), as well as Kurdish, spoken in Turkey, Iran, and Iraq, and Tocharian A and B, once spoken in Chinese Turkestan, but the article itself is pretty good. It does, however, contain one irritating bit:

Gray was trained as a biologist, not a linguist, which some scientists said could explain the generally cautious reception yestereday's paper was greeted with among linguists. "Partly, I think they are irritated", said Luigi Luca Cavalli-Sforza, who is a leading expert on historic population migrations and a professor emeritus at Stanford Medical School. "It is a very good paper."

Cavalli-Sforza is indeed a distinguished geneticist, whom I first encountered via his book Cultural Transmission and Evolution: A Quantitative Approach which I read with pleasure many years ago and still own. But as far as I can tell Cavalli-Sforza has no reason whatever to think that the cautious reaction of linguists to the paper was based on anything other than legitimate scientific issues. There are some, discussed in my previous posting. I know of no evidence that anyone's reaction was based on irritation. He's just blowing smoke.

For the record, here are the comments that I sent Gareth Cook when he was writing the piece in question. It seems to me that they make a few technical points, are in many ways positive about the paper, and withhold final judgment until I can find out more about what exactly the authors did. Its fair to characterize them as cautious, but I don't see any irritation. You can judge for yourself.

The paper by Gray and Atkinson is a serious paper. It shows familiarity with the literature and attempts to address the known problems with glottochronology and methods of dating based on lexical turnover. And they used a reasonably reliable source of data and information about cognation. They have also taken a number of precautions to ensure that their results are not the result of chance and to see that their assumptions are not influencing the results. So it compares quite favorably with the junk that we sometimes see in which people apply a technique from another field to a problem that they don't really understand, often with poor data sources.
The main question that this paper leaves me with is whether their technique adequately addresses the fact that the rate of lexical replacement is known not to be constant. They acknowledge the issue and say that "the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model variation across the tree." The reference they give is to what appears to be the manual for a piece of software. I'm not familiar with this, so on short notice I simply can't tell whether it adequately addresses the problem.
The other problem, pointed out by Don Ringe, is that it isn't clear what exactly they have done with their cognate sets. Dyen et al. contains Swadesh 200 word lists for 95 languages. They excluded 11 languages that Dyen et al. did not code, which leaves 84. Then they added Hittite, Tocharian A, and Tocharian B. So they should have 200 cognate sets across 87 languages. If they were using methods of the sort I am most familiar with, each cell in the matrix would have a value indicating either "for this lexical item this language retains a reflex of the reconstructed Proto-Indo-European etymon" or not. But that can't be what they have done since they talk about 2,449 cognate sets. So they've apparently split each gloss into multiple cognate sets, and they don't explain how.
I have an idea of what they might have done, but its just a guess. Perhaps they have used each subset of cognate words as a "cognate set". For example, the PIE word for "bear" is believed to be the ancestor of Latin ursus, Greek arktos, Sanskrit rkshas, Welsh arth (as in the name Arthur) etc. However, this doesn't show up in Germanic and Balto-Slavic. Germanic languages have words like English "bear", German baer, Old Norse bjorn - evidently they referred to bears as "the brown ones". In Slavic you get words like Russian medved, literally "honey eater". Presumably, this reflects taboo-ing of the original word for bear. Anyhow, in a case like this they might have treated cognates of ursus as one cognate set, cognates of bear as another cognate set, and cognates of medved as a third cognate set. There's nothing wrong with that, as far as it goes. But "has a cognate of ursus", "has a cognate of bear", and "has a cognate of medved" are not independent - e.g., if a language has a cognate of ursus as its word for "bear", it doesn't have a cognate of medved. So if your technique assumes the independence of the characters, you can't do this.
It's quite possible that whatever they've done is not problematic - I can't tell because they don't give sufficient detail.
A minor comment is that it is a little odd to use the Romance languages when they are known to descend from Latin, which of course is well attested. Using the daughters rather than the ancestor can only add noise. Presumably they didn't use Latin because Dyen et al. don't give Latin data.

We expect scientists to provide objective commentary based on a knowledge of the subject, not insinuations about the alleged motives of those who disagree with them. We can leave that to the Postmodernists in the literature departments. So you might think that Cavalli-Sforza's remark was merely an addendum to a discussion of the scientific issues and suppose that the newspaper is at fault for reporting only the fluff. That probably isn't what happened though: this wouldn't be the first time that Cavalli-Sforza has substituted unfounded, ad hominem remarks for intelligent commentary.

Cavalli-Sforza is a staunch defender of the late Joseph Greenberg, whose 1987 book Language in the Americas is generally considered by historical linguists to be worthless, partly because its methodology is invalid, and partly because Greenberg's handling of the data is so appallingly bad. Cavalli-Sforza hasn't made any attempt to defend Greenberg's data, and his attempts to defend Greenberg's methodology contain nothing of substance. Let's take an example. In his book Genes, Peoples, and Language he says (pp. 137-138):

...some anti-Greenberg linguists believe it is impossible to posit a quantitative relationship between any two languages. By disallowing reliable measurements, and by limiting the relationship betweeen two languages only to "related or not related", the American linguists opposing Greenberg have ruled out the possibility of hierarchical classification, an essential prerequisite to taxonomy.

Now, this is perfect nonsense. I think it is fair to say that all of the linguists who have criticzed Greenberg's work believe in degrees of relationship, that is, that some languages are more closely related to each other than to other languages. I have never heard ANY linguist express the view described by Cavalli-Sforza. Virtually every book and paper on historical linguistics assumes a hierarchical classification. To claim that historical linguists are critical of Greenberg because they don't believe in degrees of relationship is like claiming that biologists are critical of Lysenko because they don't believe in evolution.

It is also striking that such an amazing claim is supported by no evidence. Cavalli-Sforza doesn't even name any of the linguists who allegedly hold this amazing view, much less supply quotations from their work or references to it. That's because there isn't any supporting evidence.

Just to be sure, I asked Cavalli-Sforza if he could offer any support for his claim:

From wjposer Sat Feb  1 13:26:19 2003
To: cavalli@stanford.edu
Subject: degrees of relationship
Content-Length: 698
Status: RO

Dear Professor Cavalli-Sforza:

In your book Genes, Peoples, and Language at pp. 137-138 you say:

      ...some anti-Greenberg linguists believe it is impossible to
      posit a quantitative relationship between any two languages.
      By disallowing reliable measurements, and by limiting the
      relationship betweeen two languages only to "related or
      not related", the American linguists opposing Greenberg have
      ruled out the possibility of hierarchical classification, an
      essential prerequisite to taxonomy.

I wonder if you could supply the names of the linguists who
take this position and references to publications in which
they have done so. Thank you.

Bill Poser

Here is his reply:

From cavalli@stanford.edu  Sun Feb  2 03:15:05 2003
Return-Path: 
Received: from smtp-roam.Stanford.EDU (smtp-roam.Stanford.EDU [171.64.14.91])
	by unagi.cis.upenn.edu (8.10.1/8.10.1) with ESMTP id h128F4D23142
	for ; Sun, 2 Feb 2003 03:15:05 -0500 (EST)
Received: from smtp-roam.Stanford.EDU (localhost [127.0.0.1])
	by smtp-roam.Stanford.EDU (8.12.6/8.12.6) with ESMTP id h128F3gG029435
	for ; Sun, 2 Feb 2003 00:15:04 -0800 (PST)
Received: from cavalli.stanford.edu (DNab42a421.Stanford.EDU [171.66.164.33])
	(authenticated bits=0)
	by smtp-roam.Stanford.EDU (8.12.6/8.12.6) with ESMTP id h128EpM5029427
	(version=TLSv1/SSLv3 cipher=DES-CBC3-SHA bits=168 verify=NOT);
	Sun, 2 Feb 2003 00:14:58 -0800 (PST)
Message-Id: <5.1.1.5.2.20030202001114.01aa9378@localhost>
X-Sender: cavalli@localhost
X-Mailer: QUALCOMM Windows Eudora Version 5.1.1
Date: Sun, 02 Feb 2003 00:14:47 -0800
To: William J Poser 

From: "L. Luca Cavalli-Sforza" 
Subject: Re: degrees of relationship
Cc: anca.ruhlen@forsythe.stanford.edu
In-Reply-To: <200302011826.h11IQKH06074@unagi.cis.upenn.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Status: RO
Content-Length: 1246

Dear Dr. Poser,
My understanding of the errors by American linguists who criticized 
Greenberg is mostly derived from Greenberg's 1987 book on Lnguage in 
Americas, Stanford University Press. You may get a better response from 
Dr.Merritt Ruhlen, to whom I am cc-ing this letter.
Sincerely
                         Luca Cavalli-Sforza

He provides no support for the claims in his book, no references, no names. In fact, he admits that he doesn't have any firsthand knowledge of what he is talking about and has taken his views from Joseph Greenberg, the very person the critics are criticizing. Caveat lector.

Posted by Bill Poser at December 10, 2003 10:33 PM