December 10, 2003

Dating Indo-European

The journal Nature (vol. 426, 27 November) contains a paper entitled "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin" by Russell D. Gray and Quentin D. Atkinson that has attracted a good deal of interest. The paper dates the initial divergence of the Indo-European language family to 8700 years ago, with Hittite as the first language to split off. This they take to support the theory that Indo-European originated in Anatolia and that Indo-European languages arrived in Europe with the spread of agriculture. They take this to argue against the alternative "Kurgan hypothesis", according to which the "Kurgan Culture" of the steppes was Indo-European speaking, though they say that it is consistent with the view that the Kurgan people represented a branch of Indo-European.

If it is really possible to obtain accurate dates for linguistic divergence from linguistic data, that would be very nice. It would provide a useful new tool for the study of prehistory. However, the reactions of historical linguists to this paper have generally been skeptical. I'll explain why.

Languages change in a number of ways: words are replaced by entirely different words, a word shifts in meaning, one grammatical construction is replaced by another. Much language change is systematic: a certain sound, in a certain context, changes into another sound in every word in which it occurs in that context. This is known as sound change, and the rules that describe the changes are known as sound laws. For example, Latin /k/ became French /sh/ (spelled <ch>) before the vowel /a/. Thus, Latin castellum became French chateau, Latin campus became French champs, Latin captivus became French chetif and so forth. To take another example, Japanese used to allow /y/ before /e/, as in yen, the unit of money, yedo, the old name for Tokyo, and yezo, the old name for Hokkaido, which shows up in the scientific neo-Latin adjective yezoensis "of or pertaining to Hokkaido", as in Porphyra yezoensis, the scientific name for susabinori, one of several species of the seaweed you eat wrapped around sushi. (Incidentally, susabinori and its relatives have a fascinating life history, which you can learn about here.) However, /y/ disappeared before /e/, so these words are pronounced /en/, /edo/, and /ezo/ in modern Japanese. Sound change plays an important role in working out the family trees of related languages. Languages that have undergone the same sound changes are likely to have been a single language at the point at which they underwent it. Interactions among sound changes can tell us the order in which they occurred.

Although sound change is the main way in which words change over time, it is also possible for a word to be replaced by an entirely different word. For example, the Proto-Indo-European word for "dog" was something like *kuon. (The star indicates that this is a hypothetical form.) We reconstruct this form from attested (actually recorded) forms like Greek kuon, Sanskrit shvan, and German hund by asking what proto-form would yield the attested forms after undergoing the sound changes observed in the various languages, and also taking into account changes in word-formation. The direct descendant of this word in English is hound. But at some point the common Germanic word for "dog" took on a more specialized meaning and was replaced, as the general term, by dog, a word whose origin we do not know.

Although we can do a reasonably good job of reconstructing the way in which languages are related to each other, the standard techniques only tell us the order in which the splits occurred; they don't give us dates. The main approach to assigning dates to linguistic divergence events is known as glottochronology or lexicostatistics, proposed in the early 1950s. Glottochronology was based on the idea that words are replaced by entirely different words at a constant rate, just as radioactive molecules decay at a constant rate. To apply the technique, you take a list of basic vocabulary known as the Swadesh list after Morris Swadesh, the linguist who proposed glottochronology, and you translate it into the languages you are working with. You then figure out which words are cognate. For example, if you were to compare English and German, you would record that English "foot" and German "fuss" are cognate while English "dog" and German "hund" are not. When you're done, you count up the number of cognates and compute the fraction of words that are cognate. You then plug this into an equation that allegedly gives you the number of years of separation between the two languages.

The equation is basically the inverse of the equation for radioactive decay, with a time constant based on the observed rate of lexical replacement in a number of languages whose history we know fairly well, primarily the Romance languages.

There are a number of variants of glottochronology, using vocabulary lists of different lengths, different rates of lexical change, and so forth, and a variety of difficulties in applying the technique, but the central problem is that the lexical replacement rate is not constant. The rates observed in languages with a known history vary considerably. For example, studies show that English preserved only 68% of its basic vocabulary over a 1,000 year period, while Icelandic preserved 97%. Time depths calculated using the "standard" rate proved to be far off the mark in a number of test cases. As a result, glottochronology is considered to have been discredited by most historical linguists. (Further discussion of glottochronology, including problems not mentioned here, can be found in Lyle Campbell's textbook Historical Linguistics: An Introduction at pp. 177-186.)

Gray and Atkinson used an existing database of words compiled by linguist Isidore Dyen (an advocate of glottochronology) and colleagues and used techniques and software developed for work in genetics to construct a family tree and assign dates to it. Their approach is similar to glottochronology in that it makes use exclusively of information about lexical replacement. It differs from glottochronology in the methods used to construct the tree and compute the dates.

This paper avoids many of the problems that frequently arise in work of this type. It shows familiarity with the literature and awareness of some of the problems with glottochronology and related methods. It also uses a reasonably reliable source of data and information about cognation.

Nonetheless, we can't accept these results at face value. One reason is that we're generally skeptical about any sort of purely lexical method such as this because we know that lexical replacement is much more subject to cultural influence, external and internal, than other aspects of language change. Its a little hard to believe that something as peripheral and unsystematic as lexical replacement provides sufficient information not only to reconstruct a realistic family tree but to date the splits. Keep in mind that the DNA sequence that serves as the input for tree construction and dating in genetics contains all of the information about biological change, whereas lexical replacement is a small part of language change.

More specifically, there is the question of whether their technique really deals adequately with the fact that the lexical replacement rate is not constant. We have to keep in mind that we're not talking about just a little bit of variation. As the examples of English and Icelandic show, the range of variation of lexical replacement rates is pretty large. (Unfortunately, the number of languages for which the rate has been determined is not large, so we don't have a good knowledge of the statistical distribution.) The paper does address this. They say that:

the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model variation across the tree.

but they give only a brief description of the approach, and the only reference is to the manual for the software that they used. The manual can be downloaded from the r8s website, but it isn't all that helpful. It looks like understanding this approach will require reading papers referred to in the r8s manual as well, very likely, as experimenting with the program. It is possible that using r8s adequately deals with the problem of lexical replacement rate variation, but at this point, we can't tell, and it is far from clear whether it really does.

Their treatment of the data also raises a red flag. Their data source contains Swadesh 200 word lists for 95 languages, but cognation information is omitted for 11 languages, so they reasonably enough left them out. Then they added data for three languages not contained in the Dyen et al. database: Hittite, the best attested of the Anatolian languages, and Tocharian A and B, the two Indo-European languages attested from Chinese Turkestan in the 6th through 8th centuries C.E. So they should have 200 sets of words across 87 languages. Each cell in the matrix would have a value indicating either "for this lexical item this language retains a reflex of the reconstructed Proto-Indo-European etymon" or not. But that can't be what they have done since they say that they used 2,449 cognate sets. They have somehow split each set of 87 words into an average of about 12 subsets.

They don't say how they did this, but we can guess. What they may have done is to take each subset of cognate words as a "cognate set". For example, the PIE word for "bear" (not on the Swadesh list, just a convenient example) is believed to be the ancestor of Latin ursus, Greek arktos, Sanskrit rkshas, Welsh arth (as in the name Arthur) etc. However, this doesn't show up in Germanic and Balto-Slavic. Germanic languages have words like English "bear", German baer, Old Norse bjorn - evidently they referred to bears as "the brown ones". In Slavic you get words like Russian medved, literally "honey eater". What they may have done is treat cognates of ursus as one cognate set, cognates of bear as another cognate set, cognates of medved as a third cognate set, and so forth.

This is a perfectly reasonable way of describing the data, but you can't use binary characters based on such cognate sets as the input for clustering algorithms because characters like "has a cognate of ursus as its word for 'bear'" are not independent. If, for example, a language has a cognate of ursus as its word for "bear", it doesn't have a cognate of medved.

As I said, this is just a guess as to what they did. It seems pretty likely, since it is the obvious, non-arbitrary way to split up sets of semantically equivalent words, and it would probably produce the right number of cognate sets. But we don't know for sure, and until we do know exactly what they did, we can't decide how much credence to give their results.

The way in which this work was published also raises some issues. Why was this work published as a letter? Clearly, exactly what the authors did, and how what they did addresses the known problems, requires a lengthier discussion than they were able to provide. Letters to Nature are appropriate for announcing important new results using well-understood techniques that don't require lengthy discussion. And why was this published in Nature, a journal with no expertise in historical linguistics? If you're proposing a new technique, you would normally publish a full length paper in a journal with expertise in the area. That way, problems are likely to be caught at the review stage, and unclear points will be clarified prior to publication. A full length publication is much more likely to provide the reader with sufficient information to evaluate the paper, and publication in a journal in the appropriate specialty makes it more likely that people with relevant expertise will read the paper. A letter to Nature makes a nice splash, but it isn't the best way to put a new idea into play and let people kick it around.

We also have to ask how a paper that clearly doesn't contain sufficient information to allow it to be evaluated got published in Nature. It seems to be part of a pattern in which journals that don'ṫ routinely deal with linguistics fail to obtain referees with appropriate expertise. Mark Liberman discussed another instance of this in a previous posting about a review in the American Scientist Online of a paper, "Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European" that appeared in the Proceedings of the National Academy of Science. PNAS is equally culpable for publishing a weak paper in the first place; they too failed to obtain appropriate advice.

[Although I alone am responsible for its final form, this note reflects discussion with Morris Halle, Jay Jasanoff, Don Ringe, Sally Thomason, and Tandy Warnow.]

Posted by Bill Poser at December 10, 2003 05:20 PM