April 28, 2004

Gray and Atkinson - Use of Binary Characters

As Mark already mentioned, yesterday Russell Gray gave a talk about the work on subgrouping and dating that appeared in a paper in Nature on which I commented a while back. The talk and subsequent discussion clarified exactly what they are doing.

One thing that emerged is that I was right about how they are treating the characters. In biology, "characters" are the features that are used for classification. In a traditional morphologically-based classification, a character might be "has a backbone" or "has a nucleus". In a DNA sequence based classification, the characters typically take the form of "has such and such a nucleotide at such and such a position". In a linguistic classification, the characters have to do with what words particular languages have. I've said this somewhat awkwardly because there is more than one way to set up lexical characters.

When linguists set up sets of words for lexical comparison, whether for classical subgrouping or for lexicostatistics, they are typically arranged by glosses. That is, we list the form that each meaning takes in the various languages. For instance, here is some data for the word for "dog" in a few of the Indo-European languages:

Sanskritʃvān
Greekkuōn
Germanhund
Latinkanis
Englishdag

The first three forms are cognate. They descend from the same proto-Indo-European source by regular sound changes. The Latin form looks like it might be cognate to the first three but it isn't - the known sound changes from PIE to Latin do not yield this form. And the English form is not cognate either. In fact, this form is unique to English and of unknown origin.

If we were to code "dog" as a single multistate character, we would have three states, which we can call A, B, and C. The three states represent which of the three cognate sets (two of which, in our example, have only one member) represents the meaning "dog".

SanskritA
GreekA
GermanA
LatinB
EnglishC

Gray and Atkinson did not code their data this way. Instead, they made all of their characters binary. In order to do this with data that are naturally multistate, they split each multistate character into a set of binary characters, one per cognate set. If we recode our "dog" data into binary characters as Gray and Atkinson did, we have to create three characters, one for each cognate set. Each character then represents whether that cognate set is represented in a particular language. For instance, character A corresponds to the question: "Does the language have a form cognate to Sanskrit [ʃvān]?". A 1 means "yes"; a 0 means "no".

Language/CharacterABC
Sanskrit100
Greek100
German100
Latin010
English001

The use of binary characters raises one additional point. Once the characters become "does a certain cognate set occur", it ceases to be relevant whether the cognate in a particular language preserves a particular meaning. For example, in the data above, English is shown as having a completely different word for "dog" from most of the other languages. However, English does have a cognate to the form that occurs in Greek, Sanskrit, and German, namely "hound". It is not listed as part of the data for "dog" because it no longer means "dog" but instead denotes a particular kind of dog. However, if we are just asking whether or not this cognate set occurs in English, the answer is "yes", so we must revise the table of character states:

Language/CharacterABC
Sanskrit100
Greek100
German100
Latin010
English101

The dataset used by Gray and Atkinson in their Nature paper consists of a set of data created by Dyen et al. to which Gray and Atkinson added data for Hittite, Tocharian A, and Tocharian B. That dataset is organized by meaning, so it does not contain full cognate sets, only those cognates that retain their original meaning. "hound", for instance, would not be listed. That means that to convert multistate characters to binary characters properly, the original dataset has to supplemented with cognates that differ in meaning. This of course does not affect the validity of the method.

The real significance of the use of binary characters is that the mathematical model that underlies the methods they use is based on the assumption that the characters are independent of each other. Whether an animal has a backbone is taken to be independent of whether or not it has a segmented body, and similarly what word a language has for "hand" is taken to be independent of what word it has for "fire". But when multistate characters are split into multiple binary characters in the manner described, the characters resulting from a split are not statistically independent. For the most part, languages have only a single word with a certain meaning and when a new word comes in, the old word disappears entirely rather than moving into a different meaning the way "hound" did in English. That means that in general, if we know that a language has a word belonging to one cognate set, we know that it does not have a word belonging to any of the others. Since we can predict that value of a character given information about the others, they are not statistically independent. The procedure that Gray and Atkinson used to create binary characters therefore violates the assumptions of the mathematical model.

This is a reason to be nervous about the validity of the results that they obtained, but it does not show that the results are wrong. Some violations render a model useless; others have insignificant effects. In this case, we don't know what the impact of the violation is. They are doing some experiments which they expect to provide information about the impact of the use of binary characters.


Posted by Bill Poser at April 28, 2004 03:15 PM