As Mark already mentioned, yesterday Russell Gray gave a talk about the work on subgrouping and dating that appeared in a paper in Nature on which I commented a while back. The talk and subsequent discussion clarified exactly what they are doing.
One thing that emerged is that I was right about how they are treating the characters. In biology, "characters" are the features that are used for classification. In a traditional morphologically-based classification, a character might be "has a backbone" or "has a nucleus". In a DNA sequence based classification, the characters typically take the form of "has such and such a nucleotide at such and such a position". In a linguistic classification, the characters have to do with what words particular languages have. I've said this somewhat awkwardly because there is more than one way to set up lexical characters.
When linguists set up sets of words for lexical comparison, whether for
classical subgrouping or for lexicostatistics, they are typically
arranged by glosses. That is, we list the form that each meaning
takes in the various languages. For instance, here is some data
for the word for "dog" in a few of the Indo-European languages:
Sanskrit | ʃvān |
Greek | kuōn |
German | hund |
Latin | kanis |
English | dag |
If we were to code "dog" as a single multistate character, we would have three
states, which we can call A, B, and C. The three states represent which
of the three cognate sets (two of which, in our example, have only one member)
represents the meaning "dog".
Sanskrit | A |
Greek | A |
German | A |
Latin | B |
English | C |
Gray and Atkinson did not code their data this way. Instead, they made all
of their characters binary. In order to do this with data that are
naturally multistate, they split each multistate character into a set of
binary characters, one per cognate set. If we recode our "dog" data
into binary characters as Gray and Atkinson did, we have
to create three characters, one for each cognate set.
Each character then represents whether that cognate set is represented
in a particular language. For instance, character A corresponds
to the question: "Does the language have a form cognate to Sanskrit
[ʃvān]?". A 1 means "yes"; a 0 means "no".
Language/Character | A | B | C |
Sanskrit | 1 | 0 | 0 |
Greek | 1 | 0 | 0 |
German | 1 | 0 | 0 |
Latin | 0 | 1 | 0 |
English | 0 | 0 | 1 |
The use of binary characters raises one additional point.
Once the characters become "does a certain cognate set occur", it ceases
to be relevant whether the cognate in a particular language
preserves a particular meaning. For example, in the data above,
English is shown as having a completely different word for "dog" from
most of the other languages. However, English
does have a cognate to the form that occurs in Greek, Sanskrit, and German,
namely "hound". It is not listed as part of the data for "dog" because it
no longer means "dog" but instead denotes a particular kind of dog.
However, if we are just asking whether or not
this cognate set occurs in English, the answer is "yes", so we must revise
the table of character states:
Language/Character | A | B | C |
Sanskrit | 1 | 0 | 0 |
Greek | 1 | 0 | 0 |
German | 1 | 0 | 0 |
Latin | 0 | 1 | 0 |
English | 1 | 0 | 1 |
The dataset used by Gray and Atkinson in their Nature paper consists of a set of data created by Dyen et al. to which Gray and Atkinson added data for Hittite, Tocharian A, and Tocharian B. That dataset is organized by meaning, so it does not contain full cognate sets, only those cognates that retain their original meaning. "hound", for instance, would not be listed. That means that to convert multistate characters to binary characters properly, the original dataset has to supplemented with cognates that differ in meaning. This of course does not affect the validity of the method.
The real significance of the use of binary characters is that the mathematical model that underlies the methods they use is based on the assumption that the characters are independent of each other. Whether an animal has a backbone is taken to be independent of whether or not it has a segmented body, and similarly what word a language has for "hand" is taken to be independent of what word it has for "fire". But when multistate characters are split into multiple binary characters in the manner described, the characters resulting from a split are not statistically independent. For the most part, languages have only a single word with a certain meaning and when a new word comes in, the old word disappears entirely rather than moving into a different meaning the way "hound" did in English. That means that in general, if we know that a language has a word belonging to one cognate set, we know that it does not have a word belonging to any of the others. Since we can predict that value of a character given information about the others, they are not statistically independent. The procedure that Gray and Atkinson used to create binary characters therefore violates the assumptions of the mathematical model.
This is a reason to be nervous about the validity of the results that they obtained, but it does not show that the results are wrong. Some violations render a model useless; others have insignificant effects. In this case, we don't know what the impact of the violation is. They are doing some experiments which they expect to provide information about the impact of the use of binary characters.