April 25, 2004

Glottochronology revisited, very carefully

Those with a serious interest in the "neo-glottochronology" research by Foster and Toth, and more recently Gray and Atkinson (see Language Log posts here, here, and here), will want to read these papers:

Tandy Warnow, Steven Evans, Don Ringe and Luay Nakhleh, "Stochastic models of language evolution and an application to the Indo-European family of languages" (.pdf)

Steven Evans, Don Ringe and Tandy Warnow, "Inference of divergence times as a statistical inverse problem" (.pdf)

Russell Gray will be visiting Penn this week, and giving a talk on Tuesday, so watch this space for more information.

The second paper's conclusion is quoted below. It's interesting to see a case in which a statistician, a linguist and a computer scientist agree on the appropriateness of Rumsfeld's "unknown unknowns" saying, and like its formulation well enough to quote it.

Much of what we have said has focused on two issues: one is formulating appropriate stochastic models of character evolution (by formally stating the properties of the stochastic processes operating on linguistic characters), and the other is inferring evolutionary history from character data under stochastic models.

As noted before, under some conditions it may be possible to infer highly accurate estimations of the tree topology for a given set of languages. In these cases, the problem of dating internal nodes can then be formulated as: given the true tree topology, estimate the divergence times at each node in the tree. This approach is implicit in the recent analyses in (Gray & Atkinson, 2003; Forster & Toth, 2003), although they used different techniques to obtain estimates of the true tree for their datasets.

The problems with estimating dates on a fixed tree are still substantial. Firstly, dates do not make sense on unrooted trees, and so the tree must first be rooted, and this itself is an issue that presents quite significant difficulties. Secondly, if the tree is wrong, the estimate of the even the date of the root may have significant error. Thirdly, and most importantly perhaps, except in highly constrained cases, it simply may not be possible to estimate dates at nodes with any level of accuracy...


Therefore we propose that rather than attempting at this time to estimate times at internal nodes, it might be better for the historical linguistics community to seek to characterise evolutionary processes that operate on linguistic characters. Once we are able to work with good stochastic models that reflect this understanding of the evolutionary dynamics, we will be in a much better position to address the question of whether it is reasonable to try to estimate times at nodes. More generally, if we can formulate these models, then we will begin to understand what can be estimated with some level of accuracy and what seems beyond our reach. We will then have at least a rough idea of what we still don't know.


As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.

-- Donald Rumsfeld
U.S. Secretary of Defense


Posted by Mark Liberman at April 25, 2004 10:19 PM