Quantitative comparative linguistics

Statistical methods have been used in comparative linguistics since at least the 1950s (see Swadesh list). Since about the year 2000, there has been a renewed interest in the topic, based on the application of methods of computational phylogenetics and cladistics to define an optimal tree (or network) to represent a hypothesis about the evolutionary ancestry and perhaps its language contacts. The probability of relatedness of languages can be quantified and sometimes the proto-languages can be approximately dated. The topic came to the attention of the popular press in 2003 after the publication of a short study on Indo-European in Nature (Gray and Atkinson 2003). A volume of articles on Phylogenetic Methods and the Prehistory of Languages was published in 2006 as the result of a conference held in Cambridge in 2004.

A goal of comparative historical linguistics is to identify instances of genetic relatedness amongst languages. The steps in quantitative analysis are (i) to devise a procedure based on theoretical grounds, on a particular model or on past experience, etc. (ii) to verify the procedure by applying it to some data where there exists a large body of linguistic opinion for comparison (this may lead to a revision of the procedure of stage (i) or at the extreme of its total abandonment) (iii) to apply the procedure to data where linguistic opinions have not yet been produced, have not yet been firmly established or perhaps are even in conflict.

Applying phylogenetic methods to languages is a multi-stage process (a) the encoding stage - getting from real languages to some expression of the relationships between them in the form of numerical or state data, so that those data can then be used as input to phylogenetic methods (b) the representation stage - applying phylogenetic methods to extract from those numerical and/or state data a signal that is converted into some useful form of representation, usually two dimensional graphical ones such as trees or networks, which synthesise and "collapse" what are often highly complex multi dimensional relationships in the signal (c) the interpretation stage - assessing those tree and network representations to extract from them what they actually mean for real languages and their relationships through time.

The standard method for assessing language relationships has been the comparative method. However this has a number of limitations. Not all linguistic material is suitable as input and there are issues of the linguistic levels on which the method operates. The reconstructed languages are idealized and different scholars can produce different results. Language family trees are often used in conjunction with the method and "borrowings" must be excluded from the data, which is difficult when borrowing is within a family. It is often claimed that the method is limited in the time depth over which it can operate. The method is difficult to apply and there is no independent test. Thus alternative methods have been sought that have a formalised method, quantify the relationships and can be tested.

...
Wikipedia