Language Log: 5 billion lost articles, 6 interesting posts

March 29, 2005

5 billion lost articles, 6 interesting posts

Jean Véronis at Technologies du Language has continued to develop the new field of Googlology, or perhaps more precisely Googlometry. In chronological order, supplying links to English versions where they are available, and to French versions otherwise:

Google: 5 billion "the" have disappeared overnight,
Google: Blogues ou bogues dans les News?
Google: A snapshot of the update,
Quel est le Data Center qui me répond?

For non-googlometricians who can read French, Jean offers an Easter basket of word frequency analysis of gospel texts, and on the secular side, a lexicometric analysis of the speeches of Jacques Chirac, following the work of Damon Mayaffre.

Jean presents a graph that

...montre un changement rhétorique majeur dans les discours présidentiels au cours de la Vème République : "le discours des trois premiers présidents, de Gaulle, Pompidou, Giscard, dans les années 1960-70 est nominal et conceptuel, tandis que le discours des trois suivants (Mitterrand1, Mitterrand2 et Chirac) à partir des années 1980 est verbal et énonciatif". Le discours se vide de sa substance...

...shows a major rhetorical change in the presidential discourse over the course of the 5th Republic: "the discouse of the three first presidents, de Gaulle, Pompidou, Giscard, in the the years 1960-70 is nominal and conceptual, while the discourse of the three following [presidents] (Mitterand 1, Mitterand 2 and Chirac), starting in the 1980s, is verbal and expressive." The discourse is emptied of substance...

(The quoted passage is from Mayaffre's book.)

I'm not sure that Jean's view (that the discourse has been emptied of substance) is required by the data. A Chirac partisan (I suppose there must be some of them) might argue that his text is more active, more muscular and so on. This might even be in some sense true, if the difference in noun/verb balance is mainly due to expressing propositions less often in nominalized form. In any case, the change that Mayaffre has found seems to be not only statistically significant but also meaningful; the question is, what does it mean?

Another graph, easier to interpret, tracks the development of the word insécurité in Chirac's texts:

In both graphs, as I understand it, the y-axis represents what the French call "écarts réduits" ("reduced deviations"), which seem to be what we would call "z scores" in English. In other words, zero is the mean value (of the textual frequency of verbs or nouns or the word insécurité or whatever), while positive or negative values are frequencies greater or less than the mean, expressed in terms of standard deviations (mean squared difference from the mean in the overall distribution). The numbers seem too large to be z scores (15 standard deviations above or below the mean would be a gargantuan effect), and z scores don't make sense as a metric on untransformed frequencies, so I'll check this further.

[Update: Jean explains that there is a "well-established tradition in French literary computing (which goes back to Muller in the 1960s)" to use "écarts réduits" as follows:

For word w in section S of corpus C, where
r is the observed count of w in S,
p is the proportion of S with respect to C, and q = 1-p,
t is the count predicted for w in S on the basis of w's frequency in C as a whole;
then the "écart réduit" e of w in S is

e = (r - t) / sqrt (t * q)

In other words, if w occurs 100 times in a certain segment S of the corpus C, and we would expect it to occur 200 times based on the overall corpus frequency of w, and S is 1/100th of C, then

e = (100-200)/sqrt(200*.99) = -7.1

Jean observes that Brunet's hyperbase program (available for the curiously exact sum of 144,83 € from the Institut National de la Langue Française) calculates this value. (I'd give you the hyperlink to INALF -- it's http://www.inalf.fr, as Google will tell you -- but its site was hacked some time ago by someone named Garzt3 and replaced with an ominous-looking flash animation, which has neither been fixed nor taken down.) ]

[More about Damon Mayaffre's work can be found here and here. Some other graphs of political lexicometry, or what Mayaffre calls "L'Herméneutique numérique" ("digital hermeneutics") -- the French right uses the various inflected forms of avoir ("have") much more often than the left does:

and the French right also uses past tenses (the passé composé and the imperfect) more often, while the left uses the future tense somewhat more:

]

Posted by Mark Liberman at March 29, 2005 05:35 AM