June 07, 2005

Captain Crunch among the literati

In her 6/5/2005 NYT essay "The Word Crunchers", Deborah Friedell sends a mixed message about the value of quantitative methods for literary analysis. She telegraphs her ambivalence in the first paragraph, by citing the passage in David Lodge's "Small World" where "a novelist" (Frobisher) learns from "a literature professor fond of computer programming" (Dempsey) that his novels are lexically "saturated with grease". (Friedell says that the novelist's "favorite word" is the form greasy -- Lodge actually says "Grease. Greasy. Greased. Various forms and applications of the root, literal and metaphorical" -- you can read the passage here). In Lodge's novel, this is an episode without a hero. Perhaps analyzing word counts kills creativity: Frobisher "[loses] faith in [his] style" and "has never been able to write fiction since". On the other hand, Lodge intimates that this is no loss to literature: perhaps the refiner's fire of stylometry purifies art?

In the next section of Friedell's essay, she places Amazon's concordance function in the context of the history of systems for word-indexed access to books, starting with Hugh of St. Cher and his 500 Dominican monks in the 13th century. (You can read more about this history here). We're meant to be impressed that concordances, which used to be so hard to build, are now so easily available to everyone. But what is this access worth? Friedell describes Lane Cooper, who produced a Wordsworth concordance in 1911, as "a geneticist of language, isolating and mapping the smallest parts with the confidence that they will somehow reveal the design of the whole". Though this phrase associates Cooper with a group whose intellectual prestige is now high -- the geneticists' reductionism has worked out pretty well, on the whole -- that "somehow" conveys a clear note of skepticism.

Friedell's most positive evaluation of concordances and such is here:

Why did they labor so? Monks used concordances to ferret out connections among the Gospels. Christian theologians relied on them in their quest for proof that the Old Testament contained proleptic visions of the New. For philologists, concordances provide a way of defining obscure words; if you gather enough examples of a word in context, you may be able to divine its meaning. Similarly, concordances help scholars attribute texts of uncertain provenance by allowing them to see who might have used certain words in a certain way. For readers, concordances can be a guide into a writer's mind. ''A glance at the Lane Cooper concordance'' led Lionel Trilling to conclude that Wordsworth, ''whenever he has a moment of insight or happiness, talks about it in the language of light.'' (The concordance showed the word ''gleam'' as among Wordsworth's favorites).

She also cites with approval some insights derived by James Painter from an analysis of words that occur only once in Yeats, but it's downhill from there. She complains that

To read a concordance is to enter a world in which all the included words are weighted equally, each receiving just one entry per appearance. While Amazon's concordance can show us the frequency of the words ''day'' and ''shall'' in Whitman, ''contain'' and ''multitudes'' don't make the top 100. Neither does ''be'' in Hamlet, nor ''damn'' in ''Gone with the Wind.'' The force of these words goes undetected by even the most powerful computers.

And she closes with some random examples from Amazon's extremely random "statistically improble phrases" feature (discussed at somewhat greater length here). The word on the street is that the "statistically improbable phrases" stuff was an undergraduate intern's summer project; as far as I know, the algorithm used has never been documented, and judging from its results, its absence from the technical literature is not a serious loss to science.

Nevertheless, there are real, interesting, and approachable problems in explaining why certain low-frequency linguistic events are psychologically salient. I discussed one class of such problems in this Language Log post. My conclusion: none of the simple and obvious methods will work, but there are some promising directions in the recent machine learning literature.

Friedell has a very different take on the situation:

Once it would have seemed unnecessary to point out that a statistical tool has no ear for allusions, for echoes, for metrical and musical effects, for any of the attributes that make words worth reading. Today, perhaps it bears reminding.

This is the real point of her essay, I think, and its logic is bizarre. Friedell has discussed word-based indices, simple considerations of (maximum and minimum) word frequency, and two undocumented attempts at slightly more sophisticated analysis of lexical statistics (Dempsey's fictional de-styling of Frobisher, and Amazon's SIP). She's never raised or discussed the question of how "allusions", "echoes" and so on might be described and modeled statistically. She certainly hasn't mentioned the extensive quantitative work on "metrical and musical effects" in metered verse. There's no evidence in the essay that she really knows what a "statistical tool" is.

Her point appears to be that the methods of rational investigation have no real place in the analysis of literature. This widely held view would be more convincing if its proponents understood the methods they reject.

[Some relevant Language Log posts:

The shadow of stylometry
A briefe and a compendious table
And yet.
Strange bookfellows

The essays's final phrase "it bears reminding" deserves a post of its own.]

Posted by Mark Liberman at June 7, 2005 07:16 AM