April 26, 2005

Strange bookfellows

Q: What do Geoff Pullum and Emily Dickinson have in common?

A: They are the only two authors in whose works the phrase gratuitous capitalization is currently identified by amazon.com as "statistically improbable".

Let me re-phrase that, with the help of amazon's "learn more" pop-up for "Statistically Improbable Phrases (SIPs)":

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside! program. To identify SIPs, our computers scan the text of all books in Search Inside. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

Click on a SIP to view a list of books in which the phrase occurs. You can also view a list of references to the phrase in each book. Learn more about the phrase by clicking on the A9.com search link.

But the funny thing is, "gratuitous capitalization" only occurs once in Geoff Pullum's The Great Eskimo Vocabulary Hoax and Other Irreverent Essays on the Study of Language, and once in The Complete Poems of Emily Dickinson. So can it really be true that this is a phrase that "occurs a large number of times in [those] particular [books] relative to all Search Inside books"?

It seems misleading, in ordinary language terms, to say that once is "a large number of times".

Nevertheless, there can be a plausible argument for characterizing a phrase that occurs only once -- or perhaps never occurs at all -- as more or less "statistically improbable". This is a point that Noam Chomsky got wrong in 1957, but it's a commonplace idea by now.

It's ironic that Geoff -- the author of the Once-is-Cool-Twice-is-Queer (OICTIQ) principle for linguists and philologists -- is tagged by amazon for a "statistically improbable phrase" that he used only once. All the same, this might be a feature of the SIP algorithm rather than a bug. In an earlier post, I asked "how many times does a word or phrase need to be repeated in order to seem characteristic of a speaker or author?" and answered "not very many times, maybe only once or twice, if the use in context is salient enough".

This might be such a case -- it must be admitted that the phrase gratuitous capitalization does, as amazon puts it, "hint at important plot elements" in Geoff's oeuvre.

Still, I'd like to know more about the algorithm that amazon is using. As I observed in the previously-cited post

Simple ratios of observed frequencies to general expectations will not work..., because ... such tests will pick out far too many words and phrases whose expected frequency over the span of text in question is nearly zero.

This is an instance of the problem that troubled Noam Chomsky in 1957. There are many, many two-word sequences in Geoff's book that do not occur at all in the other works indexed so far by amazon's "search inside" program. Looking at the context in which gratuitous capitalization occurs in Geoff's book, the immediately following sentence is

The harsh yoke of (e.g.) Academic Press and MIT Press copy-editing practices imposes on authors pointless and information-destructive capitalization of `significant' words (roughly, words that belong to the categories, N, A, or V) in titles.

Choosing at random, I find (by searching on A9.com) that the sequence "authors pointless" occurs in no other work known to amazon.com (check the returns for books, not the results borrowed from Google...). So why is "authors pointless" not in the SIP list for The Great Eskimo Vocabulary Hoax? Amazon must be doing something clever.

Ah, you may say, but "gratuitous capitalization" is a syntactically and semantically meaningful unit, while "authors pointless" is not. This is certainly an issue for such algorithms -- SIPs ought to be meaningful phrases of some sort, not just random uncommon word sequences. However, it's not obvious that amazon's cleverness is based on this sort of linguistic analysis of the content of the books indexed. Take a look at the actual occurrence of "gratuitous capitalization" in The Complete Poems of Emily Dickinson (pages x-xi of the Front Matter, written by the editor Thomas H. Johnson):

I have silently corrected obvious misspelling (witheld, visiter, etc.) and misplaced apostrophes (does'nt). Punctuation and capitalization remain unaltered. Dickinson used dashes as a musical device, and though some may be elongated end stops, any "correction" would be gratuitous. Capitalization, though often capricious, is likewise untouched. [emphasis added]

So in this case the "statistically improbable phrase" is no phrase at all, but a word sequence spanning a sentence boundary.

On the other hand, looking over some longer lists of Statistically Improbable Phrases, it does seem that they are limited to things that are plausibly phrases to start with. (See for example the SIP list for Ray Jackendoff's Foundations of Language.)

So here's what seems to be going on:

  1. amazon is indexing books by a method that throws away all punctuation, case (and stop words?), and identifying possible SIPs by reference to (2- and 3-element?) subsequences of the resulting degraded strings;
  2. amazon is limiting SIPs to things that are plausibly phrases in a linguistic sense, as they might occur in undegraded text, independent of their context of occurrence in any particular work -- or they are imposing some other condition that has this effect;
  3. candidate SIPs (identified as in [1], and limited as in [2]) are accepted iff their probability (estimated from a model derived from all books indexed) is below some threshold (and perhaps if some other conditions are met).

I'm pretty sure about [1] and [3] (though I'd like to know more about the probability estimation method, and any other conditions that may be used). [2] is the part that is least clear to me. All the methods that occur to me will either miss genuinely characteristic phrases (problems with "recall"), or flag sequences that should not be considered phrases at all (problems with "precision").

A few minutes of poking around turned up plenty of other mistakes like the Emily Dickinson one, where a SIP is not actually being used as a phrase in the cited context, but no examples at all where a SIP might not plausibly be a meaningful phrase in some context. Thus amazon must be tuning its algorithm (sensibly) for high precision at low(er) recall. But I'd still like to know how it works.

And I'm distressed to learn that Geoff and Emily are not really textual siblings after all.

Posted by Mark Liberman at April 26, 2005 06:07 AM