November 25, 2003

Parsers that count

A month ago, I cited the difficulty of parsing complex nominals like the one found on a plaque in a New Jersey steakhouse: "Volume Feeding Management Success Formula Award". We're talking about sequences of nouns (with adjectives mixed in as well), and the problem is that these strings mostly lack the structural constraints that parsers traditionally rely on.

When you (as a person or a parser) see a sequence like "A lapse in surveillance led to the looting" (from this morning's New York Times, more or less), you don't necessarily need to figure out what it means or even pay much attention to what the words are: "A NOUN in NOUN VERBED to the NOUN" has a, like, predictable structure in English, however you fill in the details. But "NOUN NOUN NOUN NOUN NOUN NOUN" is like a smooth, seamless block -- you (or the parser) can carve anything you please out of that.

One traditional solution is to look at the meaning. Why is it "[stone [traffic barrier]]" rather than "[[stone traffic] barrier]"? Well, it's because traffic barriers made of stone make easy sense in contemporary life, while barriers for stone traffic evoke some kind of science-fiction scenario. The practitioners of classical AI figured out how to do this kind of analysis for what some called "limited domains", and others called "toy problems". But this whole approach has stalled, because it's hard.

There's another way, though.

Here's a set of simple illustrative examples, taken from work in a local project on information extraction from biomedical text. (These examples come from Medline). Each of the four possible 3-element complex nominal sequences (with two nouns or adjectives preceding a noun) is exemplified in each of the two possible structures (one with the two leftward words grouped, the other with the two rightward words grouped).

[NN]N
sickle cell anemia
10561 2422
N[NN]
rat bile duct
203 22366
[NA]N
information theoretic criterion
  112       5
N[AN]
monkey temporal lobe
   16     10154
[AN]N
giant cell tumour
7272 1345
A[NN]
cellular drug transport
262  746
[AA]N
  small intestinal activity
8723       120
A[AN]
inadequate topical cooling
   4     195

And the numbers? The numbers are just counts of how often each adjacent pair of words occurs in (our local version of) the Medline corpus (which has about a billion words of text overall). Thus the sequence "sickle cell" occurs 10,561 times, while the sequence "cell anemia" occurs 2,422 times.

Most of the time, in a 3-element complex nominal "A B C", you can parse the phrase correctly just by answering the question "which is commoner in a billion words of text, "A B" or "B C"?

In a crude test of 64 such sequences from Medline (8 of each type in the table above), this method worked about 88% of the time.

Actually, this is an underestimate of the performance of such approaches. In the first place, the different sequence types are not at all equally frequent, nor are the parsing outcomes equally likely for a given sequence type. Thus in the Penn Treebank WSJ corpus (a thousand times smaller than Medline, and much less infested with complex nominals, but still...) there are 10,049 3-element complex nominals, which are about 70% right-branching ([A [B C]]) vs. 30% left-branching ([[A B] C]). More information about the part-of-speech sequence or the particular words involved gives additional leverage. And other counts (such as the frequency of the individual words, of the pattern "A * C", etc.) also may help. There are also more sophisticated statistics besides raw bigram frequency (though in this case the standard ones, such as ChiSq, mutual information, etc., work slightly worse than raw counts do).

Yogi Berra said that "sometimes you can observe a lot just by watching". The point here is that sometimes you can analyze a lot just by counting. And while understanding is hard, counting is easy.

Posted by Mark Liberman at November 25, 2003 06:25 AM