Language Log: Computer does something-or-other to Moby Dick -- in 9.5 seconds

March 05, 2005

Computer does something-or-other to Moby Dick -- in 9.5 seconds

According to a March 3 NYT article by Noah Shachtman, a company named Attensity has developed software that can "parse" Moby Dick in 9 and a half seconds. Why? "By labeling subjects and verbs and other parts of speech, Attensity's software gives the documents a definable structure, a way to fit into a database. And that helps turn day-to-day chatter into information that is relevant and usable." And the CIA, which helped fund the company, doesn't care about Moby Dick, but wants to use Attensity's software to "comb through e-mail messages and chat room talks".

This general sort of technology comes under the heading of "Information Extraction from Text", sometimes abbreviated "IE", or "text data mining", or "Automatic Content Extraction". (I'm a member of a group at Penn that's been working on information extraction from biomedical text.) The NYT article also describes applications that are really "information retrieval" (IR) rather than IE: "Looking through a company's customer file for a person named Bonds, for example, is fairly simple. But if the data is unstructured - if the word 'bonds' hasn't been classified as the name of a ballplayer or as an investment option - searching becomes much more difficult."

The NYT article also mentions Inxight, Intelliseek, and some other companies. To put this work in context, you'd want to sketch the history of DARPA's TIPSTER program, the series of "Text Retrieval", "Multilingual Entity" and "Message Understanding" conferences (TREC, MET and MUC), DARPA's ACE program, DARPA's CALO project, the failure of Whizbang! (some of whose technology went to Inxight, who in turn licensed it to Intelliseek, who also hired some of the people from Whizbang!'s Pittsburgh lab), and a few other things as well. Against this background, you'd want to know what kinds of analysis Attensity's software is really doing, and how accurate the various types of analysis are. That would enable you to (begin to) evaluate how well the software works in one sort of application or another. The software's speed is also worth knowing, but it's a second-level question. (And it also matters what the hardware is: "analyzing 200,000 words in 10 seconds" doesn't mean much, unless we know whether this is being done on a single machine, or a cluster of 1,000.)

Alas, the article gives us no real idea at all what Attensity's software is doing. Shachtman doesn't give us any coherent technical description of its analyses, or any examples of the analysis that it performs on any specific sentences. As a result, we can't tell whether it's trying to provide a full parse, or is just doing part-of-speech tagging and perhaps some noun-phrase or clause chunking. It's apparently doing some tagging of some types of entity references, but we have no idea which ones, or how well. It may be trying to infer some relations among text strings, or relate text strings to stable cross-document references (e.g. as identifiable people, places, organizations and so on). It's possible that it's even trying to do some sort of predicate-argument analysis, or at least analysis of the relations implicit in certain specified types of events and actions. That's certainly the implication of the description given by an Attensity customer:

"Attensity shows how the words all relate to one another - all the actors, objects and actions in a document, and how they connect."

But Shachtman seems confused about what the differences might be among these various sorts of analysis:

MAYBE sixth-grade English was more helpful than you thought. One of the dullest grammar exercises is being used to help find potential terrorists, and save companies a bundle.

Diagramming sentences - picking out subject, verb, object, adjective and other parts of speech - has been a staple of middle and high school grammar lessons for decades. Now, with financing from the Central Intelligence Agency, a California firm is using the technique to comb through e-mail messages and chat room talks, which can be a rich lode of corporate and government information, and a tough one to mine.

Shachtman seems to think that "diagramming sentences" is a matter of assigning part-of-speech labels to words. But actually, it's a kind of parsing, which assigns structural labels and relationships, recursively, to groups of words. On the other hand, "subject" and "object" are not "parts of speech", but rather (simplifying a bit) relationships between a noun phrase and verb. So Shachtman is recursively confused -- "diagramming sentences" is more than "picking out parts of speech", but two of the four examples he gives of "parts of speech" are actually examples of the type of relationships among groups of words that "diagramming sentences" is supposed to describe. And showing "all the actors, objects and actions in a document" would be another level of analysis entirely.

Shachtman's confusion, I'm afraid, reflects a historical mistake in his lede. Grammatical analysis of whatever kind -- whether diagramming sentences, assigning parts of speech, determining (co-)reference, or analyzing semantic relationships among words or their referents -- is far from being a "staple" of middle and high school education. Rather, it's become more and more rare in the American educational system at all levels. When it's done at all, it's less and less likely that the teachers themselves actually know how to do what they're trying to teach, since they themselves have never learned. And neither, it seems, have the reporters.

The botched description of "grammar exercises" in Shachtman's lede is not important at all. But it does help us understand why he apparently had no conception at all of what sorts of analysis Attensity's software might be doing, and therefore didn't ask any of the relevant questions while reporting his article, or present any of the relevant answers when he wrote it.

[Update: Martha Palmer emailed:

Check out David L. Bean and Ellen Riloff, "Corpus-Based Identification of Non-Anaphoric Noun Phrases", ACL-99,pp. 373-380.
It looks like good ol' muc technology, souped up regular expression pattern matching....w/ some pos tagging and some semantic grammar rules...
(and lots of hype, or course!)

(David L. Bean is co-founder and CTO of Attensity)

Well, there might have been some changes in their algorithms since 1999. And there's nothing wrong with a little good old American hype when you get a chance to be featured in the NYT. But in the best of all possible worlds, the tech writer assigned to a story like this at the NYT would understand the linguistic issues well enough to identify the underlying technology briefly but accurately -- instead of incoherently and misleadingly. ]

[Update 3/7/2005: Shachtman's article appeared on 3/5 in the IHT under the headline "Grammar become tool for CIA and businesses." ]

[Update 2/7/2005: Cassandre Creswell points out a more recent article: David Bean and Ellen Riloff, " Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution", HLT-NAACL-2004.

Posted by Mark Liberman at March 5, 2005 10:04 AM