April 30, 2004

Intelligence vs. random fluctuations

BlogPulse scans about 750,000 weblogs for "Key Phrases, Key People, BlogBites, and Top Links", and displays the results on a day-by-day basis. The algorithms are said to look for "bursty" items rather than simply common ones. As I understand it, this means that "George Bush" (for example) shouldn't show up in "key people" unless the number of mentions of him increases significantly, relative to some estimate of the expected background level.

This is a demo project from the "Intelliseek Applied Research Center", which was set up after Intelliseek "brought on board key members of WhizBang! Labs, a Pittsburgh technology team specializing in natural language programming, text mining, data retrieval and other technologies." Other refugees from WhizBang!, an unfortunate casualty of the dot.com bust, include Fernando Pereira and Andrew McCallum.

Here's the picture that Intelliseek uses to convey their "technology vision":

It's got a lot in common with the vision of DARPA's current TIDES project ("Translingual Information Detection, Extraction and Summarization"). DARPA has been supporting research in related areas for several decades, and it's clearly about time for this investment to start paying off

The technology in BlogPulse seems to work well in some areas, such as detection of personal names, which seems at least not to have many false positives. This is expected since named entity tagging is a pretty mature technology. I'm more impressed that their listing of "Key Phrases" seems to be picking up strings that really are English phrases, as opposed to word sequences that happen to occur more often than expected but cross-cut phrase boundaries. Phrase-finding with a low rate of false positives is not easy.

However, it looks like BlogPulse is not trying to connect alternate forms of names across documents. For example, yesterday's references to Elton John are listed as instances of the name "Sir Elton", and the references to Kofi Annan are listed as instances of "General Kofi Annan" (apparently because the algorithms truncated "Secretary-General Kofi Annan"). Elton John and Kofi Annan are famous enough that if BlogPulse were tracking entity mentions across documents, and doing a decent job of it, they should be getting these right. So I conclude that they've punted on this one -- and this is a big thing to leave out if you really want to turn unstructured text data into "intelligence". In my opinion, (what's sometimes called) cross-document entity tracking is a key problem for technologies of this general kind, maybe the key problem. That's not just because most users want to see the indexing done right -- it's also because if you make the connections accurately, you get a graph (of entity mentions across documents) that you can use for all kinds of other neat (and non-obvious) stuff.

I'm also not convinced that BlogPulse is doing a very good job of distinguishing random statistical blips in term frequency from significant trends. It's hard to judge this for "key people", since any name that occurs fairly often probably reflects discussions that are connected at least via the individual named, and without a fair amount of fussing with the data, it's hard for me to judge whether (say) Kofi Annan really was discussed significantly more often yesterday than usual.

However, for the "Key Phrases", it's a lot easier to make a judgment on this point, and my evaluation is that BlogPulse hasn't got it right yet. For example, "Key Phrase" #3 (of 40) for yesterday (4/29/2004) was "very good friend", and as far as I can tell from the list of "sample citations" given, none of them have anything to do with any of the others. I'll assume that "very good friend" usually occurs less than 19 times a day, but the fact that it came up 19 times yesterday (in the new entries on 750,000 blogs) seems to have been just a random statistical fluctuation, not any sort of leading indicator of warm feelings of fellowship sweeping through the blogosphere.

I feel the same way about many of yesterday's other "Key Phrases". Maybe the BlogPulse algorithm for estimating likelihood ratios needs a tune-up? Or maybe they forgot the Bonferroni correction or some appropriate approximation to it? This is likely a source of problems, since the number of tests implicitly done is quite large (perhaps as large as count of all the N-grams in the day's blogtext, for 2<=N<=4), and so it won't be easy to steer between the Scylla of fantasy and the Charybdis of obliviousness.

I'm not sure what to make of the BlogBites, which are "weblog entries from the Blogsphere which showcase the past day's burstiest themes." The site doesn't tell us what a "theme" is, algorithmically, and I can't say that their selection strikes me as getting at the essence of anything. I wouldn't be shocked to see the same list presented by some human as his or her idea of the most important posts of the day. But on the other hand, I also wouldn't be surprised to see the list emerging from a selection of first paragraphs at random from the day's scraping of blogtext.

One final comment: the limitation to day-by-day textual listing of Key X's is too bad. It would be nice to see graphs of mentions of Key X's over time -- weeks or months. Then you could really see the pulse of the blogs.

Posted by Mark Liberman at April 30, 2004 09:00 AM