November 23, 2004

Google Scholar

After reading Daniel Akst's article on computer text generation in yesterday's NYT ("Computers as Authors? Literary Luddites Unite!"), I decided to use it to try out Google Scholar.

Akst describes his first example of computer-generated text like this:

That pregnant opening paragraph was written by a computer program known as Brutus.1 that was developed by Selmer Bringsjord, a computer scientist at Rensselaer Polytechnic Institute, and David A. Ferrucci, a researcher at I.B.M.

Probing Google Scholar with {Brutus Bringsjord Ferrucci}, the first hit is a .pdf for the 52-page preface of a 1999 book by Bringsjord and Ferrucci, "Artificial Intelligence and Literary Creativity: Inside the Mind of Brutus, a Storytelling Machine". There are 37 other hits, and more than half look pretty good. It's pretty clear that we're talking about five-year-old work with some more recent commentary, but that's the fact of the matter, not any flaw in Google Scholar's search and retrieval.

After the next computer-generated passage, Akst writes

What you just read is the work of StoryBook, "an end-to-end narrative prose generation system that utilizes narrative planning, sentence planning, a discourse history, lexical choice, revision, a full-scale lexicon and the well-known Fuf/Surge surface realizer." Believe it or not, that description was written not by a computer but by the humans who created StoryBook, Charles B. Callaway and James C. Lester, who are computer scientists.

Asking Google Scholar about {StoryBook Callaway Lester} gives four hits, of which the first leads to a .pdf of a 2001 book chapter by Charles Calloway, "A Computational Feature Analysis for Multilingual Character-to-Character Dialogue". The others are also relevant.

So even without actually reading any of the links, we've learned that Akst is not reporting breaking news here. He doesn't pretend to -- his article is billed as "an essay", not "a news flash" -- but the context will likely lead some readers to imagine that there's been some recent breakthrough. On the contrary, the most interesting aspect of this strand of computational linguistics, from my perspective at least, how old-fashioned it is. For example, Calloway's work (which is more recent than Bringsjord's) is based on FUF, "functional unification grammar", which is High Classic AI. I'll postpone an explanation of the issues for another time, but let's say that a plausible analogy would be unification:computational linguistics::I.M. Pei:architecture. Or perhaps unification:computational linguistics::John Havlicek:basketball.

Anyhow, Google Scholar did just fine so far. How does it compare with regular Google? Well, if we probe regular Google with the two search strings that we tried, {Brutus Bringsjord Ferrucci} and {StoryBook Callaway Lester}, it turns up pretty much the same stuff, and more besides, without too much irrelevant junk. But these are pretty well specified search strings. However, "Brutus" and "Storybook" generate a big pile of irrelevant stuff on the top of the returns from both searches, while {Brutus Bringsjord} and {StoryBook Callaway} are specific enough to get useful information at the top in both cases. So in this test, we're not seeing any evidence of a real benefit due to limitation of the search to the scholarly literature.

In principle, Google Scholar ought to offer not just more focused search, but also links to some material not normally indexed, because it spiders some journals and other sources not generally accessible. That didn't turn out to matter in this case, but sometimes it ought to make a big difference. Just as important, Google Scholar sometimes offers "cited by N" links that let you see how many other indexed sources cited a given document. Even better, you can click the link to see the list, and iterate the process to explore the citation space. I didn't report on those links in this case, though a couple of minutes of poking around turned up some interesting things.

Anyhow, I'm adding Google Scholar to my Firefox toolbar, and will continue to try it out.


Posted by Mark Liberman at November 23, 2004 10:23 AM