February 13, 2007

NLP in search

A few days ago, the NYT's Miguel Helft had an interesting article on the PARC/Powerset deal ("In a Search Refinement, a Chance to Rival Google", 2/9/2007)

[Xerox] PARC is ... licensing a broad portfolio of patents and technology to a well-financed start-up with an ambitious and potentially lucrative goal: to build a search engine that could some day rival Google.

The start-up, Powerset, is licensing PARC’s “natural language” technology — the art of making computers understand and process languages like English or French. Powerset hopes the technology will be the basis of a new search engine that allows users to type queries in plain English, rather than using keywords.

Helft cast Fernando Pereira in the "yes, but" role:

PARC’s natural-language technology is among the “most comprehensive in existence,” said Fernando Pereira, an expert in natural language and the chairman of the department of computer and information science at the University of Pennsylvania. But by itself, it will not guarantee Powerset’s success, Mr. Pereira said.

“The question of whether this technology is adequate to any application, whether search or anything else, is an empirical question that has to be tested,” Mr. Pereira added.

In the old days, that would have been the end of the discussion. But now?

As Fernando observed ("searching", 2/9/2007):

Miguel Helft wrote a clear and balanced piece, which was not easy given the complexity of the issues. I talked with him for over half an hour. He chose representative quotes from what I said, but it's of course impossible to go into details within the length limits of the daily press. I wrote about the issues in previous postings, and I'm writing a new posting that works out some of the arguments more fully.

The promised new post came out a bit later the same day: "Powerset in PARC deal".

But from a certain point of view, Helft's NYT article was just a burst of off-stage noise in the on-going blog debate between Fernando and Matt Hurst.

This drama started small, with a brief note by Matt linking to an earlier NYT article about Powerset ("Powerset In the New York Times", 1/1/2007):

A nice little article summarizing the playing field for novel search going in to 2007.

Fernando disagreed with the "nice" part (Fernando: "Powerset In the New York Times", 1/1/2007):

It's good to see Barney and his colleagues in the Times. However, I didn't think much of the article. As is unfortunately common in the MSM, there is no substance in the story, except for who invested and how much. What is "natural language search," (NLS) in terms that would make sense to the average reader of the business section of the Times? If current search engines do not use NLS, it it just because they are too fat and distracted? Or are there technical, let alone scientific reasons for the lack of NLS? The writer missed the opportunity to illustrate the issues and challenges with some concrete examples, for instance some of those that Barney discussed in his blog a while ago.

And he went on to discuss the issues that the NYT article omitted. Matt responded, and a veritable avalanche of interesting posts was underway:

Matt: "Natural Language Search", 1/1/2007
Fernando: "Natural Language Search" (response) , 1/2/2007
Matt: "The Two Faces of Natural Language Search", 1/3/2007
Fernando: "The Two Faces of Natural Language Search" (response), 1/3/2007
Matt: "Ask Innovates Search UI", 2/1/2007
Fernando: "Ask Innovates Search UI" (response), 2/1/2007
Matt: "Why NLP Is A Disruptive Force", 2/1/2007
Fernando: "Why NLP Is A Disruptive Force" (response), 2/3/2007
Matt: "NLP and Search: Free Your Mind", 2/11/2007
Fernando: "NLP and Search: Free Your Mind" (response), 2/11/2007
Matt: "The time to build NLP applications, 2/13/2007.

If you're interested in linguistic technology (or in the rhetorical evolution of that emerging form, the weblog debate), you'll want to spend a leisurely brunch reading the whole series.

Some of Fernando's earlier posts are also relevant, including "Germany quits EU-based search engine project" ( 1/7/2007), and 'The cost of search computations" (1/26/2007).

Posted by Mark Liberman at February 13, 2007 08:34 AM