June 18, 2007

Data Catalysis

I'm back in Philadelphia, after a quick jaunt to Kyoto for ISUC2007. One of the most interesting presentations there was Patrick Pantel's "Data Catalysis: Facilitating Large-Scale Natural Language Data Processing":

Large-scale data processing of the kind performed at companies like Google is within grasp of the academic community. The potential benefits to researchers and society at large are enormous. In this article, we present the Data Catalysis Center, whose mission is to stage and enable fast development and processing of large-scale data processing experiments. Our prototype environment serves as a pilot demonstration towards a vision to build the tools and processing infrastructure that can eventually provide level access to very large-scale data processing for academic researchers around the country. Within this context, we describe a large scale extraction task for discovering the admissible arguments of automatically generated inference rules.

Imagine astronomy if all the large telescopes were owned by private companies and used to develop trade secrets; or particle physics if all the accelerators had a similar socio-economic role. Fernando Pereira used that analogy a few years ago to describe the emerging situation in computational linguistics.

Patrick's idea, as I understand it, is not to create yet another supercomputer center. Instead, his goal is a model that other researchers can replicate, seen as a part of "large data experiments in many computer science disciplines ... requiring a consortium of projects for studying the functional architecture, building low-level infrastructure and middleware, engaging the research community to participate in the initiative, and building up a library of open-source large data processing algorithms".

Posted by Mark Liberman at June 18, 2007 07:33 AM