Language Log: More on data catalysis

June 21, 2007

More on data catalysis

In commenting on Patrick Pantel's "Data Catalysis" paper, I quoted a remark that Fernando Pereira made a few years ago, to sum up the problem of effective access for computational linguists to web-scale data. This was after a talk by Peter Norvig on aspects of Google's insfrastructure; Fernando said something like "I feel as if we're particle physicists and you have the only accelerator".

Fernando read Patrick's paper, and laid out on his blog the way that he feels about the problem now. There are two key quotes:

"I'm worried about grid-anything. In other fields, expensive grid efforts have been more successful at creating complex computational plumbing and bureaucracies than at delivering new science."

"Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators."

I strongly agree with the first of these -- it's why I emphasized that Patrick is trying to create modular and shareable architectures and software, not yet another supercomputer center. And I also agree with the points that Fernando makes about choosing problems, and about the serious mismatch between current research opportunities and current academic models for funding, staffing and research management.

However, I continue to believe that Patrick is addressing an important set of issues. As both Patrick and Fernando observe, the hardware that we need is not prohibitively expensive. But there remain significant problems with data access and with infrastructure design.

On the infrastructure side, let's suppose that we've got $X to spend on some combination of compute servers and file servers.. What should we do? Should we buy X/5000 $5K machines, or X/2000 $2K machines, or what? Should the disks be local or shared? How much memory does each machine need? What's the right way to connect them up? Should we dedicate a cluster to Hadoop and map/reduce, and set aside some other machines for problems that don't factor appropriately? Or should we plan to to use a single cluster in multiple ways? What's really required in the way of on-going software and hardware support for such a system?

To some extent, the answers to such questions depend on your local problems and opportunities. (Maybe the key constraint turns out to be power and cooling, for example.) But with some luck, people like Patrick will come up with experiences that others can copy (or avoid, depending on how they turn out), and even with whole designs that can be replicated on various scales.

These are problems worth solving, even if they're not the ones that Fernando lays out in his post.

Posted by Mark Liberman at June 21, 2007 07:44 AM