December 30, 2003

More on the computational linguistics of smells

Fernando Pereira picks up the question of the computational linguistics of smells:

Surprising as it might seem to outsiders, this question is central to modern computational linguistics. One side will argue that without perceptual grounding, anything we glean from texts is a poor, fake proxy. The other feels that the grounding of much of the language we use, especially that pertaining to social and technical topics, is other language. ...

Current information-extraction techniques based on labeling a bunch of documents and learning pattern matchers from the examples take less advantage than we'd like from co-occurrence statistics. Some research ... suggests that one can do much better using lots of unlabeled data, but at present those techniques are a black art: sometimes they work, sometimes they don't, and it's not yet clear why. I think that part of the problem is that existing techniques focus on just one kind of entity and very superficial features, while the way we learn that CPEB may denote a kind of protein involves seeing the term used in relation to several other terms, themselves belonging to rich terminological networks of which we have some knowledge.

Read the whole thing, including Fernando's Proustian ruminations on mildew smells across time and space.

I conjecture that biomedical text may be the best initial testbed for the kind of research that Fernando describes (as he broadly hints in his note), since it's easy to get access not only to billion-word text corpora but also to a rich and varied universe of bioinformatic databases and ontology-attempts. The fact that the results may often be intrinsically worthwhile is another motivation to look in that domain first.

Posted by Mark Liberman at December 30, 2003 12:30 PM