Back in the fall of 2001, some of us at Penn put together a proposal to the National Science Foundation for research on automatic information extraction from biomedical text. Most of the proposal was about what we planned to do and how we planned to do it. But in the atmosphere of two years ago, we felt that we also had to say a few words to validate the problem itself, the problem of creating software to "understand" ordinary scientific journal articles. This was not because the task is too hard (though that is a reasonable fear!), but because some NSF reviewers might have thought that it was about to become too easy. After all, the inventor of the World Wide Web was evangelizing for another transformative vision, the Semantic Web, which promised to make our problem a trivial one.
As we wrote in the proposal narrative:
Some believe that IE technology promises a solution to a problem that is only of temporary concern, caused by the unfortunate fact that traditional text is designed to convey information to humans rather than to machines. On this view, the text of the future will wear its meanings on its sleeve, so to speak, and will therefore be directly accessible to computer understanding. This is the perspective behind the proposed "Semantic Web" [BLHL01], an extension of the current hypertext web "in which information is given well-defined meaning," thereby "creat[ing] an environment where software agents . . . can readily carry out sophisticated tasks for users." If this can be done for job descriptions and calendars, why not for enzymes and phenotypes?
In the first place, one may doubt that the Semantic Web will soon solve the IE problem for things like job descriptions. The Semantic Web is the current name for an effort that began defining the W3C's Resource Description Framework (RDF) more than five years ago, and this effort has yet to have a significant general impact in mediating access to information on the web. Whatever happens with the Semantic Web, no trend in the direction of imposing a complete and explicit knowledge representation system in biomedical publishing is now discernable. In contrast, we will argue that high-accuracy text analysis for the biomedical literature is a plausible goal for the near future. Partial knowledge-represention efforts such PubGene's gene ontology (GO)[Con00] will help this process, not replace it. The technology needed for such text analysis does not require HAL-like artificial intelligence, but it will suffice to extract well-defined patterns of information accurately from thousands or even millions of documents in ordinary scientific English.
The past two years have confirmed this perspective. Even in bioinformatics, where some might think that everything should be clear and well defined, the attempt to provide a universal ontology (and a universal description language based on it) is not even close to providing a basis for expressing the content of a typical scientific article in the biomedical field. Don't get me wrong -- the kind of information extraction that we (and many others) are working on is certainly possible and valuable. But it's all interpretive and local, in the sense that it creates a simple structure, corresponding to a particular way of looking at some aspect of a problem (like the relationships among genomic variation events and human malignancies), and then interprets each relevant chunk of text to fill in pieces of that structure. It doesn't aim to provide a complete representation of the meaning of the text in a consistent and universal framework.
Recently, Clay Shirky has written an interesting general critique of the Semantic Web concept that is much more radical than what we dared to put into the staid columns of an NSF proposal. He starts with a bunch of stuff about syllogisms, which rather confused me, since syllogisms have been obsolete at least since Frege published his Begriffsschrift in 1879, and I haven't heard that the Semantic Webbers are trying to resurrect them. But Shirky ends with some ideas that I think are clear and true:
Any attempt at a global ontology is doomed to fail, because meta-data describes a worldview. The designers of the Soviet library's cataloging system were making an assertion about the world when they made the first category of books "Works of the classical authors of Marxism-Leninism." Charles Dewey was making an assertion about the world when he lumped all books about non-Christian religions into a single category, listed last among books about religion. It is not possible to neatly map these two systems onto one another, or onto other classification schemes -- they describe different kinds of worlds.
Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can't get a standard til you have an agreement, and you can't force an agreement to exist where none actually does.
Shirky points out the connection between the semantic web and classical AI, which seemed to be dead but is to some extent reincarnated in the semantic web and the many things like it that are out there.
There's an interesting question to be asked about why people persist in assuming that the world is generally linnaean -- why mostly-hierarchical ontologies are so stubbornly popular -- in the face of several thousand years of small successes and large failures. I have a theory about this, which this post is too short to contain :-) ... It has to do with evolutionary psychology and the advantage of linnaean ontologies for natural kinds -- that's for another post.
[Thanks to Uncle Jazzbeau for the reference to Shirky's article.]
[Unnecessary pedantic aside: it seems that the inventor of the Dewey Decimal system was Melvil Dewey, not "Charles Dewey" as Clay Shirky has it. Google doesn't seem to know any Charles Deweys in the ontology trade. I have to confess that I alway thought it was John Dewey who designed the Dewey Decimal System, and I'm disappointed to find out that it was Melvil after all.]
[Update: Charles Stewart pointed me to a reasoned defense of the Semantic Web by Paul Ford. In effect, Ford argues that there is a less grandiose vision of the semantic web, according to which it just provides a convenient vehicle for encoding exactly the kind of local, shallow, partial semantics that IE ("information extraction") aims at.
Ford closes by saying that "on December 1, on this site, I'll describe a site I've built for a major national magazine of literature, politics, and culture. The site is built entirely on a primitive, but useful, Semantic Web framework, and I'll explain why using this framework was in the best interests of both the magazine and the readers, and how its code base allows it to re-use content in hundreds of interesting ways." I'll be interested in seeing that, because the it's exactly what I haven't seen from semantic webbers up to now: any real applications that make all the semantic web infrastructure look like it works and is worth the trouble.]
[Update 11/16/2003: Peter van Dijck has posted an illustrated guide to "Themes and metaphors in the semantic web discussion.]
Posted by Mark Liberman at November 11, 2003 09:32 AM