Language Log: Progress in malignancy tagging

November 09, 2006

Progress in malignancy tagging

There's some new news from BIOIE, an NSF-sponsored research project on information extraction from biomedical text, which is one of my day jobs. I share faculty responsibility with Fernando Pereira and Aravind Joshi at Penn, and Pete White at Children's Hospital; but as usual in such projects, most of the real research is done by graduate students. One of those students, Yang Jin, has just had a paper accepted by BMC Bioinformatics: "Automated recognition of malignancy mentions in biomedical literature".

Last month, I posted about "Fable 2.0", an on-line system that automatically tags articles with mentions of genes, normalizes the mentions so that various different ways of referring to the same gene are connected, and lets you search millions of articles to find genes associated with arbitrary boolean combinations of keywords. Yang's new paper applies the same named-entity tagger to finding clinical descriptions of malignancies, as part of a larger strategy to link molecular and phenotypic observations, both in reports of laboratory research and in clinical records.

Yang used the "same tagger" as Fable does, in the sense that he used a general-purpose program that will attempt to learn how to "tag" any sort of text regions at all, generalizing from a body of hand-tagged training material. To make a gene tagger, the program was trained on text hand-annotated for genes. To make a malignancy tagger, it was trained on text hand-annotated for malignancies. (This tagger was developed by Ryan McDonald while he was a grad student at Penn -- Ryan is now at Google -- based on the Mallet machine learning toolkit.)

Yang's malignancy tagger works pretty well: 0.84 precision, 0.83 recall, 0.84 F-measure. ("Precision" is the proportion of hits that are valid; "recall" is the proportion of valid mentions that are found; the "F-measure" is the harmonic mean of precision and recall. These days, across various entity types and document collections, such taggers generally have F-measures in the 0.7-0.9 range.)

Yang's tagger also worked notably better than the obvious baseline of string-matching against a term list. Yang took the National Cancer Institute's neoplasm ontology, a term list of 5,555 malignancies, and tested it (on a random subset of abstracts from the larger test set) using case-insensitive string matching. Of the 202 malignancy mentions in this subset, the term-list method found only 85, for a recall of 0.42, while his tagger found 190, for a recall of 0.94. The mentions missed by term-list matching but found by the tagger included some variations in form for items already on the NCI list (e.g. "leukaemia" vs. "leukemia" or AML vs. "acute myeloid leukemia"), but also quite a few that simply weren't on the list in any form, such as "temporal lobe benign capillary haemangioblastoma" and "parietal lobe ganglioglioma".

One of the most interesting and promising results was an essentially negative one. Yang trained the tagger in one trial with a completely generic set of features (words, character n-grams, and so on), that could be used for any entity tagging task at all, and in another trial with additional cancer-specific feature sets, in particular the NCI term list and a list of indicative suffixes. The generic tagger scored an F-meaure of 0.834, while the addition of the cancer-specific feature sets only improved its performance to 0.838. This suggests that for some biomedical tagging tasks, domain-specific lexicons and other task-specific feature sets may not be needed.

But the single most important part of this story, in my opinion, is who Yang is. He's a graduate student in neuroscience, not in computer science or even in bioinformatics. We're beginning to enter an era when text-mining techniques are just another scientific tool, like a centrifuge or perhaps more analogously a package of software for fMRI analysis, available for use by researchers whose goals have no intrinsic connection to the analysis of language.

Posted by Mark Liberman at November 9, 2006 07:50 AM