Language Log: EMELD

July 02, 2005

EMELD

After returning from a trip to Colorado and Wyoming, I'm now at the EMELD 2005 Workshop in Cambridge, MA. "EMELD" stands for "Electronic Metastructure for Endangered Languages Data". This particular workshop is focusing on just one aspect of EMELD, namely GOLD (the "General Ontology for Linguistic Description"). So far, I've heard about some very interesting work, which hasn't quite overcome my general worries about the application of Semantic Web ideas and tools in science (or elsewhere, for that matter).

Roughly, E-MELD aims to solve three problems: how to make the documentation of endangered languages durable (so that it can still be read and used in 20 or 50 or 100 years), how to make it interpretable as data (other than to an informed human eyeball), and how to make it interoperable (so that you could search or amalgamate data across descriptions of many different languages by many different people). I hope it's obvious that these are real problems. In terms of durability, a corpus or a dictionary in a proprietary format may be difficult or impossible to use just a few years from now. One extreme example: the archive of scripts and transcripts at the Voice of America used to be kept in the storage format associated with a now-defunct Xerox multi-language word processing system, which (as I understand it) was basically a binary dump of the run-time heap of the program. In terms of interpretability, interlinear glossed text (whether expressed in a word processor file, in a typesetting format like .pdf or as plain text) may be quite readable, but can be very difficult to transform into a database that can be searched, linked to a dictionary, etc. And in terms of interoperability, the problem is that linguists use a wide variety of terminology (e.g. some linguists might use "nominative" where others use "absolutive") and an ever wider variety of abbreviations (NOM might be short for "nominative" or "nominalization" or "nominal").

The durability problem is the most important one, and it also has the easiest solution: just use open, documented standards (and archival-quality storage methods, of course). The interpretability problem is the next most important one, and it's fairly easy to solve: use (tools that produce) well-designed descriptive mark-up in a well-defined format such as XML, rather than presentational mark-up.

The interoperability problem is by far the hardest one. The proposed solutions involve connecting descriptive entities and relations to a shared ontology, or a lattice of partially shared sub-ontologies, or a set of mappings among ontologies; and using tools like RDF and OWL in all their variants to keep track of all the connections and correspondences. Some nice examples have been presented -- for instance, Scott Farrar and William Lewis discussed an experimental effort to create a cross-language database of Interlinear Glossed Text (IGT) from examples available on the web, and Gary Simons about how to use RDF metaschemas to combine entries from three apparently incompatible dictionary databases.

I'm still thinking about all this, but I have two specific concerns. First, the ontologists' focus on terminological logic distracts attention from a number of other problems that are at least as important, such as how to represent the complex connections among recordings, transcripts, texts, analyses and lexicons (even within a single descriptive framework applied to a single language). In fact, the ontologists' methods sometimes can make it much harder to solve these other problems, for example by prescribing inappropriate structures to sets of concepts or to linguistic objects. (I'll say more about these problems in a later post). Second, the process of "ontologizing" a linguistic description is complex, difficult and time-consuming, as examplified in a workshop presentation by Laura Buszard-Welcher on her "experience as a field researcher mapping morphosyntactic categories of Potawatomi, an Algonquian language, to the GOLD ontology through FIELD, an ontology-based lexical database program". I'm worried that these difficulties will delay the adoption by "ordinary working linguists" of (much more accessible) tools and practices that solve the durability and interpretability problems.

Meanwhile, with all this traveling, my net access has been frustratingly erratic, and I've had frustratingly few of those little chunks of blogging time in the interstices of my schedule. As a result, my to-blog list is growing frustratingly long. To all of you who've sent me suggestions, links and other messages: sorry, I'll get to it! (I hope...)

For an introduction to the broader controversy about what semantic-web-style ontologies may or may not be good for, see Peter van Dijck's "Themes and metaphors in the semantic web discussion". It's almost two years old, which is like a decade in internet years, but it's still relevant.

Posted by Mark Liberman at July 2, 2005 06:07 AM