Language Log: Searching for Santa Cruz

March 28, 2004

Searching for Santa Cruz

A new service for searching language archives has just been set up on the LDC website. Enter a language name like Warlpiri, to find 41 results in 7 different language archives, ranging from a bunch of primary resources in the Australian Studies Electronic Data Archive, to a paper in the ACL Anthology on "Parsing a Free-Word Order Language." If you use a variant or incorrect spelling of the language name (e.g. Walbiri), the service will direct you to the correct version, thanks to Ethnologue's list of alternate language names, approximate string matching, and various other tricks. Enter a country name to find resources for languages spoken in that country. Search for Santa Cruz (a language of the Solomon Islands) and find Voorhoeve and Wurm's recordings held in the Pacific And Regional Archive for Digital Sources in Endangered Cultures. Now try the same search using Google, to discover a host of irrelevant sites (like the UCSC homepage) and realize the value of having this new service which searches a union catalog of major language archives. Visit LINGUIST List for a more fine-grained interface for searching within the same collection. All this is made possible by OLAC, the Open Language Archives Community...

Back in October Mark Liberman wrote: "One thing I'd like to understand better is the relationship to the Open Archives Initiative and the Open Language Archives Community. Steven?" (Another scientific revolution?). Later Mark gave OLAC some more air-time: "The OLAC Metadata set is a modest set of extensions to the Dublin Core, useful for cataloguing language-related archives of various types" (Borges on metadata). Let me take this as my cue to tell you some more about OLAC.

In December 2000, an NSF-funded Workshop on Web-Based Language Documentation and Description, held in Philadelphia, brought together a group of nearly 100 language software developers, linguists, and archivists responsible for creating language resources in North America, South America, Europe, Africa, the Middle East, Asia, and Australia. The outcome of the workshop was the founding of the Open Language Archives Community, with the following purpose:

OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.

Today OLAC has over two dozen participating archives in seven countries, with 26,656 records describing language resource holdings. Anyone in the wider linguistics community can participate, not only by using the search facilities, but also by documenting their own resources (providing data), or by helping create and evaluate new best practice recommendations (sign up for OLAC mailing lists, starting with OLAC General).

OLAC is built on two frameworks developed within the digital libraries community by the Dublin Core Metadata Initiative and the Open Archives Initiative. The DCMI provides a way to represent metadata in electronic form, while the OAI provides a convenient method to aggregate metadata from multiple archives.

"Metadata" is structured data about data - descriptive information about a physical object or a digital resource. Library card catalogs are a well-established type of metadata, and they have served as collection management and resource discovery tools for decades. The OLAC Metadata standard defines the elements to be used in descriptions of language archive holdings, and how such descriptions are to be disseminated using XML descriptive markup for harvesting by service providers in the language resources community. The OLAC metadata set contains the 15 elements of the Dublin Core metadata set plus several refined elements that capture information of special interest to the language resources community. In order to improve recall and precision when searching for resources, the standard also defines controlled vocabularies for descriptor terms covering language identifiers, linguistic data types, discourse types, linguistic fields, and participant roles. You can see three of these vocabularies in use by searching for Pullum and picking the record for Pullum & Derbyshire's paper Object-initial languages.

I'm indebted to Gary Simons, along with dozens of institutions and individuals for helping to build and support OLAC.

Posted by Steven Bird at March 28, 2004 05:46 AM