December 30, 2004

CASS corpus

According to a recent wire story from Xinhua

Chinese linguists are going to complete China's largest database of spoken Chinese, on the basis of which they will compile the country's first modern spoken Chinese dictionary and grammar book.

Shen Jiaxuan, director of the Chinese Academy of Social Sciences (CASS) Institute of Linguistics, said the database include three sub bases such as a live Chinese conversation base whose data were collected in Beijing, a base consisting of six dialects of Shanghai, Xi'an, Guangzhou, Beijing, Chongqing and Xiamen, and a base of phonetic symbols of modern spoken Chinese.

The live conversation base now has 650 hours of live conversations recorded in Beijing, which were transferred to 8.9 million words in transcript.

The English-language web page for the CASS Institute of Linguistics is here.

I wasn't able to find out whether the recordings and transcripts will be published. I hope so -- the corpus-based dictionary and grammar will be even more valuable if the base materials are also available to scholars, as the British National Corpus and many other large linguistic corpora are.

The cited numbers for the Beijing conversational recordings and their transcripts (650 hours of conversations, 8.9 million "words") add up to about 228 "words" per minute. This leaves me uncertain about whether this should be taken to mean "words" in the lexicographic sense, or "characters" as they would be used in transcribing the conversations into normal Chinese orthography. The Chinese writing system doesn't separate "words" by spaces or any other marks, but the project aims at producing a dictionary, which will certainly be organized in terms of multi-character words, just as (for instance) CEDICT or the ABC Chinese Dictionary is. Thus e.g. dian4 shi4 ji1电视机, meaning"television (set)", is three syllables, three characters, but one (lexicographic) word.

228 whatevers per minute seems too fast for the units to be words, which I think should average about two syllables each in Beijing conversation, but it's kind of slow for syllables in conversational speech. It's probably syllables (=characters), though, with some pausing accounting for the slower rate.

Anyhow, it's terrific to see this development.

[via Victor Mair]


Posted by Mark Liberman at December 30, 2004 11:20 AM