December 05, 2004

Semen, green rice and the rate of internet decay

Hanzi Smatter is a blog "[d]edicated to the misuse of Chinese characters (Hanzi or Kanji) in Western culture". For those of us who are ignorant of Chinese characters in all their forms, it's especially nice that characters cited are identified with links to the Unihan database. For example, an entry from December 1 pictures someone who meant to tattoo (jing1) "essence, semen, spirit " on his elbows, but by splitting the character into the two radicals (mi3) "uncooked rice" and (qing1 or jing1) "blue, green, black; young", managed instead to display "green rice".

You can find the same character discussed on the website here, where the semantic part (mǐ , i.e. mi3, mi with third tone) is glossed as "rice" or "kernel", and the phonetic part (qīng, i.e. qing1, quing with first tone) is glossed as "color of lush growth that burns red", "green, blue", or "young".

The site also provides a list of relevant cross-reference links for each character, e.g. here for the original jing1 "essence", but unfortunately most of the links are broken. Of the 16 links given as cross-references for jing1, only 4 worked for me: an animated display of the character being drawn, a corresponding Cantonese entry, the entry in an etymological database, and AltaVista search for the character.

The Foreword to Web Version says that "was created in the fall of 1996". If 25% of links are still active after 8 years, a simple model of internet decay would say that the average rate of link preservation per year is .25^(1/8) = 0.84. That's quite a bit better than the rate of link rot that I generally see when I update the links in my on-line course lecture notes each year (say for the intro linguistics course ling 001), but the links are mostly to big lexicographical reference sites, which are likely to be more stable. I guess it's also likely that the author of the site, Rick Harbaugh, has updated some of the references since 1996 -- the site's copyright notice says 1995-2003 -- which would bring the yearly retention rate back down towards the .6 or so that I'm used to seeing.

All the same, even a link retention rate as high as .84 means that internet cross-references become useless on a time scale that's small compared to the traditional life cycle of scholarship. After 10 years, only 18% of references would still be valid. After 16 years -- that's how long it's been since the 2nd edition of the OED was published in 1989 -- only 6 percent of the links would still work. After 76 years -- the time elapsed since the first edition of the OED in 1928 -- only about 2 links in a million would be valid.

In my opinion, it's past time for the creators of serious content on the web to start using something like the DOI system to establish stable links. This would solve a portion of the problem in a way that doesn't require authors of sites with cross-reference links to run fast just to stay in one place. Some of the dead links at are to content that is still on the web, but can't be accessed at the old URLs because sites have been moved or internally reorganized or both. For example, the links to the the CEDICT and unicode database entries are of this type. There's still another piece of the problem, though -- sometimes content just goes dark, because the provider moves on, in some sense or another of that phrase. We don't evaporate all copies of a book when the author retires or dies, and the Internet Archive offers one model of how to retain web content in a library-like fashion. However, it's far from clear (at least to me) how to integrate one or more systems of stable links with one or more systems of archival storage. I also worry about other problems, for example the trend towards dynamically-composed rather than static content, where effective archiving may require access to a compatible version of time-varying programs of various sorts.

So the optimistic side of things is that a site like Hanzi Smatter is now easy to set up, and can easily display not only photographs but also Chinese characters in any reasonably compliant browsing environment, and can link to marvelously informative external pages on each of the characters discussed. was obviously harder to set up -- for example, the author had to use gifs instead of character codes because browser and OS technology was not reliably able to deal with Chinese characters as text rather than as images. Nevertheless he did it, and also provided an extraordinary range of systematic external as well as within-site links. The worm in the apple: most of the external links are now dead, just a few years later.


Posted by Mark Liberman at December 5, 2004 11:58 AM