March 08, 2005

The progress and prospects of the digital BNF

A terrific post by Andrew Joscelyne at Blogos discusses a Le Monde interview with Jean-Noël Jeanneney, the head of the French National Library ( Bibliothèque Nationale de France, or BNF) who wrote a few weeks ago about Google's challenge to Europe. The most interesting part, to me anyhow, is Andrew's memory of an earlier BNF disaster. More precisely, this was an earlier case in which an obsession with industrial competitiveness, combined with top-down decision making by technically clueless bureaucrats, wasted many tens of millions of new francs.

Andrew wrote:

I remember attending a demo in the early pre-web 1990s given by Cap Gemini (or whatever the IT services company was called at the time) which had been charged with designing a ‘scholar’s workstation’ for the brand new BNF, looming with its four monstrous bookend towers and damp wooden platform over the Seine opposite the old wine depot of Bercy. The idea was to offer serious readers digitized and bitmapped versions of books from every age, allowing both access to the text as a corpus, and as a set of specially designed original pages. All wonderful stuff, yet predicated on a massive digitization campaign. However, as Jeanneney admits today, the BNF has managed to digitize only 80 thousand works in a decade, compared with Google’s project of 15 million in about half that time. Why so few?

Alas, the BNF has created nowhere near 80,000 e-texts, as we'll see shortly. But let's focus first on this BNF workstation. I never saw a demo, but I did attend more than one presentation by its funders and developers. "Wonderful stuff" is a phrase that never occurred to me at the time, and it seems even less appropriate in retrospect.

If my memory is correct, the following things were true:

1. The workstation (called PLAO, "poste de lecture assistee par ordinateur" or "computer-assisted reading environment") was based on new, proprietary hardware and software. Why? To promote the French IT industry.
2. According to the original plan, the BNF digital library would be accessible only via special PLAO workstations at the BNF site in Paris. There would be no provision for any sort of remote access, not even dial-up, much less via the (American!) internet. I recall asking questions about this at one of these presentations, around 1994 or 1995: the speaker looked pained, as if I'd suggested putting ketchup on my croque-monsieur.

The PLAO thus set its face directly against the two biggest technology trends of the decade, namely commodity computing and the internet.

I emphasize that this is my memory from a couple of presentations that I heard a decade ago or more, and may be inaccurate or incomplete -- I welcome additions and corrections from more knowledgeable readers. I presume that PLAO is in the dustbins of technological history; in any case, BNF's digital library is now accessible on the web via the Gallica site.

I've used Gallica with gratitude in the past, and expect to use it again in the future, but based on my interactions with the site, I'm confident that many fewer than 80,000 works are really available -- in text form, anyhow. The number of works returned by my searches seems to suggest a significantly smaller number, and indeed Gallica has a list of documents in text mode, which includes merely 1,118 works.

A quick check on the Interrogation du catalogue page suggests that this list might be a bit out of date, but not by much. There are six types of document: "Ouvrages en mode texte", "Monographies en mode image", "Périodiques en mode image", "Lots d'images", "Documents sonores", and "Documents manuscrits" ("works in text mode", "monographs in image mode", "periodicals in image mode", "sets of images", "audio documents", "manuscript documents"). If you search the "Ouvrages en mode texte", you'll get hit counts like 935 for France, 1,135 for terre, 1,131 for chose, 1,043 for parmi, and so on. (It won't work to search for common function words like "mais" or "non" -- these are apprently stop words, and -- silently -- return nothing.)

So apparently the BNF's efforts over the past decade have given Gallica e-text holdings of about 1,200 works. The all-volunteer Project Gutenberg has produced more than 13,000 e-texts over roughly the same period of time.

A couple of other comparisons: if I ask Google for pages in the French language that contain the word France, I get 24,000,000 hits. If I ask for books on the subject of France, I get 26,100 hits. If I ask the Literature Online ("LION") database for texts containing the word France, I get 26,880 instances in 7,959 distinct works. Asking LION for earth gets 201,326 instances in 69,425 distinct works -- out of the "more than 350,000 works of poetry, drama and prose" that LION offers.

If Brussels gives M. Jeanneney the "plan pluriannuel" with a "budget généreux" that he's asking for, let's say that it'll be a triumph of hope over experience.

[Let me be clear that I'm all for pluralistic efforts, and especially European efforts, in the digital libraries arena. And I also believe that government-supported efforts could have a crucial role to play, especially if the result were to be e-texts in the public domain or otherwise openly available (and not just through one entity's web site). But I also believe that the best way to predict what (well-established) institutions will do in the future is usually to look at what they've done in the past.]


Posted by Mark Liberman at March 8, 2005 02:39 PM