May 10, 2005

Realistic surrealism

Jean-Frédéric Jauslin, director of the Swiss Office Fédéral de la Culture ("Federal Culture Office"), needs to learn to do research on the web, or to use a calculator, one or the other. Or perhaps he just needs a little more common sense and a little less arrogance. All of this, of course, is supposing that that he was quoted accurately by the reporter from silicon.fr. In such cases, my normal rule of thumb is to blame the journalist, but I might need to make an exception for culture bureaucrats.

Jauslin was apparently an observer at the recent meeting of EU Culture ministers. In any case, he has cast his lot with the Chirac/Jeanneny initiative, saying in a press conference that "la Bibliothèque nationale suisse pourrait participer aux futures réalisations européennes en matière de bibliothèques virtuelles pour contrer le projet de Google" ("the Swiss National Library could participate in future European activities in the area of virtual libraries, to counter Google's project").

The French geekoid publication silicon.fr quotes Jauslin as echoing the now-usual sentiments about the "danger non négligeable pour la pluralité culturelle" ("non-negligeable danger for cultural pluralism") and the "risque de prédominance de la notion de profit et de l'anglais" ("risk of predominance of the idea of profit and of English"). However, he goes that one chin-pull too far:

Il modère ainsi l'enthousiasme légitime de son confrère français le président de la Bibliothèque nationale Jean Noël Jeanneney à l'initiative du projet européen, en déclarant qu'une "numérisation systématique est surréaliste". En effet, numériser 100.000 pages par jour prendrait pas moins de... 400 ans!

He thus moderates the legitimate enthusiasm of his French colleague, the president of the National Library Jean Noël Jeanneney, for the initiative of the European project, declaring that "a systematic digitization is surrealist". In fact, to digitize 100,000 pages per day would take not less than... 400 years!

Well, "systematic digitization" might well be surrealist, but it's not unrealistic.

The picture on the right shows the Digitizing Line automated book scanner. The page turning system is made by the Swiss company 4digitalbooks, and the scanning system and camera by the French company i2s. It can scan 30 pages a minute -- "up to 1500 pages per hour in unattended operation", with "constant process reliability operating 24 hours a day", according to the maker. One of these machines was at the center of the Stanford digital library project where Larry Page and Sergey Brin got their start, before they went off to found Google, according to a May 12, 2003, NYT story, a paragraph of which can be found for free here.

Another "robot scanner", made by Kirtas Technologies of Rochester, NY, claims 1,200 pages per hour combined with "ultra-gentle handling". This machine is said to be "about the size of a small kitchen refrigerator" (much smaller than the Swiss unit, which is described as "the size of an SUV"), able to be "easily moved between locations", and (as of 1/15/2004) on sale for "$150,000 a pop".

The Google Library Project page says that "We have developed innovative technology to scan the contents without harming the book". This may imply that Google has sponsored the design of some new machines, which presumably are at least as fast as the 4digitalbooks and Kirtas products; or it may be that whoever wrote that page is using "we" in a rather inclusive sense.

Anyhow, let's do the arithmetic. The Google Library Project has been quoted as aiming at 4.5 billion pages -- the content of 15 million 300-page books -- for a cost of $150-200M, over a number of years. 1,200-1,500 pages per hour is 28,800 to 36,000 pages per day -- let's bring that down to 25,000 pages per day, to allow for maintenance and whatnot. Then 4.5 billion pages is 180,000 machine-days, or 493 machine-years. Perhaps that's where Jauslin (or the silicon.fr reporter?) gets his sneering estimate of "not less than... 400 years".

OK, let's round 493 up to 500. There are 25 EU countries. If they split the chore among them, that would be 20 machine-years each. If each of them had 20 machines, they could do it in a year.

There are five libraries participating in Google's project, so for them, it reduces to 100 machine-years per library. If they spread the work over five years, each would need 20 machines "the size of a small kitchen refrigerator". I'm sure that each of those libraries already owns and operates many more than 20 copiers of that size or larger.

At the 1/2004 quantity-one price quoted by Kirtas, this would require $15M for the 100 scanners. But this is not an unduly large proportion of the budgeted $150-200M, and surely Google will get a volume discount. More likely, Google will be able to take advantage of economies of scale in other, more serious ways.

I found all this out in less than half an hour of searching on line, and did the math in a few seconds. I'm sure that M. Jauslin has subordinates who know how to use Google and a calculator as well as I do, but apparently it never occurred to him to ask them to deploy their skills.

Of course, the EU sometimes seems to run according to a different system of arithmetic. According to this recent article on book digitization technology for European libraries at another French geekoid publication, 01net:

C'est dans ce contexte qu'Infotechnique, une filiale de Getronics spécialisée dans la gestion électronique des documents, notamment pour le compte de l'Union européenne, vient d'inaugurer Eurodema (pour « Europe dématérialisation ») à La Walck, à 40 kilomètres de Strasbourg. Le premier contrat d'ampleur engrangé par ce centre porte sur la numérisation des 32 millions de pages issues des livres d'actes notariés accumulés en Alsace-Moselle depuis plus d'un siècle. Montant de l'addition : 23 millions d'euros, facturés au Gilfam, le groupement d'intérêt public constitué par les départements du Bas-Rhin, du Haut-Rhin et de la Moselle.

It's in this context that Infotechnique, a subsidiary of Getronics specialising in the electronic administration of documents, especially for the European Union, has just inaugurated Eurodema (for "Europe Dematerialization") at La Walck, 40 kilometers from Strasbourg. The first large contract collected by this center deals with the digitization of 32 million pages of books of certificates (?) notarized in Alsace-Moselle over the past century. Adding up the bill: 23 million euros, divided among Gilfam, the public-interest group made up of the departments of Bas-Rhin, Haut-Rhin and Moselle.

This looks like an extraordinarily good deal to me -- for Infotechnique!

32 million pages for 23 million euros -- €0.72/page = $1.05/page. If I could get that contract, I'd be tempted to take a leave from Penn and do the job myself. I often scan articles and book chapters to put on reserve for students in my courses. Using my cheap, unautomated commodity scanner and Adobe Acrobat, I generally allow for a rate of 2 scans per minute. For most book formats, each scan is two pages, so I can do 240 pages per hour. Thus at Infotechnique's rate I could earn $252/hour, which I view as a pretty good wage. Since 32 million pages would get tiresome, even at that sort of rate, I'd be happy to split the work with some colleagues and friends.

And in fact we could do much better for ourselves. We could invest in one of those Kirtas scanners for $150K. Then all we need to do is load a new books in, one every 15 minutes or so, and the scanner would earn us up to $28,800 per day, making its cost back in less than a week. So we could easily buy several such scanners. At the rate of 25K pages per day, the whole job would take 1280 machine-days. With four machines, and (say) a dozen congenial partners to do the work in shifts -- in a nice place, with all the amenities -- we could do it in a year, and divide more than $29M among us, or almost $2.5M each.

Well, I know that there would be other costs. Let's allow $1M for renting a tasteful chateau in the neighborhood, and another $1M for spares, supplies, utilities, legal fees and whatnot. The split would still be about $2.25M each. Now if only Jean Véronis had kept his eye on the politico-digital-library ball, rather than doing all that clever reverse engineering of indexing methods! Or perhaps Chris Waigl might have been cultivating contacts in Alsace-Moselle rather than planting eggcorns...

Note that Google is projecting 4.5 billion pages for $150-200M -- between $.033 and $.044/page. If we call it $.04/page, that's 26 times cheaper than Infotechnique's rates. At Google's estimated prices, I'd only earn $9.60/hour with my old HP flatbed scanner, which would not tempt me, though many honest and respectable people work for less. In fact, come to think of it, it's almost $10/hour more than Language Log contributors get...

[Note: I recognize that the Alsace-Moselle contract may well involve all sorts of special and labor-intensive circumstances. Perhaps, for example, large numbers of hand-written documents need to be transcribed and edited; perhaps a textual (as opposed to image) form of the output needs to certified by notaries; etc. All this might mean that the contract is not outrageously padded, but just atypical. However, the 01net article does not mention any considerations of this kind, and instead presents this contract as a representative sample of the virtual library activities to come...

And as further evidence of sometimes-odd EU arithmetic in this general area, you can refer to my earlier discussion of on-going digitization efforts at Jeanneney's BNF, where a decade of work seem to have resulted in fewer than 1,500 books processed, or some 18 days' work for one of the automated scanners. ]

Posted by Mark Liberman at May 10, 2005 08:34 AM