July 11, 2004

Typescript finished: a milestone and a rant

A major milestone was reached today in the process of completing A Student's Introduction to English Grammar, the grammar textbook that Rodney Huddleston and I have been working on for over a year. After a marathon work session at his beach house overlooking the Tasman Sea, Rodney completed the analog of what software engineers call "the build": he put together all the pieces and prepared and debugged the final complete electronic version of our book for Cambridge University Press. It has just been emailed from Brisbane (where Rodney's mail server is) to Cambridge, England (where the publisher is), with a copy to Santa Cruz (for the second author, yours truly). And how do I feel with respect to the technology that permitted this? I'm impressed, and furious, and grateful, and disgusted. Let me explain. Or not, if the last thing you want today is to read a rant. This rant is dedicated to Wolf Angel, who understands that the world is a place that one might occasionally want to rant at.

The Internet is of course a major boon to scholars and has changed my life for the better. Once it would have taken two months to get a bulky double-spaced manuscript from Australia to England as a parcel; now it takes roughly two minutes. Two minutes during which our book changed status from "in preparation" to "in press". What's more, hardly any of the six seconds was travel time. It was taken up with waiting for router boxes and CPUs and disk drives on intermediate servers to get to their tasks. The majority of their work involves forwarding spam. Some counts say that as much as 80% of today's email is spam.

(While waiting for my message with the final-build book typescript to arrive I received a message headed "Nude blonde raped" that eluded my spam filter completely. It contained just some ASCII gibberish strings to fool the spam filter into thinking there was ordinary text there and some HTML code with a URL to click on if I was intrigued by the idea of seeing pictures of a nude blonde being raped. I didn't visit. I did resent the intrusion of this unwanted extra communication slipping through the spam net. But even if we could find out who controlled the website that the rented domain name pointed to at the time, and we could also find the sender of the message, we could never gather convincing proof that the owner of the website had authorized the sending of the spam that was supposed to direct traffic to him; he would say he knew nothing about it, and the spam-sender would say he knew nothing about the website or the sending of the messages, and prosecution of either or both would fail. We legitimate Internet users can never stop the spammers by ordinary legal means. We have to kill them. But I digress.)

If it were not a matter of waiting for intermediate machines to get to the task of routing the message, a journey of a mere 12,000 miles, at the speed of electrons, would take only about 64 milliseconds. But this is a digression too. Let me struggle to get back to whatever my point was; I know I had one.

Internet technology was indispensable to the production of our book. I've done the trip to Australia to work directly with Rodney on The Cambridge Grammar five times already, spending in total about a year of my life there, and it's a long, long way, and it's expensive. I couldn't make the trip during the writing of this book. We had to do it entirely by exchanging WordPerfect files in email attachments. The final email message was 2,233,589 bytes — way bigger than what limits on message size used to permit. Came through without a hitch. Decoded like magic under Linux (I use the nail program; old-fashioned, but because I never use a Windows-based downloading email program, I am totally virus-proof). Downloaded to the Windows machine in a blink. Loaded under WordPerfect 11 in a split second. I should be grateful to the industry that made this possible, the industry that gave us the Internet and word processing software. And yet... Let's just say that I'm not a generous-hearted enough human being to be capable of unalloyed gratitude, given what happened toward the end of the build.

During the last frantic day of work, Rodney found he had a single chapter for which the file appeared to be corrupted. On its own, it could be loaded, viewed, and edited. But if he added it to the previous chapters, WordPerfect would freeze when he tried to look at the whole thing. If he split the book into two parts and tried to make the problem file the first file of the second part, then the second part couldn't be loaded. He worked on the problem in lonely agony for four hours straight. He struggled to find a way of using the file, by desperate strategies like retyping parts of it near where the freeze-up seemed to occur. Then eventually he told me about it.

As soon as I read his email with the description of the problem (it came in just after one telling me where I could obtain videos of incest rape if I was fresh out of them), I began thinking: open brackets. Like LaTeX groups that were never closed, or maybe nested keeps. It wasn't hardware memory limits $#151; the whole book could be loaded all at once as long as the bad chapter on clause type wasn't in there. But something in the file was causing a buffer to open but never close. And I suspected Block Protect.

WordPerfect allows you to mark a block of text as not permitted to be interrupted by a page break. But it is unfortunately possible to not notice that you're in one of these protected blocks, and start putting other material, perhaps dozens or scores more pages, into it. And you can start another protected block inside it. WordPerfect really doesn't like that. (Why am I using WordPerfect if I object to its behavior? Compatibility with a colleague who selected it in 1989 and has built up a library of millions of words and hundreds of macros. I already explained a bit about my experience with WordPerfect here and later here.)

So I began looking for over-extended Block Protects in the apparently corrupted file, and that's what I rapidly found. A Block Protect opened, pages and pages went by, and eventually another one opened; the memory organization couldn't keep track of the illegal structure, and WordPerfect would just freeze up and stop working. Dozens and dozens of times over, without ever giving an error message that would provide a clue as to what was going on.

I deleted the first Block Protect character, and everything was fine. I sent the file to Australia and five minutes later the build could proceed. A happy ending? Well sorry, but not happy enough. I am actually furious, looking back. In the 1970s, the Unix formatter troff was of amply high enough quality to format a professional-looking printed book, it ran fast and was not crippled with bugs, and it had the capability to spot what it calls "illegal nested keeps", or blocks that are opened but haven't been closed when the file ends, or blocks that are closed when they have never been opened. TeX could (and can) do similar error diagnosis too. It is not beyond the powers of software engineering to spot problems of this kind and warn about them in the parsing phase, exiting gracefully whatever happens (by which I mean, leaving the operating system still functional; when I was still using WordPerfect 6.1 on a Windows 95 machine I noticed that every WordPerfect crash would necessitate rebooting the entire machine with loss of all unsaved data in open programs, because after a crash it was not possible for Windows to reinitialize WordPerfect and open it again).

By 1990, what was the improved state of affairs? We had word processors that were orders of magnitude bigger in object code and disk space required (libraries of absurd crap like clip art), and ran slower, with editing capabilities that didn't have the regular-expression searching that Unix editors have, but the formatting algorithm didn't come with the capability to warn about what errors it had encountered in a file containing a typing error. WordPerfect would just crash or freeze or "become unstable" when something ugly happens in the document, without an error message. (Don't ask about Microsoft Word. It is, as always, much worse: WordPerfect has the crucially important Reveal Codes feature, which permits the user to look into the file as encoded internally and see what hidden formatting codes are in there. Microsoft won't supply that, although they could. Instead they put an entry in their Help files explaining that you don't need that. Right. Ve vill tell you vhat you need, worthless enduser scum!) I'm saying that word processing technology moved backward to a considerable extent in the period 1980-1990.

Word processors. Can't live with 'em, can't live without 'em. The project Rodney Huddleston and I have just completed was only possible because of modern word processing technology, which enabled us to exchange dozens of drafts a day across the Pacific in seconds. Yet at the same time, modern word processing technology nearly killed it by allowing an algorithically detectable but unreported and obscure document content error to crash the machine, giving no error message, when one file was concatenated with another, though not when the former was opened. We nearly couldn't deliver a typescript at all. If software technology was automobile technology, the roads would be running with blood and thousands of car manufacturers would be in jail. Where they would deserve to be.

Now we move forward to the stage where a copy editor will work through the typescript, and unless Cambridge University Press has warned them about who we are, they will start changing our whiches to thats and our sinces to becauses and moving our punctuation to the other side of our quotes where we didn't want them and so on and so on. I will try to keep you (or at least WolfAngel) informed with regular rants.

Posted by Geoffrey K. Pullum at July 11, 2004 07:24 PM