March 09, 2006

The December 1 DWIM effect

The damage done by well-intentioned (mis)features of MS Office is not limited to occasional dadafication of EU bureaucratese. According to Barry R Zeeberg, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett and John N Weinstein, "Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics", BMC Bioinformatics 2004, 5:80:

When we were beta-testing [two new bioinformatics programs] on microarray data, a frustrating problem occurred repeatedly: Some gene names kept bouncing back as "unknown." A little detective work revealed the reason: ... A default date conversion feature in Excel ... was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] was being converted to '1-DEC.' Figure 1 lists 30 gene names that suffer an analogous fate.

A worse problem apparently afflicts information from microarray experiments:

There is another default conversion problem for RIKEN clone identifiers identifiers of the form nnnnnnnEnn, where n denotes a digit. These identifiers are comprised of the serial number of the plate that contains the library, information on plate status, and the address of the clone. A search ... identified more than 2,000 such identifiers out of a total set of 60,770. For example, the RIKEN identifier "2310009E13" was converted irreversibly to the floating-point number "2.31E+13." A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern. Most important, these conversions to an internal date representation or floating-point number format are irreversible; the original gene name cannot be recovered.

RIKEN microarrays are systematically affected, but other microarray results are apparently often garbled as well:

The floating-point conversion is not restricted to RIKEN clone identifiers but will affect any clone designation derived from plate coordinates. ... [If plate library references are omitted or numerical], all clones from row E of any plate are converted to floating point numbers by Excel. ... Since 96-well plates contain 8 rows and 12 columns, row E represents 12/96 or 12.5% of the clones on the plate; similarly, 6.25% of clones from 384-well plates would be affected. Most libraries contain hundreds of plates, each of which would be subject to this problem.

If some computer virus or trojan did this sort of damage to the results of thousands of high-cost biomedical experiments, I imagine that we'd see a serious effort to put some people in jail. I'm not suggesting that any similar sort of retribution is appropriate here, but perhaps some rehabilitation would be in order, along the lines suggested below.

There's an acronym from the old days of classic AI, DWIM, standing for "Do What I Mean". The Jargon File explains:

Warren Teitelman originally wrote DWIM to fix his typos and spelling errors, so it was somewhat idiosyncratic to his style, and would often make hash of anyone else's typos if they were stylistically different. Some victims of DWIM thus claimed that the acronym stood for ‘Damn Warren’s Infernal Machine!'.

In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed delete *$ to free up some disk space. (The editor there named backup files by appending $ to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported *$ not found, assuming you meant 'delete *'. It then started to delete all the files on the disk! The hacker managed to stop it with a Vulcan nerve pinch after only a half dozen or so files were lost.

The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type delete *$ twice.
DWIM is often suggested in jest as a desired feature for a complex program; it is also occasionally described as the single instruction the ideal computer would have. Back when proofs of program correctness were in vogue, there were also jokes about DWIMC (Do What I Mean, Correctly).

It seems to me that all interactive programs should have a prominently-displayed switch labelled something like DEWITYD, "Do Exactly What I Tell You, Damnit!" (pronounced as "de-witted"). No doubt the results will be wrong (or even disastrous) at least as often as the results of DWIM will be; but at least you'll know exactly who to blame.

[Update: Joshua Fruhlinger, the Comics Curmudgeon, writes:

In my non-comics-mocking life, I'm an editor, and I can tell you that the first thing most editors do with a new install of Office -- especially those of us who work in jargon-heavy fields -- is turn off the auto-correction features. These options are usually buried under several levels of preference menus, but for people who are in the business of writing precisely what they want to write, they must be turned off if life is to be worth living. I am always horrified that they are left on out of the box by default.

]

Posted by Mark Liberman at March 9, 2006 05:51 PM