January 01, 2008

The miserable English alphanumeric system and its inadequacies

My current task — and my first New Year's resolution is to get on with it — is evaluating a few dozen handwritten examination scripts from the final exam in my course last semester. And (may I take five minutes to blog this without violating my resolution?) it has reminded me of the terrible design errors in the Latin alphabet, and even more so in the Arabic numeral shapes used along with it. I don't mean the spellings (they are a total mess too, but that is a totally different and much more complex issue); I am talking about the actual character shapes.

In handwriting, and also in many ill-designed fonts and LED displays, 0 looks like O and sometimes like D, 1 looks like l and also like I, 2 looks like Z, 5 looks like S, 6 looks like b, 7 is a bit too similar to 1, and 8 looks too much like B. More than half a dozen serious possibilities for error, at the very least. (One could say there are more: sometimes 9 looks too much like certain handwritten tokens of a or q.)

This is utterly unnecessary. We need only a small number of distinct shapes here. It would not be hard to select from among the nondenumerable infinity of available planar shapes a suite of 62 glyphs that are easy to write yet very unlikely to be confused with each other under ordinary visual conditions. Instead, for the 62 alphanumerics (our 26 capital letters, 26 lower-case letters, and 10 digits) we have only 46 clearly distinct shapes: A, a, B, b, c, D, d, E, e, F, f, G, g, H, h, I, i, J, j, K, k, L, l, M, m, N, n, o, p, Q, q, R, r, s, T, t, u, v, w, x, Y, y, z, 3, 4, and 9. (I grant you that one might say I have been too generous on allowing that K and k have different shapes.)

The capital letters not in the above list (C, O, P, S, U, V, W, X, Z) are distinguished from their lower-case counterparts almost entirely on size,which does not permit accurate identification from many people's non-cursive handwriting (the discipline of respecting x-heights and placing ascenders and descenders outside the x-height is a subtlety only dimly remembered). Taking this together with the digits that are confusable with letter shapes, the overall evaluation must be that the system is disgraceful.

And there are contexts (URLs, login names, passwords, filenames, student ID numbers, bank codes, mathematical and logical formulae, brand names, street addresses and proper names in foreign languages) where such matters are absolutely vital and the confusions can really cause trouble.

Some writing systems (I will not provoke jealousies or ethnic rivalries by citing examples) do much better. And others do even worse, of course.

Posted by Geoffrey K. Pullum at January 1, 2008 12:50 PM