Unless you've been living in a cave you've probably heard that Bernardo Provenzano,
suspected of being the *capo di tutti capi* of the Sicilian mafia,
was recently captured after eluding the police for over forty years.
What hasn't been emphasized in many of the news stories is the fact that
his capture was due to his lack of knowledge of linguistics.

According to
this account
at the *Discovery Channel*, Provenzano was caught after police intercepted
messages between Provenzano and other members of his organization written in a
code that they were able to break. Details can be found at
http://www.bernardoprovenzano.net
(in Italian).

Actually, in technical terms, they weren't using a code but a cipher, a variant
of the *Caesar Cipher*, said to have been used by Julius Caesar. The Caesar
Cipher is one in which each letter of the alphabet is mapped onto the letter three
places farther down. In the version used by Provenzano, each letter is first
replaced by the corresponding ordinal, to which three is then added. Here is the
resulting cipher:

Code | Letter | Code | Letter |
---|---|---|---|

4 | a | 15 | n |

5 | b | 16 | o |

6 | c | 17 | p |

7 | d | 18 | q |

8 | e | 19 | r |

9 | f | 20 | s |

10 | g | 21 | t |

11 | h | 22 | u |

12 | i | 23 | v |

13 | l | 24 | z |

14 | m |

Here is the cipher text of a bit of a message from Angelo Provenzano to his father:

1012234151512 14819647415218

and here is the decryption:

10 12 23 4 15 15 12 14 8 19 6 4 7 4 15 21 8 g i v a n n i m e r c a d a n t e

(Mr. Provenzano misspelled *Giovanni*.)

Ciphers of this type have been around for a very long time and are known to be easy to break. If you suspect you're dealing with a shift cipher, you can check without all that much work since there are only N of them for an alphabet of N letters, and one, the one with shift N, doesn't conceal anything. The class of monoalphabetic substitution ciphers, in which each letter is replaced by some other letter, is much larger: for an alphabet of size N, there are N! of them. The number of different substitution ciphers on an alphabet of 21 letters is 510,909,400,000,000,000,000,000 (over 500 sextillion). That is far too many to break by brute force without a computer. Even with a computer able to check 1,000,000,000 possibilities per second it would take 1618.974 years. This kind of problem is easily parallelizable, meaning that many computers can work on it at the same time, so you can cut the time by using more machines. That only gets you so far though. If you put 1,000 computers to work it would still take 1.619 years. It is worth noting, too, that Italian makes the task easier since it has only 21 letters. For an alphabet of 26 letters like that of English, there are 403,291,500,000,000,000,000,000,000 (over 403 septillion) monoalphabetic substitution ciphers. With those same 1,000 computers trying them all would take 12,779,535 years.

In point of fact, monoalphabetic substitution ciphers are easy to break if you've
got a reasonable amount of cipher text. The reason is that the frequency of the
letters in natural languages is widely dispersed. Here are the frequencies of the
characters in an English version of Machiavelli's *The Prince*:

Rank | Percent | Count | Code | Character |
---|---|---|---|---|

1 | 17.213 | 29,900 | 0x20 | SPACE |

2 | 10.417 | 18,095 | 0x65 | e |

3 | 7.357 | 12,779 | 0x74 | t |

4 | 6.121 | 10,633 | 0x6F | o |

5 | 5.993 | 10,411 | 0x61 | a |

6 | 5.755 | 9,996 | 0x69 | i |

7 | 5.619 | 9,760 | 0x6E | n |

8 | 5.328 | 9,255 | 0x68 | h |

9 | 5.014 | 8,710 | 0x73 | s |

10 | 4.619 | 8,024 | 0x72 | r |

11 | 3.129 | 5,435 | 0x64 | d |

12 | 2.714 | 4,715 | 0x6C | l |

13 | 2.247 | 3,904 | 0x63 | c |

14 | 2.129 | 3,698 | 0x75 | u |

15 | 1.918 | 3,331 | 0x66 | f |

16 | 1.895 | 3,291 | 0x6D | m |

17 | 1.624 | 2,821 | 0x0A | LINEFEED |

18 | 1.623 | 2,819 | 0x77 | w |

19 | 1.501 | 2,607 | 0x2C | , |

20 | 1.474 | 2,560 | 0x79 | y |

21 | 1.357 | 2,357 | 0x70 | p |

22 | 1.315 | 2,285 | 0x62 | b |

23 | 1.253 | 2,177 | 0x67 | g |

24 | 0.883 | 1,533 | 0x76 | v |

25 | 0.398 | 692 | 0x2E | . |

26 | 0.379 | 659 | 0x6B | k |

27 | 0.212 | 369 | 0x3B | ; |

28 | 0.175 | 304 | 0x78 | x |

29 | 0.088 | 153 | 0x71 | q |

30 | 0.075 | 131 | 0x6A | j |

31 | 0.059 | 102 | 0x7A | z |

32 | 0.033 | 57 | 0x2D | - |

33 | 0.024 | 41 | 0x3A | : |

34 | 0.014 | 25 | 0x27 | ' |

35 | 0.008 | 14 | 0x2A | * |

36 | 0.006 | 11 | 0x28 | ( |

37 | 0.006 | 11 | 0x29 | ) |

38 | 0.006 | 11 | 0x3F | ? |

39 | 0.003 | 6 | 0x5D | ] |

40 | 0.003 | 6 | 0x5B | [ |

41 | 0.002 | 4 | 0x31 | 1 |

42 | 0.002 | 4 | 0x22 | " |

43 | 0.002 | 3 | 0x35 | 5 |

44 | 0.001 | 2 | 0x34 | 4 |

45 | 0.001 | 1 | 0x32 | 2 |

46 | 0.001 | 1 | 0x38 | 8 |

47 | 0.001 | 1 | 0x33 | 3 |

48 | 0.001 | 1 | 0x30 | 0 |

49 | 0.001 | 1 | 0x36 | 6 |

As you can see, after the space character the letter **e** is the most frequent by a substantial margin, followed by
the letter **t**. The details of the rank ordering will vary with the nature and amount of text that you analyze,
but the first few are pretty stable. If you guess that the most frequent letter of a cipher text is **e** and that the
next most frequent letter is **t**, you will be right most of the time, and once you have these two, you won't find it
too difficult to figure out the rest. Of course the letter frequencies in very short messages
may be rather far off so it helps to have a reasonable amount of cipher text.

If only Mr. Provenzano had known a little about linguistics and mathematics, he could have chosen a much more secure form of encryption.

PS: Those wishing to calculate numbers of substitution ciphers and the like yourselves and wondering how to do it
should check out R, the Free statistical system that I used. I obtained the frequency counts using the *unihist* program from my unidesc package, after first using the Unix utility tr to convert the entire text to lower-case.