Unless you've been living in a cave you've probably heard that Bernardo Provenzano, suspected of being the capo di tutti capi of the Sicilian mafia, was recently captured after eluding the police for over forty years. What hasn't been emphasized in many of the news stories is the fact that his capture was due to his lack of knowledge of linguistics.
According to this account at the Discovery Channel, Provenzano was caught after police intercepted messages between Provenzano and other members of his organization written in a code that they were able to break. Details can be found at http://www.bernardoprovenzano.net (in Italian).
Actually, in technical terms, they weren't using a code but a cipher, a variant of the Caesar Cipher, said to have been used by Julius Caesar. The Caesar Cipher is one in which each letter of the alphabet is mapped onto the letter three places farther down. In the version used by Provenzano, each letter is first replaced by the corresponding ordinal, to which three is then added. Here is the resulting cipher:
Code | Letter | Code | Letter |
---|---|---|---|
4 | a | 15 | n |
5 | b | 16 | o |
6 | c | 17 | p |
7 | d | 18 | q |
8 | e | 19 | r |
9 | f | 20 | s |
10 | g | 21 | t |
11 | h | 22 | u |
12 | i | 23 | v |
13 | l | 24 | z |
14 | m |
Here is the cipher text of a bit of a message from Angelo Provenzano to his father:
1012234151512 14819647415218
and here is the decryption:
10 12 23 4 15 15 12 14 8 19 6 4 7 4 15 21 8 g i v a n n i m e r c a d a n t e
(Mr. Provenzano misspelled Giovanni.)
Ciphers of this type have been around for a very long time and are known to be easy to break. If you suspect you're dealing with a shift cipher, you can check without all that much work since there are only N of them for an alphabet of N letters, and one, the one with shift N, doesn't conceal anything. The class of monoalphabetic substitution ciphers, in which each letter is replaced by some other letter, is much larger: for an alphabet of size N, there are N! of them. The number of different substitution ciphers on an alphabet of 21 letters is 510,909,400,000,000,000,000,000 (over 500 sextillion). That is far too many to break by brute force without a computer. Even with a computer able to check 1,000,000,000 possibilities per second it would take 1618.974 years. This kind of problem is easily parallelizable, meaning that many computers can work on it at the same time, so you can cut the time by using more machines. That only gets you so far though. If you put 1,000 computers to work it would still take 1.619 years. It is worth noting, too, that Italian makes the task easier since it has only 21 letters. For an alphabet of 26 letters like that of English, there are 403,291,500,000,000,000,000,000,000 (over 403 septillion) monoalphabetic substitution ciphers. With those same 1,000 computers trying them all would take 12,779,535 years.
In point of fact, monoalphabetic substitution ciphers are easy to break if you've got a reasonable amount of cipher text. The reason is that the frequency of the letters in natural languages is widely dispersed. Here are the frequencies of the characters in an English version of Machiavelli's The Prince:
Rank | Percent | Count | Code | Character |
---|---|---|---|---|
1 | 17.213 | 29,900 | 0x20 | SPACE |
2 | 10.417 | 18,095 | 0x65 | e |
3 | 7.357 | 12,779 | 0x74 | t |
4 | 6.121 | 10,633 | 0x6F | o |
5 | 5.993 | 10,411 | 0x61 | a |
6 | 5.755 | 9,996 | 0x69 | i |
7 | 5.619 | 9,760 | 0x6E | n |
8 | 5.328 | 9,255 | 0x68 | h |
9 | 5.014 | 8,710 | 0x73 | s |
10 | 4.619 | 8,024 | 0x72 | r |
11 | 3.129 | 5,435 | 0x64 | d |
12 | 2.714 | 4,715 | 0x6C | l |
13 | 2.247 | 3,904 | 0x63 | c |
14 | 2.129 | 3,698 | 0x75 | u |
15 | 1.918 | 3,331 | 0x66 | f |
16 | 1.895 | 3,291 | 0x6D | m |
17 | 1.624 | 2,821 | 0x0A | LINEFEED |
18 | 1.623 | 2,819 | 0x77 | w |
19 | 1.501 | 2,607 | 0x2C | , |
20 | 1.474 | 2,560 | 0x79 | y |
21 | 1.357 | 2,357 | 0x70 | p |
22 | 1.315 | 2,285 | 0x62 | b |
23 | 1.253 | 2,177 | 0x67 | g |
24 | 0.883 | 1,533 | 0x76 | v |
25 | 0.398 | 692 | 0x2E | . |
26 | 0.379 | 659 | 0x6B | k |
27 | 0.212 | 369 | 0x3B | ; |
28 | 0.175 | 304 | 0x78 | x |
29 | 0.088 | 153 | 0x71 | q |
30 | 0.075 | 131 | 0x6A | j |
31 | 0.059 | 102 | 0x7A | z |
32 | 0.033 | 57 | 0x2D | - |
33 | 0.024 | 41 | 0x3A | : |
34 | 0.014 | 25 | 0x27 | ' |
35 | 0.008 | 14 | 0x2A | * |
36 | 0.006 | 11 | 0x28 | ( |
37 | 0.006 | 11 | 0x29 | ) |
38 | 0.006 | 11 | 0x3F | ? |
39 | 0.003 | 6 | 0x5D | ] |
40 | 0.003 | 6 | 0x5B | [ |
41 | 0.002 | 4 | 0x31 | 1 |
42 | 0.002 | 4 | 0x22 | " |
43 | 0.002 | 3 | 0x35 | 5 |
44 | 0.001 | 2 | 0x34 | 4 |
45 | 0.001 | 1 | 0x32 | 2 |
46 | 0.001 | 1 | 0x38 | 8 |
47 | 0.001 | 1 | 0x33 | 3 |
48 | 0.001 | 1 | 0x30 | 0 |
49 | 0.001 | 1 | 0x36 | 6 |
As you can see, after the space character the letter e is the most frequent by a substantial margin, followed by the letter t. The details of the rank ordering will vary with the nature and amount of text that you analyze, but the first few are pretty stable. If you guess that the most frequent letter of a cipher text is e and that the next most frequent letter is t, you will be right most of the time, and once you have these two, you won't find it too difficult to figure out the rest. Of course the letter frequencies in very short messages may be rather far off so it helps to have a reasonable amount of cipher text.
If only Mr. Provenzano had known a little about linguistics and mathematics, he could have chosen a much more secure form of encryption.
PS: Those wishing to calculate numbers of substitution ciphers and the like yourselves and wondering how to do it should check out R, the Free statistical system that I used. I obtained the frequency counts using the unihist program from my unidesc package, after first using the Unix utility tr to convert the entire text to lower-case.
Posted by Bill Poser at April 21, 2006 04:27 PM