April 21, 2006

The Provenzano Code

Unless you've been living in a cave you've probably heard that Bernardo Provenzano, suspected of being the capo di tutti capi of the Sicilian mafia, was recently captured after eluding the police for over forty years. What hasn't been emphasized in many of the news stories is the fact that his capture was due to his lack of knowledge of linguistics.

According to this account at the Discovery Channel, Provenzano was caught after police intercepted messages between Provenzano and other members of his organization written in a code that they were able to break. Details can be found at http://www.bernardoprovenzano.net (in Italian).

Actually, in technical terms, they weren't using a code but a cipher, a variant of the Caesar Cipher, said to have been used by Julius Caesar. The Caesar Cipher is one in which each letter of the alphabet is mapped onto the letter three places farther down. In the version used by Provenzano, each letter is first replaced by the corresponding ordinal, to which three is then added. Here is the resulting cipher:

CodeLetterCodeLetter
4a15n
5b16o
6c17p
7d18q
8e19r
9f20s
10g21t
11h22u
12i23v
13l24z
14m

Here is the cipher text of a bit of a message from Angelo Provenzano to his father:

1012234151512 14819647415218

and here is the decryption:

10 12 23 4 15 15 12	14 8 19 6 4 7 4 15 21 8
 g  i v  a  n  n  i	 m e  r c a d a  n  t e

(Mr. Provenzano misspelled Giovanni.)

Ciphers of this type have been around for a very long time and are known to be easy to break. If you suspect you're dealing with a shift cipher, you can check without all that much work since there are only N of them for an alphabet of N letters, and one, the one with shift N, doesn't conceal anything. The class of monoalphabetic substitution ciphers, in which each letter is replaced by some other letter, is much larger: for an alphabet of size N, there are N! of them. The number of different substitution ciphers on an alphabet of 21 letters is 510,909,400,000,000,000,000,000 (over 500 sextillion). That is far too many to break by brute force without a computer. Even with a computer able to check 1,000,000,000 possibilities per second it would take 1618.974 years. This kind of problem is easily parallelizable, meaning that many computers can work on it at the same time, so you can cut the time by using more machines. That only gets you so far though. If you put 1,000 computers to work it would still take 1.619 years. It is worth noting, too, that Italian makes the task easier since it has only 21 letters. For an alphabet of 26 letters like that of English, there are 403,291,500,000,000,000,000,000,000 (over 403 septillion) monoalphabetic substitution ciphers. With those same 1,000 computers trying them all would take 12,779,535 years.

In point of fact, monoalphabetic substitution ciphers are easy to break if you've got a reasonable amount of cipher text. The reason is that the frequency of the letters in natural languages is widely dispersed. Here are the frequencies of the characters in an English version of Machiavelli's The Prince:

Rank Percent Count   Code   Character  
117.21329,9000x20SPACE
210.41718,0950x65e
37.35712,7790x74t
46.12110,6330x6Fo
55.99310,4110x61a
65.7559,9960x69i
75.6199,7600x6En
85.3289,2550x68h
95.0148,7100x73s
104.6198,0240x72r
113.1295,4350x64d
122.7144,7150x6Cl
132.2473,9040x63c
142.1293,6980x75u
151.9183,3310x66f
161.8953,2910x6Dm
171.6242,8210x0ALINEFEED
181.6232,8190x77w
191.5012,6070x2C,
201.4742,5600x79y
211.3572,3570x70p
221.3152,2850x62b
231.2532,1770x67g
240.8831,5330x76v
250.3986920x2E.
260.3796590x6Bk
270.2123690x3B;
280.1753040x78x
290.0881530x71q
300.0751310x6Aj
310.0591020x7Az
320.033570x2D-
330.024410x3A:
340.014250x27'
350.008140x2A*
360.006110x28(
370.006110x29)
380.006110x3F?
390.00360x5D]
400.00360x5B[
410.00240x311
420.00240x22"
430.00230x355
440.00120x344
450.00110x322
460.00110x388
470.00110x333
480.00110x300
490.00110x366

As you can see, after the space character the letter e is the most frequent by a substantial margin, followed by the letter t. The details of the rank ordering will vary with the nature and amount of text that you analyze, but the first few are pretty stable. If you guess that the most frequent letter of a cipher text is e and that the next most frequent letter is t, you will be right most of the time, and once you have these two, you won't find it too difficult to figure out the rest. Of course the letter frequencies in very short messages may be rather far off so it helps to have a reasonable amount of cipher text.

If only Mr. Provenzano had known a little about linguistics and mathematics, he could have chosen a much more secure form of encryption.


PS: Those wishing to calculate numbers of substitution ciphers and the like yourselves and wondering how to do it should check out R, the Free statistical system that I used. I obtained the frequency counts using the unihist program from my unidesc package, after first using the Unix utility tr to convert the entire text to lower-case.

Posted by Bill Poser at April 21, 2006 04:27 PM