Language Log: The Provenzano Code

April 21, 2006

The Provenzano Code

Unless you've been living in a cave you've probably heard that Bernardo Provenzano, suspected of being the capo di tutti capi of the Sicilian mafia, was recently captured after eluding the police for over forty years. What hasn't been emphasized in many of the news stories is the fact that his capture was due to his lack of knowledge of linguistics.

According to this account at the Discovery Channel, Provenzano was caught after police intercepted messages between Provenzano and other members of his organization written in a code that they were able to break. Details can be found at http://www.bernardoprovenzano.net (in Italian).

Actually, in technical terms, they weren't using a code but a cipher, a variant of the Caesar Cipher, said to have been used by Julius Caesar. The Caesar Cipher is one in which each letter of the alphabet is mapped onto the letter three places farther down. In the version used by Provenzano, each letter is first replaced by the corresponding ordinal, to which three is then added. Here is the resulting cipher:

Code	Letter	Code	Letter
4	a	15	n
5	b	16	o
6	c	17	p
7	d	18	q
8	e	19	r
9	f	20	s
10	g	21	t
11	h	22	u
12	i	23	v
13	l	24	z
14	m

Here is the cipher text of a bit of a message from Angelo Provenzano to his father:

1012234151512 14819647415218

and here is the decryption:

10 12 23 4 15 15 12	14 8 19 6 4 7 4 15 21 8
 g  i v  a  n  n  i	 m e  r c a d a  n  t e

(Mr. Provenzano misspelled Giovanni.)

Ciphers of this type have been around for a very long time and are known to be easy to break. If you suspect you're dealing with a shift cipher, you can check without all that much work since there are only N of them for an alphabet of N letters, and one, the one with shift N, doesn't conceal anything. The class of monoalphabetic substitution ciphers, in which each letter is replaced by some other letter, is much larger: for an alphabet of size N, there are N! of them. The number of different substitution ciphers on an alphabet of 21 letters is 510,909,400,000,000,000,000,000 (over 500 sextillion). That is far too many to break by brute force without a computer. Even with a computer able to check 1,000,000,000 possibilities per second it would take 1618.974 years. This kind of problem is easily parallelizable, meaning that many computers can work on it at the same time, so you can cut the time by using more machines. That only gets you so far though. If you put 1,000 computers to work it would still take 1.619 years. It is worth noting, too, that Italian makes the task easier since it has only 21 letters. For an alphabet of 26 letters like that of English, there are 403,291,500,000,000,000,000,000,000 (over 403 septillion) monoalphabetic substitution ciphers. With those same 1,000 computers trying them all would take 12,779,535 years.

In point of fact, monoalphabetic substitution ciphers are easy to break if you've got a reasonable amount of cipher text. The reason is that the frequency of the letters in natural languages is widely dispersed. Here are the frequencies of the characters in an English version of Machiavelli's The Prince:

Rank	Percent	Count	Code	Character
1	17.213	29,900	0x20	SPACE
2	10.417	18,095	0x65	e
3	7.357	12,779	0x74	t
4	6.121	10,633	0x6F	o
5	5.993	10,411	0x61	a
6	5.755	9,996	0x69	i
7	5.619	9,760	0x6E	n
8	5.328	9,255	0x68	h
9	5.014	8,710	0x73	s
10	4.619	8,024	0x72	r
11	3.129	5,435	0x64	d
12	2.714	4,715	0x6C	l
13	2.247	3,904	0x63	c
14	2.129	3,698	0x75	u
15	1.918	3,331	0x66	f
16	1.895	3,291	0x6D	m
17	1.624	2,821	0x0A	LINEFEED
18	1.623	2,819	0x77	w
19	1.501	2,607	0x2C	,
20	1.474	2,560	0x79	y
21	1.357	2,357	0x70	p
22	1.315	2,285	0x62	b
23	1.253	2,177	0x67	g
24	0.883	1,533	0x76	v
25	0.398	692	0x2E	.
26	0.379	659	0x6B	k
27	0.212	369	0x3B	;
28	0.175	304	0x78	x
29	0.088	153	0x71	q
30	0.075	131	0x6A	j
31	0.059	102	0x7A	z
32	0.033	57	0x2D	-
33	0.024	41	0x3A	:
34	0.014	25	0x27	'
35	0.008	14	0x2A	*
36	0.006	11	0x28	(
37	0.006	11	0x29	)
38	0.006	11	0x3F	?
39	0.003	6	0x5D	]
40	0.003	6	0x5B	[
41	0.002	4	0x31	1
42	0.002	4	0x22	"
43	0.002	3	0x35	5
44	0.001	2	0x34	4
45	0.001	1	0x32	2
46	0.001	1	0x38	8
47	0.001	1	0x33	3
48	0.001	1	0x30	0
49	0.001	1	0x36	6

As you can see, after the space character the letter e is the most frequent by a substantial margin, followed by the letter t. The details of the rank ordering will vary with the nature and amount of text that you analyze, but the first few are pretty stable. If you guess that the most frequent letter of a cipher text is e and that the next most frequent letter is t, you will be right most of the time, and once you have these two, you won't find it too difficult to figure out the rest. Of course the letter frequencies in very short messages may be rather far off so it helps to have a reasonable amount of cipher text.

If only Mr. Provenzano had known a little about linguistics and mathematics, he could have chosen a much more secure form of encryption.

PS: Those wishing to calculate numbers of substitution ciphers and the like yourselves and wondering how to do it should check out R, the Free statistical system that I used. I obtained the frequency counts using the unihist program from my unidesc package, after first using the Unix utility tr to convert the entire text to lower-case.

Posted by Bill Poser at April 21, 2006 04:27 PM