November 05, 2003

Zettascale Linguistics

In a presentation on cluster computing, I found the phrase:

"5 Exabytes: All words ever spoken by human beings"

The authors are Philip Papadopoulos, Greg Bruno and Mason Katz, of the San Diego Supercomputer Center, and the presentation seems to be one of a series that was given in Singapore in April of 2002.

The phrase means that digital storage amounting to 5 * 10^18 bytes would suffice to store everything that every human being has ever said. This is compared with the expected storage capacity of a modest ($300K-cost) computer cluster in 2007, which is listed at 1.2 exabytes, only about 4 times smaller. In fact this calculation seems to be wrong, by a factor of 8 million or so -- but never mind, the correction just puts things off for another couple of decades :-)... Despite the mistake, I have to exclaim "oh brave new world, that has such calculations in it!"

The context is an extrapolation of current trends forward to 2007. The authors discuss the likely future of commodity disk technology, and conclude (on slide 29) that in 2007, a "conservative" serial ATA disk will offer 1680 GB for a price of $46 (US), while an "agressive" disk will provide 5120 GB for $142 (US).

After discussing trends in other components as well, they give a picture of a "2007 cluster" (slide 37 translated from ppt into html, emphasis mine):

  • 4 TFLOPS
    • 128 dual processor compute nodes
    • 3rd on current TOP500 list
      • 2nd place is PSC Terascale cluster
  • 2.3 TB main memory
  • 1.2 EB storage
    • 2 disks per node
    • 5 Exabytes: All words ever spoken by human beings
  • 12.8 Tb/s aggregate network I/O
  • System cost: USD$300,000
    • PSC Terascale cluster = USD$35 million

The idea seems to be that each of 128 cluster nodes will have two "aggressive" 5.12 terabyte disks, which will collectively provide 1.2 exabytes. In order to impress us with how much this is, the authors tell us in an aside that 5 exabytes would suffice to store "all words ever spoken by human beings."

Truly an impressive (if horrifying) thought.

And I'm impressed enough, in advance, by being able to get 5-terabyte disks for $142 each.

However, I believe that this slide contains two numerical errors. First, the proposed configuration would amount to 1.2 petabytes, which is a thousand times smaller than 1.2 exabytes. Second, a 5 exabyte store would roughly be eight thousand times too small to store "all words ever spoken by human beings", at least in audio form. Therefore the 2007 cluster's storage would be too small by a factor of about 32 million rather than a factor of 4. I freely confess that maybe the authors were thinking about text -- but in the first place I'm a phonetician, and in the second place most human languages have not had a written form. So bear with me here for a while.

First, the cluster storage sum.
128 * 5120 * 10^9 * 2 = 1.31072 * 10^15
(128 cluster nodes, 5120 GB per disk, 2 disks per node). This is ~ 1.3 petabytes -- a petabyte is 10^15 bytes -- not 1.3 exabytes -- an exabyte is 10^18 bytes. (The change from 1.3 to 1.2 presumably has to do with disk format issues).

Second, the storage requirements for all human speech. There are said to have been 1 billion people in 1800, 1.6 billion people in 1900, and 6.1 billion people in 2000. So let's assume that 10 billion people have lived an average of 50 years, speaking for 2 hours a day on average throughout their lives. This is
10 * 10^9 * 50 * 365 * 2 * 60 * 60 = 1.314 * 10^18 seconds.
If we assume 16 KHz 16-bit linear single-channel audio, at 32KB per second, we've got
1.314 * 10^18 * 3.2 * 10^4 = 4.208 * 10^22 bytes.

This is 42 zettabytes (a zettabyte is 10^21 bytes), and is more than 8 thousand times more than 5 exabytes, and thus more than 32 million times larger than the projected storage of the 2007 computer cluster.

All these numbers -- number of people, amount of talking, audio encoding, etc. -- could be adjusted up or down by modest factors, but I believe that any way you slice it, "all words ever spoken by human beings" is a zettascale project. Unless I've screwed up the arithmetic, which is entirely possible, since Papadopoulos et al. did, and I'm sure they're less likely to drop a few orders of magnitude early in the morning than I am :-).

[Note: the 5-exabytes-for-all-human-speech meme seems to be proverbial -- scroll down the hyperlink to the defiition for exabyte, where you'll find that "It has been said that 5 Exabytes would be equal to all of the words ever spoken by mankind".]

[Also: given that disk price/performance continues to improve by a factor of two every year, it will take an additional 25 years to take care of the needed factor of 32 million (2^25 = 33,554,432). So we're talking about the typical cluster of the year 2032 -- except that some form of Stein's theorem is likely to intervene -- unless Davies' Corollaries apply...]

[Update 11/12/2003: the canard that "Five exabytes... is equivalent to all words ever spoken by humans since the dawn of time" was repeated in this 11/11/2003 NYT article. It's amazing how people pass this stuff around without checking it or thinking it through: Eskimo snow words all over again, though on a much smaller scale.

The Dutch periodical Onzetaal linked to the NYT article and also to this post -- maybe the internet culture can start to keep these small thoughtless "idées reçues" in check.]

[Update 1/3/2003: Adam Morris wrote to explain:

Gigabyte is a confusing unit, similar to billion (one thousand million or one million million? I'm used to both now and assume that unless explicitly mentioned Brits mean the larger while Americans mean the smaller...) A gigabyte should be 10^9 bytes, but as computer people frequently deal in binary, it is also used to mean 2^30. As 2^10 is 1024 this is frequently used as a multiplier in disk sizes and memory. This would make a terabyte, not 10^12 but 2^40 bytes. A 5120 GB disk would thus be five terabytes, and two of them would be ten terabytes. This gives us 1,280 terabytes, or 1.25 petabytes (2^50 not 10^15). thus the change from 1.3 to 1.2 is to do with the actual size of the units involved. Disk drive manufacturers usually use 10^X as it makes the disks seem bigger than the 2^Y maths used elsewhere.

I guess I sort of knew that, but neglected to bring it to bear on the calculations above. I'm grateful for the clarification.

I've heard from various other people with observations about better ways to estimate the total number of person-years in human history to the present; about alternative notions of how much talking people do; about audio encoding and audio compression methods; and so on. None of these seems to make more than an order of magnitude difference at most (mostly a factor of two or thereabouts), and the effects are sometimes to increase the estimate, and sometimes to decrease it. So I'll stand pat for now.

With respect to the number of people who have ever lived, Brian Carnell argues (with a reference) that it's closer to 100 billion than 10 billion. I haven't studied the source, but I'll accept the correction -- except that as Brian also observes, the figures deal with the number of humans who have ever been born, and during much of human history, most folks died pretty young, making my 50-year-life-span estimate far too high. The cited reference (a paper by Carl Haub) says that "[l]ife expectancy at birth probably averaged only about 10 years for most of human history". So rather than a ten-fold increase, there might be as little as a two-fold increase.

For those who care, here's a table of representative audio encoding rates. I chose 32 KB/sec -- roughly the quality of FM broadcasts -- as the data rate. One could use lossless encoding to lower this by a factor of two or so; one could use lossy coding (like MP3) to get higher perceptual quality in the 16-32KB/sec range; but it'd be a crime against humanity to go to cell phone or LPC-10 data rates.

NAME
Rate in bits/sec
Rate in bytes/sec.
Rates in bytes/hour
1. CD standard (stereo)
44.1KHz 16b/sample
1411.2K 176.4 635.04M
2. FM-quality wideband (mono)
16KHz 16b/sample
256K 32K 115.2M
3. Same as above
with lossless coding
~128K ~16K ~57.6M
4. Typical MP3, AAC etc. 128K 16K 57.6M
5. Basic digital telephony
(one channel)
64K 8K 28.8M
6. ADPCM (one channel) 32K 4K 14.4M
7. Typical Digital cellular
(one channel)
8K 1K 3.6M

8. LPC-10
(one channel)

2.4K 300 1.08M

So maybe it's two times more people and two times fewer bits per second. Any way you slice it, I think it's still a zettascale problem... ]

Posted by Mark Liberman at November 5, 2003 07:40 AM