November 12, 2003

More on the 5 exabyte mistake

The canard that "Five exabytes... is equivalent to all words ever spoken by humans since the dawn of time" was repeated in this 11/12/2003 NYT article by Verlyn Klinkenborg. It's amazing how people pass this stuff around without checking it or thinking it through: Eskimo snow words all over again, though on a much smaller scale (so far).

The Dutch periodical OnzeTaal linked to the NYT article and also to my earlier post on the topic -- maybe the internet culture can start to keep these small thoughtless quantitative "idées reçues" in check.]

Klinkenborg is struck by the claim that telephone traffic in the year 2002 "added up to about 17 exabytes, more than three times all the words ever spoken by humans until that point". If the sum given for 2002 telephone traffic is correct (and since I haven't checked that, I'm not sure it's true), then a plausible estimate of "all the words ever spoken" would be more than 2,000 times greater. I'm not sure whether Klinkenborg would find that comforting or upsetting, in the abstract, but concretely it would cancel the premise of today's NYT piece.

[Note: it's curious how easily we linguists tend to fall into the role of ""insect dry discoursing gammer / [who] tells what's not rhyme and what's not grammar." Well, someone's got to do it, I guess. But I prefer inspiration to deflation, I really do!]

[Update: Klinkenborg is relying on this report from Berkeley, about which I might say more when I've had a chance to go through it. The Berkeley report supports the assertion that 17 exabytes of telephone traffic flowed in 2002, but on a quick read, I did not find anything connected to Klinkenborg's belief that all prior human talk would amount to 5 exabytes.

Though this (false) idea apparently was not mentioned in the Berkeley report (?), it is not something that Klinkenborg just made up. Hit google with the search string
"5 exabytes" spoken
and you'll get 215 repetitions of this idea, which is what might be called an "urban legend statistic". It gets started somewhere and then spreads by memesis, almost completely independent of whether or not has any factual basis. As it pretty clearly does not, in this case -- I looked at the first few pages of google hits above, and a sample of the later ones, and I couldn't find any explanation or justification of the figure, correct or incorrect. The bare numerical assertion is just cited as if it were common knowledge among the well-informed.]

Posted by Mark Liberman at November 12, 2003 10:10 AM