In yesterday's post on "Comparing communication efficiency across languages", I compared the sizes of the English and Chinese sides of parallel (i.e. translated) text corpora, and observed that English seems to require 20-40% more bits to express the same information, even after the application of compression techniques that ought to eliminate most of the superficial and local reasons for such a difference. Bob Moore sent an interesting comment:
I can think of at least couple of reasons that might explain how there can appear to be a difference between the communication efficiency of two languages. One suggests that it might only be apparent; the other explains why it might be real.
Bob's first suggestion, about how to explain the difference away:
First, I would not trust gzip to yield an accurate measure of the information content of a given text. You probably recall that Peter Brown et al. showed in their 1992 CL paper on the entropy of English that a simple tri-gram language model yields more than twice the compression of standard text compression algorithms, 1.75 bits per character vs. more than 4 bits per character. Furthermore, If memory serves, the Shannon game of guessing the next character yields a number of around 1 bit per character for English, if there is a human brain in the loop. Playing the Shannon game with the two sides of a bilingual corpus might produce a more accurate comparison than gzip.
I agree that gzip doesn't distill all the redundancy out of text, though it's a lot better than compress, which is the algorithm that produced the estimate of "more than 4 bits per character" -- the output of compress is about 85% larger on standard test corpora than the output of gzip, which in turn is about 20% larger than szip, and about 43% larger than SBC on standard test corpora. But we're looking at a comparison of information content between two sources, and that should be somewhat better estimated by standard compression techniques (even older ones like gzip), especially with respect to local factors.
Take a look at these tables from Balkenhol and Kurz, "Universal Data Compression Based on the Burrows-Wheeler Transformation", IEEE Transactions on Computers, 49(10), 2000), comparing compression rates for different algorithms across the text files in two standard test corpora. The antique compress algorithm yields file sizes more than twice those of BK98, and gzip gives sizes about 15-20% greater -- but the bits-per-character values for different subcorpora are highly correlated across algorithms (e.g. r=0.96 between gzip and BK98 across the various files in the Calgary corpus):
While we're on the subject, let me recommend to you the classical references that Bob cites. The trigram model reference is Peter Brown, Vincent Della Pietra, Robert Mercer, Stephen Della Pietra, and Jennifer Lai, "An estimate of an upper bound for the entropy of English", Computational Linguistics, 18(1): 31-40, 1992. And while their model is "simple" compared to some other things one might think of, there's actually quite a bit to it. It models English in four stochastic steps:
1. It generates a hidden string of tokens using a token trigram model.
2. It generates a spelling for each token.
3. It generates a case for each spelling.
4. It generates a spacing string to separate cased spellings from one another.
A genuinely "simple" trigram model, in which case, spelling and spacing were simply handled by multiplying the number of tokens, would not have done as well.
In their paper's famous conclusion, Brown et al. take care to note all the things that they are leaving out:
From a loftier perspective, we cannot help but notice that linguistically the trigram concept, which is the workhorse of our language model, seems almost moronic. It captures local tactic constraints by sheer force of numbers, but the more well-protected bastions of semantic, pragmatic, and discourse constraint and even morphological and global syntactic constraint remain unscathed, in fact unnoticed. Surely the extensive
work on these topics in recent years can be harnessed to predict English better than we have yet predicted it.We see this paper as a gauntlet thrown down before the computational linguistics community. The Brown Corpus is a widely available, standard corpus and the subject of much linguistic research. By predicting the corpus character by character, we obviate the need for a common agreement on a vocabulary. Given a model, the computations required to determine the cross-entropy are within reach for even a modest research budget. We hope by proposing this standard task to unleash a fury of competitive energy that will gradually corral the wild and unruly thing that we know the English language to be.
The history of efforts to estimate the entropy of English by asking human to guess the next character in a text begins with another one of Bob's references: Claude Shannon, "Prediction and entropy of printed English", Bell System Technical Journal ,30:50-64, 1951. Shannon came up with a range of 0.6 to 1.3 bits per character.
A quarter century later, Cover and King ("A convergent gambling estimate of the entropy of English", IEEE Transactions on Information Theory", 24(4):413-421, 1978) used a slightly different technique to sharpen the estimate to 1.25-1.35 bits per character.
Unfortunately, the Shannon/Cover-King results are not strictly comparable to the Brown et al. results, nor to the other algorithmic compression-rate results. The experiments test on different texts, deal with various formatting issues in different ways, estimate entropy by different methods -- and especially, they have radically different standards for how to amortize the entropy of the "dictionary" and other prior knowledge involved. The techniques based on human guessing don't add any cost for transmitting knowledge of English (as opposed to Chinese or French or whatever); the same is basically true of the Brown et al. study, as I recall, since they simply base their estimate on the cross-entropy of the test text given the model, without any provision for transmitting the model. In contrast, the other algorithmic compression techniques start out equally ready to compress text from any source, and must build up and encode their expectations incrementally in the compressed output.
Anyhow, I think that modern compression techniques should be able to remove most of the obvious and simple reasons for differences in document size among translations in different languages, like different spacing or spelling conventions. If there are residual differences among languages, this either relates to redundancies that are not being modeled -- along the lines sketched in Brown et al.'s conclusion -- or it reflects a different sort of difference between languages and cultures, which Bob Moore picks up in his second point:
Second, why should we expect two languages to use the same number of bits to convey the same thoughts? I believe that when we speak or write we always simplify the complexity of what is actually in our heads, and different languages might implicitly do this more than others. Applying Shannon's source/channel model, suppose that when we have a thought T that we want to convey with an utterance U, we act as if our hearer has a prior P(T) over the possible thoughts we may be conveying and estimates a probability P(U|T) that we will have used U to express T. As you well know, according to Shannon, the hearer should find the T that maximizes P(U|T)*P(T) in order to decide what we meant. But the amount of effort that the speaker puts into U will determine the probability that the hearer will get the message T correctly. If the speaker thinks the prior on T is high, then he may choose a shorter U that has a less peaked probability of only coming from T. If I say to my wife "I got it," I can get by with this short cryptic message, if I think there is a very high probability that she will know what "it" is, but I am taking a risk.
My conjecture is that the acceptable trade-off between linguistic effort and risk of being misunderstood is socially determined over time by each language community and embodied in the language itself. If the probability of being misunderstood varies smoothly with linguistic effort (i.e., bits) without any sharp discontinuities, then there is no reason to suppose that different linguistic communities would end up at exactly the same place on this curve.
Tru dat.
And before we get into cultural differences in tolerance for optional ambiguity and risk, note that simple differences in morphology and syntax are likely to have a significant effect here. Definiteness and plurality are somewhat redundant, but by no means entirely so; and they are sometimes relevant, but by no means always so. Therefore, if a language insists on marking definiteness and plurality on every noun, it forces more entropy into the channel. Similarly, requiring explicit as opposed to null pronouns will increase the explicit information content.
In my earlier post, I tried to estimate how much effect such differences between Chinese and English could have, and I came up short of the observed differences in compressed text size. Maybe my back-of-the-imaginary-envelope estimates were inadequate; or maybe gzip isn't good enough to eliminate some of the effects of differences in superficial orthographic conventions. But I have to admit that there might also be a difference in some sort of cultural set-point for communicative specificity.
[Note to self: try some better text compression progams (like SBC) on parallel-text archives, and see if the differences hold up.]
Posted by Mark Liberman at April 5, 2008 06:57 AM