October 04, 2003

Colorless green probability estimates

43 years later, someone finally checked. And it turns out that Chomsky was wrong.

In Syntactic Structures (1957) Chomsky famously wrote:

  (1) Colorless green ideas sleep furiously.
  (2) Furiously sleep ideas green colorless.

. . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

This was one of the most compelling passages in an enormously influential book, which killed the early-50s information-theoretic explorations of language.

Chomsky's typically confident conclusion is both extraordinarily broad -- "in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds" -- and also unsupported by any argument other than assertion. Yet anyone who knows that a statistical model can assign different probabilities to different unseen events will suspect that his assertion is wrong.

In an article "Formal grammar and information theory: together again?", Fernando Pereira describes an experiment that disproves Chomsky on this point, by fitting a simple statistical model (an "aggregate bigram model") to a corpus of newspaper text.

The result? The sentence "Furiously sleep green ideas colorless" is estimated by this model to be about 200,000 times less probable than "Colorless green ideas sleep furiously" (p. 7).

Read the whole thing, which gives a picture of the history of these issues since 1950, including a sympathetic account of Zellig Harris' research program, and makes some interesting suggestions for the future.

[Note: Pereira's article was prepared for this volume on "The Legacy of Zellig Harris", which contains other interesting articles as well.

Posted by Mark Liberman at October 4, 2003 06:58 AM