April 25, 2006

Probability theory and Viswanathan's plagiarism

I have recently mentioned just how much undergraduate plagiarism disgusts me, and I will not repeat any of those remarks in the context of 19-year-old Harvard undergraduate Kaavya Viswanathan's debut novel How Opal Mehta Got Kissed, Got Wild, and Got a Life, now widely known to have included passages plagiarized from Megan McCafferty's Sloppy Firsts (2001). But let me just point out that at least one of the plagiarized passages was 14 words long.

That may seem short to you, but according to modern estimates of the entropy in ordinary running English text [thanks to Fernando Pereira for information that led me to revised this post on April 26], if you graph the word positions in English text against the number of words that would be grammatically possible as the next word given the last few words of the text, although the numbers vacillate wildly, the average across them all tends to settle in at something approaching 100. If that's right, then at any arbitrary starting point in an arbitrary text, if text was being composed at random, the probability that you will find the next 14 words match some previously designated sequence of 14 words is very roughly in the region of 1 in 1028, i.e., 0.0000000000000000000000000001.

That number is so close to zero that we don't really need to ask any more. This is evidence of copying. And when there are a dozen other cases of plagiarism from the same source, as the The Harvard Crimson has shown there are, the probability plummets to something vastly lower. One could quibble with some of the assumptions behind the application of probability theory here (I'm assuming a novelist is free to choose each word independently from all the grammatically legitimate ones available at that point), but it won't really change the fact that the chances of this being accidental are not just small but nonexistent.

Viswanathan now says "I was very surprised and upset to learn that there are similarities between some passages in my novel, and passages in these books." Give me a break. We're not talking about "similar", we're talking about identical. She claims "any phrasing similarities between [McCafferty's] works and mine were completely unintentional and unconscious." I don't believe her. It's impossible. Nobody memorizes 14-word sequences accidentally and writes them under the delusion that they're original. Nobody accidentally borrows the phrase "and 170 specialty shops later", where any number would have done as well as "170" because it was picked arbitrarily by McCafferty. Nobody comes up with a phrase like "a pink tube top emblazoned with a glittery Playboy bunny" through some unlucky accidental half remembering. Sorry, but I'm not buying it. This is a sorry case of fraud, lying, copyright infringement, and abdication of the writer's intellectual responsibility. It's sickening.

And the notion of the honest Dan Brown getting sued for plagiarism while Kaavya Viswanathan does not really boggles the mind.

P.S. Most of the people emailing me about this are saying they object to the idea that I can read the author's mind with statistics. I'm not saying that. I'm not saying anything about her motivation or state of mind. I'm saying the hard evidence of straightforward copying of text is so extreme that we can regard it as conclusive.

Posted by Geoffrey K. Pullum at April 25, 2006 03:25 PM