December 16, 2006

Britain's scientists risk becoming hypocritical laughing-stocks, research suggests

Back in April and May of 2005, there was a flurry of preposterous stories about how using cell phones and email lowers your IQ more than smoking marijuana does. You can read all about it here. The basic ingredients were:

  • a company with something to sell, which
  • hired a reputable scientist to do a (private, unreleased) study designed to publicize its products, and then
  • distributed misleading and partly false press releases, exaggerating the results of this research, which
  • lazy, credulous or opportunistic journalists vied with one another to publish in ever more sensationalist and misleading forms.

We seem to be going down that primrose path again. As often, the BBC is leading the way -- "UK's Vicky Pollards 'left behind'", 12/12/2006:

Britain's teenagers risk becoming a nation of "Vicky Pollards" held back by poor verbal skills, research suggests.

And like the Little Britain character the top 20 words used, including yeah, no, but and like, account for around a third of all words, the study says.

If, like me, you're a bit fuzzy about just who this Vicky Pollard person is, I can recommend the Best of Vicky Pollard I and Best of Vicky Pollard II on YouTube. It's curious. The Vicky character -- a broad satire of the accent, dress and manners of British lumpen-teen females -- is portrayed as hyper-verbal. One of the basic Vicky bits is her jabbering rapidly on automatic pilot, saying far more than she should. Yet the BBC sees her as someone who is unable to communicate due to an inadequate word stock, not someone who over-communicates with socially inappropriate content, accent, word choice and sentence structure. This is another piece of evidence that journalists these days are incapable of elementary observation and common-sense description, at least when it comes to speech and language.

Now, we're told, "research suggests" that the stereotype of low-verbal Vicky is correct. I'm not sure what's really going on here, since the primary source is the BBC science section, which has become a consistently unreliable source of information. The article attributes the research in question to Tony McEnery, who is a fine computational linguist. It quotes or paraphrases him saying a number of things that don't really make sense as written, like this:

His analysis of a database of teenage speech suggested teenagers had a vocabulary of just over 12,600 words compared with the nearly 21,400 words that the average person aged 25 to 34 uses.

It's essentially impossible to estimate someone's total vocabulary accurately from a sample of their speech or writing -- certainly not with a precision like "just over 12,600 words" or "nearly 21,400 words". And in any case, numbers like 12,600 and 21,400 are way too small to represent the vocabulary of contemporary English speakers -- credible estimates published long ago yield estimates for receptive vocabularies in the range of 40,000 "word families" for typical high-school graduates (corresponding to several times that number of distinct word forms). So I'll guess that what Tony did was to measure the number of different orthographic words (or word forms?) used in a given amount of text by two different groups.

It's well known that the rate of vocabulary display varies with age, socio-economic status, formality, and so on. And it's also well known that rate of vocabulary display is poorly correlated -- sometimes negatively correlated -- with communicative effectiveness. But without knowing what the databases were, and how they were collected, and what kind of analysis was done on them, it's hard to know what the cited disproportion in "vocabulary size" really means, and in particular whether it's a new property of today's British teens, or the same old story about vocabulary display as a function of age and class and context.

In any case, the factoid that makes the biggest impact is is the assertion that "the top 20 words used ... account for around a third of all words".

Thus "Are iPods shrinking the British vocabulary?" , Ars Technica, 12/15/2006, says:

McEnery found that one-third of most teenage speech was made up of only 20 common words like "yeah," "no," and "but." This is problematic for teenagers seeking jobs in the corporate world, where at least some level of professionalism is required when communicating with others.

And Sarah O'Grady, "The teenagers who just can't speak proper" , Daily Express, 12/13/2006:

The most frequent 20 words they speak account for a third of all words used in their conversations, a university study found. And the 10 most popular words are yeah, no, like and but.

[Um, that's 4 words -- where are the other 6? I don't expect journalists to be mathematically literate, but you'd think they could count to ten.]

The Daily Record tells us ("Vicky-speak warning"):

TEENAGERS need lessons in how to speak properly because so many sound like Little Britain's Vicky Pollard, experts say.

A study by Lancaster University linguistics specialist Professor Tony McEnery found teenagers rely on a limited vocabulary, as schools fail to teach verbal communication skills.

And Ruki Sayid, "IT'S LIKE, YEAH, WHAT, YOU KNOW ..AND THAT: 20 words in third of teen talk", Daily Mirror, 12/13/2006:

TEENAGERS use just 20 words for a third of everything they say, research reveals.

And the best, I think, is the satirical take at Anorak:

So it is time for the immigrant to learn how to speaka da Ingleesh. And the good news is that it is well easy, innit.

The Mail sees the work of linguists at Lancaster University. And it notes that while the over-25s use 21,391 words in daily conversation, the teenagers use just 12,682.

This seems impressive, until you realise that no less than 11,216 of those teen words are for chips. The teenage vocabulary, to which any immigrant should aspire, is pared down to 20 key words.

Now, I'm sure that Britain's teens would benefit from additional vocabulary instruction. But (as Arnold Zwicky pointed out a couple of days ago), the assertion that they "use just 20 words for a third of everything they say" is a spectacularly lousy argument for this conclusion.

Here's why. The Zipf's-law distribution of words, whether in speech or in writing, whether produced by teens or the elderly or anyone in between, means that the commonest few words will account for a substantial fraction of the total number of word-uses. And in modern English, the fraction accounted for by the commonest 20 orthographical word-forms is in the range of 25-40%, with the 33% claimed for the British teens being towards the low side of the observed range.

For example, in the Switchboard corpus -- about 3 million words of conversational English collected from mostly middle-aged Americans in 1990-91 -- the top 20 words account for 38% of all word-uses. In the Brown corpus, about a million words of all sorts of English texts collected in 1960, the top 20 words account for 32.5% of all word-uses. In a collection of around 120 million words from the Wall Street Journal in the years around 1990, the commonest 20 words account for 27.5% of all word-uses.

And in Tony McEnery's autobiographical sketch, the commonest 20 words account for 426 of 1190 word tokens, or 35.8% . . .

In fact, Tony used 521 distinct words in composing his 1190-word "Abstract of a bad autobiography"; and it only takes the 16 commonest ones to account for a third of what he wrote. News flash: "COMPUTATIONAL LINGUIST uses just 16 words for a third of everything he says." Does this mean that Tony is in even more dire need of vocabulary improvement than Britain's teens are?

I doubt it. In comparison, the first chapter of Huckleberry Finn amounts to 1435 words, of which 439 are distinct -- so that Tony displayed his vocabulary at a substantially faster rate than Huck did. And Huck's commonest 20 words account for 587 of his first 1435 word-uses, or 40.9%. So Tony beats Huck, by a substantial margin, on both of the measures cited in the BBC story. (And just the 12 commonest words account for a third of Huck's first chapter: and, I, the, a, was, to it, she, me, that, in, and all.) We'll leave it for history to decide whose autobiography is communicatively more effective.

The BBC article ends this way:

"When things are funny it is because they ring true with people," said Prof McEnery who conducted the research for retailer Tesco. [...]

Tesco, which commissioned the report, said it was responding by launching a scheme which allows all UK comprehensive schools to interact and communicate with other schools around the country using its internet phone technology.

So once more, we seem to have an unpublished study commissioned by a company that is using it to sell something, and is publicizing it using a striking but meaningless -- or actively misleading -- quantitative assertion. "Reading email and answering a cell phone reduces IQ by 10 points, compared to four for a joint"; "Teenagers use just 20 words for a third of what they say".

OK, right. So let's see, we'll improve the vocabulary of British teens by wiring them up for easier internet cell-phone access. And we'll also make sure they get plenty of cannabis. No, wait, um, I'm confused. Too much science journalism, do you see; research shows that it eliminates logical thought in favor of knee-jerk associations between press releases and popular culture. I wish I could give it up, but the bastards have got me hooked.

In particular, the main source of information here is the BBC, and only a fool would trust what the BBC prints about scientific topics. I wrote to Tony McEnery on December 12, shortly after the BBC article came out. I haven't heard from him yet, but when I do -- especially if he's able to give me some documentation of the cited research -- I'll update this commentary.

[See here for Tony's response... Indeed, as I suspected, the media reports substantially distorted what he had to say.]

[Anatol Stefanowitsch writes:

Your remarks on the BBC's claims about the poor verbal skills of British teenagers and Geoff Pullum's replication of the original study using the BBC article itself would not be complete, I feel, without taking a look at the undisputed master of the English language, the great Bard himself.

Surely William Shakespeare's verbal skills must have exceeded those of "UK's Vicky Pollards"?

Alas, no. At least not by much. The Comedy of Errors, for example, consists of 16,298 word-form tokens (here and below I use the files
provided by the Project Gutenberg with the header material removed). The top twenty words account for 5,578 tokens, i.e. 34.2 per cent. What is
more, the Bard's creativity seems to have been overestimated by scholars of literature: the top twenty are very bland words, such as of, I, and, the, to and you!

OK, but the Comedy of Errors is, well, a comedy. Perhaps the great tragedies will show more clearly the linguistic genius of the greatest poet of the English language?

Only by a very narrow margin. Hamlet consist of 32,040 word-form tokens. The top twenty account for 9,937 tokens, i.e. 31%.

I look forward to the reappraisal of the entire history of British literature that is sure to follow these discoveries -- one that measures the literary worth of authors by the degree to which their writing deviates from Zipf's laws.

Actually, all of these texts probably obey a form of Zipf's law about equally well -- the difference would be the parameters of the word-frequency distribution, not the basic type of distribution. As Cosma Shalizi is fond of reminding people, things that look like a power-law (Zipfian) distribution are often really log-normal; but it seems that for words, power-law distributions really are more predictive than log-normal distributions. In any case, we would be comparing Zipfian parameters, not deviations from Zipf's predictions.

By the way -- I started to compile a corpus of Vicky Pollard transcripts, since her speech (what I can understand of it) seems quite lexically inventive, and in fact is likely to compare favorably with the BBC (and perhaps with Shakespeare) in that respect. The problem is, there's about a third of what she says that I can't figure out. So if someone more familiar with her dialect will provide some transcriptions, I'll gladly do the statistics.]

Posted by Mark Liberman at December 16, 2006 07:22 AM