I report here a small experiment I conducted to follow up on Mark Liberman's discussion (and Arnold Zwicky's earlier mention) of the shocking news that Britain's teenagers use a mere 20 words for a third of everything they say. (Scientists like to make sure that results can be replicated.) I took the entire text of the actual BBC article reporting this news of verbal poverty (see it at this web page), computed the top 20 most frequent words in it, and worked out what percentage of the total it was. The answer is between 36 and 40 percent. (The difference depends on how much you collapse different word forms together into lexemes. Collapsing genitives and plurals with non-genitive singulars makes hardly any difference to the results, but treating is, are, was, and were as different words rather than as representatives of the verb be lowers the figure slightly. If you do the collapsing, the top 20 words make up over 39.5% of the text. If you don't, the top 20 account for just over 36%.)
So this is the situation. This staggeringly stupid news report states that Britain's teenagers are "held back by poor verbal skills" because the evidence shows that the top 20 words in their speech account for 33% of all the words they use — the implication being that they aren't using enough words, they're just repeating a few words like "yeah" and "no" and "but" and "like". But in the staggeringly stupid article itself, the top 20 words account for substantially more than that. So Britain's science writers (at least at the BBC) are even more verbally retarded. Hello? Is there anyone out there who thinks Mark is exaggerating when he says BBC science reporters are writing junk that brings science into disrepute?
In case you want to see the results I got (which you can easily check for yourself), here they are (with the lexeme collapsing done). There are 402 words in the text (if you replace hyphens by spaces), and this table shows the numbers of occurrences for the top 20 in frequency:
25 | the | 16 | forms of the verb be | 13 | of | 10 | and | 10 | in | 10 | to | 9 | forms of the noun word | 8 | a | 7 | but | 6 | as | 6 | forms of the pronoun it | 5 | forms of the pronoun he | 5 | no | 5 | forms of the verb say | 5 | speech | 4 | by | 4 | forms of the noun school | 4 | that | 4 | which | 4 | with |
These words account for 25 + 16 + 13 + 10 + 10 + 10 + 9 + 8 + 7 + 6 + 6 + 5 + 5 + 5 + 5 + 4 + 4 + 4 + 4 + 4 = 160 occurrences, and 160/402 = 39.8%.
The reason it seems to me sensible to collapse down to lexemes is that it would be absurd to say that a teenager wasn't using the word "parallelism" if the record showed that he regularly used the word "parallelisms", or to insist of someone who used "emancipator's" that she didn't have the word "emancipator" in her vocabulary. However, even if you insist on going with raw word forms with not even the singulars and plurals collapsed, my count shows the percentage only going down to 36%, which is still higher than the teenagers' alleged 33%.
Posted by Geoffrey K. Pullum at December 16, 2006 03:00 PM