Language Log: Flashy frequency finder

December 16, 2006

Flashy frequency finder

Here's a quick footnote to the fatuous folderol about British teens' vocabulary deployment currently being laid waste by Arnold, Mark and Geoff.¹ Squires, over at Polyglot Conspiracy, recently linked to a funky little online app called WordCount which shows you in order of frequency and in a graphically enhanced format the 86,800 words that occur more than once in the British National Corpus.

The description of the app at the site says:

WordCount^TM is an artistic experiment in the way we use language. It presents the 86,800 most frequently used English words, ranked in order of commonness. Each word is scaled to reflect its frequency relative to the words that precede and follow it, giving a visual barometer of relevance. The larger the word, the more we use it. The smaller the word, the more uncommon it is.

It's supposed to be an intuitive interface, and certainly it's easy to scroll through, and easy to type in a word and find its relative frequency. The interest of the graphic scaling wears off quickly, though, since the long-tail distribution of the English vocabulary means that after the first few words, the size difference between any two words in the window at the same time is minimal to nonexistent. It's not useful for anyone actually interested in doing anything with the numbers, who are much better off working with some other datasource, but for just a quick look (like the one the teen-word-use journalists didn't bother to take), it can be fun. At the moment most use of the site seems to have to do with 'conspriacies', funny or suggestive conglomerations of words in the same frequency range, which doesn't strike me as being of any intrinsic linguistic interest.

But one could imagine applications that would be more substantial and would still have some kind of 'kewl!' factor for the casual clicker-by. For example, how about allowing users to identify their vocabulary recognition rate in various frequency ranges? The site could scroll to a random sequence of ten words some medium frequency range, and allow readers to identify which words they recognize; then another ten words higher or lower in the frequency count, and so on. Then the site could report to the reader the frequencies at which their average recognition rate was, e.g., 10 in 10, 7 in 10, 5 in 10, 3 in 10... and compare the user's recognition rate to the average of other users on the site. Or something like that.

Anyway, the main thing is, it's a user-friendly visual presentation of the point that the senior Loggers have been making about the relative percentages of any corpus taken up by the most frequent words in the English lexicon. They're all 'function' words -- you can hardly make a grammatical English sentence without one or more of them -- and they have essentially nothing to tell us about the vocabulary range of any English speaker. If you don't use a goodly sampling of these words in nearly every sentence, you're almost certainly not a native speaker of English.

Update: Daniel Ezra Johnson writes to let us know that the scaling on WordCount adjusts the height of a word in proportion to its frequency. As a consequence, the overall area occupied by a scaled word is proportional to the square of its frequency, likely resulting in a rather different set of perceived comparisons than intended. A word with 4 occurrences in 1,000,000 would be twice as high as a word with 2 occurrences in 1,000,000 (4:2), but that makes it four times as big, area-wise, as a 2:1,000,000 word of the same length (16:4). The bigger the difference in the numbers, of course, the worse it gets. It's a little trickier even than that, since the words are of different lengths, so a long high-frequency word is really going to look a lot bigger than a short low-frequency word.

¹ It's a bit dicey going into the Senior Writers' Lounge these days; they're all looking a little wild around the eyes. This on top of the battle with the Brizendine hydra has almost been too much. Some major news source had better publish a report on some well-documented, novel and interesting language research soon, or the fog of despair that occasionally threatens at the Plaza may congeal decisively, with something of a negative effect on posting rates, I'm afraid.

Posted by Heidi Harley at December 16, 2006 10:14 PM