January 24, 2007

State of the Union word count tools

See this New York Times page for tools to analyze last night's State of the Union address (and previous ones by this president) on the basis of the frequency with which particular words are used. From the examples given on the page one cannot tell whether it is lexemes or word forms that they count, but a small amount of testing with the search facility soon reveals that it is word forms, indeed, simply character strings, because those are easy to count; so there are separate counts for hero and heroes, for example, rather than a figure for the lexeme hero that embraces both. But if you want to study (say) insure, you can get some of the way by looking for insur, which is matched by insures, insuring, etc., though it will also be matched by insurance (a different lexeme).

Posted by Geoffrey K. Pullum at January 24, 2007 11:50 AM