October 02, 2004

hard opponent must hope: today even respect issues

2004 is the 40th anniversary of Gerald Salton's SMART system for full-text information retrieval, or at least of the earliest documentation of it that I've seen. One of the key insights of this system was that the content of a document can be surprisingly well approximated by nothing more than the frequency counts of the words in it. This insight is still a fundamental part of the document retrieval systems that we all use today, and in this post, I'm going to apply it to the transcript of Thursday's presidential debate.

A simple but compelling demonstration of this idea emerged a side-effect of a collaboration, about 17 years ago, between some Bell Labs researchers in my (then) department and the HarperCollins publishing company. We were using what would now be called "data mining", to find new terms for inclusion in the 5th edition of Roget's Thesaurus. As part of that effort, HarperCollins gave us the typographer's tapes for most of the books they published one year. We used statistical techniques to find frequent words and phrases that were missing from the 4th edition, and the 5th edition's editor, Robert Chapman, then looked those lists over to decide what to add.

Ron Hardin, one of my colleagues at Bell Labs (and the original author of festoon), took the texts of these 400-odd books, and implemented a simple but elegant little hack. He counted the frequency of all the words in each book, and then sorted them according to the ratio between their frequency in that book and their frequency in the overall set. The top of each books' list gave a pretty good idea of what the book was about:

    "College: the Undergraduate Experience:" undergraduate faculty campus student college academic curriculum freshman classroom professor

    "Earth and other Ethics:'' moral considerateness bison whale governance utilitarianism ethic entity preference utilitarian.

    "When Your Parents Grow Old:'' diabetes elderly appendix geriatric directory hospice arthritis parent dental rehabilitation

    "Madhur Jaffrey's Cookbook:'' peel teaspoon tablespoon fry finely salt pepper cumin freshly ginger

If we treat each candidate's (concatenated) contributions to the debate as a document, and apply a similar metric, here's what we get:

word
Bush count
Kerry count
 
word
Bush count
Kerry count
hard
20
0
  today
0
14
opponent
19
0
  even
0
11
must
10
0
  respect
0
9
hope
10
0
  issues
0
8
free
29
2
  president
12
87
defeat
9
0
  secretary
0
7
essential
8
0
  forces
0
7
positions
7
0
  union
0
6
Libya
7
0
  two
0
6
her
7
0
  training
0
6
course
7
0
  states
2
20
work
25
3
  sort
0
6
won't
6
0
  most
0
6
signals
6
0
  90
0
6
justice
6
0
  than
2
18
hoping
6
0
  nuclear
4
28
citizens
6
0
  new
1
10
achieve
6
0
  general
1
10
we've
21
4
  united
4
26

We can see the greater repetitiveness of Bush's language -- although he used almost 16% fewer words (6,135 to 7,136), he repeated some of those words much more often.

If I had done "stemming", some of the results would have been even more striking. For instance, we have

word
Bush count
Kerry count
hope
10
0
hoping
6
0
hopes
3
0
hopeful
1
0
total
20
0

In this particular case, Kerry's avoidance of the word is just as interesting as Bush's overuse of it. Bill Clinton was the "man from Hope" -- but lexically, John Kerry had no hope at all on Thursday evening. I wonder if that was a conscious attempt to avoid association with Clinton's theme? or perhaps the result of a desire to sound strong by avoiding any mention of hypothetical states of affairs? Maybe his handlers told him something like "not hopes, but policies".

If they didn't, perhaps they should have, given how weak Bush's use of this word turned out to be. Four times, he refered to the hopes of enemies: "their hope is that we grow weary and leave"; "hoping to shake our will"; "hoping that the world would turn a blind eye". He repeated a stock phrase "the hopes and aspirations" (of foreigners who want democracy) three times. He used the word "hope" twice to mock Kerry's proposed reliance on alliances ("...the hope that somehow resolutions and failed institutions will make this world a more peaceful place"; "...let's, you know, hope to talk him out"). However, the rest of the hopes were expressions of his own desire to experience unreal states, mostly in respect to better days in Iraq or problems of nuclear proliferation:

"I hope it's as soon as possible"; "I would hope I never have to"; "I was hopeful diplomacy would work"; "I would hope never to have to use force"; "I would hope we never have to"; "I certainly hope so"; "I hope we can do the same thing"; "I hope we can do it"; "I was hoping diplomacy would work"; "I went there hoping that..."

I share all of these hopes, as it happens, but his way of talking about them made him seem weak. Al Queda hopes to shake our will: never happen. Arab reformers hope for democracy: lots of luck, folks, it looks like a long road. Kerry hopes that the U.N., Nato and the Arab nations will bail us out in Iraq: yeah, right. And what does Bush offer us, lexically at least, when asked about bringing U.S. troops home from Iraq, or dealing with nuclear weapon in North Korea and Iran? His hopes. Um, uh, wait a minute...

I don't think that most of Bush's other lexical themes worked for him, either. His 20-fold repetition of hard, for example, was mainly about "hard work" (11 times), "how hard it is" (3 times), "working hard" (twice) and similar senses. This reminded me of the argument that I get from a student who has flat-out failed an exam, but wants to persuade me that due to ceaseless toil and dedication, they deserve a better grade. I'll confess that I'm a sucker for this kind of argument, but really, there's a difference between diligence and accomplishment.

Kerry's top two words -- today and even -- surprised me. At first I thought they would show how abstract he can be. But in fact both of these were part of attack modes that worked pretty well. Today was all about facing the unpleasant realities on the ground:

And so, today, we are 90 percent of the casualties and 90 percent of the cost...
And you go visit some of those kids in the hospitals today who were maimed because they don't have the armament.
And today, there are four to seven nuclear weapons in the hands of North Korea.
We've got a backdoor draft taking place in America today...
Now, there are terrorists trying to get their hands on that stuff today.

And even adds bite to accusations, without seeming mean-spirited:

Iraq was not even close to the center of the war on terror before the president invaded it.
They avoided even the advice of their own general.
Even the administration has admitted they haven't done the training...

Kerry used "sort of" six times, five of them being used to soften a criticism of his opponent -- and Bush used this sequence not at all. At first I interpreted this in terms of the "Kerry is weak and wishy-washy" meme, but on balance I think it adds politeness -- which many people appreciate -- without really removing the sting from his criticism:

What I think troubles a lot of people in our country is that the president has just sort of described one kind of mistake.
He cut it off, sort of arbitrarily.
Now, that, I think, is one of the most serious, sort of, reversals or mixed messages that you could possibly send.
And there, again, he sort of slid by the question.
But let me talk about something that the president just sort of finished up with.

In the end, what I've been doing here is a very limited and superficial form of analysis. It models the meaning of a text as the statistics of a "bag of words", a model that deserves the adjective Fred Jelinek is fond of applying to the (similarly simple and effective) linguistic models used in speech recognition technology: moronic.

Still, there may be some aspects of impression-formation that are not much smarter. So if political consultants are not doing this kind of analysis already, perhaps they should be.

[Update: Cameron Marlow at overstated does something similar with the debate transcripts, except that he uses an algorithm to "parse the document and extract the noun phrases", and he ranks the results by frequency for each candidate separately, rather than looking for the most different things. He also offers access to his software. (You could have mine too, except that I just wrote simple ad hoc unix scripts...)

Similar phrase-counting was done at Amy's Robot. ]

 

Posted by Mark Liberman at October 2, 2004 05:59 AM