June 13, 2004

Modeling banality

Mimi Smartypants starts a recent post with this:

I like those moments where you say or hear an unusual sentence and think: Wow, I bet no one else in the entire world said that today.

She goes on to give examples (with context and/or links) like these:

I want this to be more like a yo-yo than it can realistically be.
I blame the mango.
The patient had a history of ingesting inadequately cooked frogs.
Don't feed your racist toothpaste to the cat.

Both formal syntax and statistical language modeling have their flaws, but I think they (and we) can agree that Ms. Smartypants is too modest in her aspirations. It's a fair bet that no one else in the entire world said any of her sentences, not only on the day she noticed them, but in the whole previous history of the world. (Well, "I blame the mango" might be an exception.) Furthermore, this is true not only of the striking examples that she cites, but of the great majority of all the sentences that she and her interlocutors use in their everyday lives.

But we can also agree that she's right to notice something striking about the particular examples that she quotes. Each of them involves some pragmatically or semantically unexpected juxtapositions, like toothpaste being racist or toothpaste being fed to a cat.

Even the shortest and most ordinary lead sentences in today's Philadelphia Inquirer are surely unique:

Carlos Silva chatted up his former teammates Friday afternoon at the Metrodome.
Home-sale prices have exploded throughout the Philadelphia region in the last five years.

The thing is, these are examples of completely ordinary and even banal ideas: a recently traded athlete talking with former teammates, home prices rising in a relevant span of space and time.The particulars are variable: which athlete, which stadium, what place and time. The choice of words is also variable -- "chatted up" or "talked with" or "spent some time talking to"; "have exploded", "have risen explosively", "have soared"; etc.. Take the cross-product of all the variables and you get an astronomically large number of possible banalities, only a tiny fraction of which will ever actually occur, even in the conversations and writings of billions of people.

You don't have to work very many combinations to get beyond what's on the web. Thus the string "chatted up his former" is unknown to Google, even though you can find things like "chatted up his old Quake 2 buddies" and "talked with his former teammates" and "spent some time talking to his former teammates". The string "Carlos Silva chatted" is also unknown to Google, but you can fine "Carlos Silva spoke" and "Michael Silva chatted" and so on.

Because such alternatives are pretty common, the sentence "Carlos Silva chatted up his former teammates Friday afternoon at the Metrodome" is never going to strike anyone as unique or original, in the way that "Don't feed your racist toothpaste to the cat" is. The "Carlos Silva" sentence has probably never been said or written before, but there's a real sense in which it's more likely than the "racist toothpaste" sentence.

Even a crude bigram model might even yield this result, though it's hard to overcome the difference between twelve words and eight. Slightly more sophisticated statistical models can account for the perceived difference between the famous sentence "Colorless green ideas sleep furiously" and the various ungrammatical re-orderings of the same words. Still more sophisticated models consider the frequency of modifier-head or verb-argument combinations -- to the extent that they can estimate them -- and would be capable in principle of noticing that toothpaste is rarely racist, or that it is rarely fed to cats. However, I'm not sure that our models are yet able to capture the true banality of the two Inky ledes, or the true originality of Ms. Smartypants' examples, because such models have no notion of the frequency of conceptual fragments except insofar as these are pretty directly represented in word strings or at least in local syntactic relationships among words.

Posted by Mark Liberman at June 13, 2004 12:02 PM