December 26, 2003

Have some word salad with that word soup

I recently visited a heritage village and found myself inside a reconstructed 1870s house browsing old books. I chanced upon an old grammar of English which contained a discussion of a sentence: "She said that that 'that' that that boy used was wrong." Later, a google search turned up thousands more sentences containing long sequences of identical words. Other repetitive sentences, like "policeNOUN policeVERB policeNOUN and their longer variants "(those) fish (that other) fish (like to) fish, (themselves) fish (other) fish", are sometimes used to test natural language processing (NLP) systems. I'll refer to this genre as word soup, to distinguish it from another interesting category called word salad.

Word salad is the technical term for the result of randomly tossing words into a sentence, e.g. `The a are of I'. As Steve Abney and others have pointed out, it is often possible to come up with a plausible interpretation for such sentences. In this particular case, one has to know that an "are" is 100 square metres (one hundredth of a hectare), and that "a" and "I" are names. So we can interpret `the a are of I' just like: `the "a" section of paddock "I"'. In general, this kind of trick is easy to do, since every word can be used to name itself and is therefore a noun, and because just about any noun can be `verbified' (i.e. we can verb most nouns).

Another category we could call word minestrone, for obvious reasons: `That that that is is that that is not is not that that that is not is not that that is is is not that so.'

Posted by Steven Bird at December 26, 2003 03:40 PM