May 30, 2006

So the search engine can understand

An article by Steve Lohr in the New York Times last April 9, about the way newspapers are using duller headlines online to make sure they get the right pickup by news-hunting web crawlers, contains the following quote from the head of product development and technology at BBC News Interactive, Nic Newman:

"The search engine has to get a straightforward, factual headline, so it can understand it," Mr. Newman said.

Now, if I seem a bit over-cautious here, keep in mind that BBC News is the organization that brought you the telepathic parrot and the three-headed frog, and Language Log is a little bit concerned that loonies have infiltrated the fine organization in question. But if Mr Newman's remark here is taken at face value, he would appear to believe that search engines understand things.

I will present no view here on whether machines might or might not be in principle capable of understanding (you'll want to vote yes if you go with Turing, no if you go with Searle; for now, I'm neutral), but my understanding is that search engines today, on this planet, cannot conceivably be described as understanding anything at all. The headline scanners of Google News do scan headlines and the first paragraphs of stories, and they do pick up enough information to classify the stories (the Google News page is put together entirely by machines, which is a really remarkable achievement). But the scanners simply look for words (letter strings) that are of normally low frequency and thus might be clues to the topic at hand. (For example, they conclude nothing at all from finding the, which occurs in nearly all sentences, but they conclude quite a lot from seeing Iran, which in texts on most subjects is rare.) They don't read for content, get the drift of the story, compare the sense of the paragraphs with their background knowledge and common sense, and chat about the issues with their friends. They tabulate letter strings and do statistical computations.

The very least one has to admit about machine understanding is that there is a big difference between a search engine algorithm and a genuine understander like you or me — and I'm not saying it necessarily reflects well on me. If you switch a Google-style search engine algorithm from working on English to working on Arabic, it will very largely work in the same way, provided only that you make available a large body of Arabic text from which it can draw its frequency information. (I have actually met people working at Google on machine processing of stories in Arabic. They do not know how to read Arabic. They don't need to.) I, on the other hand, will become utterly useless after the switch. I will no longer be able to classify news stories at all (I don't even know the Arabic writing system, so I can't even see whether Iran is in a paragraph or not).

Call the machines cleverer, or call me cleverer, I don't care, but we're not the same kind of animal, and it seems to me that the verb understand is utterly inappropriate as a term for what Google News algorithms do.

[Added later: People from the programming culture have been mailing me to point out that a metaphorical use of the word ("The compiler won't understand that unless you put brackets round it") is commonplace among programmers. And if Steve Lohr is a programmer, the above could well be regarded as unfair. Maybe so. In that case, just ignore the above cautions. But be very aware that metaphor is in play. Google's algorithms are ingenious and they work very well; but they understand things only in a very attenuated metaphorical sense under which you might also say that a combination door lock set to 4357 understands you when you punch in 4357 but not when you punch in 4358.]

Posted by Geoffrey K. Pullum at May 30, 2006 05:35 PM