Language Log: What researchers really need from web search

January 23, 2005

What researchers really need from web search

As long as we're talking about research use of Google, warts and all, let me point out something that we really, really need from our internet search engines. I'm not talking about a coherent implementation of boolean search -- I'm assuming that Google will get that one straightened out at some point. And when I say "we", I don't just mean us linguists, computational or otherwise. Anyone who wants to use web search for rational inquiry needs this. That includes anthropologists, psychologists, sociologists, political scientists, rhetoricians, language teachers, marketing researchers, and just plain folks. Can you guess what I mean?

We all need to be able to get accurate approximate counts of things that Google can't search for. For example, we need to be able to count uses of words in particular constructions, or with designated senses, or in particular sorts of discourse contexts, or written by people with particular allegiances or attitudes, or with specified connotations or emotional loading.

"Well," you may be thinking, "duh". As the proverb says, people in hell want ice water. What I'm asking for isn't possible, not because Google doesn't have enough computers, but because there aren't any accurate algorithms for computing any of these things.

All the same, there's something simple that a search engine could do for us that would solve all these problems. More or less.

All we need is for the engine to forget, temporarily, about page rank and other fancy algorithms for sorting query results, and just give us a random sample of the set of pages returned by a query. Then we could use sampling techniques to harness human judgment, in an efficient way, to give us the numbers we need. The problem with the current situation is that in many (most?) cases, higher page-rank pages and lower page-rank pages have different distributions and interactions of the relevant features of linguistic form, content and context. The result is that the results of human evaluation of the (high page rank) pages that Google will let you see can't reliably be extrapolated across the (low page rank) pages that you can't see. Some earlier discussions of these issues on Language Log can be found here, here, here, and here, among other places.

From what I understand about how such search engines work, it should be possible to offer a reasonable-sized random sample (say a thousand hits) without prohibitive computational cost. Given how well optimized the current algorithms are for returning results in a useful order, I'm sure there the cost in extra computation would still be significant. But maybe one of the search engine companies could offer this as an extra-cost service? Or a public service?

While we're waiting, let me share with you Groucho Marx's response to those who want ice water. "Ice Water? Get some Onions - that'll make your eyes water!"

Posted by Mark Liberman at January 23, 2005 05:59 PM