May 24, 2004

GKP, Gh, GPB

Geoff Pullum (GKP) suggests that we start using new units Gh, KGh, MGh and so on, as a way of formalizing the counting of Google hits, now traditional among web language folk, and dubbed "ghits" earlier this year by Trevor at kaleboel. This is a terrific idea, though I'm afraid that Geoff's suggested pronunciation "gee hits" has little chance of making headway against the competitor [gɪts] (hard g, rhymes with grits).

However, I want to propose a more substantive addition to Geoff's proposal, based on the fact that to get a measure of frequency (which is what we often but not always want), we need to normalize by the size of the set searched. In this case, the set is the documents in Google's index, and Google puts the size of this set right up on its front page. As I write this, the number is 4,285,199,774. Now if we take a modest number of Ghits -- a few hundred to a few thousand -- and divide by 4,285,199,774, we'll get an unpleasantly small number. For example, {Pullum} has a count of 23,800 Gh, or 23.8 KGh, or .0238 MGh, which is a pretty respectable number. However, 23,800 divided by 4,285,199,774 is 5.554e-06, or .000005554 Ghits per document indexed.

We can deal with this the way we deal with other uncomfortably large standard measures like Farads and Lenats, by using prefixes such as micro- and nano-. Thus the frequency of Pullum becomes 5.554 microGh/document, or 5,554 nanoGh/document. In general, I think that the nano-scale measure is the right one to use for term frequencies, since the plausible range of sensible and useful frequency counts then correspond to a sensible and useful range of natural numbers: one nanoGH/document corresponds today to about 4 or 5 Ghits; ten nanoGH/document corresponds to about 40 or 50 Ghits; a thousand nanoGH/document corresponds to about 4,000-5,000 Ghits; a million nanoGh/document corresponds to about 4 or 5 million Ghits; and so on.

We need a shorter term for this measure than "nanoGh/document" -- so I suggest that the web frequency of terms should be measured in GPB, for "Ghits per billion documents". I'll illustrate the use of this measure in some subsequent posts.

The value added of a normalized measure is that it will continue to give comparable estimates of how frequent a pattern is, as the number of pages that Google indexes continues to grow. Here's a graph of (self-reported) search engine index size from 12/95 through 6/03, in terms of billions of textual documents indexed:

(KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista).

So a bit less than a year ago, Google indexed 3.3 billion textual pages, vs. 4.3 billion now. That's about a 30% increase. At that rate of increase, the index will double in size in about 2.5 years, and will increase by a factor of 100 in about 17.5 years. In 2021, Google may or not still be in business, and the web will certainly be organized in very different ways -- average document sizes may be quite different, to mention one trivial matter -- but to the extent that we want to make comparisons of frequencies over time -- even over a couple of years of time -- we'd better do our best to normalize counts somehow. And we linguists aspire to work on a time-scale of centuries, if not millennia.

Of course, if we are just looking at the ratios of counts -- or frequencies -- for different cases at a given time, it doesn't make any difference whether we use counts or frequencies, the results are exactly the same. In that case, it's clearer and simpler just to use counts -- and there Geoff's Gh, KGh and so on are just the right thing. An excellent case study using such comparisons of counts can be found here at Tenser, said the Tensor. We've posted a number of examples of the same sort of analysis, for example here.

Finally, I should mention that there's another issue about frequency -- document frequency and term frequency are not entirely interchangeable measures, and the cases in which they differ more or less than expected are sometimes especially interesting. For more on this, see e.g. this reference (or wait for another post on the subject). However, GPB remains a pretty decent proxy for a measure of the frequency of bits of text -- much better, and much more accessible, than anything we had just a few years ago.

[Update: Semantic Compositions suggests using capitalization to distinguish between "raw" ghits (e.g. kGh) and validated ghits (e.g. KGH). I guess one could similarly use Gpb and GPB, though I'm skeptical that folks will be able to keep the capitalization straight.]

Posted by Mark Liberman at May 24, 2004 07:52 PM