May 24, 2004

KiloGhits and megaGhits: measuring web frequency

First let me say that in this post I want to make a spelling reform proposal. Previous spelling reform proposals for English have had a disastrously unsuccessful history, but I only want to respell one word, and only by a capitalization. It relates to the matter of getting a little more serious about the terminology for measurment units in practical everyday use of the web as a corpus.

The term "ghit" for "Google hit" is slowly beginning to get established, at least here on Language Log. It seems to me unfair to Google&tm; to use the lower-case "g"; it should be "Ghits". We should honor Google (a company that includes "You can make money without doing evil" as part of its corporate philosophy deserves our respect) the way Heinrich Hertz (1857-1894) is honored in the abbreviation "Hz", the basic unit of measurement for frequency of wave vibrations. The term for the unit should be the Ghit, with capital G, pronounced G hit ("jee hit"), and the abbreviation should be Gh.

A thousand Ghits will be a kiloGhit (under usual US capitalization conventions, KGh — compare KB for kilobytes, KHz for kiloHertz; outside the US we may expect the spelling kGh). A million Ghits will be a megaGhit ((MGh); a billion (109) Ghits, which the would probably get, would be a gigaGhit ((GGh). In due course, as the web gets bigger, we may come to need a term for a trillion Ghits: a teraGhit ((TGh), though the web will have to get about 250 times larger before we need that term.

A measurement in Ghits will be by definition a count of the number of web pages returned by a search pattern. A pattern gets n Ghits if and only if searching the web using Google yields n distinct web pages that contain tokens of the pattern. I do think the pages should be distinct: it seems to me that duplicate pages should in principle be eliminated if the notion of a Ghit is to mean anything. Since it is perfectly possible for a page on the web to have an identical copy at a different URL (this probably happens quite a bit), it is clearly possible for copies of pages to come up as separate hits in the list when you run a Google search. That means that the number of items on the list returned by the Google search engine will only be a rough approximation to the actual Ghit count for your search string. It also will not be a measure of the number of occurrences of the string on the web: the number of occurrences will be higher than the Gh value because a page will often contain multiple occurrences.

Notice that a pattern is a set of strings, not a string. The pattern {ghit} gets 636 Gh, most of them spurious (as Mark pointed out here). But the set {ghit, "Google hit"} is also a pattern, and it gets only 7 Gh, the number of pages that contain BOTH "ghit" and "Google hit". Those are all genuine hits for the word "ghit" that we're talking about, the one that I say should be respelled "Ghit". Switching to plurals gives us {ghits, "Google hits"}, which gets 9 Gh.

Adding strings to a pattern set either keeps the Gh the same or decreases it. There may be quite a few people using Google who do not fully understand that. It would be reasonable to think that a search using the pattern {flowers tulips daffodils pansies dahlias roses} might do even better at getting pages about flowers than {flowers} would, but that is not true; it gets far fewer pages, three orders of magnitude different: 4.5 KGh for {flowers tulips daffodils pansies dahlias roses}, 12.5 MGh for {flowers}. That's because an otherwise relevant page missing just one of the words, say "dahlias", will be ruled out under Google's search principles if "dahlias" is included in the search pattern. Remember also that putting a string of words in quotes turns them into a single word-like unit (call it a pseudoword): searching with the 2-word pattern {chocolate cake} will give utterly different results (far higher: 2.16 MGh) than searching with the 1-pseudoword pattern {"chocolate cake"} (0.616 MGh = 616 KGh).

Posted by Geoffrey K. Pullum at May 24, 2004 02:17 PM