October 15, 2005

Web search counts: half empty or half full of __?

Heidi Harley at Heideas has posted some thoughts on Scalar Adjectives with Arguments, illustrated by a cartoon:

Heidi observes that "half empty of X" seems bad to her, though "empty of X" is fine; and she points out that Google counts support her judgments, since the ratio of the frequencies of "half full of" and "half empty of" is much higher than the ratio of the frequencies of "half full" and "half empty" -- by a factor of more than 100.

This reminded me of the on-going concerns about the usefulness of web counts for linguistic analysis . One way to evaluate this is to look at the consistency of the numbers across different web search engines, as I did in the post just linked to. As discussed in that post (and especially in the posts by Jean Veronis cited there), there are several reasons for inconsistencies across search engine counts:

  • differences in what's indexed, especially in how duplicate documents and nests of fake "search engine optimization" documents are excluded;
  • differences in how much of each document is indexed;
  • differences in how counts are estimated (since the numbers are often an extrapolation from a small sample).

The advantage of the web search engines is that they index a lot of documents, so that you can get a reasonable sample size for fairly small corners of the language. The disadvantage is that (despite their best efforts) they index a lot of crap, and (at least in some cases) their counts may be estimated by methods that are not very accurate.

In order to offer a small numerical window on some of these issues, I thought I'd repeat Heidi's experiment with two other search engines (Yahoo and MSN in addition to Google), and with three other corpora of various types and sizes.

I added the GigaWord corpus of English-language newswire available from the Linguistic Data Consortium, which currently comprises about 2.5 billion words (2,458,744,437 words in 5,710,419 documents, to be exact); the LDC's collection of English conversational telephone speech (CTS), which currently comprises about 25 million words (26,151,602 words in 28,274 conversations, to be exact); and the World Edition of the British National Corpus, which includes about 100 million words (100,467,090 orthographic words in 4054 texts, to be exact).

 
Google
Yahoo
MSN
GW (words)
GW (docs)
CTS
BNC
full
1,890,000,000
2,000,000,000
367,645,836
524,609
436,154
2,577
28,215
empty
137,000,000
192,000,000
37,086,190
71,923
64,181
280
5,379
full / empty ratio
13.8
10.4
9.9
7.3
6.8
9.2
5.2
half full
2,030,000
4,090,000
683,765
2,111
2,031
14
66
half empty
1,580,000
2,530,000
384,730
1,966
1,908
8
35
half full / half empty ratio
1.28
1.62
1.78
1.1
1.1
1.8
1.9

half full of

248,000
397,000
63,841
123
119
2
22
half empty of
1,910
1,420
1,682
3
2
0
1
half full of / half empty of ratio
130
280
38
41
60
NA
11
(half full of / half empty of) /
(full of / empty of) meta-ratio
101
173
21
38
56
NA
12

I've got four points to make about this tiny little experiment.

The first point is that Heidi's results are validated: it's clear that "half empty of" is a lot less frequent that "half full of". This doesn't seem to be a matter of complete ungrammaticality, since some of the hits certainly seem perfectly fine to me:

(BNC) We swayed down the long baggage car, which was half empty of freight and very noisy, and George, having told me to remove and lay aside my waistcoat in case I got oil on it, unlocked the door at the far end.
(GW) When Mike Tyson and Buster Mathis Jr. finally entered the ring Saturday night, the 18,000-seat Spectrum was half empty of spectators and fully barren of suspense.

What's going on? See below, and look at Heidi's post and the comments from Q. Pheevr and Lance Nathan for some thoughts.

The second point is that size matters. A corpus of 26 million words (the LDC CTS corpus) is too small to address Heidi's question in a reliable way. There are only 2 instances of "half full of", and no instances of "half empty of", and from this we can determine little other than that neither of the patterns involved is terribly common. A corpus of 100 million words (the BNC) is not a great deal better. There are 22 instances of "half full of", and 1 instance of "half empty of". This is enough to support a judgment that the first pattern is genuinely more frequent than the second, but not enough to support much research into why this is or what it means. A corpus of 2.5 billion words (the LDC GW newswire corpus) is again not a great deal better: the string counts of 123 and 3 give us even better confidence that there's a difference in expected frequency, but not a great deal to go on in determining what factors are involved in discouraging or permitting the infrequent pattern.

The third point is that the welcome size of web indices has a price: web search results are sometimes heavily weighted with crap of various sorts. In this case, nearly half of the web search counts for "half empty of" are from the typo "half empty of half full" (for "half empty or half full"): 836 of 1910 from Google, 573 of 1,420 for Yahoo, and 725 of 1,682 for MSN. There are other problems as well:

Crisis management would be far too Glass-is-half-empty of a term to describe life since the relocation to LA ...
Only the most "glass half empty" of HR professionals will perceive that their careers are at risk
.. the large bottle, half empty, of diet cola, reminder of the previous evening's riotous revelry...
"Half empty, of course," said Pamela.

Still, we can find plenty of relevant examples to suggest ideas about why the pattern is relatively infrequent and nevertheless sometimes validly used. One clue is that many of the valid examples involve parallel contrast with "half full", e.g.

It’s half full of water and half empty of water.
...is it half empty of sadness or half full?
...remember this: when a glass is half empty of water, it's also half full of hot air!

Many of the other examples seem to be cases where "half empty" means "half emptied", referring to a point in a process of emptying, whether literally or as a metaphoric evocation of loss:

Due to the basic laws of physics, by the time your reel is half empty of line, the drag has effectively doubled.
The glass is half empty of a brown liquid, two ice cubes float in the mix, waiting to melt and become part of the whole.
The moon is half-empty of white. The houses and roads are serrated.
She looked around the familiar scene -- two glasses half empty of soft drink, overflowing ashtrays, the TV flickering with the sound off.
Sure enough, when I got back, the tub was half empty of water.
The Veneroso number suggests that central bank vaults are “one-third to one-half empty” of their reported gold.

This supports Q. Pheevr's suggestion that "The contexts in which I would use half empty are mostly ones in which the container being described is expected to be full, or was recently full and has been partially emptied". In any case, web search results need to be examined carefully, and counts need to be considered in the light of such examination.

The fourth point is that web search counts are (as we already knew) quantitatively unreliable. As I've observed before, we can tell this from the instability of ratios of counts across different search engines -- and of course ratios of ratios of counts are even less stable. We can look at the issue in a different way, given exact counts from corpora whose properties we know better (very few duplicates, little or no garbage text, decent proofreading). In particular, we can use the exact (word and document) frequency counts from the GW corpus to (crudely!) estimate the number of documents indexed by the web search engines for comparable searches. Note that the structure of the GW corpus (2,458,744,437 words in 5,710,419 documents) means that it averages about 431 words per document, which should be the same order of magnitude as the web documents that the search engines index.

The word full occurs in 436,154 GW documents out of 5,710,419, which corresponds to a rate of about 76.4 per thousand documents. If the rate for the search engines were the same, and their (document) counts were accurate, then MSN would be indexing about 4.8 billion documents; Yahoo about 26 billion documents; and Google about 25 billion documents. If we do the same extrapolations for empty, we get MSN at 3.3 billion, Yahoo at 17 billion, and Google at 12 billion. Doing it for "half full" gives us MSN at 1.9 billion documents, Yahoo at 11.5 billion, and Google at 5.7 billion. The estimates for "half empty" are MSN at 1.2 billion, Yahoo at 7.6 billion, and Google at 4.7 billion.

All of these are order-of-magnitude consistent with the general idea that text searches on the web are now indexing roughly 10 billion documents. But there's an interesting trend: as we look at words or phrases whose document frequency in the GW corpus gets smaller, the resulting estimates of the size of the search engine's collections also gets smaller. I don't think this gives us any special insight into the search engines' true index sizes, but it does suggest that their methods for estimating counts might have increasing positive bias with increasing frequency.

Posted by Mark Liberman at October 15, 2005 05:51 PM