December 06, 2003

Google-sampling: avoiding pseudo-text in cyberspace

Neat! David Beaver uses google-sampling corpus linguistics to argue that "far from" has already become an accepted pseudo-adverb, and that it occurs in Google's sample of the web at a rate of about 1 per 10 million words (roughly as often as Hammurabi or Frege, for example).

Now, I'd already learned (by asking) that younger Americans find nothing at all wrong with phrases like "he far from fulfilled his promise". I could come to like this innovation. We used to be able to say "they nearly succeeded" but not, alas, "*they farly succeeded". Now we can say "they far from succeeded": big deviations get equal adverbial time! Mere syntactic coherence is a small price to pay.

However, I want to warn you aspiring google-samplers to be careful. There are some mean texts out there, kiddies. In particular, you need to watch out for the textual wiles of gambling dens and porn parlors, who create big networks of interlinked web pages in order to boost their google score. Google tries to ignore obvious examples of this sort, so the bad guys hire renegade computational linguists to write programs that churn out pages full of searchable stuff looking enough like real text to fool Google. Stuff like "For example, a progressive jackpot indicates that a tablet a cosmopolitan hoofer. Another oed hestitates, because an ungraciously blindfold optimist a quodlibet of another progressive jackpotistry. When you see the modiste, it means that a restroom hides."

These linguistic grifters (and some other less criminal effects, such as Google's habit of indexing sequences across punctuation) have polluted David's samples to the point that his estimates are off by a factor of about 14. This doesn't invalidate google-sampling as a technique. But you have to watch out!

David used the following reasoning, in my reconstruction:

  1. According to Google's index, appropriately filtered, sentences of the form "They far from FiniteVerb ..." are about 10 times commoner than sentences of the form "They ungraciously FiniteVerb ...".
  2. the word ungraciously occurs about 10,000 times in Google, "most of which come from 'ungraciously + finite verb'".
  3. Therefore (given a few other assumptions that need to be checked!), there are about 10*10,000 = 100,000 occurrences of 'far from + finite verb' in Google's index.

This is an excellent example of creative google-sampling analysis, in form. But the content has a problem -- the samples weren't carefully enough filtered.

Looking at the very same data more carefully, it appears that a better estimate of the count of 'far from + finite verb' in Google's index would be 7,250, not 100,000 (see below for details). If Google indexes a trillion words, roughly, then the frequency of this construction is roughly one in 140 million, not one in 10 million as David estimated.

Of course, if we were serious about this question, we'd want to try some other approaches. For example, we might try inspecting a sample of occurrences of "far from" directly, to see what fraction precede finite verbs. This is harder, as I learned when I was writing my original piece on adverbial far from. Google returns 6.95 million pages for this string, and it's clear that only a very small fraction of these are adverbial uses, as you can see if you look yourself. I checked the first 150 hits and found none. On David's estimate of 100K total "far from" pseudo-adverbs, roughly 1 in 70 should be adverbial, while on my estimate of 7,250, roughly 1 in 1,000 should be. In order to get an accurate enough estimate of the rate of occurrence of a phenomenon like that, we'd have to check a sample of ten thousand pages or more. I'm sure that's why David took the more indirect approach of comparing far from to another word in a particular context where the adverbial ore is enriched, and then trying to scale the results in proportion ... So, the truth is clearly out there, but perhaps we've got enough of it now. Or more than enough; though I'm waiting for someone to point out to me that adverbial far from was used by Winston Churchill, Jane Austen, William Shakespeare and even the author of Beowulf :-)...

At this point, most of you readers who are still with me will want to turn your attention to something interesting, like this. But for you aspiring google-samplers, here are the details...


Google finds 9,670 pages containing ungraciously, sure enough. But only 15% are human-generated uses of the form "ungraciously+finite verb".

If this sample is typical, then a better estimate of the google count for "ungraciously" + finite verb" is actually .15*9670 = 1450.

The next stage of David's analysis involves "they far from". He suggests that about 200 of 481 google hits for this sequence involve pseudo-adverbial modification of a finite verb. This sequence gets 479 google hits for me (Google gives slightly different results on different trials, for various reasons!).. I checked a sample of 40 (pages 1, 5, 13, and 18 of the google hits) and found that 25% (10/40) were genuine pseudo-adverbial examples (see below for analysis of the rest). Thus "they far from" produces .25*479 = 120 cases.

Finally, there is the count for "they ungraciously". Google gives me 24, all of which seem to be pre-finite-verb cases, as David indicated.

So David's 200*10000/23 = 86,956 should be 120*1450/24 = 7250, or about 7% of the 100K that he rounded up to.

Further details and examples are below...


25% (10 out of a sample of 40) were genuine pseudo-adverbials like this:

I am sad to report that I am not a huge Dryspell fan. They far from suck or anything, they are just not my cup of tea.

The rest were punctuation-spanning:

...they, far from being stupid, are actually hundvísir "most wise"...


They ... emphasized that not only were they far from areas where mercenaries operated, but ...

or copula-deleted:

yo bwoi u mite wanna fix up ur spellings blood, they far from desired. no offence, just relax and type when u is redy, innit man?


I checked a sample of 20 -- pages #2 and #11 of the google hits, with the sample from page #2 reproduced in full below. Only 35% of them (7 out of a sample of 20) are even human uses of the word "ungraciously" at all! And only 15% (3 out of 20) occur before an active (1) or passive (2) verb in a finite clause.

Another 4 are in non-finite clauses or are post-verbal uses that are not relevant to "far from" and the like, such as "..., she said ungraciously". No one is yet starting to write things like "*..., she said far from", so we can ignore these. The other 13/20 instances of ungraciously in the sample are dictionary entries, word lists -- and especially, on-line gambling pseudo-text pages (like this one for Best Betting), generated by program to fool google and similar search engines.

2nd page of google hits for ungraciously:

conscience would permit, rather ungraciously perhaps, the indulgence of a number of carefully selected desires.

simply ignores smoker simply ignores returned
ungraciously speaking returned ungraciously speaking
returned ungraciously speaking returned ungraciously speaking
parent powers

Future citizenship manner. Chinese children
hoarse groan. Gurgle man who being
murdered Mercy ungraciously late July.
Secretary acknowledged made threats
Thai Post staff.

ungraciously - gracelessly, ungracefully, without graciousness, woodenly

ungraciously - gracelessly, ungracefully, without graciousness, woodenly

If an exudation behind a durum a stringy derby, then the immoderation beyond the sovietism self-flagellates. When you see a stitchwort, it means that an ungraciously nescient fennel feels nagging remorse. Furthermore, a sympatric fulcrum daydreams, and a consoling wingman phylogenetically a vista.

Trelawny showing Campo Santo settled Life villa Goethe work flew grand spacious Life villa Thy mountains seas vineyards ungraciously rendered gift less ungraciously rendered gift less towers bent dun faint ethereal gloom precious implanting fatal trait representation scene passion

> Ive noted that in Soc. Motts one of our users has been rather
> ungraciously badgered by a number of individuals for an occurance that
> was beyond his control.

Furthermore, the superficies returns home, and the redeeming scientist another farmland. When a cocklebur is hypermetropic, an earlier play as the dealer ungraciously a taffy over the merging. Now and then, an osteoclasis about a pompano alternatively a tangible moniliales.

"Mabbe," observed Jimmie Dale, as ungraciously as before, "mabbe dere's some more t'ings youse don't know!"



Posted by Mark Liberman at December 6, 2003 05:41 AM