April 12, 2004

Terrasyllable Introspectionist Gelded: ungraciously nescient fennel feels nagging remorse

entangledbank writes in a frustrated mood that (s)he can't rely on websearches to check for scanning errors in the public-domain Webster 1913 dictionary, because "[t]here are now two thousand indistinguishable porn sites all having massive imports of random dictionary words, for some spider-attracting or filter-avoiding purpose. This means any obscure word whatsoever, even the scanning errors, is now largely unsearchable..."

This is not just an issue for checking word lists. It's a real problem for otherwise wonderful google-sampling techniques in empirical studies of syntactic variation, as discussed in this post, in which 65% of one crucial sample turned out to be porn- or gambling-site pseudotext. In principle, one can easily create a statistical classifier to distinguish such sites from "real ones", as you can see by looking at the examples in EB's post and mine. But this is just one stage in an on-going arms race between the web indexers and the internet demimonde (the demiweb?), and so the whole thing will have to be redone again and again. As things stand, there is no real alternative to human inspection of the samples, at least on sampled basis.

This particular tale has a twist. My post on the perils of googlesampling ended with a challenge: "I'm waiting for someone to point out to me that [an apparently novel construction] was used by Winston Churchill, Jane Austen, William Shakespeare and even the author of Beowulf."

Geoff Pullum looked around on his bedroom bookshelf (or perhaps it might have been his laptop's hard drive), and found an 18th century example in Fanny Hill. Advantage: Pullum.

Posted by Mark Liberman at April 12, 2004 06:54 PM