November 20, 2005

So I says to myself, "Self, what's up with these Googlecounts?"

In my recent post on the difficulties of Googlinguistics, I heeded Mark Liberman's warning to be suspicious about the reliability of Googlecounts much greater than 100,000. But an attempt at some Google-aided snowclone research suggests that the upper limit for reliability may in some cases be on the order of 1,000 or less.

First, let me explain a quirk in the way that Google displays search results. Overall Googlecounts are especially meaningless if one searches on bits of text that appear verbatim on multiple websites — as with song lyrics, poems, public-domain literature, etc. This is particularly evident with disjunctive queries, which use the minus sign to exclude particular search terms. Compare these search results, for instance, on pages with lyrics to the song "Junco Partner" (a traditional New Orleans song, also known as "Junker Partner," recorded by many performers including Harry Connick, Jr.):

"junco partner" lyrics 9,440
"junco partner" lyrics connick 279
"junco partner" lyrics -connick 930

Googlecounts are notorious for exhibiting variations based on the time and place that the search engine is accessed. Still, results from one user at one time, such as mine above, should at least be internally consistent. These results clearly are not, since we would expect A to be roughly equal to the sum of B (mostly pages referring to Connick's cover version) and C (mostly pages referring to versions by other performers, such as the Clash). Ideally, of course, A should be exactly equal to the sum of B and C.

But now try adding "&start=950" to the end of the URL for each search. At the bottom of the page, Google gives a message like this:

In order to show you the most relevant results, we have omitted some entries very similar to the N already displayed.
If you like, you can repeat the search with the omitted results included.

For these queries, Google has found many search results that are more or less identical — such pages most likely contain the same song lyrics with slightly different surrounding content. Focusing on the counts for only the "most relevant" results, i.e., the ones that Google deems to be non-identical, I currently get:

"junco partner" lyrics 413
"junco partner" lyrics connick 85
"junco partner" lyrics -connick 374

Using this search method, the sum of B and C at least approximates A without an enormous margin of error. The lesson here is that the "most relevant" results should provide more trustworthy numbers when dealing with text that appears frequently on the Web with only minor differences across sites. However, this technique will not help if the search string appears non-identically on more than 1,000 pages. For more commonly appearing search strings, the "most relevant" results will cut off at some number under 1,000 (generally between 800 and 950).

Now to the snowclone research I mentioned. I was curious about a snowclonish turn of phrase that is often used to indicate a jokey interior monologue (or dialogue, actually): (So) I says to myself, "Self (I says)..." In September, I mentioned this expression on the American Dialect Society mailing list, asking if anyone knew of its origin. Surprisingly, despite the fact that it sounds like it comes from some old vaudeville routine, no one was able to find an example before the 1980s. John Baker tracked down this example from the Boston Globe of May 31, 1981, quoting a New Hampshire gardener:

"Becuz of the mild weatha at the end of the winta, every blossom and blade of grass was weeks ahead of schedule. Back about the middle of this month, I says to myself, Self, mebbe this is the yea to plant early,' but then I hud the ghost of my fatha and his fatha sayin', Plant the tumatuz on Memorial Day.' I held off."

It's notable that the example uses New England "dialect writing." The use of says with a first-person singular pronoun is common in representations of reported speech in numerous American dialects. Here are examples of I says from two of our most illustrious (and perceptive) dialect writers, Mark Twain and Ring Lardner, Jr.:

"Geewhillikins," I says, "but what does the rest of it mean?"
"We ain't got no time to bother over that," he says; "we got to dig in like all git-out."
Well, anyway," I says, "what's SOME of it? What's a fess?" 
—Mark Twain, The Adventures of Huckleberry Finn, Ch. 38

I says Well I won the pot didn't I? He says Yes and he called me something. I says I got a notion to take a punch at you.
He says Oh you have have you? And I come back at him. I says Yes I have have I? I would of busted his jaw if they hadn't stopped me. You know me Al.
—Ring Lardner, You Know Me Al, Ch. 1

It's not surprising, then, that the dialectal form I says turns up not just in representations of "authentic" American speech but also in jocular expressions like (So) I says to myself, "Self..." Very often with this snowclone, though, the standard forms of the verb (i.e, present-tense I say or past-tense I said) are used instead, as in this example from Buffy the Vampire Slayer:

Willow: Yeah.. I- I know I've been sort of a party-poop lately, so I said to myself, "Self!" I said, "It's time to shake and shimmy it off."
Buffy the Vampire Slayer, "Something Blue," Season 4 (aired Nov. 30, 1999)

It's also common to see the snowclone appear with other verbs appropriate for the self-reporting of an interior monologue, such as think/thought or ask/asked. (However, the mock-dialectal equivalents of I says, namely I thinks and I asks, appear very rarely.)

Here is where Googlecounts would be particularly valuable to calculate the relative frequencies of variant forms. Below are the results that I found in searches conducted back in September and now, for both total Googlehits and "most relevant" results:


"so I said to myself, self"
1,480 493 4,680 585
"so I says to myself, self" 3,240 169
"so I say to myself, self" 996
"so I thought to myself, self" 1,850
"so I think to myself, self" 377
"so I asked myself, self" 495
"so I ask myself, self" 370

We would expect the "total" results to be skewed by the repetition effect noted above. Interestingly, though, the total figures have stayed relatively constant since September, except for an expansion of results for said (perhaps in part due to various fan sites repeating the bit of dialogue from Buffy the Vampire Slayer, or an undercounting of previous such repetitions).

But if we discount the total figures and focus on the "most relevant" results, the figures have not been particularly stable. Results for said have increased, though more moderately compared to the jump in "total" results. Raw hits for says, say, and thought, however, have increased significantly, and it's highly doubtful that there were actually that many more examples for Google to count in the past two months. Only the low-frequency results for think, asked, and ask have kept their results at constant levels, and they all return less than 200 "most relevant" results. (Note that the automatic stemming discussed in the previous post wouldn't have an effect here, since Google applies stemming only to individual search terms, not to words in a string with quotation marks.)

Perhaps it's better to ignore the raw numbers and simply rank the results. Then we find a jump in the rankings for said in the total results and for say in the most relevant results:

  9/26: (says-thought)-said-(say-asked-think-ask)
11/20: said-(says-thought)-(say-asked-think-ask)

  9/26: (said-thought-says)-(think-asked-ask)-say
11/20: (said-thought-says)-say-(think-asked-ask)

The jump for say in the most relevant results possibly rectifies a previous undercounting, since it brings all the rankings into rough alignment: said/says/thought in the top three spots, followed by say, rounded out by think/asked/ask in the bottom three spots.

My suspicion is that the shifts in Googlecounts since September are largely due to various "invisible" factors, such as changes in Google's searching algorithms and its methods of extrapolating results based on small samples. It's a bit distressing, though, that only the search strings with the lowest frequencies (under 500 total results, under 200 most relevant results) show much stability. But I suppose these are matters of interest only to reporters and computational linguists.

[Update #1: Just to be clear, the turn of phrase under investigation is crucially marked by the vocative use of the word self. Dialect writing has plenty of examples of "(so) I says to myself (I says)..." without vocative self, e.g.:

"'Where be the stoat?' he says — 'I ain't seen 'em,' I says. Well, next day we goos again — and I says to myself, I says, — 'I wunt be afeared of a stoat,' I says — so I caught 'em that time — gor' how he did bite surely — they be wonderful bitten things, stoats."
— "A Summer Stroll in Sussex" by Edward Clayton, The Living Age, June 7, 1890, p. 637

So I says to myself I says, there you are, greedyguts, I says, if that pot had smashed Friday night so's you couldn't eat the business cooked in it they wouldn't be smashing all the pots and pans and plates this Saturday night.
— "A Pot Story" by S.J. Agnon, in A Golden Treasury of Jewish Literature, edited by Leo W. Schwarz (1937), p. 358

The use of self as a playful mode of self-address adds an extra layer of self-conscious irony to the expression. In fact, the 1981 example from the Boston Globe given above is the only example I've seen with vocative self that purports to be a genuine representation of unironic speech.]

[Update #2: It's possible that children's rhymes could have provided an early template for this sort of pattern. Here are two examples that seem like they could be related to the snowclone, though neither uses vocative self:

As I walked by myself,
And talked to myself,
Myself said unto me:
"Look to thyself,
Take care of thyself,
For nobody cares for thee."
I answered myself,
And said to myself
In the selfsame repartee:
"Look to thyself,
Or not look to thyself,
The selfsame thing will be."
The Real Mother Goose by Blanche Fisher Wright (1916)

James James
Morrison Morrison
Weatherby George Dupree
Took great
Care of his Mother,
Though he was only three.
James James
Said to his Mother,
"Mother", he said, said he;
"You must never go down to the end of the town,
if you don't go down with me."
James James
Morrison's Mother
Put on a golden gown,
James James
Morrison's Mother
Drove to the end of the town.
James James
Morrison's Mother
Said to herself, said she:
"I can get right down to the end of the town and be
back in time for tea."
When We Were Very Young by A.A. Milne (1924)

Thanks to Mark Liberman for reminding me of the latter.]

[Update #3: Barbara Zimmer points out that So I say(s) to myself, "Self..." has been popularized in recent years by the chef Emeril Lagasse. On his show on the Food Network, Emeril has developed the catchphrase into a call-and-response between him and his studio audience, with the audience expectantly chiming in "Self!" (much as Johnny Carson's audience would chime in "How cold IS it?") ]

[Final update: See further commentary here.]

Posted by Benjamin Zimmer at November 20, 2005 06:00 PM