Language Log: In the weeks and months

May 27, 2007

In the weeks and months

At President Bush's May 24 press conference, his opening statement included the following passage:

This summer is going to be a critical time for the new strategy. The last of five reinforcement brigades we are sending to Iraq is scheduled to arrive in Baghdad by mid-June. As these reinforcements carry out their missions the enemies of a free Iraq, including al Qaeda and illegal militias, will continue to bomb and murder in an attempt to stop us. We're going to expect heavy fighting in the weeks and months. We can expect more American and Iraqi casualties. We must provide our troops with the funds and resources they need to prevail.

His prediction has been featured in broadcast news sound bites over the past couple of days, and every time I hear it, I notice the apparent failure to complete the thought. "In the weeks and months ahead"? "In the weeks and months to come"? Presumably something like that was in the original text.

The disfluency before "weeks" (a broken-off production of "month"?) makes it clear that something went wrong in the president's performance:

Another phonetic signpost of trouble was the size of the silent pause preceding that sentence: 2.25 seconds, vs. about 1.5 to 1.7 seconds for W's within-paragraph pauses up to that point.

In reporting on the news conference, some stories supplied a suitable time modifier in parentheses or outside the quotation marks:

“We’re going to expect heavy fighting in the weeks and months” to come, Bush told a White House news conference.
"This summer is going to be a critical time for the new strategy," said Bush. "We're going to expect heavy fighting in the (coming) weeks and months."

Others relied on a semantically complete paraphrase, or gave up and just used the quote intact, relying on their readers to understand the meaning in context.

In fact, not much context is required. I suspect that if you ask practiced readers of journalistic, political and commercial discourse in English to complete the phrase

in the weeks and months __

most of them would supply "ahead" or "to come" as their first guess.

That's certainly what an algorithm based on contextual frequency would do. Google finds 310,000 pages containing {"in the weeks and months"}, and almost two thirds of them continue either as "in the weeks and months ahead" (140K) or "in the weeks and months to come" (65K). If we add "in the weeks and months following" (27.3K), "in the weeks and months after" (27.2K), and "in the weeks and months that followed" (14.1), we get 273.6K, or 88%. (Yes, I know that assuming superposition of Google counts is naive, but the additivity of counts is probably good enough, these days, to support this particular argument.)

To use one of the president's signature words, I find it interesting that "in the weeks and months __" is so radically biased towards the future. In comparison, Google has only 2 hits for "in the weeks and months earlier", 7 for "in the weeks and months previous", 8 for "in the weeks and months that preceded". The only even reasonably common past-oriented continuation that I can come up with is "in the weeks and months before", with a mere 966.

But I wonder what larger patterns this is part of. Is it just a fact about particular English conjunction "weeks and months", or perhaps the prepositional phrase "in the weeks and months"? Or could there be a more general bias towards following events into the future rather than tracking them back into the past? It's obviously time for a Breakfast Experiment^™.

The president's example can be generalized in many directions. Given the available tool of textual search on the web, the easiest dimensions to check are the ones defined by simple string substitutions. So I'll pour another cup of coffee and give it a try.

If we look at a range of time-units from seconds to centuries, and limit ourselves to the common past- vs. future-oriented continuations "before" and "after", we get this:

	seconds	minutes	hours	days	weeks	months	years	decades	centuries
"in the __ before"	15.4K	30.6K	138K	952K	320K	314K	668K	115K	37.6K
"in the __ after"	799	18.1K	106K	413K	193K	223K	667K	137K	38.5K
before/after ratio	19.3	1.7	1.3	2.3	1.7	1.4	1.0	0.84	0.98
before percentage	95%	63%	57%	70%	62%	58%	50%	46%	49%

The ratio of total before counts to total after counts is 1.44 (2.59M to 1.8M), and the overall percentage of before in the before+after total is 59%.

So it's clear that there's no general bias in favor of looking towards the future -- in this particular set of string-substitution contexts, the past is winning.

But if we do the same thing with conjunctions of adjacent pairs of time units in order of increasing size, like "in the seconds and minutes before" or "in the hour and days after", we get this very different pattern:

	seconds and minutes	minutes and hours	hours and days	days and weeks	weeks and months	months and years	years and decades	decades and centuries
"in the __ before"	1	51	155	11.4K	966	597	198	69
"in the __ after"	8	208	14K	31.4K	27.2K	15.2K	688	85
before/after ratio	0.13	0.25	0.01	0.36	0.04	0.04	0.29	0.81
before percentage	11%	20%	1%	27%	3%	4%	22%	45%

Now the ratio of total before counts to total after counts is 0.15 (13.4K to 88.8K), and the overall percentage of before in the before+after total is 13%. When we look at the all the conjunctions of time units in ascending order of size -- not just "weeks and months" -- the future is winning by a landslide!

Here's the same data, presented graphically. When the phrasal head is a single time unit, the past generally wins:

But when the phrasal head is a conjunction of time units in ascending order, the future kicks the past's ass:

What's going on here?

The consistency of the pattern across time-units makes it clear that it's not a fact about any particular lexical item, or about any particular collocation of lexical items. Is the effect due only to the conjunction of (adjacent?) time units, or does it matter that tme units are ordered from smaller to larger? Let's try it the other way around:

	minutes and seconds	hours and minutes	days and hours	weeks and days	months and weeks	years and months	decades and years	centuries and decades
"in the __ before"	15	367	2.94K	376	314	882	2	6
"in the __ after"	1	1	474	3	6	4	0	0
before/after ratio	15	367	6.2	125	53	221	-	-
before percentage	93.8%	99.7%	86%	99%	98%	99.5%	100%	100%

Changing the order of the time units, so that the larger one comes first, flips the effect towards the past. Now the ratio of total before counts to total after counts is 10 to 1 (4,902 to 489), and the overall percentage of before in the before+after total is 91%. Graphically:

OK, I think it's clear what's going on.

It's natural to think of the time-scale narrowing down -- zeroing in -- as our perspective gets closer in time to the event under discussion. And it's natural to think of (say) "hours and days" as hours followed (temporally) by days, while "days and hours" is days followed temporally by hours.

From those two assumptions, it follows that a conjunction of time units in order of increasing size (like "hours and days") will more naturally be used to describe time after an event; while a conjunction of units in decreasing-size order (like "days and hours") will more naturally be used for time before an event.

And that's what happens!

If this theory is right, then the same thing ought to happen in French or Chinese or Turkish -- as long as the assumptions continue to hold, about zeroing in on events and listing time periods in chronological order.

[Update -- Anatol Stefanowitsch writes:

cool.
It also works in German (see attached .csv file).

Turned into html tables, the file that Anatol attached looks like this:

Table 1 -- smaller units first:

	sekunden und minuten	minuten und stunden	stunden und tagen	tagen und wochen	wochen und monaten	monaten und jahren	jahren und jahrzehnten	jahrzehnten und jahrhunderten
in den _ (vor\|bevor\|davor\|vorher)	1	1	10	871	992	87	181	18
in den _ (nach\|nachdem\|danach\|nachher)	0	7	203	1350	1270	411	310	18
vor/nach ratio	-	0.14	0.05	0.65	0.78	0.21	0.58	1
vor percentage	100%	13%	5%	39%	44%	17%	37%	50%

Table 2 -- larger units first:

	minuten und sekunden	stunden und minuten	tagen und stunden	wochen und tagen	monaten und wochen	jahern und monaten	jahrzehnten und jahren	jahrhunderten und jahrzehnten
in den _ (vor\|bevor\|davor\|vorher)	0	9	116	382	66	33	0	3
in den _ (nach\|nachdem\|danach\|nachher)	0	0	1	0	2	0	0	0
vor/nach ratio	-	-	116	-	33	-	-	-
vor percentage	-	100%	99%	100%	97%	100%	-	100%

I venture to suggest that the kind of research represented by this collective Breakfast Experiment^™ deserves a name. "Yes", I hear you say, "how about 'Computational Linguists With Way Too Much Time On Their Hands'?" No, actually, what I had in mind was something more like "Google Cognitive Linguistics".]

Posted by Mark Liberman at May 27, 2007 07:03 AM