September 20, 2004

Which vs. that: a test of faith

I agree completely with Geoff Pullum's views on the relationship between the which/that choice and the distinction between "integrated" and "supplementary" relative clauses. Copy-editors' strictures against using which in integrated relatives are an invention -- what in ordinary life we would call a lie -- with no basis in the facts of the English language. Specifically, that is no longer used in supplementary relatives; but in integrated relatives, both which and that continue to be in common use by all the best writers, as has been true for centuries.

However, I partly disagree with Geoff on one secondary question. He thinks that "reading a few books and noting the thats and whiches and forming semantic hypotheses" is not worth the trouble, because " it would amount to looking for a meaning difference that isn't there". This violates my belief -- maybe it should be called a prejudice or an article of faith -- that if there's a difference in form, there will generally turn out to be a difference in meaning, at least in one of the weaker senses of that protean word. These differences may be the lingering residue of a lost history -- of etymology or dialect or register -- or they may be an emerging association, engendered by compositional convenience, phonetic resonance or collocational accident. The differences are likely to be contextual and gradient. But my theology of linguistics, which is simple-minded but deeply felt, tells me that we'll find the differences if we look for them.

On the other hand, common grammatical morphemes like that and which are about as unlikely as any words can be to gather this sort of meaning-moss. So to test my faith, I decided to take up Geoff's challenge. It's likely that someone has already explored this area more thoroughly -- I didn't take the trouble to do a literature search -- but I'll present you with the fruits of a few minutes spent Googling.

I looked for evidence relating to two "semantic hypotheses", one having to do with humanity (or perhaps a more general hierarchy of animacy) and the other with (degree of relative clause) integration. I'll discuss the "humanity" finding in the rest of this post, and you can make up your own mind whether the results are worth the trouble. I'll take up the integration-gradation in another post.

It's well know that there's a contrast between which and who as relative pronouns -- CGEL (p. 497-499) characterizes this as the difference between "personal gender" and "non-personal gender". The facts are interestingly complicated, but the main point is that who is used for humans except in certain special circumstances, and which similarly for non-humans. However, the word that is obviously available for relative clauses with both human and non-human referents: "the man that corrupted Hadleyburg"; "the dog that didn't bark"; "the land that time forgot".

This is a kind of "meaning difference" between which and that -- which requires "non-personal gender" while that imposes no gender constraints. But Geoff already knows this -- he wrote the book, literally.

However, the facts are not quite so simple. If we look at integrated relative clauses of the form "those that/which/who ...", we expect to find who used for persons, which used for non-persons, and that used freely for either one. And for who and which, that's just the way it works out. However, in the case of "those that..." , there seems to be a strong overall preference (roughly 90%) for human referents. This is far from the 50/50 split that lack of personhood might seem to predict. Is my simple faith rewarded? Not yet, as it turns out -- but read on...

Google finds 1,570,000 pages containing the string "those which". I checked three pages of ten instances each (numbers 1, 5 and 10 in my search); unsurprisingly, I found 26 instances of non-human referents, as in

Temperate bonsai are those which require cool winter temperatures.
They ought to regulate their decisions by the fundamental laws, rather than by those which are not fundamental.
This took the form of a questionnaire sent to all Anglican cathedrals in England, followed by individual visits to those which showed a particular interest in being involved in the project

and no instance of human referents (the other four cases were irrelevant things like "those 'which are you' quizzes", or duplicate pages).

Google finds 15,400,000 pages containing the string "those who", and again, in a sample of 30 I found 29 instances of human referents, and no instances of non-humans.

Google found 6,220,000 pages containing the string "those that". When I checked my three pages of ten examples each, I found 26 with human referents, like these:

Program students are those that live in dormitories or group homes on Heartland property.
Look, Paul, let me put it another way, those that aren't with us are against us.
Will initial teacher training for those that are not yet qualified teachers be different to that done by those joining the programme as qualified teachers?

as opposed to 3 instances of nonhuman referents, e.g.

Great countries are those that produce great people.

So about 90% of the "those that..." examples refer to people. Is Geoff wrong? This looks like a meaning difference (other than the obvious one) influencing the which/that choice. I mean, when a choice is supposed to be completely unspecified, but 90% of the tests go one way, that looks like a pretty big effect.

But it isn't -- because there's a contextual bias. Remember that in the relevant cases where we can tell, personal gender ("those who") is about ten times commoner than non-personal gender ("those which"). 15,400,000 to 1,570,000 ghits, to be precise, or 9.8 times commoner. Combining these two cases, we have 15,400,000/(15,400,000+1,570,000) = 90.7% personal gender.

So it's hardly a surprise that in the case of "those that...", where personhood is ambiguous, 26/29 examples in my sample (89.7%) turned out to have human referents. This is exactly the sort of result we expect from an underlying random process that is biased to produce human referents 9.8 times more often than non-human ones. Chalk up a score for Geoff and the "meaning difference that isn't there."

But let's continue a little bit further, and add a bit more context, in the form of a verb that selects subjects on the animate end on the great chain of being, like live. The string "those who live" gets 397,000 ghits, and "those which live" only 984; so in the "those (who|which) live" context, personal gender wins 99.8% of the time.

However, the string "those that live" gives me 27,300 ghits, and in a sample of 30 of these, 17 referred to humans and 12 were animals. Only 58.6% human. What gives?

And it gets worse: "those which live", in a sample of 30 (of 984), had 26 instances referring to animals, but 3, unexpectedly, referring to humans -- 10% "personal gender" where we expected none.

Here's a tabular summary of this case:

 
whG
personal
(of 30)
non-personal
(of 30)
% personal
"those who live "
387,000
30
0
100%
"those which live "
984
3
26
10.3%
100*who/(who+which)
99.8%
"those that live"
27,300
17
12
58.6%

Looking at the 3 human heads that I found in my sample of "those which live", it's easy to come up with some possible explanations. In the first place, all three were all from old texts, like Malthus' 1798 work "An Essay on the Principle of Population":

(link) The rest of the inhabitants might be 1200 naked miserable and despicable Arabs, like the rest of those which live in villages.

and a passage from a 16th-century work informatively entitled "THE TRUE PICTURES AND FASHIONS OF THE PEOPLE IN THAT PART OF AMERICA NOW CALLED VIRGINIA, DISCOVERED BY ENGLISHMEN sent thither in the years of our Lord 1585, at the special charge and direction of the Honorable SIR WALTER RALEIGH Knight Lord Warden of the stannaries in the duchies of Carenwal and Oxford who therein has been favored and authorized by her MAJESTY and her letters patents. Translated out of Latin into English by RICHARD HACKLVIT. DILIGENTLY COLLECTED AND DRAWn by JOHN WHITE who was sent thither specially and for the same purpose by the said SIR WALTER RALEIGH the year abovesaid 1585. and also the year 1588. now cutt in copper and first published by THEODORE de BRY at his own charges":

(link) The apparel of the chief ladies of that town differ but little from the attire of those which live in Roanoke.

and a sermon preached in 1658:

(link) Those which live in impiety, and depart in their iniquity, they which have here provoked the wrath of God, and goe hence with that wrath abiding on them, as they could create nothing to their relations but sorrow in their life, so must they necessarily increase it at their death.

In addition to being old, all three examples also refer to ethnically or morally subordinated people. Though the N is too small to be very confident about either of these explanations, both seem plausible, and could be explored further if this were a real piece of research and not just an hour's test of linguistic faith.

In any case, we still don't have any explanation for the shortfall in human instances of "those that live..." Here's another little contextual test where it looks like there is a similar problem. This time we'll use the word concern, which predisposes the construction towards a non-human referent:

 
whG
personal
(of 30)
non-personal
(of 30)
% personal
"those who concern"
661
30
0
100%
"those which concern"
3,740
0
30
0%
100*who/(who+which)
15.4%
"those that concern"
4,790
1
29
3.3%

Here indeed the non-personal forms are overall much commoner than the personal ones (about 85% by the who/which test), but again, that is a lot less likely to be human than the who/which ratio in the same context would suggest.

What's going on here?

Is that always less likely to be human than predicted by who/which, as a sort of statistical version of the prescriptivist stricture that I ridiculed in an earlier post? Perhaps, but I'd want to look at more than two contexts before coming to this conclusion. Is that is more likely to be omitted in introducing relative clauses when the head is human? Maybe, but subject relatives (where that is hardly ever omitted) form the bulk of these sets, so I don't think this can be the explanation for the effect.

On balance, I think my faith is upheld, though ambiguously and mysteriously.

Coming up: integration gradation.

 

Posted by Mark Liberman at September 20, 2004 04:21 PM