May 15, 2004

All your base are belong to which lexical category?

The first sentence of this BBC story about Intel's profits took me slightly aback:

For the three months to 27 March, the Californian-based company made a profit of $1.7bn, almost double the $915m recorded for the same period in 2003.

Shouldn't that be "California-based"? I thought to myself.

Checking relative frequencies on the web, I got an answer: 416,000 ghits for "California-based" vs. 2,580 for "Californian-based". 99.4% of the web agrees with my judgment -- and also with grammar and logic, it seems to me, since "X-based" should be a compositional compound noun, meaning "based in (or on) X", where X is a noun. Nobody would say "based in Californian." QED. The feeble 0.6% are just confused, I thought smugly. Perhaps they are attracted by the irrelevant analogy of other adjective-noun sequences. So much for the Beeb, how the mighty have fallen, etc.

But wait a minute, said the still small voice of conscience. How about "European-based"? Doesn't that sound just as good as "Europe-based", or maybe even better? Checking the web, I found 42,600 ghits for "Europe-based" vs. 60,700 for "European-based": 41% for the noun, 59% for the adjective. Even-steven from a grammatical point of view (though an adjectival landslide in electoral terms!)

And looking at the next few examples of relevant noun/adjective pairs that occurred to me makes the picture even murkier. "Boston-based" is 80,000 times commoner than "Bostonian-based", but "Canada-based" is about 34% less common than "Canadian-based", and so on:

 
noun
adjective
ratio
Athens/Athenian
6,460
11
587
Boston/Bostonian based
240,000
3
80,000
California/Californian based
416,000
2,580
161
Canada/Canadian based
70,300
94,400
0.745
China/Chinese based
39,000
7,450
5.25
Egypt/Egyptian based
4,920
4,520
1.09
Europe/European based
42,600
60,700
0.702
France/French based
24,800
29,100
0.852
Germany/German based
44,800
44,300
1.01
Greece/Greek based
3,320
2,970
1.12
Ireland/Irish based
34,400
16,300
2.11
Israel/Israeli based
20,100
6,750
2.98
Japan/Japanese
43,800
8,940
4.90
Korea/Korean based
14,900
5,680
2.62
Latvia/Latvian
558
250
2.32
Nigeria/Nigerian based
2,070
853
2.43
Norway/Norwegian based
8,400
3,800
2.21
Paris/Parisian based
91,000
297
306
Pennsylvania/Pennsylvanian based
45,100
38
1,187
Russia/Russian based
10,400
6,600
1.58
Scotland/Scottish based
31,100
28,900
1.08
Tunisia/Tunisian based
336
130
2.59
Turkey/Turkish
4,840
1,570
3.08
Vienna/Viennese based
21,000
32
656

(Some of these should probably be removed from consideration, at least pending reanalysis, because the "adjective" forms are really nouns much of the time, as in "Greek-based" meaning "based on the Greek language". I don't think this will change the overall picture much. It's possible that a more careful accounting for other sense differences and other details of semantic relationships would clear things up, but I doubt it.)

Adding it all up, it's about 79% for the nouns, 21% for the adjectives. A victory for logical grammar, but hardly a resounding one. There are several pockets of stalwart adjectival resistance (or craven concession to adjectival irrationality?): Europe, France, Canada, at .70, .85, .75 noun/adjective ratios respectively. Germany is on the edge at 1.01.

Seriously, it's clear that different place-names are behaving differently here. What's the principle, if any? Word length? Unigram (word) frequency? Longitude? Affix? Country vs. City? Few of my first few hypotheses are even true, and none of them explain much of the variance.

And what if we picked a different head noun, such as "X-oriented" or "X-bound" or "X-educated"? Would the statistics be similar, or different?

And does all this have anything to do with the compound nouns that don't involve a de-verbal head at all, but are created by adding "-ed" to a modified noun, as in "red haired"? In other words, is the construction (at least sometimes and for some people) [[Canadian base]+ed] ? If so, does that offer any traction in explaining the enormous variation in usage statistics sketched above? I don't see how, but at least it would provide a grammatically and logically plausible analysis for such phrases.

Then again, maybe I'm just being old-fashioned in expecting a coherent compositional account of how regular phrasal patterns acquire their form and meaning, as opposed to the currently-spreading view that "our interpretive capacities take into account holistic informational characteristics of linguistic constructions and don't simply generate meanings by way of 'bottom up' recursion principles."

Posted by Mark Liberman at May 15, 2004 07:38 AM