Language Log: All your base are belong to which lexical category?

May 15, 2004

All your base are belong to which lexical category?

The first sentence of this BBC story about Intel's profits took me slightly aback:

For the three months to 27 March, the Californian-based company made a profit of $1.7bn, almost double the $915m recorded for the same period in 2003.

Shouldn't that be "California-based"? I thought to myself.

Checking relative frequencies on the web, I got an answer: 416,000 ghits for "California-based" vs. 2,580 for "Californian-based". 99.4% of the web agrees with my judgment -- and also with grammar and logic, it seems to me, since "X-based" should be a compositional compound noun, meaning "based in (or on) X", where X is a noun. Nobody would say "based in Californian." QED. The feeble 0.6% are just confused, I thought smugly. Perhaps they are attracted by the irrelevant analogy of other adjective-noun sequences. So much for the Beeb, how the mighty have fallen, etc.

But wait a minute, said the still small voice of conscience. How about "European-based"? Doesn't that sound just as good as "Europe-based", or maybe even better? Checking the web, I found 42,600 ghits for "Europe-based" vs. 60,700 for "European-based": 41% for the noun, 59% for the adjective. Even-steven from a grammatical point of view (though an adjectival landslide in electoral terms!)

And looking at the next few examples of relevant noun/adjective pairs that occurred to me makes the picture even murkier. "Boston-based" is 80,000 times commoner than "Bostonian-based", but "Canada-based" is about 34% less common than "Canadian-based", and so on:

	noun	adjective	ratio
Athens/Athenian	6,460	11	587
Boston/Bostonian based	240,000	3	80,000
California/Californian based	416,000	2,580	161
Canada/Canadian based	70,300	94,400	0.745
China/Chinese based	39,000	7,450	5.25
Egypt/Egyptian based	4,920	4,520	1.09
Europe/European based	42,600	60,700	0.702
France/French based	24,800	29,100	0.852
Germany/German based	44,800	44,300	1.01
Greece/Greek based	3,320	2,970	1.12
Ireland/Irish based	34,400	16,300	2.11
Israel/Israeli based	20,100	6,750	2.98
Japan/Japanese	43,800	8,940	4.90
Korea/Korean based	14,900	5,680	2.62
Latvia/Latvian	558	250	2.32
Nigeria/Nigerian based	2,070	853	2.43
Norway/Norwegian based	8,400	3,800	2.21
Paris/Parisian based	91,000	297	306
Pennsylvania/Pennsylvanian based	45,100	38	1,187
Russia/Russian based	10,400	6,600	1.58
Scotland/Scottish based	31,100	28,900	1.08
Tunisia/Tunisian based	336	130	2.59
Turkey/Turkish	4,840	1,570	3.08
Vienna/Viennese based	21,000	32	656

(Some of these should probably be removed from consideration, at least pending reanalysis, because the "adjective" forms are really nouns much of the time, as in "Greek-based" meaning "based on the Greek language". I don't think this will change the overall picture much. It's possible that a more careful accounting for other sense differences and other details of semantic relationships would clear things up, but I doubt it.)

Adding it all up, it's about 79% for the nouns, 21% for the adjectives. A victory for logical grammar, but hardly a resounding one. There are several pockets of stalwart adjectival resistance (or craven concession to adjectival irrationality?): Europe, France, Canada, at .70, .85, .75 noun/adjective ratios respectively. Germany is on the edge at 1.01.

Seriously, it's clear that different place-names are behaving differently here. What's the principle, if any? Word length? Unigram (word) frequency? Longitude? Affix? Country vs. City? Few of my first few hypotheses are even true, and none of them explain much of the variance.

And what if we picked a different head noun, such as "X-oriented" or "X-bound" or "X-educated"? Would the statistics be similar, or different?

And does all this have anything to do with the compound nouns that don't involve a de-verbal head at all, but are created by adding "-ed" to a modified noun, as in "red haired"? In other words, is the construction (at least sometimes and for some people) [[Canadian base]+ed] ? If so, does that offer any traction in explaining the enormous variation in usage statistics sketched above? I don't see how, but at least it would provide a grammatically and logically plausible analysis for such phrases.

Then again, maybe I'm just being old-fashioned in expecting a coherent compositional account of how regular phrasal patterns acquire their form and meaning, as opposed to the currently-spreading view that "our interpretive capacities take into account holistic informational characteristics of linguistic constructions and don't simply generate meanings by way of 'bottom up' recursion principles."

Posted by Mark Liberman at May 15, 2004 07:38 AM