March 18, 2005

Surnames

Linking to a post on Japanese surnames at Butterfly Blue, Steve at Language Hat asks "Did you know (to take one startling fact) that Japan has more different surnames than any other country in the world (about 120,000)?" Backed by Steve's well-deserved authority on the net, I'm afraid that this meme may start to propagate, although it's far from being true.

20-odd years ago, I worked on software to determine the pronunciation of names for text-to-speech applications. We worked from lists of American surnames, derived from phone books and other sources, that comprised several million distinct (orthographic strings representing) surnames, and were far from complete. This is also the kind of number that emerges from the description of NameX, a software product that can be seen as a "thesaurus containing 132 million variants for 2.6 million distinct Surnames," though this is for names "from all over the world with comprehensive coverage of names with European origins".

You can download a list of "Frequently Occurring Names" from the U.S. Census Bureau that includes 88,799 distinct surnames from 6,290,251 records. According to the documentation, this is based on the records of people living in the 5,300 blocks where the "post enumeration sample" was done, along with "additional surrounding ring blocks", amounting to about 1/40 of the overall 1990 census. This was then pruned further: "For purposes of both confidentiality and elimination of data noise we restricted the number of unique names available at this internet site to the minimum number of entries that contain 90 percent of the population in that data file."

The file was sorted in inverse order of frequency, with the names for each frequency count sorted in inverse alphabetical order. The last batch before the file is cut off includes 13,124 names from

ZYSETT 0.000 89.231 75677

to

AALDERINK 0.000 90.483 88799

The count of individuals with each name is not given, but the numbers mean that these 13,123 names comprise 90.483-89.231= 1.252% of the total set of people in the sample, so that we can determine that the sample count for the names in this set must have been 6. This allows us to provide a lower bound on the number of distinct names in rest of the sample as about 99,774 (i.e. assuming that all additional names occur six times each), which would yield 88,799+99,774=188,573. In fact this is surely much too small, since the tail of names will normally include increasing numbers at lower frequency counts (of 5, 4, etc. people per name). In addition, the other 39/40 of the census will add considerably to the tail of infrequent names, not only because of the expected effects of a larger sample, but because of the bias introduced because the existing list sampled whole households from a particular set of compact geographical areas ("blocks").

I don't present this as any sort of estimate of the number of distinct surnames among American households (which I believe to be well over a million), but rather as a demonstration that whatever the true number is, it's much larger than 120,000.

 

Posted by Mark Liberman at March 18, 2005 08:43 AM