November 12, 2007

Nationality, gender and pitch

Do the Japanese exaggerate the natural differences between men and women in the pitch of the voice? Some people say so, and so in this morning's Breakfast Experiment™, I take a look at the facts of the matter. The results are interesting -- and I hope you'll enjoy the discussion, which takes in empirical sociology, cognitive neuroscience, and the U.S. presidential campaign.

First, some background. This all started when we took a look at a report by Matthew Rusling about the problems he created for himself by trying to learn Japanese from his girlfriend ("The perils of mixing romance with language learning", 11/7/2007):

I thought my Japanese was fine, while in reality the effeminate, almost childish twang I had been learning made me sound very much like a 20-something, pink miniskirted Japanese woman.

The female features that he copied included certain modes of self-reference, elongation of certain utterance-final syllables, use of certain female-associated particles, and perhaps a more feminine pattern of honorifics. But the biggest issue, he reports -- at least the one that annoyed his girlfriend the most -- was pitch:

Most of all, she said, I needed to take the pitch of my voice down several notches from the tone I had learned.

I can see someone learning the wrong particles and modes of self-reference -- though the gendered nature of these things in Japanese is widely discussed in language-learning texts -- but it seemed to me

... a little surprising that Rusling never realized that women have higher-pitched voices than men do, and that exaggerating the difference is a way of explicitly drawing attention to sexual identity -- this general pattern is hardly unique to Japanese.

But various readers took me to task for this. Randy Alexander wrote:

Come on, that's a little below the belt! I'm 100% sure that he's factoring in a male-female difference of about an octave. His girlfriend meant that he was talking too high in his range. I think if you listen to a few Japanese people having a conversation in Japanese, you might notice that the women talk higher in their range than the men do.

Karen Kay wrote:

My voice is higher when I speak Japanese. When John Wayne's voice is dubbed into Japanese, he sounds like Barry White instead of John Wayne because that's his cultural image. This is one of the things I taught when I taught Japanese language, pitching your voice higher or lower.

And some observation in the blog of an English teacher in Korea ("Deep-Voiced Japanese and Pitch Parrots", 11/10/2007):

While I was in Japan to renew my visa I thought the bus drivers talked in an unnaturally low voice. I remember the first bus I rode, the driver was a big guy, looked like he might've been a former sumo wrestler, so when he got on the intercom and announced a stop I thought his extremely low voice could've been natural. But then other bus drivers much smaller in stature also spoke in the range of two octaves below middle C. Thanks to this recent post at Language Log it seems there's some method to this madness.

Referring to the Language Log post: In class when I ask students to model my pronunciation I'm always taken aback when a young woman not more than 90 lbs not only copies my pronunciation but also drops her pitch to match mine. But then again, I've done the same thing when Ju-yeong has taught me Korean; not thinking about my tone I tend to match hers - which gives her and her friends a good laugh.

My own impression, for what it's worth, is that some Japanese women speak in a higher pitch range than I expect, while Japanese men use about the same range of pitches as Americans. But I don't trust such impressions very far, so I decided to check.

For a start, I took some recordings of telephone conversations made about a dozen years ago in the "CallHome" project, and published by the Linguistic Data Consortium in 1996 and 1997. There were 120 Japanese conversations of about half an hour each; I decided to focus on 18 conversations that involved one male and one female participant. (The rest of the conversations involved two males, two females, or -- more than one participant on each side of the conversation). For comparison, I took the 27 CallHome English conversations with the same characteristic -- just two participants, one male and one female.

I pitch-tracked all of the conversations using the get_f0 program from the ESPS software system. [This was originally written by Dave Talkin based on an algorithm by George Doddington -- this is the pitch tracker used in WaveSurfer from KTH in Stockholm, but I used a standalone version available as part of a free package here.]

This produces quite a bit of data -- around four and a half million pitch values, divided among the four categories of nationality and sex.

One simple way to compare those four categories is to lump all the pitch data from all the male Japanese speakers together and look at the quantiles of fundamental frequency values -- the 10th percentile was 88.5 Hz., the 50th percentile was 122.1 Hz., the 90th percentile was 207.0 Hz. -- and do the same for the female Japanese speakers, the male Americans, and female Americans.

If we plot the results in terms of semitones (relative to A 110), we get this:

The same data plotted with pitch estimates in Hertz (cycles per second):

So sure enough, the Japanese speakers are more gender-polarized -- the male Japanese speakers are pitching their voices somewhat lower (overall) than the male Americans, while female Japanese speakers are overall somewhat higher-pitched than female Americans.

How big is the effect? The table below shows the overall female-male F0 difference in semitones at percentiles from 10% to 90%:

  .1 .2 .3 .4 .5 .6 .7 .8 .9
Americans 6.0 8.4 8.1 7.5 6.9 6.1 4.8 4.4 3.7
Japanese 9.2 9.5 9.2 9.1 9.3 9.0 8.3 7.6 6.2

Overall, the Japanese (in this sample) separate the sexes by one to three semitones more than the Americans do. Since each semitone corresponds to a pitch difference of about 5%, this is a difference with a certain amount of oomph. (With 4.5 million data points, the difference is highly "significant" in the statistical sense, though that fact is of no value or consequence whatsoever.)

Does this mean that the stereotype about national differences is correct, this time?

Maybe so, but I'd recommend caution.

And I don't mean to repeat my usual warnings about pop platonism, which mistakes overall group differences for essential properties of individual group members. It's certainly true that the distributions overlap: in the sample I used in this experiment, there are plenty of pairs of Japanese male and female speakers whose pitch ranges are closer than some pairs of American male and female speakers.

But that's not what worries me most in this case. My main concern is that these speakers may not be typical of the categories we're trying to learn about by studying them.

The speakers in these conversations were not randomly selected Japanese and American male and female speakers. They were recruited by offering free overseas telephone calls. in the mid-1990s, pre-Skype days when international calling rates were often several dollars per minute. All calls originated in the U.S., and so the Japanese participants were (I think) mostly students calling their parents, while the American participants were a more mixed group.

I picked the calls purely on the basis of nationality and sex, but my sample was not controlled for age, class, caller's relationship to callee, or for the interaction of those categories. So perhaps we've discovered that male Japanese students and their mothers tend to polarize their pitch ranges; or that American married couples tend to harmonize their pitch ranges; or something else entirely. I haven't looked into the ages and relationships of the participants in these conversations, so I don't mean to suggest that these explanations are likely ones -- I'm just spinning out some ideas about things that might be going on.

This is why social scientists put a lot of effort into controlling the demographic characteristics of survey participants. This is also partly why they use large sample sizes -- no matter how carefully you control for the obvious things, there are always lots of subgroup or individual differences that you have to treat as noise (and you hope you're lucky enough that all the other stuff averages out in your sample -- it probably usually doesn't, alas).

For some reason, experimental psychologists in general, and neuroscientists in particular, don't seem to have learned these same lessons. And their willingness to draw conclusions about group properties from tiny and uncontrolled samples is amplified by journalists and politicians.

I've cited many cases where brain imaging studies involving a handful of subjects -- and often with marginal results on those -- have been interpreted as telling us something about men and women in general, or boys and girls in general, or members of other general categories. For example, here's a study of 9 boys and 10 girls used to argue that "Girls and boys behave differently because their brains are wired differently"; here's a study of 10 female and 10 male medical students at UCLA used to argue that "Women really do enjoy a good laugh as much as you do; they are just wired to focus on different aspects of humor."

A beautiful example of the same thing was published yesterday in the New York Times: Marco Iacoboni et al., "This is your brain on politics":

In anticipation of the 2008 presidential election, we used functional magnetic resonance imaging to watch the brains of a group of swing voters as they responded to the leading presidential candidates. Our results reveal some voter impressions on which this election may well turn.

Our 20 subjects -- registered voters who stated that they were open to choosing a candidate from either party next November -- included 10 men and 10 women. In late summer, we asked them to answer a list of questions about their political preferences, then observed their brain activity for nearly an hour in the scanner at the Ahmanson Lovelace Brain Mapping Center at the University of California, Los Angeles. Afterward, each subject filled out a second questionnaire.

We don't learn anything more about who these 10 males and 10 females were -- medical students again? It doesn't matter, anyhow, since in such a small sample, it's impossible to control for the range of demographic, cultural, individual and random factors that are likely to swamp whatever conclusions you'd like to draw about the political reactions of American voters in general.

This doesn't prevent the authors from making sweeping generalizations about voters' reactions to particular parties, issues and candidates, e.g.

The two areas in the brain associated with anxiety and disgust -- the amygdala and the insula -- were especially active when men viewed "Republican."

And as this example indicates, they're particularly eager to draw conclusions about gender differences:

Men show little interest in Mrs. Clinton initially but after watching her video they react positively. Women respond to her strongly at first, but their interest wanes after they watch her video.

With Mr. Giuliani, the reactions are reversed. Men respond strongly to his initial still photos, but this fades after they see his video. Women grow more engaged after watching his video.

This is evidence that swing voters' responses change when they see these two candidates in action. For men, Mrs. Clinton is a pleasant surprise. For women, Mr. Giuliani has unexpected appeal.

I wonder how consistent these reactions were in their sample. I'll bet there was some overlap, and maybe a lot. But again, it doesn't matter -- it's irresponsible to take the responses of 10 medical students (or whatever) recruited at UCLA as a proxy for the reactions of 55 million male or 55 million female U.S. voters.

Their conclusions might be true, or they might not be, but the fact that some of their evidence comes from high-tech brain-imaging machines doesn't make the results any more likely to generalize to American voters as a whole than if they asked for a show of hands in their Introduction to Neuroscience class.

I'm not surprised to see that some political scientists understand this. Thus Brendan Nyhan writes ("Watch out for brain scan hype"):

When you read about brain-scanning studies like the one in today's New York Times, remember that interpreting fMRI data is more art than science and that the sample sizes are tiny and unrepresentative. I don't know how to interpret any of the claims in the article without way more information, which will hopefully be forthcoming in the authors' academic work.

Many people are concerned about the misuse or misinterpretation of this technology because brain scanning carries perceived scientific authority -- so much so that even irrelevant neuroscientific information can be perceived as more persuasive. As a result, even preliminary and not particularly newsworthy studies like this one may receive a great deal of hype.

What are the prospects that neuroscientists and journalists might learn these lessons? I try to stay optimistic, but each group has an interest in failing to understand the limitations of such studies.


[In passing, let me point out that the discussion above ignores the possibility that a national difference in gender polarization might be genetic rather than cultural. That's because I think a genetic explanation is vanishingly unlikely to be true -- if the pitch-range polarization effect is real, it's probably a fact about Japanese vs. American societies, not Japanese vs. American genomes.

But in many analogous cases, similar or even smaller smaller group differences have been presented as evidence for genetic differences (between women and men, or between Africans and Europeans, or whatever). And in this case, we know that the basic sex difference in pitch range is genetically based, caused by the effect of testosterone on the male larynx at puberty; and we know that this particular form of sexual dimorphism is a relatively recent innovation in human evolution, not shared with chimps and gorillas, so that it must have been under selective pressure in recent times, and might still be.

It wouldn't be nuts to imagine that the selective pressures on this trait might have been unusually great, in Japanese society over the past 1,000 years or so, creating a increase in average laryngeal dimorphism. But I doubt it. ]

Posted by Mark Liberman at November 12, 2007 07:39 AM