August 12, 2005

Rorschach Science

The stimulus? A journal article about functional brain imaging of men listening to variously-hacked men's and women's voices.

The response? Worldwide resonant evocation of sexual stereotypes, congruent and contradictory alike.

Some headlines: "Er, you what, luv?" -- "Man Leaves Wife, Realizes Six Hours Later" -- "Female Voices are Easier to Hear" -- "What We Have is Failure to Communicate" -- "Men do Have Trouble Hearing Women" -- "Why Imaginary Voices are Male" -- "It's official! Listening to women pays off" -- "Men do have trouble hearing women, scientists find".

The blogospheric reactions are just as creative: "I can't hear you, honey...you're just too difficult to listen to" -- "What to tell your wife when you didn't hear her" -- "Men who are accused of never listening by women now have an excuse -- women's voices are more difficult for men to listen to than other men's, a report said" -- "I've been waiting for this for a long time. I'm often accused of 'selective hearing' in which certain statements just disappear from my consciousness - often statements made by Mrs. HolyCoast. It usually occurs when I'm multi-tasking, such as watching TV or blogging while listening to my better half..." -- "Science explains patriarchal monotheism!" ...

So I went and read the journal article: Dilraj S. Sokhi, Michael D. Hunter, Iain D. Wilkinson and Peter W.R. Woodruff, "Male and female voices activate distinct regions in the male brain", In Press, NeuroImage. I'm deeply puzzled by some of the research that paper describes -- if Sokhi et al. really did what they seem to be saying they did, I don't see how the results can be interpreted at all -- but I'm pretty sure that the experiment doesn't mean most of the things that people are saying it does. Maybe it doesn't mean any of them.

Here's what they did. They recorded 12 male and 12 female speakers reading some "emotionally neutral" sentences, "balanced in being directed to three main cortical modalities: vision ('look in the newspaper'), auditory ('listen to the music') and motor ('open the kitchen door')". As expected, the average pitch (F0) was different for the two groups -- "112.01 ± 8.11 Hz for male speakers and 204.68 ± 19.31 Hz for female speakers". They took from the literature the observation that there is a "gender-ambiguous" F0 range in the region of 135 to 181 Hz., where the typical "tessitura" of male and female speakers overlaps, and so they scaled each phrase in four steps from its original speed to a speeded-up or slowed-down version whose average pitch would be at 135 Hz. (for female speech) or 181 Hz. (for male speech).

If that sounds like a strange thing to do, it is. But here's what they say:

We calculated the difference between a speaker's F0 and the ‘target F0’ which, defined by the GAR F0 (see above), was 181 Hz for male speakers and 135 Hz for female speakers. We then derived speaker-specific scalar factors (SFqs) to pitch-scale a speaker's F0 in four equidistant steps (q = 1 to 4) to the ‘target F0’ without preserving Fn.

When they say "without preserving Fn", what they (seem to) mean is that they didn't do any fancy processing to change the pitch (F0) without changing the vocal-tract resonances (F1, F2, F3 etc.); instead they just increased or decreased the overall playback speed in proportion to the needed F0 changes. And the maximum amount of change was considerable -- an average male recording would have been sped up to as much as 181/112 = 162% of the original rate, while an average female recording would have been slowed down to as much as 135/205 = 66% of the original rate. (It's possible that I've misunderstood this, but I don't see any other way to interpret what they say...)

In fact they didn't use quite this much shifting, because they did perceptual tests to find the amount of shift that would produce "gender-ambiguous" stimuli, "defined by reaching the 50% mark .. for accuracy in reporting the gender of a given set of voices", and this was achieved by shifting a selected subset of stimuli to "159.13 ± 5.52 Hz for male speakers and 156.83 ± 4.09 Hz for female speakers, where the F0s for the corresponding selected natural stimuli were ... ‘male gender-apparent’ = 107.55 ± 6.46 Hz and ‘female gender-apparent’ = 211.77 ± 14.07 Hz."

So they speeded up the male recordings, on average, by a ratio of 159.13/107.55 = 1.48, and they slowed down the female recordings by an average ratio of 156.83/211.77 = 0.74. Still quite a big shift -- I'd expect these stimuli to be species-ambiguous, not just gender-ambiguous. Also, note that the duration of the phrases will be changed by the same factors, so the female phrases slowed down so as to be sexually ambiguous will be roughly twice as long as the male phrases speeded up so as to be sexually ambiguous.

Why did they do this? Well, they say that the shifted recordings "were selected for the fMRI experiment as these stimuli were matched for F0, thus removing the confound of simple pitch effects during perception of gender from heard speech". The basic idea is a sound one, since they want to be able to claim that they're seeing the effects of perceived speaker sex, not just the effects of higher versus lower pitched voices. But this is a strange way to go about it, since the shifted stimuli (according to their perceptual experiments) were selected so as to be identified as male and female about equally often! (There are indeed other cues to sex in the voice besides F0, as the authors mention, but they've specifically selected the artificial stimuli so that sex judgments are roughly equal...). Logically, I would have expected them to choose naturally-occurring male-perceived voices and female-perceived voices with F0 in an overlapping range, but they didn't try to do this. And the rate-shifting manipulation that they did (apparently) use not only doesn't preserve perceived sex, it introduces some other non-sex-linked acoustic factors (like duration differences) that seem just as problematic as the F0 difference it eliminated. They could have used pitch-shifting technology to change the pitch without changing the duration or the vocal-tract resonances, but they didn't, again I'm not sure why.

In any case, they've got four classes of stimuli:

The original samples of the sentences recorded from each speaker, with the original F0, together with the set of new stimuli of frequency F0(g-amb) gave 96 stimuli falling into four categories: ‘male gender-apparent’ (unaltered in pitch), ‘male gender-ambiguous’ (pitch-scaled and ‘gender-ambiguous’), ‘female gender-apparent’ and ‘female gender-ambiguous’ stimuli.

They played these 96 stimuli to 12 male subjects. It's not clear why they only studied males -- at least I couldn't find any reason for this. Maybe they're planning to look at female subjects in a different study. But 12 subjects is not a big fMRI experiment, so I'm not clear why they didn't look at both sexes. (And as you'll see, having female subjects would make a big difference in interpreting the results...)

Anyhow, the key thing in such functional imaging studies is that you can't just look at one condition. You need to compare the distribution of cerebral blood flow when subjects are doing X to the distribution when they are doing Y, or some more complicated sort of comparison of a similar kind. This is roughly for the same reason that in studying the effect of drug on a disease, you can't just give it to some patients and see how many get well; you need to compare the results for a matched set of patients who didn't get the drug. In this experiment, they defined their comparisons as follows:

(i) [(‘female gender-apparent’ > ‘male gender-apparent’) AND (‘female gender-ambiguous’ > ‘male gender-ambiguous’)] = “female versus male”;
(ii) [(‘male gender-apparent’ > ‘female gender-apparent’) AND (‘male gender-ambiguous’ > ‘female gender-ambiguous’)] = “male versus female”;
(iii) [(‘male gender-apparent’ > ‘male gender-ambiguous’) AND (‘female gender-apparent’ > ‘female gender-ambiguous’)] = “‘gender-apparent’ versus ‘gender-ambiguous’”; and
(iv) [(‘male gender-ambiguous’ > ‘male gender-apparent’) AND (‘female gender-ambiguous’ > ‘female gender-apparent’)] = “‘gender-ambiguous’ versus ‘gender-apparent’”.

Thus when they say (in their press release) that "when a man hears a female voice" such-and-such a region of his brain is activated, what they mean is that the specified region is (among the regions where) the two conditions specified in (i) are met: first, 'female gender-apparent' recordings create significantly more activation than  'male gender-apparent' recordings, and second, 'female gender-ambiguous' recordings yield significantly greater activation than  'male gender-ambiguous' recordings.

But there are some other descriptions you could give of that set of conditions. For example, you could say that these are the brain regions that respond more to higher-pitched speech than to lower-pitched speech; and for speech in a medium pitch range, respond more to recordings that have been slowed down to reach that level than to recordings that have been speeded up to reach that level. Or perhaps, respond more to phrases that are longer in duration than to phrases that are shorter in duration. This last is not a trivial issue, especially since the subjects were listening to the stimuli against the background of scanner noise, which is roughly like being in a boiler factory inside one of the boilers. (It's possible to arrange the scanning acquisition so that audio stimuli are played in silent intervals, but that was not done in this experiment). So higher-pitch or longer-duration stimuli will probably be more acoustically salient, especially in this very noisy environment, and therefore might show increased auditory activation, quite apart from any sexuality judgments. And lower-pitch or shorter-duration stimuli will be harder to hear, and therefore might engage some additional attention-focusing mechansisms, again apart from any sexuality judgments.

Whatever the reasons, their results were these:

Conjoint contrast
Brain region
“Female vs. male” Right anterior superior temporal gyrus
“Male vs. female” Right precuneus
“‘Gender-apparent’ vs. ‘gender-ambiguous’” Posterior superior temporal plane contiguous with inferior parietal lobule
“‘Gender-ambiguous’ vs. ‘gender-apparent’” Right anterior cingulate gyrus

So as I said, I'm really puzzled about how to think about what these results mean. Whatever is going on, though, there's nothing in their results to stand behind statements like "[t]he female voice is actually more complex than the male voice, due to differences in the size and shape of the vocal cords and larynx between women and men", as the Sheffield press release asserts.

And the same press release says that "when a man hears a female voice the auditory section of his brain is activated, which analyses the different sounds in order to 'read' the voice and determine the auditory face" -- are we supposed to conclude that males hears male voices in a way that by-passes the auditory cortex? Well, they go on to say that "[w]hen men hear a male voice the part of the brain that processes the information is towards the back of the brain and is colloquially known as the 'mind's eye'. This is the part of the brain where people compare their experiences to themselves, so the man is comparing his own voice to the new voice to determine gender."

But if even if their conjoint contrast (ii) is really male-vs-female and not lower-pitch-and-shorter-phrases-vs.-higher-pitch-and-longer-phrases (and similarly for the other three contrasts), the results are still not about males-hearing-sex-identified-voices. They're (at best) about males-hearing-males after you subtract out everything this has in common with males-hearing-females; and males-hearing-females after you subtract out everything this condition has in common with males-hearing.males. And because they don't have any data on females-hearing-males vs. females-hearing-females (or females-hearing-lower-pitch-and-shorter-phrases, and so on), interpretations in terms of "people comparing their experiences to themselves" are at best highly speculative.

Unless I'm missing something, it seems to me that the increased STG activation in their condition (i) -- which they explain as males hearing females in areas adjacent to the auditory cortex -- might just as well be explained as subjects responding to acoustically more salient stimuli (higher pitch or longer duration) with more activation in acoustically-specialized areas of the brain. As for the increased precuneus activation in their condition (ii) -- which they explain as males responding to males by self-comparison in "the mind's eye" -- the precuneus (a structure in the parietal lobe) has been implicated in all sorts of things, from representation of the visual periphery to motor imagery of finger movement, with some stuff about attention along the way, so that I'd think you might just as plausibly explain this effect in terms of subjects attending more closely to acoustically less salient stimuli in a noisy environment, while thinking harder (or for a longer time) about which button to press to register the perceived sex of each stimulus.

The journal article starts out with some statistics about auditory verbal hallucinations in schizophrenia -- "The voices of AVHs are perceived as male 71% (and female 23%) of the time irrespective of the patient’s gender. The characteristics of the voices of AVHs are also commonly middle-aged, external to the person, right-lateralised, ‘‘BBC newsreader’’ accent in quality and derogatory in content (Nayani and David, 1996)." This is interesting, but I'm not convinced that the fMRI findings help us to understand this, especially the middle-aged, BBC newsreader, derogatory parts, which are properties totally orthogonal to anything in the experiments.

And as for the rorschach-blot reactions in the popular press and the blogs, about how this explains why men have a hard time paying attention to women, or why women's speech is more valuable, or why men and women often fail to communicate... Well, what's responsible for these responses is not the STG or the precuneus, it's the limbic system. When people have strong and complex feelings about a topic, research results become a screen for them to project their preconceptions onto.

Posted by Mark Liberman at August 12, 2005 12:20 AM