July 08, 2007

What men and women actually talk about

I posted earlier today about sex-linked vocabulary items, as imagined by an anonymous BBC News writer and as measured in a sample of weblog text by Koppel et al. ("What men and women blog about"). It occurred to me that I could easily check the generality of these two sets of items by searching the LDC's collection of transcriptions of telephone conversations. (Thus making an even larger mountain out of the original molehill of an article -- but it's 96 degrees out today, and I'm putting off going out to run errands...)

The corpus that I used includes 14,136 conversations, comprising a total of 26,151,602 words. Out of the 31 items listed in the BBC News article (they claim 46, but quantification is obviously not their strong suit), 10 are either too rare or too British or too topical to occur at all in this corpus ("home birth", "pomegranate", "conventionally attractive", "Jessica Metcalfe", "footless tights", "kitten heels", "agony aunt", "handbagging", "beefeater", "concealer"). Of the remaining 21, 5 are actually used at a higher rate by males in the conversational corpus ("what are you thinking", "Afghanistan", "flexible working", "Ms", "Middleton"). Only "babies", "absolutely beautiful", "pilates", and "heels" seem to be be reasonably common words or phrases that are actually useful indicators of a female speaker in this corpus.

Here are the details, with the raw counts and the counts normalized as frequency per million words (note that there are more female than male speakers in this collection, 15,685 to 12,589).

Item
Women
count (f/M)
Men
count (f/M)
book club
11 (.79)
7 (.6)
accessorize
1 (.07)
0
body image
3 (.21)
1 (.09)
empowering
3 (.21)
2 (.17)
burlesque
2 (.14)
1 (.09)
size zero
1 (.07)
0
pilates
37 (2.65)
9 (.77)
cellulite
2 (.14)
0
absolutely beautiful
27 (1.93)
8 (.68)
breastfeeding
15 (1.07)
2 (.17)
emotional intelligence
0
1 (.09)
heels
37 (2.65)
15 (1.28)
what are you thinking
10 (.72)
12 (1.03)
feminism
3 (.21)
2 (.17)
afghanistan
168 (12.01)
212 (18.13)
airbrushing
1 (.07)
0
flexible working
0
1 (.09)
babies
419 (29.96)
97 (18.13)
superwoman
1 (.07)
0
Ms
9 (.64)
13 (1.11)
Middleton
0
1 (.09)
why
10390 (743)
8266 (707)

If we turn instead to the list of content-words from the Koppel et al. study, we get much better predictions. Only one of the items is missing from the conversational corpus ("gb", which was presumably an abbreviation for "gigabyte", specific to the textual mode, as Cory Lubliner has pointed out to me). In general, the frequencies are higher. And there are no reversals -- all the items that were sex-associated in the weblog sample are sex-associated in the same direction in this conversational sample. For comparison, I've taken the top ten items from each end of their list (male-associated and then female-associated):

Item Women
count (f/M)
Men
count (f/M)
linux
2 (.14)
10 (.86)
microsoft
80 (5.72)
145 (12.4)
gaming
23 (1.64)
40 (3.42)
server
19 (1.36)
26 (2.22)
software
137 (9.8)
198 (16.93)
programming
86 (6.15)
122 (10.43)
google
38 (2.72)
48 (4.11)
data
84 (6.01)
125 (10.69)
graphics
44 (2.15)
76 (6.5)
india
155 (11.08)
229 (19.59)
 
cute
668 (47.77)
164 (14.03)
gosh
2242 (160.34)
530 (45.33)
kisses
15 (1.07)
5 (.43)
yummy
21 (1.5)
1 (.09)
mommy
154 (11.01)
20 (1.71)
boyfriend
743 (53.14)
102 (8.72)
skirt
47 (3.36)
7 (0.6)
adorable
57 (4.08)
13 (1.11)
husband
9168 (655.65)
484 (41.4)
hubby
10 (.72)
3 (.26)

I'm sure that if we sorted words by information gain with respect to sex determination, we'd get a different ranking from these conversations than Koppel et al. got from their weblog corpus. But it's encouraging that predictions from the weblog sample are so reliably maintained in the (very different) conversational data.

[Note that my title, "What men and women actually talk about", is mostly tongue-in-cheek. I'm not looking at anything except counts of some of the words that people choose to use. And obviously any individual's word choices would vary widely, depending on the context and the topic.

The material that I searched comes from a variety of sources, and does include some conversations in which people talked with friends and family about whatever they wanted to. But most of the transcribed conversations were with randomly-assigned strangers, where the participants were asked to talk about a randomly-assigned topic (among a set that they had previously agreed they would be willing to discuss). One of the sub-collections had 70 such topics; another had 30. You can read more about some of these collections here, here, here, here, etc.]

Posted by Mark Liberman at July 8, 2007 03:05 PM