November 28, 2006

Word counts

When scientists want to support a factual assertion in print, they either present some experimental evidence, or they add a footnote referencing some earlier publication of evidence. Journalists have an analogous pair of methods: one is to report what they themselves experienced, and the other is to quote an eye-witness, an official spokesperson, or an expert. But every once in a while, journalists act like scientists and do an experiment. In yesterday's Guardian, Stephen Moss gives an example: "Do women really talk more?" For this article, he wired up a man and a woman -- Tim Dowling and Hannah Pool, who (I think) are Guardian staffers -- and recorded and transcribed everything they said for a day.

I think this started because back in September, I wrote a piece in the Boston Globe, "Sex on the brain", about Louann Brizendine's claim that women use about 20,000 words a day and men only about 7,000. This in turn followed up on some Language Log posts during the previous month, which you can find listed here. I noted that none of Brizendine's end-notes provided any factual support for the words-per-day claim; that version of this claim are common in psychological self-help books and even religious tracts; and that the relevant parts of the experimental literature show no meaningful sex difference in talkativeness, with several studies even showing men as slightly talkier.

The Guardian's experiement was consistent with the literature:

Hannah said 12,329* words
Tim said 11,279 words
*Hannah accidentally turned off her recorder for two hours, however, so her real total could be 14,000.

And Stephen Moss even reached Louann Brizendine by phone -- in a picturesque location! -- and she graciously conceded the point:

When I reach Brizendine, just as she is crossing the Golden Gate bridge, she tells me that she has accepted the criticism of the numbers quoted in the book - on both volume of words and rate of speech - and will be deleting them from future editions. Nor will they appear in the UK edition, to be published by Bantam in April. "I understand Mark Liberman's point and I am grateful to him," she says. "He felt I was passing on data that was not nailed down, and thus perpetuating a myth, so it will be taken out in future editions." She admits language is not her specialism, and she had been reliant on the advice of others.

This is excellent journalism. And it warms a linguist's heart to see how engaged Moss gets in the details of the project -- he's learning to do linguistic research, and he seems to have enjoyed it. But I'm afraid that this wasn't very good science, all the same.

In fact, Moss understands this (some of the following is apparently quoted from observations by "our linguist, Dr. Jane Sunderland"):

This is one man and one woman sampled on one, not necessarily, typical day. Moreover, our man admits that he is naturally reserved, while our woman is noted for her effervescence and says she always feels the need to act as a facilitator in conversations. They might almost have been chosen to act out the urban myth of taciturn man and talkative woman. [...]

Tim spent the first part of his recording at home, watching television, not talking to his family, and made two 40-minute tube journeys alone. He spent the day in the Guardian offices - which he doesn't usually - surrounded by people he did not know particularly well, and with his head down. (Hannah was also in the office, but she works there every day and is very relaxed in the environment.) Despite this (and despite at one point describing himself as "a man of few words"), Tim produced more than 11,000 words over 14 hours. [...]

In contrast to Tim, Hannah was with people most of the day (the exception being shopping in Sainsbury's). When you are with people you usually talk to them. (Incidentally, Hannah's figure suggests that for anyone to produce 20,000 words in a day would be difficult.)

We should add that the two subjects in this case knew what the point of the experiment was, and were able to adjust their behavior to influence the results. If you really wanted to draw conclusions about men and women in general, you'd need to record a demographically balanced sample of people in a balanced sample of contexts. With one woman and one man, you could get almost any result at all. It's nice that the Guardian's result was a plausible one, but it's a puzzle for philosophers, I think, why people are so ready to be influenced by the results of single-trial experiments on phenomena they know to be highly variable.

By a curious coincidence, another study featuring the interpretation of word counts from a small sample recently played a prominent role in a major English-language publication. This study was not a one-day journalistic lark, but a serious, decade-long study that has played a major role in influencing a public-policy debate that is central to our society. And yet, it has some issues in common with what Stephen Moss did.

I'm talking about Betty Hart and Todd Risley's classic research on social-class differences in language acquisition (Betty Hart and Todd Risley, "Meaningful Differences in the Everyday Experience of Young American Children", 1995; Betty Hart, "A Natural History of Early Language Experience", Topics in Early Childhood Special Education, 20(1), 2000; Betty Hart and Todd Risley, "The Early Catastrophe: the 30 Million Word Gap", American Educator, 27(1) pp. 4-9, 2003). This work was featured in Paul Tough's article in last Sunday's New York Times Magazine last Sunday, "What it takes to make a student".

Here's the abstract from Hart and Risley (2003):

By age 3, children from privileged families have heard 30 million more words than children from underprivileged families. Longitudinal data on 42 families examined what accounted for enormous differences in rates of vocabulary growth. Children turned out to be like their parents in stature, activity level, vocabulary resources, and language and interaction styles. Follow-up data indicated that the 3-year-old measures of accomplishment predicted third grade school achievement.

This is obviously serious stuff. Here's some of Tough's discussion:

They found ... that vocabulary growth differed sharply by class and that the gap between the classes opened early. By age 3, children whose parents were professionals had vocabularies of about 1,100 words, and children whose parents were on welfare had vocabularies of about 525 words. The children’s I.Q.’s correlated closely to their vocabularies. The average I.Q. among the professional children was 117, and the welfare children had an average I.Q. of 79.

When Hart and Risley then addressed the question of just what caused those variations, the answer they arrived at was startling. By comparing the vocabulary scores with their observations of each child’s home life, they were able to conclude that the size of each child’s vocabulary correlated most closely to one simple factor: the number of words the parents spoke to the child. That varied greatly across the homes they visited, and again, it varied by class. In the professional homes, parents directed an average of 487 “utterances” — anything from a one-word command to a full soliloquy — to their children each hour. In welfare homes, the children heard 178 utterances per hour.

What’s more, the kinds of words and statements that children heard varied by class. The most basic difference was in the number of “discouragements” a child heard — prohibitions and words of disapproval — compared with the number of encouragements, or words of praise and approval. By age 3, the average child of a professional heard about 500,000 encouragements and 80,000 discouragements. For the welfare children, the situation was reversed: they heard, on average, about 75,000 encouragements and 200,000 discouragements. Hart and Risley found that as the number of words a child heard increased, the complexity of that language increased as well. As conversation moved beyond simple instructions, it blossomed into discussions of the past and future, of feelings, of abstractions, of the way one thing causes another — all of which stimulated intellectual development.

Hart and Risley showed that language exposure in early childhood correlated strongly with I.Q. and academic success later on in a child’s life. Hearing fewer words, and a lot of prohibitions and discouragements, had a negative effect on I.Q.; hearing lots of words, and more affirmations and complex sentences, had a positive effect on I.Q. The professional parents were giving their children an advantage with every word they spoke, and the advantage just kept building up.

This is certainly consistent with our expectations -- our stereotypes -- and unlike the 20,000-vs.-7,000 legend, it's based on experimental data. However, as Hart and Risley write:

All parent-child research is based on the assumption that the data (laboratory or field) reflect what people typically do. In most studies, there are as many reasons that the averages would be higher than reported as there are that they would be lower. But all researchers caution against extrapolating their findings to people and circumstances they did not include. Our data provide us, however, a first approximation to the absolute magnitude of children’s early experience, a basis sufficient for estimating the actual size of the intervention task needed to provide equal experience and, thus, equal opportunities to children living in poverty. We depend on future studies to refine this estimate.

They also tell us clearly that their sample was a small one:

Our final sample consisted of 42 families who remained in the study from beginning to end. From each of these families, we have almost 2 1/2 years or more of sequential monthly hour-long observations. On the basis of occupation, 13 of the families were upper socioeconomic status (SES), 10 were middle SES, 13 were lower SES, and six were on welfare.

Now, six is a bigger number than one, obviously, and it's big enough that it makes sense to do statistical significance tests on tables like this one, taken from Hart and Risley (2003):

Families' Language and Use Differ Across Income Groups



13 Professional

23 Working-class

6 Welfare

Measures & Scores Parent Child Parent Child Parent Child
Protest scorea 41   31   14  
2,176 1,116 1,498 749 974 525
   utterances per 
487 310 301 223 176 168
Average different  
   words per hour
382 297 251 216 167 149
a When we began the longitudinal study, we asked the parents to complete a vocabulary pretest. At the first observation each parent was asked to complete a form abstracted from the Peabody Picture Vocabulary Test (PPVT). We gave each parent a list of 46 vocabulary words and a series of pictures (four options per vocabulary word) and asked the parent to write beside each word the number of the picture that corresponded to the written word. Parent performance on the test was highly correlated with years of education (r = .57).
b Parent utterances and different words were averaged over 13-36 months of child age. Child utterances and different words were averaged for the four observations when the children were 33-36 months old.

But six should not be a big enough number to lay our concerns to rest. We wouldn't try to predict the results of a national election based on an in-depth survey of six people in one city. Should we make national educational policy based on a similarly small sample,even if the data comes from 2 1/2 years of monthly visits? Does a sample of children from six poor families in 1980's St. Louis, as observed in a monthly visit from researchers with recording equipment, gives a meaningful picture of the experience of the millions of people that Tough's article takes them to represent? In particular, it's not clear how to reconcile this picture of monetary poverty engendering linguistic poverty with the central role that "lower SES" people have always played in American linguistic creativity.

This is not a criticism of Hart and Risley, who did a marvelous piece of research. But I think that it amounts to a criticism of several related scientific disciplines, including my own. More than a decade after Hart and Risley's first publication, the "future studies" that they "depend on to refine [their] estimate" are mostly still just as hypothetical as ever.

[Update -- Mark McConville writes:

I remember the Guardian's Polly Toynbee using this research three years ago to argue for non-selective education -- "We can break the vice of the great unmentionable", 1/2/2004.

I'm not sure to what extent she has misrepresented the sample size though:

"Meaningful Differences in the Everyday Experience of Young American Children is one of the most thorough studies ever conducted."

"There is no room here to do justice to this epic analysis, but no one could fail to be convinced by it."

She didn't mention it was based on just "six poor families in 1980's St. Louis" :-)

Well, it was indeed an "epic analysis", especially for its time -- they collected 42*2.5*12 = 1,260 hours of recordings, transcribed and coded them in many ways, and then followed the same kids through their subsequent career in school, and cross-correlated everything with everything, more or less. That doesn't change the fact that the "welfare" group, whose kids are at greatest risk of low achievement in school, was N=6.]

Posted by Mark Liberman at November 28, 2006 08:39 AM