April 09, 2005

Language in the social and behavorial sciences

I'm going to start this post with two suggestive graphs, continue with some historical background, and end with a startling prediction.

The graphs come from a talk that Jamie Pennebaker gave at Penn a few weeks ago. They plot two simple time-functions derived from posts by Americans on LiveJournal between September 10 and November 5, 2001.

The first graph shows the frequency (in percent) of the words I, me, my, day-by-day from September 10 to September 24, and then week-by-week to November 5:

The second graph shows the frequency of the words we, us, our from the same sources over the same period:

Pennebaker and his co-workers calculated these counts because of their theory that "word choice can serve as a key to people's personality and social situations", and in particular that "pronouns, prepositions, conjunctions, articles, and auxiliary verbs" are especially "powerful indicators of people’s psychological state". Their work offers many other striking facts and interpretations, and raises all sorts of complex questions, all of which I'll ignore for now, because I want to talk about the history and future of a related family of ideas and techniques.

The early 1960s saw Gerald Salton's insight that the content of a document can be usefully approximated by nothing more than the frequency counts of the words it contains, and also the influential work by Frederick Mosteller and others on the use of simple linguistic statistics to make inferences about authorship. It was during this same period that Bill Labov showed how to use counts of simple things like word choice and pronunciation variation to investigate the social and temporal dimensions of language.

About 20 years later, in the mid 1980s, Doug Biber and others explored the idea that notions like register and genre could emerge from an analysis of the distribution of simple linguistic measures across texts. Pennebaker's work adds "people's personality and social situations" to the list of things that can be studied this way.

Anyone who uses internet search is reaping the benefits of Salton's ideas, and of course there's a robust area of academic and industrial research on how to make textual information retrieval work better. As for models of authorship, Erica Klarreich wrote in a Science News article in 2003 that

Stylometry is now entering a golden era. In the past 15 years, researchers have developed an arsenal of mathematical tools, from statistical tests to artificial intelligence techniques, for use in determining authorship. They have started applying these tools to texts from a wide range of literary genres and time periods...

Quantitative sociolinguistics has become an established discipline, with its own journals and meetings. Computational linguists have been busily and successfully applying frequentistic methods to a wide range of problems, from parsing and semantic analysis to summarization, automatic translation and "text data mining". And psycholinguists, who have always had to control for frequentistic effects in their experimental design, are increasingly interested in studying such effects directly.

Viewed in this context , what's especially interesting to me about Pennebaker's work is how isolated it is. If we look across the social and behavioral sciences -- outside of sociolinguistics and psycholinguistics -- there are remarkably few cases where linguistic analysis plays any explicit role in research. (See Damon Mayaffre's "digital hermeneutics" for another example.)

I'm exempting sociolinguistics and psycholinguistics because the whole enterprise in these subdisciplines is focused is on aspects of language or language use. And I'm using the term "explicit role in research" because many social and behavioral science researchers use linguistic analysis implicitly, for instance in interpreting survey or interview results, or in examining political rhetoric. What's missing is work that uses linguistic analysis -- even something as trivial as word counts -- as an explicit component of a research program that's not mainly about language.

I predict that this will change. Pennebaker's use of LiveJournal data to investigate the social and psychological effects of 9/11 suggests some of the reasons:

  • Enormous amounts of text are now being produced in digital form, explicitly situated in space, time and various sorts of social networks.
  • Much of this text is freely available to anyone who cares to download it from the web.
  • Even the most elementary forms of analysis (such as local word counts) can serve as effective indicator variables for content, individual and social identity, style, emotional state and so on.
  • Simple and accessible computer methods make it easy to generate and analyze such data on a large scale.

There are other reasons as well:

  • There are new techniques for automatic analysis of the form and content of text (parsing, tagging of "entity mentions", determination of reference and co-reference, etc.).
  • There are new statistical techniques for finding relevant patterns in very high-dimensional data.
  • In some cases, linguistic analysis could be used simply to enhance research productivity in existing paradigms (e.g. because many of the kinds of "coding" of survey and interview transcripts that already go on every day could be automated).

There's also an enormous educational opportunity here. At the high school level, you could use quantitative linguistic analysis to teach statistics, simple computer programming, and scientific methodology -- and even perhaps some linguistics! Simple techniques of this kind can be applied to many sorts of problems that most students will be interested in: information retrieval, analysis of individual and group identity, style, personality and mood, and so on. So I also predict that it will become routine to use this stuff to teach math and science in high school.

Some readers may be tempted to complain that these predictions are not at all "startling," despite what I wrote in the first sentence of this post. If you're one of them, I'm happy that you share my belief that the predicted changes are so easy and so beneficial that implementing them would be a no-brainer. But I'm afraid that I still find the predictions "startling", in the sense that I'll be pleasantly surprised if they come true in the near future.

[ You can learn more about the 9/11 LiveJournal investigation in Michael A. Cohn, Matthias R. Mehl and James W. Pennebaker, Linguistic Markers of Psychological Change Surrounding September 11, 2001, Psychological Science, Volume 15, Issue 10, Page 687-693, October 2004.

The abstract:

The diaries of 1,084 U.S. users of an on-line journaling service were downloaded for a period of 4 months spanning the 2 months prior to and after the September 11 attacks. Linguistic analyses of the journal entries revealed pronounced psychological changes in response to the attacks. In the short term, participants expressed more negative emotions, were more cognitively and socially engaged, and wrote with greater psychological distance. After 2 weeks, their moods and social referencing returned to baseline, and their use of cognitive-analytic words dropped below baseline. Over the next 6 weeks, social referencing decreased, and psychological distancing remained elevated relative to baseline. Although the effects were generally stronger for individuals highly preoccupied with September 11, even participants who hardly wrote about the events showed comparable language changes. This study bypasses many of the methodological obstacles of trauma research and provides a finegrained analysis of the time line of human coping with upheaval.

]

Posted by Mark Liberman at April 9, 2005 11:01 AM