Arnold Zwicky recently wrote that "I recalled with pleasure [C.C.] Fries's careful development of a system of parts of speech via distributional analysis, using as raw data some fifty hours of (covertly) recorded conversations". Tom Duff emailed an interesting suggestion in response:
I wonder if there's a primary education hook here (and a way to promote general Linguistics awareness.) Unless the math is too heavyweight, it sounds like a research program that schoolkids could replicate: taking down each other's speech, analyzing the data, discovering the grammar of the language as used by their peers. I would have been so stoked by this when I was 9 or 10.
It sounds like a complete primary education program -- English, science & math all rolled together. And talking in class!
I think this is an absolutely terrific idea. There are many difficulties, some of which I'll sketch below, but the opportunities are even greater. And much of the needed computational infrastructure could be shared with other projects, pedagogical and otherwise.
First, let's generalize the idea. Although distributional analysis of word classes would be a fine thing to do, you wouldn't want either to start there or to stop there. Students could learn some acoustic physics with their math, while looking at pitch contours or measuring formant frequencies and segment durations. They could learn some simple statistics, especially if data is available to them from multiple classes and schools, by looking at the effects of age, sex, region and so on. They could analyze the rhetoric and the performance of speeches, or the dynamics of conversation, looking at how gestures and facial expressions are aligned with words and phrases. They could compare vernacular and formal speech. They could look at different languages, for example to see how differently words with similar meanings are used.
I can say from personal (though informal) experience that bright nine- or ten-year-olds are interested in this kind of thing, at least at the level of looking at waveforms, spectrograms and pitch tracks of their own speaking, singing and assorted weird noises, or using web search to try to figure out what the right way to say something in Spanish is.
And as a technical matter, it would be fairly easy to make such analyses available to kids. Most of the needed infrastructure is already available, as free software on generic personal computers -- though you'd need to create more kid-friendly (or teacher-friendly) versions in some cases. There's one thing that's still missing, however: support for sharing data and for conveniently accessing shared data. The main motivation for this is that many interesting things, including distribution analysis of word classes, require more data than one class could collect; but even sharing data within a group of 30 or so students could be challenging without an appropriate system.
Here's one idea about what you'd want: a server where anyone can upload audio (and video too) with appropriate metadata; an Ajax-based tool for creating, editing and viewing transcriptions (and other time-aligned annotations), also saved and accessed on the net; a mechanism for defining virtual corpora out of sets of these annotated audio/video files; and a user interface (and an API) for searching such virtual corpora.
This would be useful for education through the graduate-school level, and for many scientific and engineering projects as well. I think that anyone who's ever taught or done research in this general area can see how it might be used.
OK, enough enthusiasm. Now for some of the (very serious) problems with the idea.
1. Most elementary-school and high-school teachers don't have the background needed to understand and teach such stuff, much less to create course materials based on it.
2. There are ethical and legal problems, in the general area covered by "human subjects" regulations, that are more acute in dealing with kids. You'd have to worry about how to prevent students from releasing information about personal identity, or inappropriate information about themselves and their families, or slander about their classmates, or whatever. This is related to the problems that myspace and facebook have, except that in this case, (some of) the material would be created or used under the authority of schools, who need to be much more cautious.
3. Even if problems (1) and (2) were dealt with, my guess is that the hardest problem here is how to create "lab exercises" that would work for students of different ages, backgrounds and interests, as presented by a similarly diverse set of teachers.
All the same, the general idea is a wonderful one. The (additional) infrastructure is worth implementing for other reasons -- more on this later. And I guess the way to make progress on the pedagogical problems would be to try it out with some kids in pilot projects, which could be in schools or in other contexts, like a summer camp or a museum program.
[Update -- Mike Maxwell and Bill Poser remind me that Ken Hale had the idea, more than 30 years ago, of using study of the Navajo language to teach the scientific method to Navajo students. Bill mentioned this work in an earlier LL post ("Reintroducing diagramming", 11/7/2004). Ken wrote a (still unpublished) textbook in support of this idea. Mike also cites Josie White Eagle, "Teaching Scientific Inquiry and the Winnebago Language", International Journal of American Linguistics, 48 306-319, and a paper by Michael Barkey, "Linguistics and Scientific Inquiry" (ms. dated 9/4/2006), which includes a brief review of "what others have done" (pp. 26-28), including Nigel Fabb's "Linguistics for ten year olds" (MIT Working Papers in Linguistics, 6, 45-61, 1985). Josie White Eagle's paper in turn refers us to a 1970 paper by Samuel Kay Keyser, "The role of linguistics in the elementary school curriculum", Elementary English, January 1970, 39-45. A bit of internet searching also turned up a brief review by Wayne O'Neill, "Linguistics in the Science Classroom: Progress and Prospects".
Since the general concept has been around for more than a generation, we need to ask why it has never been adopted to any significant extent. My speculation would be that it's because curricular innovation is hard; because most teachers lack the knowledge and skills needed to teach such material; and because the cultural trend has been strongly against teaching any analytic skills at all, at least in the area of language and communication.
Are things any different now? Well, the anti-analytic tide may have turned; the internet's "long tail" effect makes it easier for enterprising teachers to find and use curricular materials; and it may be possible to design interactive web-based materials and tools that can help teachers (and students) develop the concepts and skills that they need to make such ideas work. Also, we might get some added traction from the use of corpus-based rather than intuition-based methods, especially for kids who are already used to internet search.]
Posted by Mark Liberman at November 7, 2006 07:49 AM