February 08, 2004

Reanalysis -- and not

Last week, the always-interesting Rosanne posted over at the X-Bar:

Here's a tidbit from my friend's son -- she refers to him as a linguistic adventurer, which I think is a delightful way to describe anyone acquiring language:

He objects to wearing long underwear beneath his clothing once he’s inside the house. He lets me know about this objection by peeling himself out of his topmost layer of clothing and bellowing: "Too both! Too both!"

Analysis withheld until I've had a good night's sleep. It is what it is, just too both.

As adult English speakers, we experience "too both" as a small collision between grammatical matter and anti-matter that vanishes in a flash of amusement. But in a more analytic mode, I suspect a prosaic back-story. For a kid just learning to talk, "too X" might as well mean "I don't like X" or "I'm uncomfortable about X" or "I can't deal with X" -- "too hot", "too cold", "too big", "too heavy". And "both" might as well mean "two layers of clothing" -- "let's see, should you wear your t-shirt? or your sweatshirt? or both?" So "too both" might alternatively be glossed "I-can't-deal-with two-layers-of-clothing", though this completely lacks the poetic concision of the original.

Traces of such infant reanalyses are everywhere in the speech of small children. Similar things sometimes happen in the larger history of languages, but not often in comparison. One little kid can't shift the whole cultural weight of the adult world, at least not very often. If a reanalysis takes hold, it must usually be because it spontaneously happens over and over, not because one infant's idea spreads to the speech community as a whole.

Some apparently plausible reanalyses are conspicuous by their absence. For example, the sequence "the uh" occurs about 600 times per million words of English conversation. This is roughly as frequent as words like look, ever, else, try, why, away, again, few, type, give, made, once or old. Sequences of filled pause ("uh") following other common determiners (such as this, that, my, your, his) are also common.. But I've never heard of a "linguistic adventurer" developing new words like "the-a", "this-a", "his-a", though it would be reasonable to guess that these might be (say) distal versions of the core determiners: "my-a X" = "that X of mine that we haven't just been talking about".

Is this because kids recognize that filled pauses are not exactly morphemes? That seems plausible, but there's quite a bit of research suggesting that "disfluencies are often really communicative choices rather than system failures." In that case, why don't filled pauses get lexicalized, alone or in combination with common adjacent words?

Is it because the communicative connotations of filled pauses are not salient enough, or at least not available to learners until later in development, when their basic grammar and lexicon of functional categories is already established? Maybe.

Or do filled pauses in fact get lexicalized (in language learning and/or in language history), so that I'm just wrong about the facts? In that case, some reader will probably set me straight.

[Update: Maryellen MacDonald emailed:

I read your language log post on child reanalysis (too both, etc.) and wanted to comment on a couple of points. You were speculating on why children don't reanalyze the-uh despite its extremely high frequency, and in connection with this you noted that

there's quite a bit of research suggesting that "disfluencies are often really communicative choices rather than system failures."

There's room to interpret much of the data here somewhat differently, I think. That is, there is good evidence that comprehenders make very good use of disfluencies to help comprehension (and so can language models, as Stolcke and Shriberg show), but it does not follow from this result that the disfluencies are provided for the comprehender's benefit. Disfluencies appear to be uttered (perhaps among other reasons) to buy the speaker more time to plan upcoming material, and thus they have a very non-random distribution--they appear in advance of more difficult material--lower frequency words, longer phrases, more complex construction, etc. The presence of a disfluency therefore has a great deal of predictive value for the comprehender concerning the upcoming material. Thus uttering "uh" is certainly a choice on the part of the speaker, and it also has communicative consequences, but it need not be a choice with the comprehender's needs in mind. Comprehenders may just be very good at taking advantage of choices that are being made to meet the speaker's needs alone.

Now as for why a child wouldn't reanalyze the-uh as a variant of the, etc. I lean toward your first hypothesis, that kids realize that filled pauses aren't exactly morphemes, and that they do this because their distributional pattern is unlike that of real morphemes. Filled pauses show up in a variety of structures and follow a variety of words beyond just deteriminers, of course. A major commonality is that they follow easy (short, high frequency) material and precede longer, lower frequency words. Also their acoustic duration probably varies more than for other syllables that are bound morphemes. And this isn't like the distribution of real morphemes.

I mostly agree, though I'm not at all sure that the "acoustic duration [of filled pauses] probably varies more than for other syllables that are bound morphemes". Since pauses (especially unfilled ones) induce lengthening of final syllables, I bet that things like "-ing" are pretty variable in duration. And uhs are often pretty stereotyped in their performance. Checking this out would be a nice term project for a student in a phonetics course.

In a different vein, Nicholas Widdows emailed that

Very interesting point about the possibility of lexicalization of fillers. My first reaction was that the filler isn't analysable as having any particular syntactic function, since it can occur almost anywhere. There's not enough purchase for the child to analyse it as a determiner affix.

Then I looked at the earlier statistics you quoted and wondered what entropy could actually do to syntax learning. If [@] has some predictive value at some point, how could that get grammaticalized? In DP = [D N] it's phonologically attached to D but pragmatically marks N as unfamiliar, as opposed to 'the cat', familiar, no [@]. That is, if the child begins by perceiving it as a distinct element in [D X N], the semantic relationship is [D [X N]] and you need some kind of reanalysis to create syntactic [[D-X] N].

Presumably the fillers are more likely just before the high-entropic parts, N or V, and we can pre-construct the functional category scaffolding for what we plan to say before realizing we need to dredge up the lexical word. So we get 'and then I, er...' and 'put the, er...', but that doesn't necessarily mean that the parser has a slot at that point available for syntactic content.

What I'm trying to say with all these points is that an analysis of the filler as material actually attached to a preceding functional category won't be made because the syntactic structure doesn't license it there.

I'm not sure that I buy this. It's not true that determiners are always atomic, certainly -- it's normal for them to show gender and number agreement, of course, and there are some other sorts of morphologically complex determiners as well.

It seems like all you'd need would be an accessible meaning dimension to associate, and there are some plausible ones. For example, some languages have a systematic proximal/distal distinction in determiners. So the learner could hypothesize that "uh" is a distal morpheme, since it tends to occur before items that are being newly introduced, are less familiar etc. Some languages also encode new/old information differences, so the same patterning might lead a learner to think that "uh" marks new information.

Though I guess I have to admit that I've never heard of the theme/rheme distinction being marked on determiners. And I might well be missing some of Nicholas' point, not being a syntactician.

In any case, I'm inclined to agree with Maryellen that learners catch on quickly to "uh" being some kind of out-of-band signal, even if they also learn to recognize its information content. The best argument would be that filled pauses never seem to get lexicalized in any context. I cited the case of "the uh" because it's the most common, but you could have "nice uh" or "mean uh" or "boy uh" as lexicalization candidates just as well.

Of course, another problem is that if a child went through a stage of using "uh" in some novel way, it might be pretty hard to notice it. The meaning is likely to be somewhat subtle, and there aren't any contexts in which "uh" would be strikingly out of place. We notice things like "too both" because they're ungrammatical in the adult language; we notice things like "let's fight with our chudders" (generalized with re-analysis from "each other") or "whobody's there?" because they involve recognizably non-standard lexical items; but would we even notice "the-a" and "my-a" if a three-year-old started using them as rhematic determiners? Heck, would we notice if 10% of the population of North America was doing it, day in and day out? ]

Posted by Mark Liberman at February 8, 2004 10:29 AM