January 17, 2004

Hi Lo Hi Lo, it's off to formal language theory we go

In my advertisement for Fitch and Hauser's new Science paper, I suggested that "one should be careful not to overinterpret these results." I'd like to explain what I meant. The experiment is a very interesting one, but Fitch and Hauser describe it in terms that are likely to mislead many readers.

Fitch and Hauser write:

Rule systems capable of generating an infinite set of outputs ("grammars") vary in generative power. The weakest possess only local organizational principles, with regularities limited to neighboring units. We used a familiarization/discrimination paradigm to demonstrate that monkeys can spontaneously master such grammars. However, human language entails more sophisticated grammars, incorporating hierarchical structure. Monkeys tested with the same methods, syllables, and sequence lengths were unable to master a grammar at this higher, "phrase structure grammar" level.

They are careful to say that their monkeys "were unable to master a grammar" at the phrase structure level. However, they assert in a more general way that "monkeys can spontaneously master such grammars", referring to finite-state grammars as a class. But the experiment was symmetrical -- it showed that tamarins could recognize deviations from the pattern imposed by one particular grammar, not all grammars of that class.

However, the interpretive problem is a much deeper one. The two particular grammars that F & H used in their experiment were so simple -- effectively generating only two short sentences each -- that it seems wrong to elevate the discussion to the level of distinctions among grammar types at all. Their (very interesting) result could alternatively be described as follows:

Given exposure to instances of the patterns ABAB and ABABAB, tamarin monkeys showed increased interest in patterns AABB and AAABBB, perhaps because these contained two to four copies of the salient (because repeated) two-element sequences (bigrams) AA and BB, which they had not heard before. By contrast, given exposure to instances of the patterns AABB and AAABBB, other tamarins did not show significantly increased interest in the patterns ABAB and ABABAB, perhaps because they contained only one or two copies of the previously-unheard bigram BA, which may also be less salient because it does not involve a repetition.

Given the same stimulus sequences, human subjects were able to categorize the new patterns as different, regardless of the direction of training and testing, perhaps because their threshold for noting statistical sequence differences was lower, and perhaps because they were able to remember longer sequences, thus noting that the training material AABB and AAABBB did not contain the four-element sequence ABAB.

Put this way, it's an experiment about memory span and/or sensitivity to statistical deviations. No talk about grammars, much less hierarchies of grammatical complexity, is required.

Here are the details. Fitch and Hauser explain about their stimuli that

The FSG was (AB)n, in which a random "A" syllable was always followed by a single random "B" syllable, and such pairs were repeated n times. The corresponding PSG, termed AnBn, generated strings with matched numbers of A and B syllables. In this grammar, n sequential "A" syllables must be followed by precisely n "B" syllables.

So the "finite state" language is (AB)n for n=2 and n=3, i.e. exactly the set of two patterns {ABAB, ABABAB}, while the "phrase structure" language is AnBn, for n=2 and n=3, i.e. exactly the set of two patterns {AABB, AAABBB}.

F&H motivate the lower limit on n (but not the upper one) as follows:

Because previous work demonstrates that tamarins can readily remember and precisely discriminate among strings up to three syllables in length, we restricted n to be two or three in both of the above grammars.

So it seems that these two "languages" -- intended to represent whole classes of formal grammatical power -- consisted of just two strings each, one four symbols long and the other six symbols long? Well, superficially, no -- the languages are much bigger than that, though still finite. A and B represent classes of syllables, with A being one of {ba di yo tu la mi no wu}, while B is one of {pa li mo nu ka bi do gu}. There are eight options for each class, and strings of syllables are formed by random selection without replacement, so the number of possible syllable strings in the FSG language is

8*8*7*7 + 8*8*7*7*6*6 = 116,032

and the number of possible syllable strings in the PSG language is the same.

Much better! Or maybe not... As Fitch and Hauser explain,

The A and B classes were perceptually clearly distinguishable to both monkeys and humans: different syllables were spoken by a female (A) and a male (B) and were differentiated by voice pitch (> 1 octave difference), phonetic identity, average formant frequencies, and various other aspects of the voice source.

In other words, the listener (human or tamarin) could forget about all the ba di yo tu stuff, and just pay attention to whether the syllable was spoken by a high-pitched female speaker or a low-pitched male one. To make it easier, there was just one female speaker and one male speaker, so you could also distinguish the classes by speaker identity.

Now the languages are down to two sentences each again. The "finite state grammar" language contains the two sentences

{ Hi Lo Hi Lo , Hi Lo Hi Lo Hi Lo }

and the "phrase structure grammar" language contains the two sentences

{ Hi Hi Lo Lo , Hi Hi Hi Lo Lo Lo }

Fitch and Hauser consider and reject a version of the alternative memory-span interpretation given above:

An alternative explanation for these results might be that tamarins fail the PSG because their ability to differentiate successive items is limited to runs of two. If this were true, it would account for the asymmetric results we obtained because they would be able to encode AB AB AB patterns but be unable to process the longer runs of AAA BBB. However, a subanalysis gave the same pattern of results even when n was limited to two (ABAB versus AABB).

This addresses a different alternative interpretation from the one I offered (in red above). It doesn't affect my suggested alternative. In the n=2 case, the FSG deviation (AABB given experience with ABAB) involves two novel bigrams, both repetitions; while the PSG deviation (ABAB given experience with AABB) involves only one novel bigram, not a repetition. This is plenty of differentiation to base an explanation on. It's also possible, as they suggest here, that the alternating sequences are grouped as (AB)(AB), which would make bigrams starting in odd-numbered positions more salient than those in even-numbered positions. This would make my account work even better, since the novel bigram BA might not even be registered as a unit. It's plausible that such binary grouping of alternating sequences is done by humans, and if it were true of monkeys as well, that would be interesting.

The familarization/discrimination paradigm is a promising one for animal studies, as it has been for studies of human infants, and the results so far are interesting, but let's face it, it's really stretching things to claim that we've learned anything about embedding, recursion, etc. -- or even about the various kinds of dependencies that finite-state grammars can express. It would be very unwise, for example, to place any wagers on the ability of tamarins to learn to recognize exactly those symbol sequences that an arbitrary egrep-style regular expression matches -- though the implication of F & H's claim that "monkeys can spontaneously master [finite state] grammars" is that they should be able to do this.

You could use the familiarization/discrimination paradigm, with appropriately varied patterns in training and testing, to explore the limits of monkeys' abilities in that area. Similarly, the plausibility of my alternative explanation in terms of sequence statistics could be tested against a more general account in terms of grammar types. We have to recognize that it's going to be hard to design these experiments, since each will involve a necessarily finite sample from each of a small number of "grammars" or other pattern spaces, and each such sample will be subject to multiple alternative descriptions. In fact, in a mathematical sense, the problem is impossible. However, it should be possible to explore the question in a way that will lead reasonable people towards provisional acceptance of interesting general conclusions about how to characterize the abilities of different animals in such experiments.

I expect that Fitch and Hauser, who are both serious researchers with a history of excellent work, will do such things, along with other cognitive scientists. But let's hold off on the general claims until the research has been done!

I wish that I could say that I'm surprised that Science let these excessive claims pass into print. There are two forces at work here -- the desire for big results, and the vagaries of reviewing at an interdisciplinary journal -- that together lead to such outcomes all too often.

[Update 9/1/2004: A later paper by Perruchet and Rey, reporting results on human subjects that call F&H's characterization of these experiments into question, is discussed here. ]

Posted by Mark Liberman at January 17, 2004 07:18 AM