July 02, 2004

Analyzing voice stress

Yesterday's NYT had an article on voice stress analyzers. As a phonetician -- someone who studies the physics and physiology of speech -- I've been amazed by this work for almost three decades. What amazes me is that research (of a sort) and commerce (at a low level) and law-enforcement applications (here and there) keep on keepin' on, decade after decade, in the absence of any algorithmically well defined, reproducible effect that an ordinary working speech researcher like me can go to the lab, implement and test.

Well, these days there's no need to go to the lab for this stuff -- you just write and run some programs on your laptop. But that makes the whole thing all the more amazing, because after 50 years, it's still not clear what those programs should do. I'm not complaining that it's unclear whether the methods work -- that's true too, but the real scandal is that it's still unclear what the methods are supposed to be.

Specifically, the laryngeal microtremors that these techniques depend on haven't ever been shown clearly to exist, as far as I know. No one has ever shown that if these microtremors exist, it's possible to measure them in the pitch of the voice, in a way that separates them from all the other phenomena that modulate the pitch at similar rates. And that's before we get to the question of how such undefined measurements might be related to truth-telling. Or not.

How can I make you see how amazing this is? Suppose that in 1957 some physiologist had hypothesized that cancer cells have different membrane potentials from normal cells -- well, not different potentials, exactly, but a sort of a different mix of modulation frequencies in the variation of electrical potentials between the inside of the cell and the outside. And further suppose that some engineer cooked up a proprietary circuit to measure and display these alleged variations in "cellular stress" (to the eyes of a trained cellular stress expert, of course), and thereby to diagnose cancer, and started selling such devices to hospitals, and selling training courses in how to use them. And suppose that now, almost half a century later, there is still no documented, well-defined procedure for ordinary biomedical researchers to use to measure and quantify these alleged cell-membrane "tremors" -- but companies are still making and selling devices using proprietary methods for diagnosing cancer by detecting "cellular stress" -- computer systems now, of course -- while well-intentioned hospital administrators and doctors are occasionally organizing little tests of the effectiveness of these devices. These tests sometimes work and sometimes don't, partly because the cellular stress displays need to be interpreted by trained experts, who are typically participating in a diagnostic team or at least given access to lots of other information about the patients being diagnosed.

This couldn't happen. If someone tried to sell cancer-detection devices on this basis, they'd get put in jail.

But as far as I can tell, this is essentially where we are with "voice stress analysis."

The "National Institute for Truth Verification" offers a page giving "a partial list" of "studies validating voice stress analysis." These go back to work by Lippold and others, starting in the 1950s, that has claimed to identify a "microtremor" with a frequency of about 8-12 (sometimes 8-14) Hz., caused by reflex arcs in the motor system, whose intensity is modulated by stress. This is supposed to be a more general muscular phenomenon, but the "voice stress" applications depend on measuring the intensity of this tremor in the fundamental frequency (pitch) of the voice, caused by "microtremors" in the muscles of the larynx.

One aspect of this stuff that I've always found counter-intuitive is that stress is supposed to diminish these microtremors, not increase them. So it's not that your voice becomes more quavery when you're nervous or upset, rather that it (supposedly) becomes more steady.

Anyhow, I've tried, on and off for almost 30 years, to measure these microtremors, and I can't find them. I don't know any reputable speech researcher who can, at least not in any reproducible way. Shortly after I started work at Bell Labs in 1975, some people in my group took a look at these claims, since voice stress analysis offered an obvious way to prevent telephone credit card fraud. These folks ran into the basic problem right away -- they couldn't find these putative microtremors. This was not because they were stupid people who didn't know how to analyze the frequencies of the voice, believe me.

This is still the situation. Pick some unusual kind of signal processing like "modulation spectrum" and Google will easily find you dozens of pages with equations defining the concepts and source code implementing them. Pick a hard-to-analyze acoustical property of the voice, like "jitter and shimmer" in fundamental frequency (related to hoarseness and so on), and the same thing is true. Look for "microtremor" or "voice stress" and you'll find lots of pages discussing whether or not the methods work -- but nothing, as far as I can tell, telling you in mathematical or algorithmic detail what the methods really are, much less offering code that implements them.

There are some reasons to think that it ought to be hard to measure 8-14 Hz. "microtremors" in the fundamental frequency of ordinary speech, even if they exist. Frequencies of 8-14 Hz. correspond to periods between about 70 and 125 msec. But the durations of phonetic segments corresponding to consonants and vowels overlap that same range -- typically 50 to 250 msec. And most such segments involve vocal gestures that strongly modulate the pitch of the voice, through changes in supralaryngeal impedence, changes in voicing, or the effects of stress, intonation and/or tone. So you're looking for a low-amplitude modulation of a signal that's simultaneously being subjected to a highly variable (because information-bearing) high-amplitude modulation in the same frequency range. Not impossible, but hard.

The best technical overview that I know is a 1996 special issue of Speech Communication on "Speech under Stress" (vol. 20, issues 1-2, pp. 1-175). The most relevant article in that issue is Robert Ruiz, Emmanuelle Absil, Bernard Harmegnies, Claude Legros and Dolors Poch, "Time- and spectrum-related variabilities in stressed speech under laboratory and real conditions" (pp. 111-129). They define an "index of microprosodic variation" which they dub μ -- for which they give an actual equation! -- and they show that it is affected by (situational) stress in both laboratory and real-world situations. Is this the long-sought technical validation of the microtremor theory?

In a word, no. This "index of microprosodic variation" is defined on individual vowel segments, as fc/((fi+ff)/2), where fc is the pitch in the center of the vowel, fi is the initial pitch of the vowel, and ff is the final pitch of the vowel. In other words, it's simply the ratio of the pitch in the middle of the vowel to the average of the starting and ending pitch. This will tend to be higher under conditions of higher "vocal effort' -- high vocal cord tension, high subglottal pressure, etc., as an elementary consequence of the physical mechanisms involved in vocal cord vibration. It's got zilch to do with "microtremors" putatively caused by a motor system reflex arc, and putatively modified by stress-induced changes in feedback strength. And increases in mu would be caused by lots of things other than stress -- talking more loudly because your listener is farther away, or because of higher background noise, for example.

So I'm still waiting. There's some good test data out there. The Linguistic Data Consortium has published a database of "Speech Under Simulated and Actual Stress (SUSAS)", collected by John Hansen. If someone will send me an equation, an algorithm, or some Matlab code, I'll be happy to try it out -- it would just take a few days to do some initial tests of plausibility -- and if it works, I'll sing its praises.

I'm not prejudiced against the "microtremor" theory -- I'd love to have another measurement dimension for speech analysis. I'm not prejudiced against "lie detector" technology -- if there's a way to get some useful information by such techniques, I'm for it. I'm not even opposed to using the pretense that such technology exists to scare people into not lying, which seems to me to be its main application these days. But when a theory about quantitative measurements of frequency-domain effects in speech has been around for half a century, and no one has ever published an equation, an algorithm or a piece of code for making these measurements, and willing and competent speech researchers (like me) can't create reliable methods for making such measurements from the descriptions we find in the literature... something is wrong.

But maybe the techniques are being kept secret to preserve competitive advantage, but really work anyhow? That's not the way these things are supposed to happen, but this is possible in principle. However, the "National Institute for Truth Verification" does not list on its page the more negative results, like this 2003 study, which tested the Vericator (TM) voice stress analyzer in a test to find (people pretending to be) smugglers at two mock border checkpoints. The study was done by the Department of Defense Polygraph Institute, not likely in priniciple to be an outfit hostile to such technologies, and found that the miss rate was about 85% (50 of 59 smugglers were missed), while the false alarm rate was about 12% (13 of 111 non-smugglers were falsely flagged). More to the point, the rate at which candidates were identified as "smugglers" was quite similar whether in fact they were in that category (9 of 59, 15%) or were not in that category (13 of 111, 12%).

As I said, if you tried to sell cancer diagnosis equipment on the basis of (non double blind) clinical trial results like that, you'd be in trouble with the law.

[Note to fellow-linguists: if you haven't already figured it out, we're talking here about 'stress" as in "I'm so stressed about my job interview", not "stress" as in "giraffe is stressed on the second syllable".]

 

Posted by Mark Liberman at July 2, 2004 07:44 AM
Comments

Are there any studies which show that humans can accurately determine who is stressed and who is not, using only audio cues? If so, then the search must go on (although the answer could lie not in phonetics but elsewhere). If not, then the endeavor is pointless.

I am reminded a bit of the attempt by linguists to identify what features of African American English cause the ethnicity of black speakers to be identified correctly over 90% of the time. As far as I understand, the evidence suggests that this mainly has to do with intonation; still, though, linguists are trying to figure out exactly what's going on. But the phenomenon is definitely real.

Posted by: Edward Garrett at July 2, 2004 08:46 AM

Unfortunately, many people are deeply uncomfortable with fuzzy probabilities. They require numbers that are balder, that decisively point to true/false. So the fact that this technology, fascinating though it may be, is just one data point for this particular problem (and a weak one, judging by decades of non-results) will elude them.

I'm curious, what of those of us who stutter, where the speech act itself is a stressful event? Wouldn't normal stuttering patterns (laryngospasms with subsequent increased vocal effort) skew the results? I'd hate to lose my liberty because some overzealous fool couldn't tell the difference between a probability and a person.

Posted by: Rick at July 2, 2004 11:20 AM