July 22, 2007

Two simple numbers

OK, after years of complaining about the darkness of science journalism, I'm lighting a candle. I've found a cure.

Not, I'm afraid, a cure for the whole syndrome of credulousness, carelessness and misreading. I can't even pretend that my remedy will have any effect on the underlying causes, which are ignorance of science and the motive of sensationalism. My medicine will only provide symptomatic relief, and only in a specific class of cases, those where scientists (or their press agents) claim to have found "the genetic basis of X".

But given the new technologies and social policies facilitating genome-wide association studies, there are going to be a lot of these stories over the next few years. My medicine should be easy for journalists to swallow, and easy for the public to understand. And later, we'll add other simple remedies for other common kinds of science stories (like effect sizes in group-difference studies).

Today's prescription is a trivial rule of scientific rhetoric. When there's a claim that some genomic variant is associated with some phenotypic trait -- whether it's breast cancer or homosexuality or conservatism or stuttering -- we need to know four simple numbers. Specifically: (A) the number of "case subjects" in the study (people with the trait in question); (B) the number of "control subjects" in the study; (C) the proportion of the case subjects with the genomic variant in question; and (D) the proportion of the controls with the genomic variant in question.

If four numbers are too many, leave out (A) and (B), as long as they're not really small. But stick with (C) and (D) -- they're the medicine that really does the work here.

Let's do a little experiment of our own, with respect to the reporting of two recent genome-wide association studies: Juliane Winkelmann et al., "Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions", Nature Genetics, published online July 18, 2007; and H. Stefansson et al., "A Genetic Risk Factor for Periodic Limb Movements in Sleep", N Engl J Med, published online July 18, 2007.

The striking thing about these studies is that they went on genetic fishing expeditions in several different populations -- Icelanders, Germans, Americans, French Canadians -- and found (in part) the same thing. By "fishing expeditions" I mean that they used microarray techniques to trawl for a lot of genomic variants -- 500,568 single nucleotide polymorphisms (SNPs) in one case, and 311,388 SNPs in the other. And they zeroed in on (different) SNPs in the same gene, BTBD9, whose etymology I discussed the other day. (They also found some SNPs in different genes that weren't replicated across studies, but never mind that.) When you combine this replication of results with the eminence of the researchers and the reputation of the journals, you can be sure that there's really some substance here.

OK, go off now and read some of the reporting of these results in the popular press. You could read the treatment that ABC News gave it (Denise Dador, "Restless Leg Syndrome Found to Be Genetic"), or the New York Times (Nicholas Wade, "Scientists Find Genetic Link for a Disorder (Next, Respect?)"), or Scientific American (Gene Emery, "Studies find gene linked to night leg movement"). Or (if you're reading this in July or early August of 2007) you can take your pick from what Google News returns from a search string like { "restless legs"}.

On balance, I think the stories that I've read about these studies are mostly pretty good, presenting a complex picture in a clear and fair way. But in none of the stories (at least among those that I've read so far) do we get the numbers that I claim are critical to understanding the real meaning and impact of this work.

So, on the basis of your popular-press readings, please now guess what those crucial numbers (C) and (D) are. What proportion of RLS sufferers (the case subjects) were found to have a variant form of BTBD9? What proportion of the general population (the control subjects) had the same variant?

To make the experiment fairer, you could ask one of your colleagues to read one of the popular-press stories, and then ask them to guess the numbers. Tell them it's part of a "wisdom of crowds" experiment.

I've tried this with a couple of random local acquaintances, and gotten guesses like "50% and 5%" or "75% and 10%" or "30% and 3%". I'd guess that less scientifically-sophisticated people might conclude from the popular-press coverage that people with the disease all have the key allele, while people without the disease don't (though the stories don't say any such thing, that's a common understanding of what "the genetic basis of X" means).

The numbers are in fact reported in the scientific articles that I referenced and linked above. They are:

Study
Population
Allele in SNP
Case subjects
Control subjects
N
Proportion
N
Proportion
Stefansson Icelanders (combined)
rs923809
429
0.774
16,866
0.656
Stefansson Americans
rs923809
188
0.766
662
0.681
Winkelmann Germans (2)
+ Canadians
rs9296249
401+903+128
0.838
1,644+891+287
0.765

(Winkelmann actually gives the proportions as MAF ("minority allele frequency"), so the the case subjects were at 0.162 and the controls at 0.235. I've subtracted these from 1.0 in order to make the numbers more comparable to Stefansson's.)

I doubt that many readers of the popular press accounts of these studies will guess that "genetic basis" means genomic variants that occur at 66% (or 77%) in the asymptomatic general population, vs. 77% (or 84%) in patients with a diagnosis of RLS confirmed by periodic leg movements found in sleep studies with an accelerometer on the ankle.

(Nor would they guess that about 35% of the RLS diagnoses were not confirmed by the accelerometer tests, or that the genetic scan did not have significant results if the accelerometer tests were not used to trim the set of case subjects. But that's a different set of problems.)

This is not to say that RLS is not real, or that these genomic variants are not somehow connected with it -- or more likely, connected with things that are connected with things that are connected with it. But there's a big difference between numbers like these and what most readers will conclude from phrases like "discovery of the genetic basis of the disorder" (a phrase used in Nicholas Wade's NYT article).

So the next time you see an article in the popular press about "the genetic basis of X" or "the gene for Y" (and surely you will!), look for the case-subject and control-subject percentages. If you don't see them (and probably you won't!), write to the editor and complain.

Posted by Mark Liberman at July 22, 2007 10:09 AM