Language Log: A gene by any other name?

December 11, 2007

A gene by any other name?

According to David Biello, "Culture Speeds Up Human Evolution", Scientific American, 12/10/2007:

Homo sapiens sapiens has spread across the globe and increased vastly in numbers over the past 50,000 years or so—from an estimated five million in 9000 B.C. to roughly 6.5 billion today. More people means more opportunity for mutations to creep into the basic human genome and new research confirms that in the past 10,000 years a host of changes to everything from digestion to bones has been taking place.

"We found very many human genes undergoing selection," says anthropologist Gregory Cochran of the University of Utah, a member of the team that analyzed the 3.9 million genes showing the most variation. "Most are very recent, so much so that the rate of human evolution over the past few thousand years is far greater than it has been over the past few million years." [emphasis added]

Analyzing 3.9 million human genes, no matter how much variation they show, would indeed be an amazing feat -- the current standard estimate is that humans have about 20,000- 25,000 genes altogether ("How many genes are in the human genome?", Human Genome Project Information site).

The University of Utah press release ("Are Humans Evolving Faster?", 12/10/2007) clarifies what you've probably already guessed, namely that the reearchers analyzed not genes but "single nucleotide polymorphisms" (SNPs):

The study looked for genetic evidence of natural selection - the evolution of favorable gene mutations - during the past 80,000 years by analyzing DNA from 270 individuals in the International HapMap Project, an effort to identify variations in human genes that cause disease and can serve as targets for new medicines.

The new study looked specifically at genetic variations called "single nucleotide polymorphisms," or SNPs (pronounced "snips") which are single-point mutations in chromosomes that are spreading through a significant proportion of the population.

Imagine walking along two chromosomes - the same chromosome from two different people. Chromosomes are made of DNA, a twisting, ladder-like structure in which each rung is made of a "base pair" of amino acids, either G-C or A-T. Harpending says that about every 1,000 base pairs, there will be a difference between the two chromosomes. That is known as a SNP.

Data examined in the study included 3.9 million SNPs from the 270 people in four populations: Han Chinese, Japanese, Africa's Yoruba tribe and northern Europeans, represented largely by data from Utah Mormons, says Harpending.

But here's a problem. The University of Utah press release says that the study was "published online Monday, Dec. 10 in the journal Proceedings of the National Academy of Sciences". But it wasn't. It's not in the Dec. 11 issue; nor was it published online in the PNAS Early Edition for Dec. 10, or Dec. 11, or any other previous date, as far as I can tell.

I've complained about this before (for example, "PNAS embargo policies considered annoying", 7/24/2007). The result of this policy by PNAS -- and I suppose it's some combination of inefficiency and thick-headedness rather than malicicous intent -- is that interesting papers like this one are first exposed to the public, for as much as a week, only via press releases in which DNA base pairs are amino acids, and stories by journalists who can't be counted on to know the difference between genes and SNPs.

[Update -- we can also learn about the paper on the weblog of one of the authors, John Hawks, here and here. But it's still a scandal, in my opinion, for PNAS to impose an embargo, and then end it as much as a week before general readers can get access to the paper.]

[Michael Watts writes that a preprint of John Hawks et al., "Recent acceleration of human adaptive evolution", can be found on the web site of the University of Utah anthropology department -- no thanks to PNAS.]

[Update 12/13/2007 -- Stephen Powell points out that the Sciam article now has a footnoted correction reading:

*This article wrongly characterized the HapMap genotype dataset used for this analysis as "genes" rather than "DNA sequences."

But this is still not quite right, since SNPs are not DNA sequences, they're "single nucleotide polymorphisms", i.e. substitution of a single nucleotide in a DNA sequence.

And the body of the article now reads

...the team that analyzed the 3.9 million DNA sequences* showing the most variation.

But as far as I can tell, they did not choose 3.9 million SNPs showing the most variation -- they simply started with a database of 3.9 million SNPs, period -- because that's all there is, at the moment -- and based their analysis on the distribution of those variants.

Meanwhile, Empty Pockets wrote to point out to me that the University of Utah press release makes just a big a boo-boo:

The University of Utah press release that you quote in your recent post, "A gene by any other name?" refers to a DNA base pair as being made of "amino acids." Perhaps it was not the best example to cite as a clarification of Scientific American's confusion!
FYI, this mistake is (surprisingly, to me) not uncommon, for example here and here.
In fact, I've reached the point where I am more astonished when I find a story about genes, DNA, and proteins that gets all the terminology right than I am when I find one that makes a mistake! Just this week, for example, in the NYT, a short article got things exactly backwards by confusing the function of a gene with the consequences of loss-of-function mutations in that gene:
http://www.nytimes.com/2007/12/11/health/research/11lab.html "BRCA1, the scientists reported online Sunday in Nature Genetics, prevents PTEN from doing its work."
(In fact, it is just the opposite -- loss of BRCA1 prevents PTEN from doing its work. I won't get started on the NYT's causative interpretation of what are mostly correlative data...)
Your larger point about preprint embargoes is well appreciated.

Yes, I noticed the curious allusion in the press release to "a twisting, ladder-like structure in which each rung is made of a 'base pair' of amino acids", and put a snarky comment about it into a review of the rest of the MSM coverage of this research, which I wound up deleting because I didn't have either the time or the energy to finish it. The state of science reporting is depressing, it really is.

As for the PNAS "fuzzy embargo", it's now 9:00 on Thursday evening. The "embargo" on the Hawks et al. paper was lifted Monday morning -- and the paper is still not available on the PNAS "early edition" web site. This is apparently always the pattern, and I have to believe that it's a deliberate policy, though what PNAS thinks it's accomplishing by this policy is entirely opaque to me.]

Posted by Mark Liberman at December 11, 2007 08:22 PM