Monday, August 25, 2008

The rise and fall of human genetics and the common variant - common disease hypothesis.

There is an enormity of positive press coverage for the Human Genome Project and its successor, the HapMap Project, even though within the field the initial euphoric party when the first results came out has already done a full 180 to be replaced by the hangover that inevitably follows such excesses.

For those of you not familiar with the history of this field and the controversies about its prognosis which were present from the outset, I refer you to a review paper I and a colleague wrote back in 2000 at the height of the controversy - Nature Genetics 26, 151 - 157 . The basic gist of the argument put forward for the HapMap project was the so-called common variant/common disease hypothesis (CV/CD) which proposed that "most of the genetic risk for common, complex diseases is due to disease loci where there is one common variant (or a small number of them)" [Hum Molec Genet 11:2417-23]. Under those circumstances it was widely argued that using the technologies being developed for the HapMap project, that one would be able to identify these genes using "genome-wide association studies" (GWAS), basically by scoring the genotype for each individual in a cross sectional study for each of 500,000 to 1,000,000 individual marker loci - the argument being that if common variants explained a large fraction of the attributable risk for a given disease, that one could identify them by comparing allele frequencies at nearby common variants in affected vs unaffected individuals. This point was contested by researchers only with regard to how many markers you might have to study for this to work if that model of the true state of nature applied. Many overly optimistic scientists initially proposed 30,000 such loci would be sufficient, and when Kruglyak suggested it might take 500,000 such markers people attacked his models, yet today the current technological platforms use 1,000,000 and more markers, with products in the pipelines to increase this even more, because it quickly became clear that the earlier models of regular and predictable levels of linkage disequiblrium were not realistic, something that should have been clear from even the most basic understanding of population genetics, or even empirical data from lower organisms.

Today such studies are widespread, having been conducted for virtually every disease under the sun, and yet the number of common variants with appreciable attributable fractions that have been identified is miniscule. Scientists have trumpetted such results as have been found for Crohn's disease, in which 32 genes were detected using panels of thousands of individuals genotyped at hundreds of thousands of markers - this sounds great until you start looking at the fine print, in which it is pointed out that all of these loci put together explain less than 10% of the attributable risk of disease, and for various well-known statistical reasons, this is a gross overestimate of the actual percentage of the variance explained. Most of these loci individually explain far less than half a percent of the risk, meaning that while this may be biologically interesting, it has no impact at all on public health as most of the risk remains unexplained. This is completely opposite to the CV/CD theory proposed as defined above. In fact, this is about the best case for any complex trait studied, with virtually every example dataset I have personally looked at there is absolutely nothing discovered at all.

At the beginning of the euphoria for such association studies, the example "poster child" used to justify the proposal was the relationship between variation at the ApoE gene and risk of Alzheimer disease. In an impressively gutsy paper recently, a GWAS study was performed in Alzheimer disease and published as an important result, with a title that sent me rolling on the floor in tears laughing: "A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer's disease" [ J Clin Psychiatry. 2007 Apr;68(4):613-8 ] - in an amazingly negative study they did not even have the expected number of false positive findings - just ApoE and absolutely nothing else... And the authors went on to describe how important this result was and claimed this means they need more money to do bigger studies to find the rest of the genes. Has anyone ever heard of stopping rules, that maybe there aren't any common variants of high attributable fraction??? This was a claim that Ken Weiss and I put forward many times over the past 15 years, and Ken has been making this point for a decade before that even, in his book, "Genetic variation and human disease", which anyone working in this field should read if they are not familiar with the basic evolutionary theory and empirical data which show why noone should ever have expected the CV/CD hypothesis to hold...

In many other fields, the studies that have been done at enormous expense have found absolutely nothing, and in what Ken Weiss calls a form of Western Zen (in which no means yes), the failure of one's research to find anything means they should get more money to do bigger studies, since obviously there are things to find but they did not have big enough studies with enough patients or enough markers - it could not possibly be that their hypotheses are wrong, and should be rejected... It is a truly bizarre world where failure is rewarded with more money - but when it comes to promising upper-middle-aged men (i.e. Congress) that they might not die if they fund our projects, they are happy to invest in things that have pretty much now been proven not to work...

While in a truly bizarre propaganda piece, Francis Collins, in a parting sycophantic commentary (J Clin Invest. 2008 May;118(5):1590-605) claimed that the controversy about the CV/CD hypothesis was "... ultimately resolved by the remarkable success of the genetic association studies enabled by the HapMap project." He went on to list a massive table of "successful" studies, including loci for such traits as bipolar, Parkinson disease and schizophrenia, and of course the laughable success of ApoE and Alzheimer disease. To be objective about these claims, let me quote from what researchers studying those diseases had to say.

Parkinson disease: "Taken together, studies appear to provide substantial evidence that none of the SNPs originally featured as PD loci (sic from GWAS studies) are convincingly replicated and that all may be false positives...it is worth examining the implications for GWAS in general." Am J Hum Genet 78:1081-82

Schizophrenia: "...data do not provide evidence for involvement of any genomic region with schizophrenia detectable with moderate [sic 1500 people!] sample size" Mol Psych 13:570-84

Bipolar AND Schizophrenia: "There has been great anticipation in the world of psychaitric research over the past year, with the community awaiting the results of a number of GWAS's... Similar pictures emerged for both disorders - no strong replications across studies, no candidates with strong effect on disease risk, and no clear replications of genes implicated by candidate gene studies." - Report of the World Congress of Psychiatric Genetics.

Ischaemic stroke: "We produced more than 200 million genotypes...Preliminary analysis of these data did not reveal any single locus conferring a large effect on risk for ischaemic stroke." Lancet Neurol. 2007 May;6(5):383-4.

And the list goes on and on of traits for which nothing was found, with the authors concluding they need more money for bigger studies with more markers. It is really scary that people are never willing to let go of hypotheses that did not pan out. Clearly CV/CD is not a reasonable model for complex traits. Even the diseases where they claim enormous success are not fitting with the model - they get very small p-values for associations that confer relative risks of 1.03 or so - not "the majority of the risk" as the CV/CD hypothesis proposed.

One must recall that in the intial paper proposing GWAS by Risch and Merikangas (Science 1996 Sep 13;273(5281):1516-7) - a paper which, incidentally, pointed out that one always has more power for such studies when collecting families rather than unrelated individuals - the authors stated that "despite the small magnitude of such (sic: common variants in)genes, the magnitude of their attributable risk (the proportion of people affected due to them) may be large because they are quite frequent in the population (sic: meaning >>10% in their models), making them of public health significance." The obvious corollary of this is that if they are not quite frequency, they are NOT having high attributable fraction and are therefore NOT of public health significance.

And yet, you still have scientists claiming that the results of these studies will lead to a scenario in which "we will say to you, 'suppose you have a 65% chance of getting prostate cancer when you're 65. If you start taking these pills when you're 45, that percent will change to 2". Amazing claims when the empirical evidence is clear that the majority of the risk of the majority of complex diseases is not explained by anything common across ethnicities, or common in populations... (Leroy Hood, quoted in the Seattle Post-Intelligencer). Francis Collins recently claimed that by 2020, "new gene-based designer drugs will be developed for ... ALzheimer disease, schizophrenia and many other conditions", and by 2010, "predictive genetic tests will be available for as many as a dozen common conditions". This does not jibe with the empirical evidence... In Breast Cancer for example, researchers claimed that knowledge of the BRCA1 and BRCA2 genes (which confer enormously high risk of breast cancer to carriers) was uninteresting as it had such a small attributable fraction in the population. Of course now they have performed GWAS studies and examined tens of thousands of individuals and have identified several additional loci which put together have a much smaller attributable fraction than BRCA1 and BRCA2, yet they claim this proves how important GWAS is. Interesting how the arguments change to fit the data, and everything is made to sound as if it were consistent with the theory.

I suggest that people go back and read "How many diseases does it take to map a gene with SNPs?" (2000) 26, 151 - 157. There are virtually no arguments we made in that controversial commentary 8 years ago which we could not make even stronger today, as the empirical data which has come up since then basically supports our theory almost perfectly, and refutes conclusively the CV/CD hypothesis, despite Francis Collins' rather odd claims to the contrary...

In the end, these projects will likely continue to be funded for another 5 or 10 years before people start realizing the boy has been crying wolf for a damned long time... This is a real problem for science in America, however, as NIH is spending big money on these rather non-scientific technologically-driven hypothesis-free projects at the expense of investigator-initiated hypothesis-driven science. Even more tragically training grants are enormously plentiful meaning that we are training an enormous number of students and postdocs in a field for which there will never be job opportunities for them, even if things are successful. Hypothesis-free science should never be allowed to result in Ph.D. degrees if one believes that science is about questioning what truth is and asking questions about nature, while engineering is about how to accomplish a definable task (like sequencing the genome quickly and cheaply). The mythological "financial crisis" at NIH is really more a function of the enormous amounts of money going into projects that are predetermined to be funded by political appointees and government bureaucrats rather than the marketplace of ideas through investigator-initiated proposals. Enormous amounts of government funding into small numbers of projects is a bad idea - one which began with Eric Lander's group at MIT proposing to build large factories for the sequencing of the genome rather than spreading it across sites, with the goal of getting it done faster (an engineering goal) instead of getting more sites involved so that perhaps better scientific research could have come along the way. This has led to a scenario years later in which the factories now want to do science and not just engineering, which is totally contrary to their raison d'etre, and leads to further concentrations of funding in small numbers of hands when science is better served, perhaps by a larger number of groups receiving a smaller amount of money so that more brains are working in different directions thinking of novel and innovative ideas not reliant on pure throughput. Human genetics has transformed from a field with low funding, driven by creative thinking into a field driven by big money and sheep following whatever shepherd du jour is telling them they should do (i.e. innovative means doing what they current trend is rather than something truly original and creative). This is bad for science, and also is bad science. GWAS has been successful technologically, and it has resoundingly rejected the CV/CD hypothesis through empirical data. If we accept this and move on, we can put the HapMap and HGP where it belongs, in the same scientific fate as the Supercollider, and let us get back to thinking instead of throwing money at problems that are fundamentally biological and not technological!


(most notably in terms of the big money NIH is sending into these non-scientific technologically-driven hypothesis-free studies, rather than investigator initiated hypothesis-driven science - one of the main causes of the "funding crisis" at NIH where a tiny portion of new grants are funded - get rid of the big science that is not working - like the supercollider! - and there is no funding crisis)

3 comments:

Unknown said...
This comment has been removed by the author.
Unknown said...

Joe,

nice rambling again, but I do think that you should give the community some more time. Complex diseases are by definition complex and the vast information we're getting has been in most cases analysed in a very robust way, and I do believe that there are avenues to follow do do it better. You should be opening those avenues dude!

I seriously think that the jury is still out and we do not know how to use all infomation we corrently have. Also, even you cannot say that none of the findings has provided interesting biological information. However, while you point to this always is that none (here I do not agree ;-)) of the findings is of novel character but are linked to known pathways, I do argue that even so, the information which genes light up in a genome-wide context is very interesting! Whether any findings will turn out to be of clinical importance remains to be seen. A priori probablility for that for current findings does not seem very high, I do agree, but patience, patience my friend!

Joseph D. Terwilliger said...

Well, nine years have passed.... My guess is that probably nobody disagrees with this today....