If you're working with next-generation sequencing (NGS) human data, chances are at some point you will be interested in automatically determining which of your sequence variants are more likely to have deleterious effects. A first step is often to focus on missense single nucleotide variants (SNVs), i.e. substitutions of a single nucleotide that result in a different amino acid. Indeed those are disproportionately deleterious compared to other variants [MacArthur et al., 2012]. In addition, you can filter out common variants, which are presumably less likely to be deleterious. But that's still a lot of variants to contend with, and that's where SNV deleteriousness prediction comes into play.
There are many tools (see this list at OMICtools) that are dedicated to the problem of predicting whether a missense SNV is deleterious (a.k.a. pathogenic, damaging, disease-causing) or neutral (a.k.a. tolerated, benign, non-damaging). Some, such as SIFT, are based on sequence conservation, under the premise that disrupting a highly conserved sequenced will be more damaging. Others, like PolyPhen-2, try to assess the effect of amino acid changes on protein structures. CADD mixes several types of genomic information. And a few tools, such as Condel, combine the outputs of other tools.
Back in 2012, we set out with the following question: given the simplicity of current prediction methods (compared to the complex machine learning models that we are usually manipulating), couldn't we come up with better annotation tools than what was out there? We started toying around with a few ideas, and soon enough had to wonder how exactly to validate the methods we were proposing. So we started investigating the state of the art and benchmark data sets in more details... and we fell down the rabbit hole.
Our story, which we just published in Human Mutation, is in essence very simple. It boils down to one of the basic commandments of machine learning: Thou shall not test on your training set, meaning that if you evaluate your prediction tool on the same data that was used to build it, you'll have no idea whether it's any good on new data or not (a phenomenon typically referred to as overfitting). To take an extreme example, if your algorithm looks up in a file the hard-coded values it should return, it will perform perfectly on the variants that are in this file, and be utterly unable to make predictions for other variants (which, presumably, is the interesting part).
Put like that it sounds rather obvious. However, the community has pushed itself in a corner where it's becoming really difficult — if not downright impossible — to properly compare deleteriousness prediction tools.
The first reason is that the publicly available benchmark data sets typically used for evaluating tools overlap with the databases used to build some of these tools. Others have pointed this out before us, and endeavored to develop independent benchmark data sets [Nair and Vihinen, 2013]. However, there can still be some overlaps (mainly in neutral variants). Furthermore, not all authors disclose the variants they used to build their tools. It is impossible to guarantee that an evaluation data set does not contain some of these variants, and hence to guarantee fairness when comparing these tools against others.
The second reason is more subtle. It turns out that, in an overwhelming majority of cases, when one variant of a gene is annotated, all other variants of that gene that are available in the database also have the same annotation. This is due to the way these data sets are put together and does not necessarily reflect biological reality. However this means that you can very efficiently leverage the annotation of other SNVs in the same gene to build what will appear to be a very accurate tool; but there is no guarantee that this tool will perform well on new variants. The evaluation of such tools (e.g. FatHMM in its weighted version, as well as tool combinations such as the latest version of Condel) is heavily biased by this phenomenon.
Our paper demonstrates the negative effects of these two types of circularity (which we're calling that way because they result in relying on (somewhat) circular reasoning to draw conclusions about the performance of the tools). Actually, the pervasiveness of these effects is such that we found it impossible to draw any definite conclusion on which of the twelve tools we tested outperforms the others. Note that in most of the cases where we can measure performance without these biases, we obtain accuracies that are significantly worse than usually reported.
So how can we move forward? In our opinion, releasing not only the data that were used for training the tools, but also the precise descriptors and algorithms used by each tool would be the best way to get out of this quandary: anyone could perform stratified cross-validations, and determine the best algorithm to be trained on the union of all available data, resulting in the best possible tool.
At the very least, authors should release which variants are in their data (even if they don't release their annotations), so that others can avoid circularity when comparing new methods to theirs. They should also abstract themselves as best as possible from the second type of circularity we described. For this purpose, we recommend reporting accuracies for varying values of the relative proportions of pathogenic and neutral variants in the gene to which the SNV belongs.
There are a few more questions that remain open.
Which transcript should be used when a tool requires features of the gene in which the SNP appeared? Others have used the transcript yielding the most deleterious score. In order to use the same transcript for each tool, we settled on the canonical transcript. The results we report weren't much affected by this choice, but I think it is a question worth considering.
More importantly, what do "deleterious", or "pathogenic", or "damaging" mean exactly? Different authors have different definitions, meaning that not all these tools set out to address exactly the same problem. How can you then compare them? Along those lines, we should also systematically disclose the source of evidence for annotations in the benchmark data sets (as is generally done, for example, in gene function prediction). Indeed it is possible that some annotations come themselves from tool predictions, hereby artificially inflating the apparent performance of these tools.
Finally, the whole field relies on the premise that some mutations are inherently more damaging than others, but I am expecting a lot of other factors, such as other variants, all sorts of environmental or clinical variables, and the specific disease you're interested in, to come into play. The fact that we report better-than-random performance shows there is some validity in this assumption, but how far can we really get? What is the best accuracy we can reach? And, given the rate at which we are accumulating knowledge about all possible missense SNVs, how long will it take before we have annotated all of them experimentally and do not require any predictive algorithm any more?
You can read the full story and find all the data and the Python scripts we used at Human Mutation: Dominik G. Grimm, Chloé-Agathe Azencott, Fabian Aicheler, Udo Gieraths, Daniel G. MacArhur, Kaitlin E. Samocha, David N. Cooper, Peter D. Stenson, Mark J. Daly, Jordan W. Smoller, Laramie E. Duncan, Karsten M. Borgwardt. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity, Human Mutation, 2015 doi: 10.1002/humu.22768
Disclaimer: this blog note has been written by myself alone and reflects my personal take on this work and on the domain, which is not necessarily that of all co-authors.