Feature #522
closedAdd Polyphen 2 predictions to GET-Evidence
Description
We currently use BLOSUM100 score to identify variants as being "disruptive", but there are algorithms out there that are specialized for making a computational prediction of pathogenic (or otherwise phenotypic) effect - e.g. Polyphen and SIFT. Some connections with the Sunyaev lab makes Polyphen a good candidate for using in GET-Evidence, and the Polyphen 2 data is all entirely downloadable:
http://genetics.bwh.harvard.edu/pph2/dokuwiki/downloads
In particular, "PolyPhen-2 annotations for whole human proteome sequence space (WHPSS) build 3" contains all possible amino acid changes caused by single base substitutions.
WARNING: THE FILES ARE TARBOMBS. Move them into a new directory before extracting.
Even though there are licensing issues with the code itself, we think integrating the downloadable dataset should be okay.
How should it be incorporated? Not sure.
The file is huge (1.6GiB), it seems like a bad idea to require incorporation of it in all instances of GET-Evidence -- maybe only on the production server. We could create a script that regularly checks GET-Evidence for variants with amino acid changes that are missing Polyphen 2 data & update them; this script would not run on most instances of GET-Evidence.
Unfortunately, if we want to prioritize an insufficiently evaluated variants by autoscore, and if the variant is not yet in GET-Evidence, we won't be able to use the Polyphen score in the autoscoring. Maybe we could have some backup behavior using BLOSUM score. Maybe installations could default to using the dbSNP version, which is only 16MB? "PolyPhen-2 annotations for dbSNP build 131"
Note: If I recall correctly, the IDs for genes in their data are uniprot IDs, and I think they are also in knownGene.txt.