Interpret phased (haplotype) data
The sequencing companies aren't generating haplotype information from individual genomes, but someday they might, and there's no reason we can't start working on this now. Some haplotype information can be generated by combining data from trios: child and two parents. For example we can use NA19240, NA19238, NA19239 genome data. Even if sequencing companies never develop haplotypes for an individual genome, researchers/clinicians may sequence all three individuals and determine some haplotype information on their own.
We'd like to create a report that uses haplotype data to report which copy of a gene a variant is on and combines these in an intelligent manner. This is very important for interpretation -- many genetic diseases are recessive and are caused by a variety of different variants - currently these would show up as two heterozygous variants. If one is on each copy of the gene then you may have a problem, but if they are both on the same copy of the gene then the other copy should be fine.
An example from within the PGP: PGP1 (hu43860C) is heterozygous for SERPINA1-E366K and SERPINA1-E288V. As I understand it, these are considered somewhat pathogenic ("PiSZ" genotype) when one is on each gene. Perhaps they are always heterozygous, but we would like to see that they are indeed on different copies of the gene and not both in the same gene.
#3 Updated by Madeleine Ball almost 8 years ago
Evan Maxwell (emaxwell) has added a python module for phasing of trio data commit: https://github.com/madprime/get-evidence/commit/fed5761e
All we need is to add some boxes in the php side where genomes are uploaded to finish it all -- the processing step checks if there's a metadata file in the source directory and, if so, reads it and checks for shasums that identify two parent genomes to send to the trio phasing module.