NA12878 (original):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0btgexp21dhe6zu
NA12878 (phased):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0c5vff0u6x0486v
I've discovered there are two main potential problems with the algorithm.
1. Flipping the phase¶
NA12878 has 3,775,119 variants, 2,249,190 of them are phased, but eagle flipped 1,076,255 of them. Statistics are obtained by running the following command.
vcf-compare -g HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz NA12878_phased.vcf.gz
For example, in the original NA12878
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:639:147,157:163,176:762:0/1:.:PATMAT
1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:553:114,125:147,172:738:0/1:.:PATMAT
In the phased NA12878
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:639:147,157:163,176:762:0/1:.:PATMAT
1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:553:114,125:147,172:738:0/1:.:PATMAT
2. Missing variants¶
The phased NA12878 missed some unphased variants of the original (meaning that it cannot phase them), but it also missed some phased ones too (this seems counter-intuitive).
For example, missed unphased variants
1 239339 rs184451216 A G 50 PASS platforms=1;platformnames=10X;datasets=1;datasetnames=10XChromium;callsets=1;callsetnames=10XGATKhaplo;datasetsmissingcall=HiSeqPE300x,CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_10XGATKhaplo_callable;difficultregion=hg19_self_chain_split_withalts_gt10k GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:0:0,0:0,0:99:0/1
Missed phased variants
1 567239 rs78150957 CG C 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable;filt=CS_HiSeqPE300xfreebayes_filt,CS_SolidPE50x50GATKHC_filt,CS_SolidSE75GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5 GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|1:534:0,214:38,252:241:1/1:.:PATMAT
Note that the missed variants are extracted by the following command, and then looking at the fn.vcf.gz (false negatives)
rtg vcfeval -b HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c NA12878_phased.vcf.gz -t ~/keep/by_id/cd0ace4987f72d1e4d8f3ef5281cef17+63099/genomes/GRCh37/rtg/GRCh37.sdf -o rtg-NA12878-NA12878_phased
There are 313,952 variants missing, 300,952 of them are phased originally, and 118,426 of them are hom alt (namely, GT being 1|1).