Story #13376

Test the effect of phasing imputation workflow

Added by Jiayong Li over 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

generate_test_genome.py (1.56 KB) generate_test_genome.py Keldin Sergheyev, 04/26/2018 05:10 PM
get_switch_errors.py (2.77 KB) get_switch_errors.py Jiayong Li, 06/20/2018 08:35 PM

Related issues

Related to Lightning - Story #13216: Write phasing imputation workflowResolved03/22/201804/10/2018

History

#1 Updated by Jiayong Li over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Jiayong Li over 1 year ago

  • Description updated (diff)
  • Status changed from In Progress to New
  • Assigned To set to Jiayong Li

#3 Updated by Jiayong Li over 1 year ago

  • Related to Story #13216: Write phasing imputation workflow added

#4 Updated by Jiayong Li over 1 year ago

Some sample lines of GS12877.

unphased:

1       52238   .       T       G       .       PASS    .       GT      1/1
1       52727   .       C       G       .       PASS    .       GT      0/1
1       53206   .       G       C       .       PASS    .       GT      1/0
1       55164   .       C       A       .       PASS    .       GT      1/1

phased:

1       52238   .       T       G       .       PASS    .       GT      1|1
1       55164   .       C       A       .       PASS    .       GT      1|1

imputed:

1       52238   rs2691277       T       G       .       PASS    AR2=0.00;DR2=0.00;AF=1  GT:DS   1|1:2
1       52253   rs530867301     C       G       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       53195   rs530111162     T       C       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       53234   rs199502715     CAT     C       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       54353   rs140052487     C       A       .       PASS    AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS   0|0:0.01
1       54490   rs141149254     G       A       .       PASS    AR2=0.00;DR2=0.00;AF=0.061;IMP  GT:DS   0|0:0.12
1       54564   rs558796213     G       T       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       54591   rs561234294     A       G       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       54712   rs552304420     T       C       .       PASS    AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS   0|0:0.01
1       54712   rs568927205     T       TTTTC   .       PASS    AR2=0.00;DR2=0.00;AF=0.59;IMP   GT:DS   1|1:1.18
1       54716   rs569128616     C       T       .       PASS    AR2=0.00;DR2=0.00;AF=0.20;IMP   GT:DS   0|0:0.39
1       54763   rs548455890     T       G       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       54815   rs568686875     T       C       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       54945   rs569799965     C       A       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       55136   rs574692163     T       C       .       PASS    AR2=0.00;DR2=0.00;AF=0;IMP      GT:DS   0|0:0
1       55164   rs3091274       C       A       .       PASS    AR2=0.00;DR2=0.00;AF=1  GT:DS   1|1:2

  • Since the phasing process will drop variants (when it can't phase), and then imputation will add variants, there is potential conflict in the workflow.
  • Also, in the imputed vcf, the position 54712 appeared twice.

#5 Updated by Jiayong Li over 1 year ago

  • Status changed from New to In Progress

#6 Updated by Jiayong Li over 1 year ago

Also beagle includes the following quality information in the imputed vcf:

##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">

P(RA): P(REF,ALT)
P(AA): P(ALT,ALT)

More info here https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Dosage

Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.

Also the imputed marker

##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">

Note that the reported variants in phased doesn't have the "IMP" marker.

Another observation is that AF (allele frequency) appears to be correlated with DS. DS is roughly 2*AF.

#7 Updated by Jiayong Li over 1 year ago

Note that in the imputed vcf, the original annotations are dropped. For example, in NA12878

unphased:

1       837214  rs72631888      G       C       50      PASS    platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0|1:558:122,138:135,166:581:0/1:.:PATMAT

phased:

1       837214  rs72631888      G       C       50      PASS    platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0|1:558:122,138:135,166:581:0/1:.:PATMAT

imputed

1       837214  rs72631888      G       C       .       PASS    AR2=0.00;DR2=0.00;AF=0.5        GT:DS   0|1:1

#8 Updated by Keldin Sergheyev over 1 year ago

This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.

#9 Updated by Jiayong Li over 1 year ago

Keldin Sergheyev wrote:

This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.

Thanks Keldin.

#10 Updated by Jiayong Li over 1 year ago

NA12878 (original):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0btgexp21dhe6zu

NA12878 (phased):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0c5vff0u6x0486v

I've discovered there are two main potential problems with the algorithm.

1. Flipping the phase

NA12878 has 3,775,119 variants, 2,249,190 of them are phased, but eagle flipped 1,076,255 of them. Statistics are obtained by running the following command.

vcf-compare -g HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz NA12878_phased.vcf.gz

For example, in the original NA12878

1       1171417 rs6603782       C       T       50      PASS    platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|0:639:147,157:163,176:762:0/1:.:PATMAT
1       1172907 rs715643        C       T       50      PASS    platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0|1:553:114,125:147,172:738:0/1:.:PATMAT

In the phased NA12878

1       1171417 rs6603782       C       T       50      PASS    platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0|1:639:147,157:163,176:762:0/1:.:PATMAT
1       1172907 rs715643        C       T       50      PASS    platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|0:553:114,125:147,172:738:0/1:.:PATMAT

2. Missing variants

The phased NA12878 missed some unphased variants of the original (meaning that it cannot phase them), but it also missed some phased ones too (this seems counter-intuitive).

For example, missed unphased variants

1       239339  rs184451216     A       G       50      PASS    platforms=1;platformnames=10X;datasets=1;datasetnames=10XChromium;callsets=1;callsetnames=10XGATKhaplo;datasetsmissingcall=HiSeqPE300x,CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_10XGATKhaplo_callable;difficultregion=hg19_self_chain_split_withalts_gt10k      GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0/1:0:0,0:0,0:99:0/1

Missed phased variants

1       567239  rs78150957      CG      C       50      PASS    platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable;filt=CS_HiSeqPE300xfreebayes_filt,CS_SolidPE50x50GATKHC_filt,CS_SolidSE75GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5        GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|1:534:0,214:38,252:241:1/1:.:PATMAT

Note that the missed variants are extracted by the following command, and then looking at the fn.vcf.gz (false negatives)

rtg vcfeval -b HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c NA12878_phased.vcf.gz -t ~/keep/by_id/cd0ace4987f72d1e4d8f3ef5281cef17+63099/genomes/GRCh37/rtg/GRCh37.sdf -o rtg-NA12878-NA12878_phased

There are 313,952 variants missing, 300,952 of them are phased originally, and 118,426 of them are hom alt (namely, GT being 1|1).

#11 Updated by Jiayong Li about 1 year ago

For imputation of GIAB NA12878, beagle adds total of 720,714 variants, and 125,936 of them lie in the confidence region. This makes sense, since 83% of the imputed variants are outside of the confidence region.

#12 Updated by Jiayong Li about 1 year ago

Regarding eagle flipping the phases, I've found this PLOS Genetics paper http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308 that evaluates phasing strategies on GIAB NA12878, and eagle and beagle are among them.

In that paper, they used the metric Switch Error, which is defined as follows

To identify switch errors, the phase of each site is compared with upstream neighboring phased sites.

This means the switch error is only 1 for a whole block of variants whose phases are all switched. In this metric, eagle did better than beagle. See http://journals.plos.org/plosgenetics/article/figure?id=10.1371/journal.pgen.1007308.t001

#13 Updated by Jiayong Li about 1 year ago

See the attached script for calculating switch errors.

Summary:

GIAB NA12878:

3,775,119 variants (2,249,190 of them phased)

After applying eagle for phasing:

3,461,167 phased variants
313,952 missing variants (300,952 of them phased originally, 118,426 of them hom alt, namely, GT being 1|1)
1,076,255 phase switches
186,362 switch errors (phase switch compared to last phased het upstream)

#14 Updated by Jiayong Li 4 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF