https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-04-20T15:57:29ZArvadosLightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=619942018-04-20T15:57:29ZJiayong Lijli@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=619952018-04-20T15:59:29ZJiayong Lijli@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/61995/diff?detail_id=59095">diff</a>)</li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>New</i></li><li><strong>Assigned To</strong> set to <i>Jiayong Li</i></li></ul> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=619972018-04-20T15:59:45ZJiayong Lijli@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-6 status-3 priority-4 priority-default closed parent behind-schedule" href="/issues/13216">Idea #13216</a>: Write phasing imputation workflow</i> added</li></ul> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=619992018-04-20T16:04:16ZJiayong Lijli@curii.com
<ul></ul><p>Some sample lines of GS12877.</p>
<p>unphased:<br /><pre>
1 52238 . T G . PASS . GT 1/1
1 52727 . C G . PASS . GT 0/1
1 53206 . G C . PASS . GT 1/0
1 55164 . C A . PASS . GT 1/1
</pre></p>
<p>phased:<br /><pre>
1 52238 . T G . PASS . GT 1|1
1 55164 . C A . PASS . GT 1|1
</pre></p>
<p>imputed:<br /><pre>
1 52238 rs2691277 T G . PASS AR2=0.00;DR2=0.00;AF=1 GT:DS 1|1:2
1 52253 rs530867301 C G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 53195 rs530111162 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 53234 rs199502715 CAT C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 54353 rs140052487 C A . PASS AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS 0|0:0.01
1 54490 rs141149254 G A . PASS AR2=0.00;DR2=0.00;AF=0.061;IMP GT:DS 0|0:0.12
1 54564 rs558796213 G T . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 54591 rs561234294 A G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 54712 rs552304420 T C . PASS AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS 0|0:0.01
1 54712 rs568927205 T TTTTC . PASS AR2=0.00;DR2=0.00;AF=0.59;IMP GT:DS 1|1:1.18
1 54716 rs569128616 C T . PASS AR2=0.00;DR2=0.00;AF=0.20;IMP GT:DS 0|0:0.39
1 54763 rs548455890 T G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 54815 rs568686875 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 54945 rs569799965 C A . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 55136 rs574692163 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0
1 55164 rs3091274 C A . PASS AR2=0.00;DR2=0.00;AF=1 GT:DS 1|1:2
</pre></p>
<ul>
<li>Since the phasing process will drop variants (when it can't phase), and then imputation will add variants, there is potential conflict in the workflow.</li>
</ul>
<ul>
<li>Also, in the imputed vcf, the position 54712 appeared twice.</li>
</ul> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=620002018-04-20T16:04:31ZJiayong Lijli@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=620012018-04-20T17:41:40ZJiayong Lijli@curii.com
<ul></ul><p>Also beagle includes the following quality information in the imputed vcf:</p>
<pre>
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
</pre>
<p>P(RA): P(REF,ALT)<br />P(AA): P(ALT,ALT)</p>
<p>More info here <a class="external" href="https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Dosage">https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Dosage</a></p>
<blockquote>
<p>Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.</p>
</blockquote>
<p>Also the imputed marker<br /><pre>
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
</pre><br />Note that the reported variants in phased doesn't have the "IMP" marker.</p>
<p>Another observation is that AF (allele frequency) appears to be correlated with DS. DS is roughly 2*AF.</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=621742018-04-26T15:20:49ZJiayong Lijli@curii.com
<ul></ul><p>Note that in the imputed vcf, the original annotations are dropped. For example, in NA12878</p>
<p>unphased:<br /><pre>
1 837214 rs72631888 G C 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:558:122,138:135,166:581:0/1:.:PATMAT
</pre></p>
<p>phased:<br /><pre>
1 837214 rs72631888 G C 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:558:122,138:135,166:581:0/1:.:PATMAT
</pre></p>
<p>imputed<br /><pre>
1 837214 rs72631888 G C . PASS AR2=0.00;DR2=0.00;AF=0.5 GT:DS 0|1:1
</pre></p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=621832018-04-26T17:13:53ZKeldin Sergheyevksergheyev@veritasgenetics.com
<ul><li><strong>File</strong> <a href="/attachments/2023">generate_test_genome.py</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/2023/generate_test_genome.py">generate_test_genome.py</a> added</li></ul><p>This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=621862018-04-26T18:12:33ZJiayong Lijli@curii.com
<ul></ul><p>Keldin Sergheyev wrote:</p>
<blockquote>
<p>This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.</p>
</blockquote>
<p>Thanks Keldin.</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=626232018-05-11T15:14:06ZJiayong Lijli@curii.com
<ul></ul><p>NA12878 (original):<br /><a class="external" href="https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0btgexp21dhe6zu">https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0btgexp21dhe6zu</a></p>
<p>NA12878 (phased):<br /><a class="external" href="https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0c5vff0u6x0486v">https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0c5vff0u6x0486v</a></p>
<p>I've discovered there are two main potential problems with the algorithm.</p>
<a name="1-Flipping-the-phase"></a>
<h3 >1. Flipping the phase<a href="#1-Flipping-the-phase" class="wiki-anchor">¶</a></h3>
<p>NA12878 has 3,775,119 variants, 2,249,190 of them are phased, but eagle flipped 1,076,255 of them. Statistics are obtained by running the following command.<br /><pre>
vcf-compare -g HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz NA12878_phased.vcf.gz
</pre></p>
<p>For example, in the original NA12878<br /><pre>
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:639:147,157:163,176:762:0/1:.:PATMAT
1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:553:114,125:147,172:738:0/1:.:PATMAT
</pre></p>
<p>In the phased NA12878<br /><pre>
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:639:147,157:163,176:762:0/1:.:PATMAT
1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:553:114,125:147,172:738:0/1:.:PATMAT
</pre></p>
<a name="2-Missing-variants"></a>
<h3 >2. Missing variants<a href="#2-Missing-variants" class="wiki-anchor">¶</a></h3>
<p>The phased NA12878 missed some unphased variants of the original (meaning that it cannot phase them), but it also missed some phased ones too (this seems counter-intuitive).</p>
<p>For example, missed unphased variants<br /><pre>
1 239339 rs184451216 A G 50 PASS platforms=1;platformnames=10X;datasets=1;datasetnames=10XChromium;callsets=1;callsetnames=10XGATKhaplo;datasetsmissingcall=HiSeqPE300x,CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_10XGATKhaplo_callable;difficultregion=hg19_self_chain_split_withalts_gt10k GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:0:0,0:0,0:99:0/1
</pre></p>
<p>Missed phased variants<br /><pre>
1 567239 rs78150957 CG C 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable;filt=CS_HiSeqPE300xfreebayes_filt,CS_SolidPE50x50GATKHC_filt,CS_SolidSE75GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5 GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|1:534:0,214:38,252:241:1/1:.:PATMAT
</pre></p>
<p>Note that the missed variants are extracted by the following command, and then looking at the fn.vcf.gz (false negatives)<br /><pre>
rtg vcfeval -b HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c NA12878_phased.vcf.gz -t ~/keep/by_id/cd0ace4987f72d1e4d8f3ef5281cef17+63099/genomes/GRCh37/rtg/GRCh37.sdf -o rtg-NA12878-NA12878_phased
</pre></p>
<p>There are 313,952 variants missing, 300,952 of them are phased originally, and 118,426 of them are hom alt (namely, GT being 1|1).</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=633832018-06-12T15:09:47ZJiayong Lijli@curii.com
<ul></ul><p>For imputation of GIAB NA12878, beagle adds total of 720,714 variants, and 125,936 of them lie in the confidence region. This makes sense, since 83% of the imputed variants are outside of the confidence region.</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=633842018-06-12T15:15:35ZJiayong Lijli@curii.com
<ul></ul><p>Regarding eagle flipping the phases, I've found this PLOS Genetics paper <a class="external" href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308">http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308</a> that evaluates phasing strategies on GIAB NA12878, and eagle and beagle are among them.</p>
<p>In that paper, they used the metric Switch Error, which is defined as follows<br /><pre>
To identify switch errors, the phase of each site is compared with upstream neighboring phased sites.
</pre></p>
<p>This means the switch error is only 1 for a whole block of variants whose phases are all switched. In this metric, eagle did better than beagle. See <a class="external" href="http://journals.plos.org/plosgenetics/article/figure?id=10.1371/journal.pgen.1007308.t001">http://journals.plos.org/plosgenetics/article/figure?id=10.1371/journal.pgen.1007308.t001</a></p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=636972018-06-20T20:40:11ZJiayong Lijli@curii.com
<ul><li><strong>File</strong> <a href="/attachments/2035">get_switch_errors.py</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/2035/get_switch_errors.py">get_switch_errors.py</a> added</li></ul><p>See the attached script for calculating switch errors.</p>
<p>Summary:</p>
<a name="GIAB-NA12878"></a>
<h3 >GIAB NA12878:<a href="#GIAB-NA12878" class="wiki-anchor">¶</a></h3>
<p>3,775,119 variants (2,249,190 of them phased)</p>
<a name="After-applying-eagle-for-phasing"></a>
<h3 >After applying eagle for phasing:<a href="#After-applying-eagle-for-phasing" class="wiki-anchor">¶</a></h3>
<p>3,461,167 phased variants<br />313,952 missing variants (300,952 of them phased originally, 118,426 of them hom alt, namely, GT being 1|1)<br />1,076,255 phase switches<br />186,362 switch errors (phase switch compared to last phased het upstream)</p> Lightning - Idea #13376: Test the effect of phasing imputation workflowhttps://dev.arvados.org/issues/13376?journal_id=739382019-04-30T17:31:23ZJiayong Lijli@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul>