Bug #13469
closedvcfbed2homref misreported homref region
Description
When testing with the GIAB dataset, vcfbed2homref (v0.1.0) run on HG004 reported an end bound in a homozygous reference region that was before the start of following line.
The offending VCF line was:
7 85067121 rs149973664 TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC T 50 PASS platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=IonExome;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:836:220,154:0,0:1460
With the corresponding BED file region being:
7 85058314 85067126 7 85067151 85068558 7 85068659 85071179 7 85071280 85073955
I believe the issue is that the bed file position wasn't updated when the homozygous start window was updated (around here).
Updated by Abram Connelly almost 6 years ago
This is specifically the error that was causing tabix
to not index properly:
7 85065014 . G <NON-REF> . PASS END=85067120 GT 0/0 7 85067121 rs149973664 TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC T 50 PASS platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=.;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:836:220,154:0,0:1460 7 85067159 . T <NON-REF> . PASS END=85067126 GT 0/0 7 85067152 . T <NON-REF> . PASS END=85067723 GT 0/0 7 85067724 rs79027572 T C 50 PASS platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250GATK,HiSeq250x250freebayes,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=10,.;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250GATK_callable,CS_HiSeq250x250freebayes_callable;filt=CS_HiSeqPE300xGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:1049:180,211:79,84:1299
Notice 85067152
comes after in the file but falls before the start of the previous lines reported 85067159
.
tabix
reports the following error (with out.vcf.gz
being the output VCF file):
[E::hts_idx_push] Unsorted positions on sequence #1: 85067159 followed by 85067152 tbx_index_build failed: out.vcf.gz
Updated by Abram Connelly almost 6 years ago
- Status changed from New to Resolved
I subtracted the beginning BED start from the source BED file as well as the VCF file so that we could create a small test snippet with a reference that was checked in.
Bug has been fixed and tests have been added to make sure the problem region passes.