Project

General

Profile

Actions

Bug #13469

closed

vcfbed2homref misreported homref region

Added by Abram Connelly almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Target version:
-
Story points:
-

Description

When testing with the GIAB dataset, vcfbed2homref (v0.1.0) run on HG004 reported an end bound in a homozygous reference region that was before the start of following line.

The offending VCF line was:

7       85067121        rs149973664     TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC  T       50      PASS    platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=IonExome;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt    GT:PS:DP:ADALL:AD:GQ    0/1:.:836:220,154:0,0:1460

With the corresponding BED file region being:

7       85058314        85067126
7       85067151        85068558
7       85068659        85071179
7       85071280        85073955

I believe the issue is that the bed file position wasn't updated when the homozygous start window was updated (around here).

Actions #1

Updated by Abram Connelly almost 6 years ago

This is specifically the error that was causing tabix to not index properly:

7       85065014        .       G       <NON-REF>       .       PASS    END=85067120    GT      0/0
7       85067121        rs149973664     TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC  T       50      PASS    platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=.;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt   GT:PS:DP:ADALL:AD:GQ    0/1:.:836:220,154:0,0:1460
7       85067159        .       T       <NON-REF>       .       PASS    END=85067126    GT      0/0
7       85067152        .       T       <NON-REF>       .       PASS    END=85067723    GT      0/0
7       85067724        rs79027572      T       C       50      PASS    platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250GATK,HiSeq250x250freebayes,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=10,.;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250GATK_callable,CS_HiSeq250x250freebayes_callable;filt=CS_HiSeqPE300xGATK_filt       GT:PS:DP:ADALL:AD:GQ    0/1:.:1049:180,211:79,84:1299

Notice 85067152 comes after in the file but falls before the start of the previous lines reported 85067159.

tabix reports the following error (with out.vcf.gz being the output VCF file):

[E::hts_idx_push] Unsorted positions on sequence #1: 85067159 followed by 85067152
tbx_index_build failed: out.vcf.gz
Actions #2

Updated by Abram Connelly almost 6 years ago

  • Status changed from New to Resolved

I subtracted the beginning BED start from the source BED file as well as the VCF file so that we could create a small test snippet with a reference that was checked in.

Bug has been fixed and tests have been added to make sure the problem region passes.

Actions

Also available in: Atom PDF