Idea #13336
closedImplement VCF with BED file conversion to FastJ
Added by Abram Connelly about 6 years ago. Updated about 6 years ago.
Description
Write a program to convert VCF with a BED file containing homozygous reference regions to FastJ.
Use the Genome In A Bottle (GIAB) datasets to test.
Updated by Abram Connelly about 6 years ago
Taking files from:
Specifically:
A quick spot check on coverage using the BED file:
1 204436417 / 249250621 ( 0.820204243343 ) 3 184024995 / 198022430 ( 0.929313891361 ) 2 219872344 / 243199373 ( 0.90408269268 ) 5 164352129 / 180915260 ( 0.908448126488 ) 4 175594424 / 191154276 ( 0.918600554873 ) 7 134734010 / 159138663 ( 0.846645356069 ) 6 154589912 / 171115067 ( 0.903426651494 ) 9 101519611 / 141213431 ( 0.718909032102 ) 8 134167881 / 146364022 ( 0.916672548121 ) 11 119883302 / 135006516 ( 0.887981599347 ) 10 117273155 / 135534747 ( 0.865262654749 ) 13 89047037 / 115169878 ( 0.773179919492 ) 12 122017004 / 133851895 ( 0.91158219314 ) 15 70672978 / 102531392 ( 0.689281366628 ) 14 82035103 / 107349540 ( 0.764186814401 ) 17 66612606 / 81195210 ( 0.820400686198 ) 16 67025077 / 90354753 ( 0.741799128154 ) 19 46341023 / 59128983 ( 0.78372771945 ) 18 70100247 / 78077248 ( 0.897831939466 ) 20 54881000 / 63025520 ( 0.870774251446 ) 21 30640180 / 48129895 ( 0.636614312165 ) 22 28087336 / 51304566 ( 0.547462695621 ) X 137156694 / 155270560 ( 0.88333998409 ) Y 0 / 59373566 ( 0.0 ) MT 0 / 16571 ( 0.0 )
Y
and MT
aren't present in the VCF. Numerator is taken from adding the deltas from the BED file, denominator is taken from the cytoband.hg19
file we have laying around (Here's one possible source).
In case it's needed, the quick and dirty script, assuming file locations are correctly set up (the above output has been reordered by hand for readability):
#!/usr/bin/python
covcount = {}
poscount = {}
for c in range(1,23):
poscount[str(c)] = 0
covcount[str(c)] = 0
poscount["X"] = 0
poscount["Y"] = 0
poscount["MT"] = 0
covcount["X"] = 0
covcount["Y"] = 0
covcount["MT"] = 0
with open("../data/cytoband.hg19.custom.txt") as fp:
for line in fp:
if len(line)==0: continue
if line[0]=='#': continue
fields = line.split("\t")
chrom = fields[0]
val = fields[2]
chrom_giab = chrom[3:]
if chrom_giab == "M":
chrom_giab = "MT"
poscount[chrom_giab] = int(val)
with open("../data/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed") as fp:
for line in fp:
fields = line.split("\t")
chrom = fields[0]
s = int(fields[1])
e = int(fields[2])
covcount[chrom] += (e - s)
for x in poscount:
print x, covcount[x], "/", poscount[x], "(", float(covcount[x])/float(poscount[x]), ")"
Note that the BED file is 0 based reference whereas the VCF is 1 based reference. The BED file's end position is non-inclusive.
From the BED FAQ:
The first three required BED fields are:
1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
References¶
Updated by Abram Connelly about 6 years ago
- Status changed from New to In Progress
From looking around and from talking to Brad Chapman, it looks like there's no "off-the-shelf" solution for creating a VCF with homozygous ref calls from a VCF and BED file. This would be the most expedient solution to create FastJs as we can convert VCF + BED to VCF with homozygous ref calls, then use the pasta
programs to convert.
Since it looks like no such tools exist, I think it'll be the quickest to create one. Brad Chapman said:
VCF + callable BED into a gVCF? I don't know of one. It seems reasonable to do, although you'd have limited way of assigning probabilities for the reference blocks.
The downloaded BED file from GIAB doesn't have quality scores so I'm a little hesitant to stuff one in. Some VCFs with homozygous reference have a period stuffed in for the QUAL
field (.
) but I'm not sure this is actually compliant with the VCF specification (v4.2).
Also, in order to convert to VCF with homozygous reference, there needs at least to be a reference genome specified (hg19
in FASTA format, say) to report the anchor reference base at the start blocks of the homozygous reference calls.
So there are two possibilities at this point:
- Convert VCF + BED to VCF with homozygous reference blocks in them
- Feed in VCF + BED to
pasta
to do the conversion directly to FastJ
Pros for creating a stand-alone tool to go from VCF + BED to VCF with homozygous reference calls:
- "more general" by creating VCF files that can be used elsewhere
- might be less work than updating
pasta
pasta
with VCF and a BED file might be too overspecialized so it's better to do it stand alone
Cons:
- might be just as much work to update
pasta
than to create a stand alone tool - might introduce more edge cases into the generated VCF
I'm leaning towards making a stand alone tool to do the conversion.
Here's an example usage:
vcf-bed-to-vcfhr -r hg19.fa.gz -b input.bed -Q 50 input.vcf.gz
Where hg19.fa.gz
is indexed, input.vcf.gz
is indexed and -Q
gives the quality value. The BED file can be read into memory and sorted. The END
value in the INFO
field in the resulting VCF will hold the end of the homozygous reference region. Maybe there should be a default GT
format with 0/0
? Maybe the number of samples should be specified (or forced to 1)?
References¶
Updated by Abram Connelly about 6 years ago
A tutorial on parsing BCF/VCF:
Updated by Abram Connelly about 6 years ago
I'm not sure this is the best way but I'm reading in through htslib
's bcf
streaming reader, converting to a VCF text line and emitting it.
The output has the form
2 388825 . C <NON-REF> . PASS END=389063 GT 0/0
I assume the BED file is uncompressed.
I assume the reference FASTA file is compressed, has an index and has the same chromosome naming scheme as the VCF and BED file. For the sample genome I was using, the Genome In A Bottle (GIAB) NA12878, the naming scheme was {{1,22},{X,Y,MT}}
.
I was worried that accessing the indexed FASTA file might not be fast enough but it looks to be reasonable.
Here is a very rough timing test on accessing the FASTA file using htslib:
$ zcat vcf.gz | wc 3775182 37751720 1807059769 $ cat bed | wc 599232 1797696 12496110
So let's choose to access the FASTA file 4,500,000 times. Though not ideal, for simplicity, we'll just
access the first 4.5M entries in chromosome 1 of the FASTA file, reading one character from the access:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <htslib/faidx.h>
#include <string>
int faidx_seq(faidx_t *fai, const char *chrom, int s0, int n, std::string &res) {
int len=0;
char *s, reg[1024];
sprintf(reg, "%s:%i-%i", chrom, s0+1, s0+n);
s = fai_fetch(fai, reg, &len);
if (!s) { return -1; }
if (len<0) { free(s); return -2; }
res = s;
free(s);
return 0;
}
int main(int argc, char **argv) {
int i, n = 4500000;
faidx_t *fai;
std::string chrom = "1", res;
if (argc<2) {
printf("provide FASTA file\n");
exit(-1);
}
fai = fai_load(argv[1]);
if (!fai) { fprintf(stderr, "error\n"); exit(-1); }
printf("starting\n");
for (i=0; i<n; i++) {
if ((i%100000)==0) { printf("# %i\n", i); fflush(stdout); }
faidx_seq(fai, chrom.c_str(), i, 1, res);
}
fai_destroy(fai);
printf("done\n");
}
On lightning-dev1
I needed to give explicit instructions on where the htslib
library was but otherwise a straight compilation and run:
$ g++ -O3 -L/var/local/lib ref-timing.c -o ref-timing -lz -lhts $ time ./ref-timing /path/to/human_g1k_v37.fa.gz starting ... done real 17m38.042s user 17m12.017s sys 0m25.984s
(some of the verbose output has been omitted for clarity).
So in all, about 20mins to access 4.5M entries, at least under this "synthetic" condition. I don't think streaming the whole reference sequence will be any faster. The conversion can be easily be parallelized if wall time is a concern.
Updated by Abram Connelly about 6 years ago
The scope has been limited to converting VCF + BED to VCF with homozygous reference lines in it. This can then be converted to PASTA using the pasta
tool. There is example code on how to convert VCF + BED to PASTA provided in the example
directory.
Doing a full conversion is a bit complicated and will required some processing at a higher level, much like converting gVCF or GFF files, so I think it's best to put this in another ticket and close out this ticket.
Updated by Jiayong Li about 6 years ago
$ git diff --name-only master cwl-version/.gitignore cwl-version/npy/cwl/cwl_steps/tiling_consol-npy.cwl cwl-version/npy/cwl/cwl_steps/tiling_create-npy.cwl cwl-version/npy/cwl/tiling_consol-npy.cwl cwl-version/npy/cwl/tiling_create-npy.cwl cwl-version/npy/cwl/tiling_npy-wf.cwl cwl-version/npy/run-cwl/submit_npy_workflow.sh cwl-version/npy/src/buildArvados/scripts/compilescript.sh cwl-version/npy/src/buildArvados/scripts/runcommand cwl-version/npy/yml/tiling_consol-npy.yml cwl-version/npy/yml/tiling_create-npy.yml cwl-version/npy/yml/tiling_npy-wf-allinputs.yml cwl-version/npy/yml/tiling_npy-wf.yml cwl-version/npy/yml/yml_steps/tiling_consol-npy.yml cwl-version/npy/yml/yml_steps/tiling_create-npy.yml tools/fjt/fjt.cpp tools/fjt/fjt_test.sh tools/fjt/testdata/035e.broken_endTag.fj tools/fjt/testdata/035e.broken_endTile.fj tools/fjt/testdata/035e.broken_md5sum.fj tools/fjt/testdata/035e.broken_n.fj tools/fjt/testdata/035e.broken_nocallCount.fj tools/fjt/testdata/035e.broken_seedTileLength.fj tools/fjt/testdata/035e.broken_startTag.fj tools/fjt/testdata/035e.broken_startTile.fj tools/fjt/testdata/035e.duplicate_tile.fj tools/fjt/testdata/035e.no_endTile.fj tools/fjt/testdata/035e.non_bool_startTile.fj tools/vcfbed2homref/Makefile tools/vcfbed2homref/README.md tools/vcfbed2homref/examples/run-pasta.sh tools/vcfbed2homref/run_tests.sh tools/vcfbed2homref/testdata/ref.fa.gz tools/vcfbed2homref/testdata/ref.fa.gz.fai tools/vcfbed2homref/testdata/ref.fa.gz.gzi tools/vcfbed2homref/testdata/small.bed tools/vcfbed2homref/testdata/small.vcf tools/vcfbed2homref/tests/check-vcf-bed.py tools/vcfbed2homref/vcfbed2homref.cpp
As far as I can tell the main code is in tools/vcfbed2homref, but could you explain the changes in cwl-version/npy a bit? Looks like it's some cwl changes, do you have any arvados runs links?
This is a minor thing, in tools/vcfbed2homref/README.md
Quick Start --- ``` make ./vcfbed2homref -r testdata/ref.fa.gz -b testdata/small.bed testdata/small.vcf | egrep -v '^#' | hea
I think you mean "head".
Updated by Sarah Zaranek about 6 years ago
I think this has to do this the fact that Abram's branch started before I merged code into the code base. All those cwl-version changes in the npy directory are mine.
Can we fix that somehow?
Updated by Jiayong Li about 6 years ago
Now that I think about it, Abram can probably do the following
git checkout master git pull git checkout 13336-vcf-bed-to-vcf-hom-ref git merge master
Updated by Abram Connelly about 6 years ago
The branch has been merged with master and the diff looks a lot cleaner.
The README has been updated.
The branch has been pushed and is ready for review again.
Updated by Abram Connelly about 6 years ago
- Status changed from Feedback to Resolved
Merged into master and pushed.