Project

General

Profile

Actions

Idea #13336

closed

Implement VCF with BED file conversion to FastJ

Added by Abram Connelly about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Target version:
-
Story points:
-

Description

Write a program to convert VCF with a BED file containing homozygous reference regions to FastJ.

Use the Genome In A Bottle (GIAB) datasets to test.


Subtasks 1 (0 open1 closed)

Task #13378: review #13336ResolvedJiayong Li04/23/2018Actions
Actions #1

Updated by Abram Connelly about 6 years ago

Taking files from:

Specifically:

A quick spot check on coverage using the BED file:

1 204436417 / 249250621 ( 0.820204243343 )
3 184024995 / 198022430 ( 0.929313891361 )
2 219872344 / 243199373 ( 0.90408269268 )
5 164352129 / 180915260 ( 0.908448126488 )
4 175594424 / 191154276 ( 0.918600554873 )
7 134734010 / 159138663 ( 0.846645356069 )
6 154589912 / 171115067 ( 0.903426651494 )
9 101519611 / 141213431 ( 0.718909032102 )
8 134167881 / 146364022 ( 0.916672548121 )
11 119883302 / 135006516 ( 0.887981599347 )
10 117273155 / 135534747 ( 0.865262654749 )
13 89047037 / 115169878 ( 0.773179919492 )
12 122017004 / 133851895 ( 0.91158219314 )
15 70672978 / 102531392 ( 0.689281366628 )
14 82035103 / 107349540 ( 0.764186814401 )
17 66612606 / 81195210 ( 0.820400686198 )
16 67025077 / 90354753 ( 0.741799128154 )
19 46341023 / 59128983 ( 0.78372771945 )
18 70100247 / 78077248 ( 0.897831939466 )
20 54881000 / 63025520 ( 0.870774251446 )
21 30640180 / 48129895 ( 0.636614312165 )
22 28087336 / 51304566 ( 0.547462695621 )
X 137156694 / 155270560 ( 0.88333998409 )
Y 0 / 59373566 ( 0.0 )
MT 0 / 16571 ( 0.0 )

Y and MT aren't present in the VCF. Numerator is taken from adding the deltas from the BED file, denominator is taken from the cytoband.hg19 file we have laying around (Here's one possible source).

In case it's needed, the quick and dirty script, assuming file locations are correctly set up (the above output has been reordered by hand for readability):


#!/usr/bin/python

covcount = {}
poscount = {}
for c in range(1,23):
  poscount[str(c)] = 0
  covcount[str(c)] = 0
poscount["X"] = 0
poscount["Y"] = 0
poscount["MT"] = 0

covcount["X"] = 0
covcount["Y"] = 0
covcount["MT"] = 0

with open("../data/cytoband.hg19.custom.txt") as fp:
  for line in fp:
    if len(line)==0: continue
    if line[0]=='#': continue
    fields = line.split("\t")
    chrom = fields[0]
    val = fields[2]

    chrom_giab = chrom[3:]
    if chrom_giab == "M":
      chrom_giab = "MT" 

    poscount[chrom_giab] = int(val)

with open("../data/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed") as fp:
  for line in fp:
    fields = line.split("\t")
    chrom = fields[0]
    s = int(fields[1])
    e = int(fields[2])
    covcount[chrom] += (e - s)

for x in poscount:
  print x, covcount[x], "/", poscount[x], "(", float(covcount[x])/float(poscount[x]), ")" 

Note that the BED file is 0 based reference whereas the VCF is 1 based reference. The BED file's end position is non-inclusive.

From the BED FAQ:

The first three required BED fields are:
1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

References

Actions #2

Updated by Abram Connelly about 6 years ago

  • Status changed from New to In Progress

From looking around and from talking to Brad Chapman, it looks like there's no "off-the-shelf" solution for creating a VCF with homozygous ref calls from a VCF and BED file. This would be the most expedient solution to create FastJs as we can convert VCF + BED to VCF with homozygous ref calls, then use the pasta programs to convert.

Since it looks like no such tools exist, I think it'll be the quickest to create one. Brad Chapman said:

VCF + callable BED into a gVCF? I don't know of one. It seems reasonable to do, although you'd have limited way of assigning probabilities for the reference blocks.

The downloaded BED file from GIAB doesn't have quality scores so I'm a little hesitant to stuff one in. Some VCFs with homozygous reference have a period stuffed in for the QUAL field (.) but I'm not sure this is actually compliant with the VCF specification (v4.2).

Also, in order to convert to VCF with homozygous reference, there needs at least to be a reference genome specified (hg19 in FASTA format, say) to report the anchor reference base at the start blocks of the homozygous reference calls.

So there are two possibilities at this point:

  • Convert VCF + BED to VCF with homozygous reference blocks in them
  • Feed in VCF + BED to pasta to do the conversion directly to FastJ

Pros for creating a stand-alone tool to go from VCF + BED to VCF with homozygous reference calls:

  • "more general" by creating VCF files that can be used elsewhere
  • might be less work than updating pasta
  • pasta with VCF and a BED file might be too overspecialized so it's better to do it stand alone

Cons:

  • might be just as much work to update pasta than to create a stand alone tool
  • might introduce more edge cases into the generated VCF

I'm leaning towards making a stand alone tool to do the conversion.

Here's an example usage:

vcf-bed-to-vcfhr -r hg19.fa.gz -b input.bed -Q 50 input.vcf.gz

Where hg19.fa.gz is indexed, input.vcf.gz is indexed and -Q gives the quality value. The BED file can be read into memory and sorted. The END value in the INFO field in the resulting VCF will hold the end of the homozygous reference region. Maybe there should be a default GT format with 0/0? Maybe the number of samples should be specified (or forced to 1)?

References

Actions #3

Updated by Abram Connelly about 6 years ago

A tutorial on parsing BCF/VCF:

Actions #4

Updated by Abram Connelly about 6 years ago

I'm not sure this is the best way but I'm reading in through htslib's bcf streaming reader, converting to a VCF text line and emitting it.

The output has the form

2       388825  .       C       <NON-REF>       .       PASS    END=389063      GT      0/0

I assume the BED file is uncompressed.

I assume the reference FASTA file is compressed, has an index and has the same chromosome naming scheme as the VCF and BED file. For the sample genome I was using, the Genome In A Bottle (GIAB) NA12878, the naming scheme was {{1,22},{X,Y,MT}}.

I was worried that accessing the indexed FASTA file might not be fast enough but it looks to be reasonable.

Here is a very rough timing test on accessing the FASTA file using htslib:

$ zcat vcf.gz | wc
3775182 37751720 1807059769
$ cat bed  | wc
 599232 1797696 12496110

So let's choose to access the FASTA file 4,500,000 times. Though not ideal, for simplicity, we'll just
access the first 4.5M entries in chromosome 1 of the FASTA file, reading one character from the access:


#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

#include <htslib/faidx.h>

#include <string>

int faidx_seq(faidx_t *fai, const char *chrom, int s0, int n, std::string &res) {
  int len=0;
  char *s, reg[1024];
  sprintf(reg, "%s:%i-%i", chrom, s0+1, s0+n);
  s = fai_fetch(fai, reg, &len);
  if (!s) { return -1; }
  if (len<0) { free(s); return -2; }

  res = s;

  free(s);
  return 0;
}

int main(int argc, char **argv) {
  int i, n = 4500000;
  faidx_t *fai;
  std::string chrom = "1", res;

  if (argc<2) {
    printf("provide FASTA file\n");
    exit(-1);
  }

  fai = fai_load(argv[1]);
  if (!fai) { fprintf(stderr, "error\n"); exit(-1); }

  printf("starting\n");

  for (i=0; i<n; i++) {
    if ((i%100000)==0) { printf("# %i\n", i); fflush(stdout); }
    faidx_seq(fai, chrom.c_str(), i, 1, res);
  }

  fai_destroy(fai);

  printf("done\n");
}

On lightning-dev1 I needed to give explicit instructions on where the htslib library was but otherwise a straight compilation and run:

$ g++ -O3 -L/var/local/lib ref-timing.c -o ref-timing -lz -lhts
$ time ./ref-timing /path/to/human_g1k_v37.fa.gz
starting
...
done

real    17m38.042s
user    17m12.017s
sys     0m25.984s

(some of the verbose output has been omitted for clarity).

So in all, about 20mins to access 4.5M entries, at least under this "synthetic" condition. I don't think streaming the whole reference sequence will be any faster. The conversion can be easily be parallelized if wall time is a concern.

Actions #5

Updated by Abram Connelly almost 6 years ago

The scope has been limited to converting VCF + BED to VCF with homozygous reference lines in it. This can then be converted to PASTA using the pasta tool. There is example code on how to convert VCF + BED to PASTA provided in the example directory.

Doing a full conversion is a bit complicated and will required some processing at a higher level, much like converting gVCF or GFF files, so I think it's best to put this in another ticket and close out this ticket.

Actions #6

Updated by Jiayong Li almost 6 years ago

$ git diff --name-only master 
cwl-version/.gitignore
cwl-version/npy/cwl/cwl_steps/tiling_consol-npy.cwl
cwl-version/npy/cwl/cwl_steps/tiling_create-npy.cwl
cwl-version/npy/cwl/tiling_consol-npy.cwl
cwl-version/npy/cwl/tiling_create-npy.cwl
cwl-version/npy/cwl/tiling_npy-wf.cwl
cwl-version/npy/run-cwl/submit_npy_workflow.sh
cwl-version/npy/src/buildArvados/scripts/compilescript.sh
cwl-version/npy/src/buildArvados/scripts/runcommand
cwl-version/npy/yml/tiling_consol-npy.yml
cwl-version/npy/yml/tiling_create-npy.yml
cwl-version/npy/yml/tiling_npy-wf-allinputs.yml
cwl-version/npy/yml/tiling_npy-wf.yml
cwl-version/npy/yml/yml_steps/tiling_consol-npy.yml
cwl-version/npy/yml/yml_steps/tiling_create-npy.yml
tools/fjt/fjt.cpp
tools/fjt/fjt_test.sh
tools/fjt/testdata/035e.broken_endTag.fj
tools/fjt/testdata/035e.broken_endTile.fj
tools/fjt/testdata/035e.broken_md5sum.fj
tools/fjt/testdata/035e.broken_n.fj
tools/fjt/testdata/035e.broken_nocallCount.fj
tools/fjt/testdata/035e.broken_seedTileLength.fj
tools/fjt/testdata/035e.broken_startTag.fj
tools/fjt/testdata/035e.broken_startTile.fj
tools/fjt/testdata/035e.duplicate_tile.fj
tools/fjt/testdata/035e.no_endTile.fj
tools/fjt/testdata/035e.non_bool_startTile.fj
tools/vcfbed2homref/Makefile
tools/vcfbed2homref/README.md
tools/vcfbed2homref/examples/run-pasta.sh
tools/vcfbed2homref/run_tests.sh
tools/vcfbed2homref/testdata/ref.fa.gz
tools/vcfbed2homref/testdata/ref.fa.gz.fai
tools/vcfbed2homref/testdata/ref.fa.gz.gzi
tools/vcfbed2homref/testdata/small.bed
tools/vcfbed2homref/testdata/small.vcf
tools/vcfbed2homref/tests/check-vcf-bed.py
tools/vcfbed2homref/vcfbed2homref.cpp

As far as I can tell the main code is in tools/vcfbed2homref, but could you explain the changes in cwl-version/npy a bit? Looks like it's some cwl changes, do you have any arvados runs links?

This is a minor thing, in tools/vcfbed2homref/README.md

Quick Start
---

```
make
./vcfbed2homref -r testdata/ref.fa.gz -b testdata/small.bed testdata/small.vcf | egrep -v '^#' | hea

I think you mean "head".

Actions #7

Updated by Sarah Zaranek almost 6 years ago

I think this has to do this the fact that Abram's branch started before I merged code into the code base. All those cwl-version changes in the npy directory are mine.

Can we fix that somehow?

Actions #8

Updated by Jiayong Li almost 6 years ago

Now that I think about it, Abram can probably do the following

git checkout master
git pull
git checkout 13336-vcf-bed-to-vcf-hom-ref
git merge master

Actions #9

Updated by Abram Connelly almost 6 years ago

The branch has been merged with master and the diff looks a lot cleaner.

The README has been updated.

The branch has been pushed and is ready for review again.

Actions #10

Updated by Jiayong Li almost 6 years ago

  • Status changed from In Progress to Feedback

lgtm

Actions #11

Updated by Abram Connelly almost 6 years ago

  • Status changed from Feedback to Resolved

Merged into master and pushed.

Actions

Also available in: Atom PDF