Story #13425


Create CWL pipeline for VCF + BED file to gVCF conversion

Added by Abram Connelly over 4 years ago. Updated about 4 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


Encapsulate the vcfbed2homref tool to create a CWL pipeline that converts an input VCF + BED file to a gVCF file (VCF 4.2+ with homozygous reference calls) as output.

There should be one VCF file output that is indexed.

Subtasks 1 (0 open1 closed)

Task #13470: review #13425ClosedJiayong Li05/18/2018

Actions #1

Updated by Abram Connelly over 4 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Abram Connelly over 4 years ago

Using the GIAB files to test. Taken from:


The GIAB VCF file is one big VCF with all chromosomes appearing in it (instead of split out, say), so the CWL will be developed with this in mind. Once we start testing with a Veritas genome we might need to change this to work on each chromosome independently with a consolidation step afterwards.

Currently the vcfbed2homref expected the bed file to be uncompressed and the VCF file to be compressed and indexed.

Actions #3

Updated by Abram Connelly over 4 years ago

Converting the GIAB file takes roughly ~20min (single-cpu, minimal memory) with a resulting file size ~170Mb.

A cursory use of tabix seems to work fine, properly reporting the lines that fall within the start and END tag.

The next steps are:

  • Confirm that it works under the Docker image
  • Create the actual CWL pipeline and test submission
  • Push the resulting gVCF through a gVCF to FastJ CWL pipeline to make sure the resulting FastJ is correct
Actions #4

Updated by Abram Connelly over 4 years ago

For testing, I uploaded the 'release' directories available through the ftp site. I don't know why but the "Chinese trio" only has a directory labelled "son".

I found some other references to GIAB data in keep but I didn't find the VCF plus BED files I needed.

The files were downloaded locally then uploaded to keep (~5Gb):

wget -r
arv-put --name GIAB-release --project-uuid su92l-j7d0g-vo06mmd9dg6f5bc

I've put the GIAB "release" data under a project called reference. The Data collection itself is called GIAB-release.

I created the CWL along with a test YAML and submission script. The test YAML file includes the VCF and BED files in the keep collection as well as the FASTA reference file.

In keeping with our loose convention of CWL, the project is organized as:

  |--- cwl/
  |--- src/
  |--- yml/
  |--- cwl-run/

Where cwl contains the CWL file, src contains any scripts that are needed for this pipeline, yml includes a test YAML file and cwl-run includes the submission script to submit the test CWL pipeline to Arvados.

The Docker image has been updated to include the vcfbed2homref executable.

The test data was run (NA12878 with it's BED file) successfully and stored in su92l-4zz18-1bw0ykm3cia0x2z.

Requiring the FastJ conversion be successful might be out of scope for this ticket. I think it's best to close this ticket with what's available now and check the FastJ conversion works separately. If the FastJ conversion is unsuccessful, either because of quirks in the input data, a bad conversion or some other issue, it's probably best to open a new ticket.

Actions #5

Updated by Abram Connelly over 4 years ago

In addition, there should be a CWL workflow that scatters on multiple VCF files provided. The scatter workflow should be tested NA12878 as well as one of the other datasets provided from GIAB.

Testing locally to make sure the gVCF can be converted to FastJ passes some simple checks so I'm happy to call this ticket closed after the scatter workflow has been created and tested.

Actions #6

Updated by Abram Connelly over 4 years ago

The gather steps dumps all files flat into the directory. If this is undesirable we can put them in subdirectories labelled with their names.

Actions #7

Updated by Abram Connelly over 4 years ago

There's a collection with two VCF + BED files converted to their respective gVCF files that can be found here that was made from the HG004 and NA12878 GIAB datasets.

Actions #8

Updated by Jiayong Li about 4 years ago

Nitpicking: I noticed in the gvcf you've created the symbolic alt allele is "<NON-REF>" (i.e, this is a hom-ref block). In contrast, GATK uses "<NON_REF>" as the symbolic alt allele. Note that there are no standards regarding this in vcf specs ( In fact, freebayes uses "." as the symbolic alt allele. But it might be nice to conform to a well know tool (e.g GATK) for the ease of downstream processing.

Actions #9

Updated by Abram Connelly about 4 years ago

vcfbed2homref has been updated to now report "<NON_REF>" with a command line option to override the default if necessary. The arvados/l7g Docker container has been updated to reflect the change.

I ran the CWL pipeline with the new vcfbed2homref and a sample run of two GIAB datasets can be seen in the su92l-4zz18-co5isboy7rcxtwu collection.

Actions #10

Updated by Jiayong Li about 4 years ago

lgtm, please merge.

Actions #11

Updated by Abram Connelly about 4 years ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF