Story #13756

Convert 10 Harvard PGP GFF to CGF [CWL]

Added by Abram Connelly over 3 years ago. Updated over 3 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


Most of the components should be in place to do a full conversion from GFF to CGF as a common workflow language (CWL) pipeline. As an initial pass to make sure each of the components is working properly and to fix issues that come up, convert 10 Harvard PGP GFF files to CGF.

  • 10 Harvard GFF files have been taken from the website (build 37) and put into collection 6318c4d7099cb65b4577c5d0595dd412+2589
  • They have been pre-processed to be bgzip'd and indexed via tabix (the script that did the recompression and index is in the .scripts directory in the collection). This should eventually be a CWL pipeline but in the interest of expediency this was done as a pre-processing step to focus on the conversion workflow.
  • The 10 datasets will create a stand-alone SGLF tile library which will be used for the final conversion.
  • The workflow should consist of conversion with checks after each major step:
    - Convert GFF to FastJ
    - Check FastJ
    - Collect FastJ into SGLF
    - Check SGLF
    - Convert FastJ and SGLF to band and CGF
    - Check CGF against the original GFF

This will be a whole genome conversion for each dataset.

This ticket will be considered completed when the 10 datasets have been converted successfully (with tests passing) and with timing results reported.


Task #13915: Review #13756, branch 13756-test-10-harvard-pgp-gff-to-cgfClosedSarah Zaranek


#1 Updated by Abram Connelly over 3 years ago

The test 10 GFF were converted to FastJ and located in the collection 0e3e9a08517dbb61cc08838af755fb30+528752.

Test FastJ conversion workflow:

Command to submit:

arvados-cwl-runner \
  --submit \
  ./cwl/tiling_convert2fastj.cwl \


   class: Directory
   path: keep:6318c4d7099cb65b4577c5d0595dd412+2589

   class: File
   path: ./src/convert2fastjCWL


ref: "hg19" 

   class: File
   path: keep:ef70506d71ee761b1ec28d67290216a0+1252/hg19.fa.gz

   class: File
   path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.gz

   class: File
   path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.fwi


refM: "hg19" 

   class: File
   path: keep:ef70506d71ee761b1ec28d67290216a0+1252/hg19.fa.gz

   class: File
   path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.gz

   class: File
   path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.fwi

seqidM: "chrM" 


   class: File
   path: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz

l7g: "l7g" 
pasta: "pasta" 
refstream: "refstream" 

#2 Updated by Abram Connelly over 3 years ago

The FastJ CWL check ran successfully with an output of 8e6bc439fd45e96de2404b328001796c+968

Command line submission:

arvados-cwl-runner \
  --submit \
  ./cwl/fastj-check_wf.cwl \


  class: File
  path: ./src/fastj-dir-check

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu826751-GS03052-DNA_B01

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huC434ED-GS01669-DNA_D08

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu34D5B9-GS01670-DNA_E02

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu34D5B9-GS01173-DNA_C07

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu119DFA-GS000022010-ASM

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huBFE0A8-GS000039670-ASM

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu3C8DE9-GS000052488-DID

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu1DD730-GS000037501-ASM

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu2843C9-GS01669-DNA_D05

  - class: Directory
    path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huD10E53-GS01669-DNA_B04

outlogs: [ "hu119DFA-GS000022010-ASM-fjcheck.log", "hu1DD730-GS000037501-ASM-fjcheck.log", "hu2843C9-GS01669-DNA_D05-fjcheck.log", "hu34D5B9-GS01173-DNA_C07-fjcheck.log", "hu34D5B9-GS01670-DNA_E02-fjcheck.log", "hu3C8DE9-GS000052488-DID-fjcheck.log", "hu826751-GS03052-DNA_B01-fjcheck.log", "huBFE0A8-GS000039670-ASM-fjcheck.log", "huC434ED-GS01669-DNA_D08-fjcheck.log", "huD10E53-GS01669-DNA_B04-fjcheck.log" ]

#3 Updated by Abram Connelly over 3 years ago

SGLF creation workflow ran successfully with output collection 634b24da41fd95cd47760e3b4185cac9+34210

Submission script:

arvados-cwl-runner \
  --submit \
  ./cwl/tiling_createsglf-wf.cwl \


   class: File
   path: ./src/

pathmin: '0'

pathmax: '862'

   class: Directory
   path: keep:0e3e9a08517dbb61cc08838af755fb30+528752

   class: File
   path: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz

nchunks: '5'

#4 Updated by Abram Connelly over 3 years ago

The SGLF check pipeline ran successfully with output collection 146d45a6406a763756e1e346a216f4c9+57.

Submission script:

arvados-cwl-runner \
  --submit \
  ./cwl/sglf-sanity-check.cwl \


  class: File
  location: ./src/sglf-sanity-check

  class: Directory
  location: keep:634b24da41fd95cd47760e3b4185cac9+34210

outFileName: "out.log" 

This leg of the step should really be run in parallel. I think putting an expression tool to batch and scatter the SGLF files (by 5, say) to feed into the actual check and then another expression tool to gather the results into a directory/single file would be appropriate.

#5 Updated by Abram Connelly over 3 years ago

When converting the FastJ to CGF, I noticed one of the datasets (hu34D5B9-GS01670-DNA_E02.cgf and hu34D5B9-GS01173-DNA_C07.cgf) had tilepath 862 as being all default with all nocall. Sometimes genomic data will not include the mitochondrial DNA section and this gets passed through as nocall but in this case, both of the original GFF datasets had chrM data. (CGF collection: b2e070c5b979351f7c560740319bd0e8+685)

Looking at the original GFF, it looks like there are spurious header lines (lines beginning with #) at the beginning of each chromosome. Not all datasets have this (such as hu826751-GS03052-DNA_B01) and is probably why I missed it before.

A new GFF collection has been created that has the "header" lines completely stripped out and can be found at c7965a47d3cc53fb342ee41168f9a680+2499. For completeness, here's the scrip that I used to strip the original and re-index:


rm -f ../.cleanup/*.tbi
rm -f ../.cleanup/*.gzi

for x in `find ../.cleanup -type f -name '*.gz' ` ; do
  bfn=`basename $x .gz`
  idir=`dirname $x`

  echo "$x ($bfn $idir)" 

  unpigz -c $x | egrep -v '^#' > $idir/$bfn
  rm -f $x
  bgzip -i $idir/$bfn
  tabix -f $x


Some things to note:

  • We should add this to the "preprocess" step to either check and convert or convert wholesale the GFF files to some standard format. This can be extended to other formats as well (I think we have some VCF preprocess pipelines around as well).
  • We should add some sort of sanity check for the FastJ to try and catch errors of this sort. Maybe adding the GFF to the "FastJ sanity check" so make sure that if a chromosome appears there's at least some non-trivial tiles.

The whole pipeline up to this point needs to be re-run.

#6 Updated by Abram Connelly over 3 years ago

Re-running with source input GFF collection c7965a47d3cc53fb342ee41168f9a680+2499 :

workflow total time amortized time (per dataset) _ output type output
GFF to FastJ ~55mins ~55mins FastJ a5713a7ea4c6ea265f1464d319287592+529231
FastJ check ~40mins ~40mins logs 0e42a97a0ff37d76dfed9bb5e950301c+968
Fastj to SGLF ~40mins ~70mins SGLF db712885dcf629e0f8367b1e9d50368d+34212
SGLF check ~220mins ~22mins logs 5d212b5d4a2c4cf036136bbc47e8a40f+57
FastJ + SGLF to CGF ~160mins ~16mins CGF 1aad87aef71bb06aba6b16c18a3ad1b2+684
CGF check ~40mins ~40mins logs 0aed6c07d6596ba34ac9342c21c8b4d4+1768

As a rough estimate, for these 10 datasets, it's about 4h run time for a full conversion.


  • stitch these together into a workflow
  • 'preprocess' the GFF to clean them up for this pipeline (take out spurious '^#' lines and index)

#7 Updated by Abram Connelly over 3 years ago

Using a modified "cgf to npy" pipeline, the runtime for the pipeline with the 10 test datasets was 8min. The requested machine was 16 cores with "ramMin: 100000 (100G+?), so a rough estimate of the amortized cost is ~15min.

The npy conversion will not scale as it requires all CGF and resulting npy arrays to fit into memory. I'm going to move on and we can come back to it later when we run into the scaling issues.

#8 Updated by Abram Connelly over 3 years ago

There was an error when trying to tie the different pieces of the pipeline together. While trying to do the SGLF creation by using a multistep pipeline that first joined a FastJ directory array into directory then 'chunked' paths as input to the SGLF creation, the pipeline failed with no discernible error message in the logs. Some portions of the 'scatter' pipeline would succeed and others would be 'cancelled' with no clear indication why.

Running each step of the pipeline individually would succeed but as soon as they were chained together, they would fail.

After talking to Jiayong, he suggested adding --submit-runner-ram 10000 to increase the runner ram which seemed to work.

Here are the relevant details:

arvados-cwl-runner --submit-runner-ram 10000 --api=containers --submit cwl/tiling_createsglf-wf-debug.cwl yml/test-fjdira-to-dir-to-sglf.yml
# Copyright (C) The Arvados Authors. All rights reserved.
# SPDX-License-Identifier: AGPL-3.0

  arv: "" 
  cwltool: "" 

cwlVersion: v1.0

class: Workflow

  - class: DockerRequirement
    dockerPull: arvados/l7g
  - class: InlineJavascriptRequirement
  - class: SubworkflowFeatureRequirement
  - class: ScatterFeatureRequirement

    keep_cache: 16384
    loadListing: shallow_listing

  bashscript: File?
  pathmin: string?
  pathmax: string?
#  datadir: Directory
  fjDirs: Directory[]
  tagset: File
  nchunks: string?
  fjcsv2sglf: ["null",File,string]
  fjt: ["null",File,string]

      type: array
        type: array
        items: File
    outputSource: step2/out1

    run: ../../cwl/expressiontool/dira-to-dir.cwl
      inDirs: fjDirs
    out: [out]

    run: ../../../../tilelib/cwl/cwl_steps/getpaths-et.cwl
      pathmin: pathmin
      pathmax: pathmax
      nchunks: nchunks
    out: [out1,out2]

     run: ../../../../tilelib/cwl/cwl_steps/createsglf-clt.cwl
     scatter: [tilepathmin, tilepathmax]
     scatterMethod: dotproduct
       bashscript: bashscript
       tilepathmin: step1/out1
       tilepathmax: step1/out2
       datadir: step0/out
       tagset: tagset
       fjcsv2sglf: fjcsv2sglf
       fjt: fjt
     out: [out1]

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu119DFA-GS000022010-ASM

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu1DD730-GS000037501-ASM

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huC434ED-GS01669-DNA_D08

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huD10E53-GS01669-DNA_B04

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu34D5B9-GS01173-DNA_C07

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huBFE0A8-GS000039670-ASM

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu3C8DE9-GS000052488-DID

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu34D5B9-GS01670-DNA_E02

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu2843C9-GS01669-DNA_D05

  - class: Directory
    location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu826751-GS03052-DNA_B01

  class: File
  location: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz

Hopefully the rest of the port should go better with this in mind.

#9 Updated by Abram Connelly over 3 years ago

A test pipeline for the 10 datasets was successfully run. The pipeline ran with output 59344db8316df1e900650eeaa487387e+787. Though there is a step that consolidates the CGF check logfiles into a single collection, I'm having trouble finding it in the Workbench display. Doing a manual check of the 25 output logs shows that the resulting CGF is consistent.

Pending review this ticket should be done.

#10 Updated by Abram Connelly over 3 years ago

I think the expressiontool CWL pipelines don't show up in workbench and that's why I was having trouble finding the log files file consolidated into a single collection.

After some refactoring and adding an explicit consolidation step to the CWL pipeline, I've run a test 10 that has all of the CGF along with it's log files into a single output collection.

The FastJ + SGLF to CGF conversion step is now batched and scattered to speed up wall time calculation.

This ticket should be in a state to close.

#11 Updated by Abram Connelly over 3 years ago

A test run of 100 was pushed through to test.

#12 Updated by Abram Connelly over 3 years ago

Changes have been pushed and merged

#13 Updated by Abram Connelly over 3 years ago

  • Status changed from In Progress to Closed

#14 Updated by Abram Connelly over 3 years ago

As a very very rough estimate, the cost should be in the ~$1.50 per genome processed.

The total runtime of the 100 datasets is 14d4h22m of node allocation time. As a rough estimate, this can be broken down into two major machine allocation types:

  • 16 vcpu with ~100Gb of RAM, using ~1h15m per batch of 16 tilepaths to create the tile library, for a total of ~68h of node allocation. At azure prices, the E16v3 instance most closely resembles those requirements at $1.064/hour
  • 1 vcpu at ~10Gb of RAM for everything else. The E2v3 instance most closely matches the constraints and costs $0.133/hour
(68 * 1.1064) + (14*24 + 4.5- 68)*0.133 = 111.4777

Since there are 100 processed, that works out to about $1.12 per genome.

Also available in: Atom PDF