Convert 10 Harvard PGP GFF to CGF [CWL]
Most of the components should be in place to do a full conversion from GFF to CGF as a common workflow language (CWL) pipeline. As an initial pass to make sure each of the components is working properly and to fix issues that come up, convert 10 Harvard PGP GFF files to CGF.
- 10 Harvard GFF files have been taken from the website (build 37) and put into collection 6318c4d7099cb65b4577c5d0595dd412+2589
- They have been pre-processed to be
bgzip'd and indexed via
tabix(the script that did the recompression and index is in the
.scriptsdirectory in the collection). This should eventually be a CWL pipeline but in the interest of expediency this was done as a pre-processing step to focus on the conversion workflow.
- The 10 datasets will create a stand-alone SGLF tile library which will be used for the final conversion.
- The workflow should consist of conversion with checks after each major step:
- Convert GFF to FastJ
- Check FastJ
- Collect FastJ into SGLF
- Check SGLF
- Convert FastJ and SGLF to band and CGF
- Check CGF against the original GFF
This will be a whole genome conversion for each dataset.
This ticket will be considered completed when the 10 datasets have been converted successfully (with tests passing) and with timing results reported.
#1 Updated by Abram Connelly over 3 years ago
The test 10 GFF were converted to FastJ and located in the collection 0e3e9a08517dbb61cc08838af755fb30+528752.
Test FastJ conversion workflow: https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-68n3xa6w7y83hgd
Command to submit:
arvados-cwl-runner \ --submit \ ./cwl/tiling_convert2fastj.cwl \ ./test10.yml
refdirectory: class: Directory path: keep:6318c4d7099cb65b4577c5d0595dd412+2589 bashscript: class: File path: ./src/convert2fastjCWL #### #### ref: "hg19" reffa: class: File path: keep:ef70506d71ee761b1ec28d67290216a0+1252/hg19.fa.gz afn: class: File path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.gz aidx: class: File path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.fwi #### #### refM: "hg19" reffaM: class: File path: keep:ef70506d71ee761b1ec28d67290216a0+1252/hg19.fa.gz afnM: class: File path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.gz aidxM: class: File path: keep:a2198905927b8d24e162abe42e541f37+173/assembly.00.hg19.fw.fwi seqidM: "chrM" #### #### tagdir: class: File path: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz l7g: "l7g" pasta: "pasta" refstream: "refstream"
#2 Updated by Abram Connelly over 3 years ago
Command line submission:
arvados-cwl-runner \ --submit \ ./cwl/fastj-check_wf.cwl \ ./checkfj10.yml
script: class: File path: ./src/fastj-dir-check fastjDirs: - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu826751-GS03052-DNA_B01 - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huC434ED-GS01669-DNA_D08 - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu34D5B9-GS01670-DNA_E02 - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu34D5B9-GS01173-DNA_C07 - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu119DFA-GS000022010-ASM - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huBFE0A8-GS000039670-ASM - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu3C8DE9-GS000052488-DID - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu1DD730-GS000037501-ASM - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/hu2843C9-GS01669-DNA_D05 - class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752/huD10E53-GS01669-DNA_B04 outlogs: [ "hu119DFA-GS000022010-ASM-fjcheck.log", "hu1DD730-GS000037501-ASM-fjcheck.log", "hu2843C9-GS01669-DNA_D05-fjcheck.log", "hu34D5B9-GS01173-DNA_C07-fjcheck.log", "hu34D5B9-GS01670-DNA_E02-fjcheck.log", "hu3C8DE9-GS000052488-DID-fjcheck.log", "hu826751-GS03052-DNA_B01-fjcheck.log", "huBFE0A8-GS000039670-ASM-fjcheck.log", "huC434ED-GS01669-DNA_D08-fjcheck.log", "huD10E53-GS01669-DNA_B04-fjcheck.log" ]
#3 Updated by Abram Connelly over 3 years ago
arvados-cwl-runner \ --submit \ ./cwl/tiling_createsglf-wf.cwl \ create10.yml
bashscript: class: File path: ./src/create-tilelib.sh pathmin: '0' pathmax: '862' datadir: class: Directory path: keep:0e3e9a08517dbb61cc08838af755fb30+528752 tagset: class: File path: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz nchunks: '5'
#4 Updated by Abram Connelly over 3 years ago
arvados-cwl-runner \ --submit \ ./cwl/sglf-sanity-check.cwl \ ./check-sglf10.yml
script: class: File location: ./src/sglf-sanity-check sglfDir: class: Directory location: keep:634b24da41fd95cd47760e3b4185cac9+34210 outFileName: "out.log"
This leg of the step should really be run in parallel. I think putting an expression tool to batch and scatter the SGLF files (by 5, say) to feed into the actual check and then another expression tool to gather the results into a directory/single file would be appropriate.
#5 Updated by Abram Connelly over 3 years ago
When converting the FastJ to CGF, I noticed one of the datasets (
hu34D5B9-GS01173-DNA_C07.cgf) had tilepath 862 as being all default with all nocall. Sometimes genomic data will not include the mitochondrial DNA section and this gets passed through as nocall but in this case, both of the original GFF datasets had chrM data. (CGF collection: b2e070c5b979351f7c560740319bd0e8+685)
Looking at the original GFF, it looks like there are spurious header lines (lines beginning with
#) at the beginning of each chromosome. Not all datasets have this (such as
hu826751-GS03052-DNA_B01) and is probably why I missed it before.
A new GFF collection has been created that has the "header" lines completely stripped out and can be found at c7965a47d3cc53fb342ee41168f9a680+2499. For completeness, here's the scrip that I used to strip the original and re-index:
#!/bin/bash rm -f ../.cleanup/*.tbi rm -f ../.cleanup/*.gzi for x in `find ../.cleanup -type f -name '*.gz' ` ; do bfn=`basename $x .gz` idir=`dirname $x` echo "$x ($bfn $idir)" unpigz -c $x | egrep -v '^#' > $idir/$bfn rm -f $x bgzip -i $idir/$bfn tabix -f $x done
Some things to note:
- We should add this to the "preprocess" step to either check and convert or convert wholesale the GFF files to some standard format. This can be extended to other formats as well (I think we have some VCF preprocess pipelines around as well).
- We should add some sort of sanity check for the FastJ to try and catch errors of this sort. Maybe adding the GFF to the "FastJ sanity check" so make sure that if a chromosome appears there's at least some non-trivial tiles.
The whole pipeline up to this point needs to be re-run.
#6 Updated by Abram Connelly over 3 years ago
Re-running with source input GFF collection c7965a47d3cc53fb342ee41168f9a680+2499 :
|workflow||total time||amortized time (per dataset)||_ output type||output|
|GFF to FastJ||~55mins||~55mins||FastJ||a5713a7ea4c6ea265f1464d319287592+529231|
|Fastj to SGLF||~40mins||~70mins||SGLF||db712885dcf629e0f8367b1e9d50368d+34212|
|FastJ + SGLF to CGF||~160mins||~16mins||CGF||1aad87aef71bb06aba6b16c18a3ad1b2+684|
As a rough estimate, for these 10 datasets, it's about 4h run time for a full conversion.
- stitch these together into a workflow
- 'preprocess' the GFF to clean them up for this pipeline (take out spurious '^#' lines and index)
#7 Updated by Abram Connelly over 3 years ago
Using a modified "cgf to npy" pipeline, the runtime for the pipeline with the 10 test datasets was 8min. The requested machine was 16 cores with "
ramMin: 100000 (100G+?), so a rough estimate of the amortized cost is ~15min.
The npy conversion will not scale as it requires all CGF and resulting npy arrays to fit into memory. I'm going to move on and we can come back to it later when we run into the scaling issues.
#8 Updated by Abram Connelly over 3 years ago
There was an error when trying to tie the different pieces of the pipeline together. While trying to do the SGLF creation by using a multistep pipeline that first joined a FastJ directory array into directory then 'chunked' paths as input to the SGLF creation, the pipeline failed with no discernible error message in the logs. Some portions of the 'scatter' pipeline would succeed and others would be 'cancelled' with no clear indication why.
Running each step of the pipeline individually would succeed but as soon as they were chained together, they would fail.
After talking to Jiayong, he suggested adding
--submit-runner-ram 10000 to increase the runner ram which seemed to work.
Here are the relevant details:
arvados-cwl-runner --submit-runner-ram 10000 --api=containers --submit cwl/tiling_createsglf-wf-debug.cwl yml/test-fjdira-to-dir-to-sglf.yml
fjDirs: - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu119DFA-GS000022010-ASM - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu1DD730-GS000037501-ASM - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huC434ED-GS01669-DNA_D08 - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huD10E53-GS01669-DNA_B04 - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu34D5B9-GS01173-DNA_C07 - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/huBFE0A8-GS000039670-ASM - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu3C8DE9-GS000052488-DID - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu34D5B9-GS01670-DNA_E02 - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu2843C9-GS01669-DNA_D05 - class: Directory location: keep:a5713a7ea4c6ea265f1464d319287592+529231/hu826751-GS03052-DNA_B01 tagset: class: File location: keep:6f0b48919ea97da91e3749f65969f03f+627/tagset.fa.gz
Hopefully the rest of the port should go better with this in mind.
#9 Updated by Abram Connelly over 3 years ago
A test pipeline for the 10 datasets was successfully run. The pipeline ran with output 59344db8316df1e900650eeaa487387e+787. Though there is a step that consolidates the CGF check logfiles into a single collection, I'm having trouble finding it in the Workbench display. Doing a manual check of the 25 output logs shows that the resulting CGF is consistent.
Pending review this ticket should be done.
#10 Updated by Abram Connelly over 3 years ago
I think the expressiontool CWL pipelines don't show up in workbench and that's why I was having trouble finding the log files file consolidated into a single collection.
After some refactoring and adding an explicit consolidation step to the CWL pipeline, I've run a test 10 that has all of the CGF along with it's log files into a single output collection.
The FastJ + SGLF to CGF conversion step is now batched and scattered to speed up wall time calculation.
This ticket should be in a state to close.
#14 Updated by Abram Connelly over 3 years ago
As a very very rough estimate, the cost should be in the ~$1.50 per genome processed.
The total runtime of the 100 datasets is 14d4h22m of node allocation time. As a rough estimate, this can be broken down into two major machine allocation types:
- 16 vcpu with ~100Gb of RAM, using ~1h15m per batch of 16 tilepaths to create the tile library, for a total of ~68h of node allocation. At azure prices, the E16v3 instance most closely resembles those requirements at $1.064/hour
- 1 vcpu at ~10Gb of RAM for everything else. The E2v3 instance most closely matches the constraints and costs $0.133/hour
(68 * 1.1064) + (14*24 + 4.5- 68)*0.133 = 111.4777
Since there are 100 processed, that works out to about $1.12 per genome.