Task #13671
closedFastJ and SGLF to CGF CWL pipeline
Description
Create/update/cleanup the CWL pipeline to go from FastJ and SGLF to CGF (v3).
Most likely the majority of the work for this has been done and it's mostly about cleaning up, running some small tests and documenting.
- Create/update the programs and scripts to create the CGF
- Run for some small test datasets (2-10) to make sure the CGF creation runs without issue
- Add some documentation on the pipeline
Updated by Abram Connelly almost 6 years ago
Tested on two Harvard PGP datasets mitochondrial DNA only, tile path 0x35e:
Output:
To see the encoding:
$ cd ~/keep/by_id/b874d1958927fb78e518eb9923d67478+224 $ ls cwl.output.json hu34D5B9-GS01173-DNA_C07.cgf hu826751-GS03052-DNA_B01.cgf $ cgft -b 862 hu34D5B9-GS01173-DNA_C07.cgf [ 28 17 -1 0 0 0 0 -1 0 0 0 186 16 -1 6 0 0 0 0 0 5 0 0 0 7 0 0 0 1 0 -1 0 2 58 13] [ 28 17 -1 0 0 0 0 -1 0 0 0 186 16 -1 6 0 0 0 0 0 5 0 0 0 7 0 0 0 1 0 -1 0 2 58 13] [[ 0 72 ][ ][ ][ ][ 346 1 ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 763 1 1003 1 2593 1 2968 1 3262 1 4485 1 4807 1 5094 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]] [[ 0 72 ][ ][ ][ ][ 346 1 ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 763 1 1003 1 2593 1 2968 1 3262 1 4485 1 4807 1 5094 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]] $ cgft -b 862 hu826751-GS03052-DNA_B01.cgf [ 81 6 0 0 0 0 0 -1 0 0 0 411 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 35 -1 194 1] [ 81 2 0 0 0 0 0 -1 0 0 0 412 0 0 0 0 0 1 0 0 0 0 0 0 29 0 0 1 0 0 -1 35 -1 194 1] [[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]] [[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
This uses a version of the CGF creation scripts that load in the complete SGLF per CGF conversion. This is potentially 30Gb+ of data for a whole genome that needs to be loaded per dataset.
This step can be parallelized to batch convert the CGF on a tilepath basis, loading the SGLF for the tilepath once at the beginning. It gets a little clunky because the "band format" files need to be stored then converted at the end but the savings are in the range of an order of magnitude.
What's done:
- CGF creation CWL that scatters on a dataset basis, loading the SGLF completely for each data set
TODO:
- Batch process the CGF, doing a CWL scatter on tilepath and then consolidating at the end.
Updated by Abram Connelly almost 6 years ago
A successful test has been run for the 'batch' CWL version that processes a tilepath with a single SGLF load and many input datasets (in the form of FastJ) at once.
Test input:
Test output:
As above, the tilepath has been restricted to 0x035e (mitochondrial) for testability.
The 'batch' version of this has a few changes from the 'single' CWL CGF:
- Empty tilepaths are filled in with the default tile (
0
) and the nocall portion is filled in with a special value of[0 0]
to indicate the whole tilepath is not called. - The intermediate "band" files are kept
As a note, the band2cgf
CWL/script does a find
in the input directory for the pattern *.band.gz
and takes the parent directory as source dataset directory. The source dataset directory is used to name the output cgf
. For example
$ find $HOME/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244 -type f -name '*.band.gz' | sed 's/[a-f0-9]*\.band\.gz$//' | sort -u /home/XXX/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244/out/hu34D5B9-GS01173-DNA_C07/ /home/XXX/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244/out/hu826751-GS03052-DNA_B01/
The last two lines of the above would be the input directory for the band2cgf
conversion and would name the resulting cgf
files hu34D5B9-GS01173-DNA_C07.cgf
and hu826751-GS03052-DNA_B01
respectively.
This ticket should now be complete once the documentation for this pipeline is done.
TODO:
- Documentation
Updated by Abram Connelly almost 6 years ago
- Status changed from New to Closed
- Start date set to 06/26/2018
- Remaining (hours) set to 0.0
In the interest of expediency, I've pushed to master but I welcome any review or feedback.