Project

General

Profile

Actions

Task #13671

closed

FastJ and SGLF to CGF CWL pipeline

Added by Abram Connelly almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Target version:
-

Description

Create/update/cleanup the CWL pipeline to go from FastJ and SGLF to CGF (v3).

Most likely the majority of the work for this has been done and it's mostly about cleaning up, running some small tests and documenting.

  • Create/update the programs and scripts to create the CGF
  • Run for some small test datasets (2-10) to make sure the CGF creation runs without issue
  • Add some documentation on the pipeline
Actions #1

Updated by Abram Connelly almost 6 years ago

Tested on two Harvard PGP datasets mitochondrial DNA only, tile path 0x35e:

Output:

To see the encoding:

$ cd ~/keep/by_id/b874d1958927fb78e518eb9923d67478+224
$ ls
cwl.output.json  hu34D5B9-GS01173-DNA_C07.cgf  hu826751-GS03052-DNA_B01.cgf
$ cgft -b 862 hu34D5B9-GS01173-DNA_C07.cgf 
[ 28 17 -1 0 0 0 0 -1 0 0 0 186 16 -1 6 0 0 0 0 0 5 0 0 0 7 0 0 0 1 0 -1 0 2 58 13]
[ 28 17 -1 0 0 0 0 -1 0 0 0 186 16 -1 6 0 0 0 0 0 5 0 0 0 7 0 0 0 1 0 -1 0 2 58 13]
[[ 0 72 ][ ][ ][ ][ 346 1 ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 763 1 1003 1 2593 1 2968 1 3262 1 4485 1 4807 1 5094 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
[[ 0 72 ][ ][ ][ ][ 346 1 ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ 763 1 1003 1 2593 1 2968 1 3262 1 4485 1 4807 1 5094 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
$ cgft -b 862 hu826751-GS03052-DNA_B01.cgf 
[ 81 6 0 0 0 0 0 -1 0 0 0 411 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 35 -1 194 1]
[ 81 2 0 0 0 0 0 -1 0 0 0 412 0 0 0 0 0 1 0 0 0 0 0 0 29 0 0 1 0 0 -1 35 -1 194 1]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
[[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]

This uses a version of the CGF creation scripts that load in the complete SGLF per CGF conversion. This is potentially 30Gb+ of data for a whole genome that needs to be loaded per dataset.

This step can be parallelized to batch convert the CGF on a tilepath basis, loading the SGLF for the tilepath once at the beginning. It gets a little clunky because the "band format" files need to be stored then converted at the end but the savings are in the range of an order of magnitude.

What's done:

  • CGF creation CWL that scatters on a dataset basis, loading the SGLF completely for each data set

TODO:

  • Batch process the CGF, doing a CWL scatter on tilepath and then consolidating at the end.
Actions #2

Updated by Abram Connelly almost 6 years ago

A successful test has been run for the 'batch' CWL version that processes a tilepath with a single SGLF load and many input datasets (in the form of FastJ) at once.

Test input:

Test output:

As above, the tilepath has been restricted to 0x035e (mitochondrial) for testability.

The 'batch' version of this has a few changes from the 'single' CWL CGF:

  • Empty tilepaths are filled in with the default tile (0) and the nocall portion is filled in with a special value of [0 0] to indicate the whole tilepath is not called.
  • The intermediate "band" files are kept

As a note, the band2cgf CWL/script does a find in the input directory for the pattern *.band.gz and takes the parent directory as source dataset directory. The source dataset directory is used to name the output cgf. For example

$ find $HOME/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244 -type f -name '*.band.gz' | sed 's/[a-f0-9]*\.band\.gz$//' | sort -u
/home/XXX/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244/out/hu34D5B9-GS01173-DNA_C07/
/home/XXX/keep/by_id/7622bf2b30e1e42f8cdd9ee1ca5a0d6d+41244/out/hu826751-GS03052-DNA_B01/

The last two lines of the above would be the input directory for the band2cgf conversion and would name the resulting cgf files hu34D5B9-GS01173-DNA_C07.cgf and hu826751-GS03052-DNA_B01 respectively.

This ticket should now be complete once the documentation for this pipeline is done.

TODO:

  • Documentation
Actions #3

Updated by Abram Connelly almost 6 years ago

Branch 13671-cgf-cwl

Actions #4

Updated by Abram Connelly almost 6 years ago

  • Status changed from New to Closed
  • Start date set to 06/26/2018
  • Remaining (hours) set to 0.0

GitHub CWL CGF link

In the interest of expediency, I've pushed to master but I welcome any review or feedback.

Actions

Also available in: Atom PDF