Task #13686
closed
Common Workflow Language (CWL) CGF check
Added by Abram Connelly almost 6 years ago.
Updated almost 6 years ago.
Description
This CWL pipeline should check the final CGF to make sure it's consistent with the input file (VCF, GFF, etc.).
The pipeline should:
- Take in gVCFs or GFFs to compare to a set of CGFs
- Should take in SGLF
- Should batch the conversion to minimize reloading the SGLF
A small test on two input datasets will do for testing initially.
Brach 13686-check-cgf-gff
.
I'm restricting the scope to GFF. When we start processing more gVCF files in earnest we can open another ticket to push extend/update/verify the gVCF to CGF checks are working.
The check assumes the scatter is on chromosome. The check loads the SGLF for a particular tilepath and then checks a batch of CGF (for that tilepath) at once.
The GFF needs to be formatted properly as I ran into issues when using tabix
with the headers that came out of the Harvard PGP site. Specifically:
- The GFF files should have not headers (no beginning '#' lines)
- The GFF files should be indexed with
tabix
(the .tbi
tabix index file should be present).
- The GFF files should have the same "basename", without the suffix
.gff.gz
, as the input CGF
files.
As a technical note, the script (verify-cgf-gff.sh
) does a find
for the appropriate file, so the directory structure for the GFF files isn't so important.
- Status changed from In Progress to Closed
- Remaining (hours) set to 0.0
Also available in: Atom
PDF