Task #13686
closedCommon Workflow Language (CWL) CGF check
Description
This CWL pipeline should check the final CGF to make sure it's consistent with the input file (VCF, GFF, etc.).
The pipeline should:
- Take in gVCFs or GFFs to compare to a set of CGFs
- Should take in SGLF
- Should batch the conversion to minimize reloading the SGLF
A small test on two input datasets will do for testing initially.
Updated by Abram Connelly over 5 years ago
Brach 13686-check-cgf-gff
.
I'm restricting the scope to GFF. When we start processing more gVCF files in earnest we can open another ticket to push extend/update/verify the gVCF to CGF checks are working.
The check assumes the scatter is on chromosome. The check loads the SGLF for a particular tilepath and then checks a batch of CGF (for that tilepath) at once.
The GFF needs to be formatted properly as I ran into issues when using tabix
with the headers that came out of the Harvard PGP site. Specifically:
- The GFF files should have not headers (no beginning '#' lines)
- The GFF files should be indexed with
tabix
(the.tbi
tabix index file should be present). - The GFF files should have the same "basename", without the suffix
.gff.gz
, as the inputCGF
files.
As a technical note, the script (verify-cgf-gff.sh
) does a find
for the appropriate file, so the directory structure for the GFF files isn't so important.
Updated by Abram Connelly over 5 years ago
- Status changed from In Progress to Closed
- Remaining (hours) set to 0.0