Task #13380
closedfastj checks wf in cwl
Description
Create and register a pipeline whereby user can input a collection of fastj files and get a plaintext report on which failed, including info about why it failed. They should be able to specify whether they just want the validity checks and the tileid consistency checks (fjt -T -t), or just the validity checks (fjt -T).
Updated by Keldin Sergheyev almost 6 years ago
- Status changed from New to In Progress
- Dockerfile mostly written, but it doesn't yet successfully create an image
- Step of compiling fjt.cpp fails because openssl/md5.h can't be found
- Cannot COPY .vimrc from host machine. Mostly likely solution is just to forego this step, since it's for my convenience.
- While in docker container: l7g can't be pulled from git.curoverse.com, but it can be pulled from github.com
- This is a problem with permissions.
- Also contemplated using COPY to get l7g from host machine, but it wasn't necessary.
- How to identify a particular version of arvados/jobs that isn't just 'latest'?
- Solution: use digest in alternative form of FROM command in Dockerfile:
- FROM arvados/jobs@sha256:2dd51d88d7a34e246b523ef95c35dce2b875cd1e7694ace7c2a6cbda6de0ffe5
- digest can be found from entering docker inspect IMAGENAME and finding digest in returned JSON object
Updated by Abram Connelly almost 6 years ago
- Project changed from 46 to Lightning
- Assigned To changed from Keldin Sergheyev to Abram Connelly
This will use the arvados/l7g
Docker image whose Dockerfile can be found on GitHub.
The CWL should take in a directory that contains the FastJ files and will process all .fj
or .fj.gz
files in that directory. An additional workflow should be created that can scatter on a list of directories given.
An option should be given to do additional tests for tile path integrity in the FastJ file (the fjt -t
option).
Updated by Abram Connelly almost 6 years ago
Doing some small timing tests on local (compressed) FastJ files, it looks to take around ~13mins (47s on a 16 core machine) with minimal memory.
Updated by Abram Connelly almost 6 years ago
There are two CWL pipelines, one that does a check on a datasets worth (for example, an individuals genome in FastJ format) and another that does it for multiple datasets.
The single check is called fastj-check.cwl
whereas the multiple check is fastj-check_wf.cwl
. The workflow fastj-check_wf.cwl
also takes in a list of output log file names to use that can be checked after the fact.
The single check CWL, fastj-check.cwl
, will fail if it finds an error, so I assume the scatter will also fail when an individual component fails in fastj-check_wf.cwl
. The failed return is saved until the end so all tile paths can be checked in fastj-check.cwl
and I'm hoping the output will be available for review on failure.
There is a test run with two successfully checked FastJ directories under the collection cc6742d8a05a5e15c35e2e504941ccab+203. This is the result of running fastj-check_wf.cwl
with the following YAML file:
script: class: File path: ../src/fastj-dir-check fastjDirs: - class: Directory path: keep:c6e202a426db2120b1f806a71e9ab876+52891 - class: Directory path: keep:0b76a09e54ccbb5425034aba8b63e8c0+52893 outlogs: [ "hu826751.log", "hu34D5B9-GS01670-DNA_E02.log" ]
Available under yml/test-fastj-check_wf.yml
.
Here is a snippet of the log file created:
# /keep/c6e202a426db2120b1f806a71e9ab876+52891/00ce.fj.gz fastj check: OK fastj tileid check: OK # /keep/c6e202a426db2120b1f806a71e9ab876+52891/001e.fj.gz fastj check: OK fastj tileid check: OK # /keep/c6e202a426db2120b1f806a71e9ab876+52891/00c9.fj.gz fastj check: OK fastj tileid check: OK # /keep/c6e202a426db2120b1f806a71e9ab876+52891/0284.fj.gz
Doing a grep
for the string "error" should provide places where the FastJ check failed (from the fjt
checks run on it).
Updated by Sarah Zaranek almost 6 years ago
Notes on review:
in: cwl-run/submit-check-fastj-wf.sh
--disable-resuse probably not necessary for running the cwl also should do the default of --api containers or mention it
in: test-fastj-check_wf.yml
I assume this is an OK file to list publicly?
+ class: Directory
+ path: keep:c6e202a426db2120b1f806a71e9ab876+52891
Updated by Abram Connelly almost 6 years ago
The --disable-reuse
is really for testing purposes. The cwl-run/submit-check-fastj-wf.sh
is only there as an example submission script and not really meant for "production". When testing, I like to force execution to make sure it's been run with the latest. Maybe a comment in the code to that effect?
c6e202a426db2120b1f806a71e9ab876+52891
and 0b76a09e54ccbb5425034aba8b63e8c0+52893
(the other dataset in yml/test-fastj-check_wf.yml
) are hu826751-GS03052-DNA_B01
and hu34D5B9-GS01670-DNA_E02
respectively, which are the FastJ collections derived from the publicly available Harvard PGP GFF files for those two participants.
Updated by Abram Connelly almost 6 years ago
- Status changed from In Progress to Closed
Added license headers to source files.
I had troubles untangling the commit history because of errors in the commit messages so I created a new branch 13380-cwl-fastj-checks-v2
which I then used to merge into master and push.
I also added .licenseignore
file as the git checks were not correctly figuring out I had license information in the source files.