Project

General

Profile

Actions

Task #13380

closed

fastj checks wf in cwl

Added by Keldin Sergheyev about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Target version:
-

Description

Create and register a pipeline whereby user can input a collection of fastj files and get a plaintext report on which failed, including info about why it failed. They should be able to specify whether they just want the validity checks and the tileid consistency checks (fjt -T -t), or just the validity checks (fjt -T).


Subtasks 1 (0 open1 closed)

Task #13498: Review #13380, CWL FastJ checkResolvedSarah Zaranek05/23/2018Actions
Actions #1

Updated by Keldin Sergheyev almost 6 years ago

  • Status changed from New to In Progress
  • Dockerfile mostly written, but it doesn't yet successfully create an image
    • Step of compiling fjt.cpp fails because openssl/md5.h can't be found
    • Cannot COPY .vimrc from host machine. Mostly likely solution is just to forego this step, since it's for my convenience.
  • While in docker container: l7g can't be pulled from git.curoverse.com, but it can be pulled from github.com
    • This is a problem with permissions.
    • Also contemplated using COPY to get l7g from host machine, but it wasn't necessary.
  • How to identify a particular version of arvados/jobs that isn't just 'latest'?
    • Solution: use digest in alternative form of FROM command in Dockerfile:
    • FROM arvados/jobs@sha256:2dd51d88d7a34e246b523ef95c35dce2b875cd1e7694ace7c2a6cbda6de0ffe5
    • digest can be found from entering docker inspect IMAGENAME and finding digest in returned JSON object
Actions #2

Updated by Abram Connelly almost 6 years ago

  • Project changed from 46 to Lightning
  • Assigned To changed from Keldin Sergheyev to Abram Connelly

This will use the arvados/l7g Docker image whose Dockerfile can be found on GitHub.

The CWL should take in a directory that contains the FastJ files and will process all .fj or .fj.gz files in that directory. An additional workflow should be created that can scatter on a list of directories given.

An option should be given to do additional tests for tile path integrity in the FastJ file (the fjt -t option).

Actions #3

Updated by Abram Connelly almost 6 years ago

Doing some small timing tests on local (compressed) FastJ files, it looks to take around ~13mins (47s on a 16 core machine) with minimal memory.

Actions #4

Updated by Abram Connelly almost 6 years ago

There are two CWL pipelines, one that does a check on a datasets worth (for example, an individuals genome in FastJ format) and another that does it for multiple datasets.

The single check is called fastj-check.cwl whereas the multiple check is fastj-check_wf.cwl. The workflow fastj-check_wf.cwl also takes in a list of output log file names to use that can be checked after the fact.

The single check CWL, fastj-check.cwl, will fail if it finds an error, so I assume the scatter will also fail when an individual component fails in fastj-check_wf.cwl. The failed return is saved until the end so all tile paths can be checked in fastj-check.cwl and I'm hoping the output will be available for review on failure.

There is a test run with two successfully checked FastJ directories under the collection cc6742d8a05a5e15c35e2e504941ccab+203. This is the result of running fastj-check_wf.cwl with the following YAML file:

script:
  class: File
  path: ../src/fastj-dir-check

fastjDirs:
  - class: Directory
    path: keep:c6e202a426db2120b1f806a71e9ab876+52891

  - class: Directory
    path: keep:0b76a09e54ccbb5425034aba8b63e8c0+52893

outlogs: [ "hu826751.log", "hu34D5B9-GS01670-DNA_E02.log" ]

Available under yml/test-fastj-check_wf.yml.

Here is a snippet of the log file created:

# /keep/c6e202a426db2120b1f806a71e9ab876+52891/00ce.fj.gz
fastj check: OK
fastj tileid check: OK
# /keep/c6e202a426db2120b1f806a71e9ab876+52891/001e.fj.gz
fastj check: OK
fastj tileid check: OK
# /keep/c6e202a426db2120b1f806a71e9ab876+52891/00c9.fj.gz
fastj check: OK
fastj tileid check: OK
# /keep/c6e202a426db2120b1f806a71e9ab876+52891/0284.fj.gz

Doing a grep for the string "error" should provide places where the FastJ check failed (from the fjt checks run on it).

Actions #5

Updated by Sarah Zaranek almost 6 years ago

Notes on review:

in: cwl-run/submit-check-fastj-wf.sh
--disable-resuse probably not necessary for running the cwl also should do the default of --api containers or mention it

in: test-fastj-check_wf.yml
I assume this is an OK file to list publicly?
+ class: Directory
+ path: keep:c6e202a426db2120b1f806a71e9ab876+52891

Actions #6

Updated by Abram Connelly almost 6 years ago

The --disable-reuse is really for testing purposes. The cwl-run/submit-check-fastj-wf.sh is only there as an example submission script and not really meant for "production". When testing, I like to force execution to make sure it's been run with the latest. Maybe a comment in the code to that effect?

c6e202a426db2120b1f806a71e9ab876+52891 and 0b76a09e54ccbb5425034aba8b63e8c0+52893 (the other dataset in yml/test-fastj-check_wf.yml) are hu826751-GS03052-DNA_B01 and hu34D5B9-GS01670-DNA_E02 respectively, which are the FastJ collections derived from the publicly available Harvard PGP GFF files for those two participants.

Actions #7

Updated by Abram Connelly almost 6 years ago

  • Status changed from In Progress to Closed

Added license headers to source files.

I had troubles untangling the commit history because of errors in the commit messages so I created a new branch 13380-cwl-fastj-checks-v2 which I then used to merge into master and push.

I also added .licenseignore file as the git checks were not correctly figuring out I had license information in the source files.

Actions

Also available in: Atom PDF