Bug #9659
closed[CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs
Description
It appears that the cwl documentation may be out of sync with latest arvados-cwl-runner code updates.
$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml /data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449 2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif 2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch 2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running 2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed 2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found 2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail Workflow error, try again with --debug for more information: Workflow failed. Traceback (most recent call last): File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main **vars(args)) File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor raise WorkflowException("Workflow failed.") WorkflowException: Workflow failed.
This error occurs consistently whether or not I use --debug
.
Updated by Brett Smith over 8 years ago
I'm getting a different error with current versions:
brett@shell.qr1hi:/data/scratch/brett/arvados/doc/user/cwl/bwa-mem$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml /data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449 2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif 2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch 2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running 2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed 2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found 2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail Workflow error, try again with --debug for more information: Workflow failed. Traceback (most recent call last): File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main **vars(args)) File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor raise WorkflowException("Workflow failed.") WorkflowException: Workflow failed.
This is consistent whether or not I use --debug
.
Updated by Brett Smith over 8 years ago
- Subject changed from [CWL] [Documentation] Error when running the workflow using arvados-cwl-runner bwa-mem.cwl bwa-mem-input.yml giving an error to [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error
- Description updated (diff)
The fact that the Arvados SDK was the only version difference between my report and Radhika's told me what I needed to know to diagnose the issue. I pushed a fix for Radhika's issue in 01bbf6c. I've updated the description to reflect the current problem.
Updated by Brett Smith over 8 years ago
At this point, the bug is sort of a deployment issue. The bug is using the very latest version of arvados-cwl-runner, but the cluster hasn't been upgraded to support that yet. The arvados/jobs:latest
Docker image on qr1hi is still using an older Python SDK that doesn't include features that were incorporated to support CWL 1.0.
Another way to think of this is: in order to run CWL 1.0 on Arvados, the Python SDK, CWL SDK, and arvados/jobs:latest
Docker image all have to be upgraded in lockstep across the cluster. If you try to upgrade one of those components without upgrading the rest, everywhere, things will fail mysteriously. Radhika upgraded the CWL SDK without upgrading anything else. I upgraded both SDKs without upgrading arvados/jobs:latest
.
Updated by Brett Smith over 8 years ago
- Subject changed from [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error
Updated by Brett Smith over 8 years ago
- Subject changed from [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs
- Category set to SDKs
We discussed this extensively at backlog grooming.
The fundamental issue is that the CWL SDK depends on various API objects existing (the arvados repository, the arvados/jobs:latest Docker image) and being recent enough, but it doesn't do anything to check for the existence of those objects, leading to strange errors when they're not sufficient.
Today we tend to treat this as an ops problem: the cluster needs to be deployed with the right dependencies. (See, e.g., #9677.) But as Arvados matures, that becomes less tenable. People are going to try using all kinds of client software with all kinds of clusters, and we're not going to be able to make sure everything stays in lockstep. That actually happened here: at Peter's suggestion, one user tried to use the very latest CWL clients on a cluster that wasn't ready for it, and the resulting errors were difficult to diagnose.
We talked about specific fixes for the CWL SDK. The strongest idea was that it could check for an arvados/jobs image tagged with its own Git hash, and arv keep docker
that to make sure it's available on the cluster before submitting work. But even this isn't perfect: the user has to have Docker privileges for it to work, and it can end up uploading more Docker images to the cluster than are actually needed (which costs money). There might be Docker version compatibility issues too; that's not clear.
It would be good to devise a general strategy for clients to verify the presence of necessary API objects as part of their functioning. Even if we just added helper functions to the SDKs to do this, and made sure clients used them consistently before starting work, that would be a noticeable improvement. Plan to discuss this more at an ideas meeting or similar.
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Peter Amstutz 5 months ago
- Release deleted (
60) - Target version deleted (
Future) - Status changed from New to Resolved
a-c-r has, for a long time now, ensured that the jobs container image version matches the client version, and that the right jobs image version is present on the cluster (uploading it if necessary).