Bug #9659

[CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs

Added by Radhika Chippada almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

It appears that the cwl documentation may be out of sync with latest arvados-cwl-runner code updates.

$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml
/data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449
2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 
2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif
2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch
2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running
2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed
2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found
2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail
Workflow error, try again with --debug for more information:
  Workflow failed.
Traceback (most recent call last):
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main
    **vars(args))
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor
    raise WorkflowException("Workflow failed.")
WorkflowException: Workflow failed.

This error occurs consistently whether or not I use --debug.


Related issues

Related to Arvados - Story #9677: [Docs] Install guide needs to create arvados/jobs:latest Docker image as a standard objectResolved07/29/2016

Associated revisions

Revision 01bbf6c2 (diff)
Added by Brett Smith almost 4 years ago

9659: Bump CWL SDK's versioned dependency on PySDK.

The CWL SDK depends on the change made to arv-run in
27816b602e9da83a2565e6fe8f87f250555b1ba5. Update the version
dependency in setup.py to reflect this. Refs #9570, #9659.

History

#1 Updated by Radhika Chippada almost 4 years ago

  • Description updated (diff)

#2 Updated by Brett Smith almost 4 years ago

I'm getting a different error with current versions:

brett@shell.qr1hi:/data/scratch/brett/arvados/doc/user/cwl/bwa-mem$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml
/data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449
2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 
2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif
2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch
2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running
2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed
2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found
2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail
Workflow error, try again with --debug for more information:
  Workflow failed.
Traceback (most recent call last):
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main
    **vars(args))
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor
    raise WorkflowException("Workflow failed.")
WorkflowException: Workflow failed.

This is consistent whether or not I use --debug.

#3 Updated by Brett Smith almost 4 years ago

  • Description updated (diff)

#4 Updated by Brett Smith almost 4 years ago

  • Subject changed from [CWL] [Documentation] Error when running the workflow using¬†arvados-cwl-runner bwa-mem.cwl bwa-mem-input.yml giving an error to [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error
  • Description updated (diff)

The fact that the Arvados SDK was the only version difference between my report and Radhika's told me what I needed to know to diagnose the issue. I pushed a fix for Radhika's issue in 01bbf6c. I've updated the description to reflect the current problem.

#5 Updated by Brett Smith almost 4 years ago

At this point, the bug is sort of a deployment issue. The bug is using the very latest version of arvados-cwl-runner, but the cluster hasn't been upgraded to support that yet. The arvados/jobs:latest Docker image on qr1hi is still using an older Python SDK that doesn't include features that were incorporated to support CWL 1.0.

Another way to think of this is: in order to run CWL 1.0 on Arvados, the Python SDK, CWL SDK, and arvados/jobs:latest Docker image all have to be upgraded in lockstep across the cluster. If you try to upgrade one of those components without upgrading the rest, everywhere, things will fail mysteriously. Radhika upgraded the CWL SDK without upgrading anything else. I upgraded both SDKs without upgrading arvados/jobs:latest.

#6 Updated by Brett Smith almost 4 years ago

  • Subject changed from [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error

#7 Updated by Brett Smith almost 4 years ago

  • Subject changed from [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs
  • Category set to SDKs

We discussed this extensively at backlog grooming.

The fundamental issue is that the CWL SDK depends on various API objects existing (the arvados repository, the arvados/jobs:latest Docker image) and being recent enough, but it doesn't do anything to check for the existence of those objects, leading to strange errors when they're not sufficient.

Today we tend to treat this as an ops problem: the cluster needs to be deployed with the right dependencies. (See, e.g., #9677.) But as Arvados matures, that becomes less tenable. People are going to try using all kinds of client software with all kinds of clusters, and we're not going to be able to make sure everything stays in lockstep. That actually happened here: at Peter's suggestion, one user tried to use the very latest CWL clients on a cluster that wasn't ready for it, and the resulting errors were difficult to diagnose.

We talked about specific fixes for the CWL SDK. The strongest idea was that it could check for an arvados/jobs image tagged with its own Git hash, and arv keep docker that to make sure it's available on the cluster before submitting work. But even this isn't perfect: the user has to have Docker privileges for it to work, and it can end up uploading more Docker images to the cluster than are actually needed (which costs money). There might be Docker version compatibility issues too; that's not clear.

It would be good to devise a general strategy for clients to verify the presence of necessary API objects as part of their functioning. Even if we just added helper functions to the SDKs to do this, and made sure clients used them consistently before starting work, that would be a noticeable improvement. Plan to discuss this more at an ideas meeting or similar.

Also available in: Atom PDF