Project

General

Profile

Actions

Bug #9659

closed

[CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs

Added by Radhika Chippada over 8 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
-
Story points:
-

Description

It appears that the cwl documentation may be out of sync with latest arvados-cwl-runner code updates.

$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml
/data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449
2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 
2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif
2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch
2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running
2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed
2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found
2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail
Workflow error, try again with --debug for more information:
  Workflow failed.
Traceback (most recent call last):
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main
    **vars(args))
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor
    raise WorkflowException("Workflow failed.")
WorkflowException: Workflow failed.

This error occurs consistently whether or not I use --debug.


Related issues 1 (0 open1 closed)

Related to Arvados - Idea #9677: [Docs] Install guide needs to create arvados/jobs:latest Docker image as a standard objectResolvedWard Vandewege07/29/2016Actions
Actions #1

Updated by Radhika Chippada over 8 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith over 8 years ago

I'm getting a different error with current versions:

brett@shell.qr1hi:/data/scratch/brett/arvados/doc/user/cwl/bwa-mem$ arvados-cwl-runner --debug bwa-mem.cwl bwa-mem-input.yml
/data/scratch/brett/cwl/bin/arvados-cwl-runner 1.0.20160717133709, arvados-python-client 0.1.20160721023501, cwltool 1.0.20160714182449
2016-07-26 19:02:49 arvados.arv-run[43558] INFO: Upload local files: "bwa-mem.cwl" 
2016-07-26 19:02:50 arvados.arv-run[43558] INFO: Uploaded to qr1hi-4zz18-yyrz0vbtn8iuuif
2016-07-26 19:02:59 arvados.cwl-runner[43558] INFO: Submitted job qr1hi-8i9sb-vvdfe70pw0oucch
2016-07-26 19:03:06 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Running
2016-07-26 19:03:21 arvados.cwl-runner[43558] INFO: Job bwa-mem.cwl (qr1hi-8i9sb-vvdfe70pw0oucch) is Failed
2016-07-26 19:03:21 arvados.cwl-runner[43558] ERROR: While getting final output object: [Errno 2] File not found
2016-07-26 19:03:21 arvados.cwl-runner[43558] WARNING: Overall process status is permanentFail
Workflow error, try again with --debug for more information:
  Workflow failed.
Traceback (most recent call last):
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/cwltool/main.py", line 707, in main
    **vars(args))
  File "/data/scratch/brett/cwl/local/lib/python2.7/site-packages/arvados_cwl/__init__.py", line 222, in arvExecutor
    raise WorkflowException("Workflow failed.")
WorkflowException: Workflow failed.

This is consistent whether or not I use --debug.

Actions #3

Updated by Brett Smith over 8 years ago

  • Description updated (diff)
Actions #4

Updated by Brett Smith over 8 years ago

  • Subject changed from [CWL] [Documentation] Error when running the workflow using arvados-cwl-runner bwa-mem.cwl bwa-mem-input.yml giving an error to [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error
  • Description updated (diff)

The fact that the Arvados SDK was the only version difference between my report and Radhika's told me what I needed to know to diagnose the issue. I pushed a fix for Radhika's issue in 01bbf6c. I've updated the description to reflect the current problem.

Actions #5

Updated by Brett Smith over 8 years ago

At this point, the bug is sort of a deployment issue. The bug is using the very latest version of arvados-cwl-runner, but the cluster hasn't been upgraded to support that yet. The arvados/jobs:latest Docker image on qr1hi is still using an older Python SDK that doesn't include features that were incorporated to support CWL 1.0.

Another way to think of this is: in order to run CWL 1.0 on Arvados, the Python SDK, CWL SDK, and arvados/jobs:latest Docker image all have to be upgraded in lockstep across the cluster. If you try to upgrade one of those components without upgrading the rest, everywhere, things will fail mysteriously. Radhika upgraded the CWL SDK without upgrading anything else. I upgraded both SDKs without upgrading arvados/jobs:latest.

Actions #6

Updated by Brett Smith over 8 years ago

  • Subject changed from [CWL] [Documentation] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error
Actions #7

Updated by Brett Smith over 8 years ago

  • Subject changed from [CWL] [OPS?] Running `bwa-mem.cwl` from the tutorial crashes with an error to [CWL] Jobs crash when arvados/jobs:latest is not suitable to run CWL jobs
  • Category set to SDKs

We discussed this extensively at backlog grooming.

The fundamental issue is that the CWL SDK depends on various API objects existing (the arvados repository, the arvados/jobs:latest Docker image) and being recent enough, but it doesn't do anything to check for the existence of those objects, leading to strange errors when they're not sufficient.

Today we tend to treat this as an ops problem: the cluster needs to be deployed with the right dependencies. (See, e.g., #9677.) But as Arvados matures, that becomes less tenable. People are going to try using all kinds of client software with all kinds of clusters, and we're not going to be able to make sure everything stays in lockstep. That actually happened here: at Peter's suggestion, one user tried to use the very latest CWL clients on a cluster that wasn't ready for it, and the resulting errors were difficult to diagnose.

We talked about specific fixes for the CWL SDK. The strongest idea was that it could check for an arvados/jobs image tagged with its own Git hash, and arv keep docker that to make sure it's available on the cluster before submitting work. But even this isn't perfect: the user has to have Docker privileges for it to work, and it can end up uploading more Docker images to the cluster than are actually needed (which costs money). There might be Docker version compatibility issues too; that's not clear.

It would be good to devise a general strategy for clients to verify the presence of necessary API objects as part of their functioning. Even if we just added helper functions to the SDKs to do this, and made sure clients used them consistently before starting work, that would be a noticeable improvement. Plan to discuss this more at an ideas meeting or similar.

Actions #8

Updated by Ward Vandewege over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #9

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #10

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions #11

Updated by Peter Amstutz 4 months ago

  • Release deleted (60)
  • Target version deleted (Future)
  • Status changed from New to Resolved

a-c-r has, for a long time now, ensured that the jobs container image version matches the client version, and that the right jobs image version is present on the cluster (uploading it if necessary).

Actions

Also available in: Atom PDF