Project

General

Profile

Actions

Feature #20825

closed

Extension to run subworkflow in its own runner

Added by Peter Amstutz 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
-
Release relationship:
Auto

Description

Sort of like arv:RunInSingleContainer, we would like an option to take an invocation of a subworkflow and have it run with a separate workflow runner, e.g. arv:SeparateRunner.

The goal is for each subworkflow to show up as a separate step under the main workflow in workbench for ease of monitoring and debugging.

  1. Need to be able to submit each runner separately
  2. Need to be able to have custom names for the steps
  3. Support reuse of the subworkflow as a unit (without re-executing the workflow runner) if the subworkflow inputs/code haven't changed

Nice to haves:

  1. Feature to trap a failed step and produce null output, so that downstream steps can run even when some scattered subworkflows have failed.

Note: this is a way of incrementally achieving some of the goals of #20532, but doesn't do anything about the overhead of having lots of separate arvados-cwl-runner process.


Subtasks 1 (0 open1 closed)

Task #20946: Review 20825-cwl-separate-runnerResolvedPeter Amstutz10/13/2023Actions

Related issues

Related to Arvados - Feature #20532: Grouping steps by sampleNewActions
Related to Arvados - Feature #21050: feature to produce null outputs from failed steps and continue runningNewActions
Related to Arvados - Feature #21051: a-c-r feature to automatically generate scatter worfklowsNewActions
Related to Arvados - Idea #21259: Document SeparateRunner CWL extensionNewActions
Actions #1

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 7 months ago

Actions #3

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 6 months ago

  • Release set to 66
  • Target version changed from Future to Development 2023-09-13 sprint
Actions #7

Updated by Peter Amstutz 6 months ago

  • Release changed from 66 to 67
Actions #8

Updated by Peter Amstutz 6 months ago

  • Assigned To set to Peter Amstutz
  • Status changed from New to In Progress
Actions #9

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Actions #10

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Actions #11

Updated by Peter Amstutz 5 months ago

20825-cwl-separate-runner @ df566fda61e84d8d99ac133ecc3dbc8726dff40f

developer-run-tests: #3854

  • All agreed upon points are implemented / addressed.
    1. launches subworkflows in separate runners -- done
    2. set custom names for the sub-runners -- done
    3. can reuse whole workflows -- done
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • the "nice to have" error handling was not implemented, added as #21050
    • also related to streamlining launching large scatters is #21051
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • added an integration test that launches a subworkflow with this feature. the test verifies it doesn't crash, but currently doesn't include a check that the subworkflow was launched with a separate runner.
    • was used to launch a test job of 500 samples, it worked very well
  • Documentation has been updated.
    • TBD
  • Behaves appropriately at the intended scale (describe intended scale).
    • This significantly improves scaling. Previously, launching a batch of (say) 500 samples would require a separate invocation of arvados-cwl-runner each time, taking 6-10 seconds per sample. With this new feature, all of the overhead of starting arvados-cwl-runner is eliminated, so launching each workflow takes about 0.2s per sample. This is a 30-50x improvement.
  • Considered backwards and forwards compatibility issues between client and server.
    • The subworkflow runners should behave exactly as if they had been launched at the command line, respecting runner level hints like ProcessProperties and WorkflowRunnerResources
  • Follows our coding standards and GUI style guidelines.
    • yes

The motivation for this branch is to be able to launch large number of per-sample subworkflows as quickly as possible, by eliminating the overhead of re-parsing the workflow each time. In a test, this allowed me to launch 500 samples in roughly 2 minutes, where previously this could take upwards of an hour.

Actions #12

Updated by Peter Amstutz 5 months ago

  • Related to Feature #21050: feature to produce null outputs from failed steps and continue running added
Actions #13

Updated by Peter Amstutz 5 months ago

  • Related to Feature #21051: a-c-r feature to automatically generate scatter worfklows added
Actions #14

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Actions #15

Updated by Alex Coleman 4 months ago

Overall I think it LGTM, I just have one nitpicky comment.

In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.

I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)

"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."

I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.

Actions #16

Updated by Peter Amstutz 4 months ago

Alex Coleman wrote in #note-15:

Overall I think it LGTM, I just have one nitpicky comment.

In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.

I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)

"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."

I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.

You're right, there's no reason that couldn't be a WorkflowException which is used generically for workflow logic errors but is an improvement over the Python base Exception. Done and merged. Thanks!

Actions #17

Updated by Peter Amstutz 4 months ago

  • Status changed from In Progress to Resolved
Actions #18

Updated by Brett Smith 3 months ago

Peter Amstutz wrote in #note-11:

  • Documentation has been updated.
    • TBD

What is even the point of having a branch review checklist if you can just fob off any point as "TBD"?

Actions #19

Updated by Brett Smith 3 months ago

  • Related to Idea #21259: Document SeparateRunner CWL extension added
Actions

Also available in: Atom PDF