Feature #20825
closed
Extension to run subworkflow in its own runner
Added by Peter Amstutz about 1 year ago.
Updated 9 months ago.
Release relationship:
Auto
Description
Sort of like arv:RunInSingleContainer, we would like an option to take an invocation of a subworkflow and have it run with a separate workflow runner, e.g. arv:SeparateRunner.
The goal is for each subworkflow to show up as a separate step under the main workflow in workbench for ease of monitoring and debugging.
- Need to be able to submit each runner separately
- Need to be able to have custom names for the steps
- Support reuse of the subworkflow as a unit (without re-executing the workflow runner) if the subworkflow inputs/code haven't changed
Nice to haves:
- Feature to trap a failed step and produce null output, so that downstream steps can run even when some scattered subworkflows have failed.
Note: this is a way of incrementally achieving some of the goals of #20532, but doesn't do anything about the overhead of having lots of separate arvados-cwl-runner process.
- Description updated (diff)
- Description updated (diff)
- Description updated (diff)
- Release set to 66
- Target version changed from Future to Development 2023-09-13 sprint
- Release changed from 66 to 67
- Assigned To set to Peter Amstutz
- Status changed from New to In Progress
- Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
- Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
20825-cwl-separate-runner @ df566fda61e84d8d99ac133ecc3dbc8726dff40f
developer-run-tests: #3854
- All agreed upon points are implemented / addressed.
- launches subworkflows in separate runners -- done
- set custom names for the sub-runners -- done
- can reuse whole workflows -- done
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- the "nice to have" error handling was not implemented, added as #21050
- also related to streamlining launching large scatters is #21051
- Code is tested and passing, both automated and manual, what manual testing was done is described
- added an integration test that launches a subworkflow with this feature. the test verifies it doesn't crash, but currently doesn't include a check that the subworkflow was launched with a separate runner.
- was used to launch a test job of 500 samples, it worked very well
- Documentation has been updated.
- Behaves appropriately at the intended scale (describe intended scale).
- This significantly improves scaling. Previously, launching a batch of (say) 500 samples would require a separate invocation of
arvados-cwl-runner
each time, taking 6-10 seconds per sample. With this new feature, all of the overhead of starting arvados-cwl-runner
is eliminated, so launching each workflow takes about 0.2s per sample. This is a 30-50x improvement.
- Considered backwards and forwards compatibility issues between client and server.
- The subworkflow runners should behave exactly as if they had been launched at the command line, respecting runner level hints like ProcessProperties and WorkflowRunnerResources
- Follows our coding standards and GUI style guidelines.
The motivation for this branch is to be able to launch large number of per-sample subworkflows as quickly as possible, by eliminating the overhead of re-parsing the workflow each time. In a test, this allowed me to launch 500 samples in roughly 2 minutes, where previously this could take upwards of an hour.
- Related to Feature #21050: feature to produce null outputs from failed steps and continue running added
- Related to Feature #21051: a-c-r feature to automatically generate scatter worfklows added
- Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Overall I think it LGTM, I just have one nitpicky comment.
In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.
I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)
"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."
I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.
Alex Coleman wrote in #note-15:
Overall I think it LGTM, I just have one nitpicky comment.
In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.
I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)
"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."
I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.
You're right, there's no reason that couldn't be a WorkflowException which is used generically for workflow logic errors but is an improvement over the Python base Exception. Done and merged. Thanks!
- Status changed from In Progress to Resolved
Peter Amstutz wrote in #note-11:
- Documentation has been updated.
What is even the point of having a branch review checklist if you can just fob off any point as "TBD"?
- Related to Idea #21259: Document SeparateRunner CWL extension added
Also available in: Atom
PDF