Feature #20825
closedExtension to run subworkflow in its own runner
Description
Sort of like arv:RunInSingleContainer, we would like an option to take an invocation of a subworkflow and have it run with a separate workflow runner, e.g. arv:SeparateRunner.
The goal is for each subworkflow to show up as a separate step under the main workflow in workbench for ease of monitoring and debugging.
- Need to be able to submit each runner separately
- Need to be able to have custom names for the steps
- Support reuse of the subworkflow as a unit (without re-executing the workflow runner) if the subworkflow inputs/code haven't changed
Nice to haves:
- Feature to trap a failed step and produce null output, so that downstream steps can run even when some scattered subworkflows have failed.
Note: this is a way of incrementally achieving some of the goals of #20532, but doesn't do anything about the overhead of having lots of separate arvados-cwl-runner process.
Updated by Peter Amstutz over 1 year ago
- Related to Feature #20532: Grouping steps by sample added
Updated by Peter Amstutz over 1 year ago
- Release set to 66
- Target version changed from Future to Development 2023-09-13 sprint
Updated by Peter Amstutz over 1 year ago
20825-cwl-separate-runner @ f165ffe5ba28815b965fa4f92ceed99fc53e0776
Updated by Peter Amstutz over 1 year ago
- Assigned To set to Peter Amstutz
- Status changed from New to In Progress
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Updated by Peter Amstutz over 1 year ago
20825-cwl-separate-runner @ df566fda61e84d8d99ac133ecc3dbc8726dff40f
- All agreed upon points are implemented / addressed.
- launches subworkflows in separate runners -- done
- set custom names for the sub-runners -- done
- can reuse whole workflows -- done
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Code is tested and passing, both automated and manual, what manual testing was done is described
- added an integration test that launches a subworkflow with this feature. the test verifies it doesn't crash, but currently doesn't include a check that the subworkflow was launched with a separate runner.
- was used to launch a test job of 500 samples, it worked very well
- Documentation has been updated.
- TBD
- Behaves appropriately at the intended scale (describe intended scale).
- This significantly improves scaling. Previously, launching a batch of (say) 500 samples would require a separate invocation of
arvados-cwl-runner
each time, taking 6-10 seconds per sample. With this new feature, all of the overhead of startingarvados-cwl-runner
is eliminated, so launching each workflow takes about 0.2s per sample. This is a 30-50x improvement.
- This significantly improves scaling. Previously, launching a batch of (say) 500 samples would require a separate invocation of
- Considered backwards and forwards compatibility issues between client and server.
- The subworkflow runners should behave exactly as if they had been launched at the command line, respecting runner level hints like ProcessProperties and WorkflowRunnerResources
- Follows our coding standards and GUI style guidelines.
- yes
The motivation for this branch is to be able to launch large number of per-sample subworkflows as quickly as possible, by eliminating the overhead of re-parsing the workflow each time. In a test, this allowed me to launch 500 samples in roughly 2 minutes, where previously this could take upwards of an hour.
Updated by Peter Amstutz over 1 year ago
- Related to Feature #21050: feature to produce null outputs from failed steps and continue running added
Updated by Peter Amstutz over 1 year ago
- Related to Feature #21051: a-c-r feature to automatically generate scatter worfklows added
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Updated by Alex Coleman over 1 year ago
Overall I think it LGTM, I just have one nitpicky comment.
In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.
I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)
"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."
I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.
Updated by Peter Amstutz over 1 year ago
Alex Coleman wrote in #note-15:
Overall I think it LGTM, I just have one nitpicky comment.
In commit 927e6781a1bb21e5ef1d887b89916685601b8fd4 in arvcontainer.py, you raise a generic Exception.
I learned it was bad to raise a generic exception, here's a quote from the python docs (https://docs.python.org/3/tutorial/errors.html)
"Exception can be used as a wildcard that catches (almost) everything. However, it is good practice to be as specific as possible with the types of exceptions that we intend to handle, and to allow any unexpected exceptions to propagate on."
I looked in our coding standards and didn't see anything, and was just wondering if we had a standard around that. If we are fine with that, I think it LGTM.
You're right, there's no reason that couldn't be a WorkflowException which is used generically for workflow logic errors but is an improvement over the Python base Exception. Done and merged. Thanks!
Updated by Peter Amstutz about 1 year ago
- Status changed from In Progress to Resolved
Updated by Brett Smith about 1 year ago
Peter Amstutz wrote in #note-11:
- Documentation has been updated.
- TBD
What is even the point of having a branch review checklist if you can just fob off any point as "TBD"?
Updated by Brett Smith about 1 year ago
- Related to Idea #21259: Document SeparateRunner CWL extension added