Actions
Feature #19975
closedOption to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
2.0
Release:
Release relationship:
Auto
Updated by Peter Amstutz almost 2 years ago
- Blocked by Feature #19986: crunch-run tracks maximum usage of each crunchstat metric added
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to Development 2023-03-15 sprint
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Development 2023-03-29 Sprint to Development 2023-03-15 sprint
- Assigned To set to Peter Amstutz
Updated by Peter Amstutz almost 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
19975-oom-resubmit @ 9e76a12ff0b25322f86caf6d5ea70c09cbfd8829
Updated by Peter Amstutz almost 2 years ago
Re-run failing test developer-run-tests-apps-workbench-integration: #3803
Updated by Peter Amstutz almost 2 years ago
Summary:
- When enabled, always retry on exit code 137 (I originally thought we should also check for a low memory condition but a bit of experimentation showed it often times will die without any warning)
- It only retries once, with the supplied multiplier. I had originally thought of allowing some arbitrary number of retries but that seemed overcomplicated.
- Is able to find retries containers and reuse them provided the RAM request / multiplier haven't changed.
- Also scans for several different substrings to infer an out of memory condition occurred
- Option to provide your own regex if your tool produces some other error message
- Integration test fakes these conditions because reliably triggering them for real is tricky
Updated by Lucas Di Pentima almost 2 years ago
Some comments:
- File
arvcointainer.py
- Line 370: Just a code style suggestion: keep the var naming
snake_cased
like the rest? - Line 489: The log warning message could be updated to offer this new feature to the user
- Line 370: Just a code style suggestion: keep the var naming
- Do you think it would be useful to notify when a retry has been made because of an OOM situation? I suspect the workflow developer could be particularly interested in these kind of events for optimization/cost estimations purposes.
- I think this needs to be added to the doc page
user/cwl/cwl-extensions.html
Other than that, it LGTM.
Updated by Peter Amstutz almost 2 years ago
- Status changed from In Progress to Resolved
Actions