Project

General

Profile

Actions

Feature #19975

closed

Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage

Added by Peter Amstutz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
2.0
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #20206: Review 19975-oom-resubmitResolvedPeter Amstutz03/06/2023Actions

Related issues

Related to Arvados - Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmitIn ProgressAlex ColemanActions
Blocked by Arvados - Feature #19986: crunch-run tracks maximum usage of each crunchstat metricResolvedBrett Smith02/13/2023Actions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Category changed from Crunch to CWL
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Blocked by Feature #19986: crunch-run tracks maximum usage of each crunchstat metric added
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Story points set to 2.0
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to To be scheduled
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to Development 2023-03-15 sprint
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-03-29 Sprint to Development 2023-03-15 sprint
  • Assigned To set to Peter Amstutz
Actions #9

Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz about 1 year ago

Summary:

  • When enabled, always retry on exit code 137 (I originally thought we should also check for a low memory condition but a bit of experimentation showed it often times will die without any warning)
  • It only retries once, with the supplied multiplier. I had originally thought of allowing some arbitrary number of retries but that seemed overcomplicated.
  • Is able to find retries containers and reuse them provided the RAM request / multiplier haven't changed.
  • Also scans for several different substrings to infer an out of memory condition occurred
  • Option to provide your own regex if your tool produces some other error message
  • Integration test fakes these conditions because reliably triggering them for real is tricky
Actions #13

Updated by Lucas Di Pentima about 1 year ago

Some comments:

  • File arvcointainer.py
    • Line 370: Just a code style suggestion: keep the var naming snake_cased like the rest?
    • Line 489: The log warning message could be updated to offer this new feature to the user
  • Do you think it would be useful to notify when a retry has been made because of an OOM situation? I suspect the workflow developer could be particularly interested in these kind of events for optimization/cost estimations purposes.
  • I think this needs to be added to the doc page user/cwl/cwl-extensions.html

Other than that, it LGTM.

Actions #14

Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to Resolved
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions

Also available in: Atom PDF