Feature #19982
closedAbility to know when a container died because of spot instance reclamation and option to resubmit
Description
New arvados-cwl-runner behavior when spot instances are enabled
- When submitting spot instance, don't retry
- Ability to detect when a container failed due to reclaimed spot instance (#19961)
- Exit code to indicate workflow failed due to spot instance
- Option to automatically re-submit as on-demand instance
Files
Updated by Peter Amstutz almost 2 years ago
- Blocked by Feature #19961: Detect and log spot instance interruption notices added
Updated by Peter Amstutz almost 2 years ago
- Category changed from CWL to Crunch
- Description updated (diff)
Updated by Peter Amstutz almost 2 years ago
- Story points changed from 2.0 to 3.0
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Related to Feature #19975: Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage added
Updated by Peter Amstutz almost 2 years ago
- Related to Feature #19974: Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted added
Updated by Peter Amstutz almost 2 years ago
- Related to Idea #18179: Better spot instance support added
Updated by Peter Amstutz over 1 year ago
- Target version changed from To be scheduled to Development 2023-08-02 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-08-16 to Development 2023-08-30
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-08-30 to Development 2023-09-13 sprint
Updated by Brett Smith over 1 year ago
- Related to Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests added
Updated by Brett Smith over 1 year ago
We should consider undoing or narrowing the reuse changes we made in #20606 after we implement this. If Arvados gets better about retrying, then odds go up that the reuse narrowing is more likely to be wasteful than helpful.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Updated by Peter Amstutz over 1 year ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Updated by Peter Amstutz 12 months ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Updated by Alex Coleman 7 months ago
- File diff_spot_instance.txt diff_spot_instance.txt added
Diff from main..HEAD:
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-06-19 sprint to Development 2024-07-03 sprint
Updated by Peter Amstutz 6 months ago
- Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
- Assigned To set to Peter Amstutz
Updated by Peter Amstutz 6 months ago
- Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint
Updated by Peter Amstutz 6 months ago
- Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Updated by Peter Amstutz 6 months ago
- Target version changed from Development 2024-08-28 sprint to Development 2024-08-07 sprint
Updated by Peter Amstutz 5 months ago
19982-spot-instance @ 4370bab5da1c4ed4d7100d3d344749166a2982f4
- All agreed upon points are implemented / addressed.
- yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
- Code is tested and passing, both automated and manual, what manual testing was done is described
- added new unit tests.
- Documentation has been updated.
- added to list of CWL extensions
- Behaves appropriately at the intended scale (describe intended scale).
- no effect on scale
- Considered backwards and forwards compatibility issues between client and server.
- if preemptionNotice is not set, has no effect.
- Follows our coding standards and GUI style guidelines.
- yes
Checks for the presence of a non-empty container["runtime_status"]["preemptionNotice"]. If so, and feature is enabled, resubmits with preemptible = false to increase the likelihood of making progress when the spot market has limited capacity.
Because I was already changing code related to re-submitting containers, I also made the changes noted in #21413 for out-of-memory retry. The most significant change is deleting the failed attempts. Users were confused by having failed steps listed when the workflow ran successfully. On the other hand, it could complicate debugging -- although it does (a) log the container request uuid and (b) doesn't delete the collection with the logs of the failed step, so that might be sufficient.
Updated by Peter Amstutz 5 months ago
- Related to Bug #21413: OOM retry is confusing added
Updated by Peter Amstutz 5 months ago
19982-spot-instance @ 1d756458d601d70b56d8eac0d8d4387569092d68
Doc updates
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Updated by Lucas Di Pentima 5 months ago
Some comments:
- File
sdk/cwl/tests/test_container.py
- Line 1342: I think it would make test cases a lot easier to follow if the expected result is included as a parameter instead of being calculated on the fly. I know repeating text is not tidy but in this case I think improves readability and makes it clear what the test author's intentions were.
- Line 1465: There's some commented out code
- File
sdk/cwl/arvados_cwl/arvcontainer.py
Line 8: I believe this is an unused import - Re: deleting failed-and-retried containers: I think users' opinion that's confusing are valid, but I it seems to me that that's just a UI problem, not a backend problem. I'm thinking for example that keeping the failed preempted attempts are useful for cost accounting and also as a data point to make decisions on suitability of preemptible instances over time.
- I think we also need the new a-c-r flags to be documented in
doc/user/cwl/cwl-run-options.html.textile.liquid
Updated by Peter Amstutz 5 months ago
Lucas Di Pentima wrote in #note-50:
Some comments:
- File
sdk/cwl/tests/test_container.py
- Line 1342: I think it would make test cases a lot easier to follow if the expected result is included as a parameter instead of being calculated on the fly. I know repeating text is not tidy but in this case I think improves readability and makes it clear what the test author's intentions were.
Done.
- Line 1465: There's some commented out code
I removed all the commented-out arv_docker_clear_cache
- File
sdk/cwl/arvados_cwl/arvcontainer.py
Line 8: I believe this is an unused import
Yea, I don't know where that came from.
- I think we also need the new a-c-r flags to be documented in
doc/user/cwl/cwl-run-options.html.textile.liquid
Done.
- Re: deleting failed-and-retried containers: I think users' opinion that's confusing are valid, but I it seems to me that that's just a UI problem, not a backend problem. I'm thinking for example that keeping the failed preempted attempts are useful for cost accounting and also as a data point to make decisions on suitability of preemptible instances over time.
Let's talk about this at standup
19982-spot-instance @ 16a847f510a6b81520a942622ad8ffd38a9cc68f
Updated by Peter Amstutz 5 months ago
19982-spot-instance @ 16cf7bf16464fbf01360dea4e07859d1256a8990
Now sets "arv:failed_container_resubmitted" on the old container request.
Updated test & documentation.
Updated by Peter Amstutz 5 months ago
- Status changed from In Progress to Resolved