Project

General

Profile

Actions

Feature #19982

closed

Ability to know when a container died because of spot instance reclamation and option to resubmit

Added by Peter Amstutz almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
3.0
Release:
Release relationship:
Auto

Description

New arvados-cwl-runner behavior when spot instances are enabled

  • When submitting spot instance, don't retry
  • Ability to detect when a container failed due to reclaimed spot instance (#19961)
  • Exit code to indicate workflow failed due to spot instance
  • Option to automatically re-submit as on-demand instance

Files

diff_spot_instance.txt (6.37 KB) diff_spot_instance.txt Alex Coleman, 06/11/2024 05:54 PM

Subtasks 1 (0 open1 closed)

Task #20761: Review 19982-spot-instanceResolvedPeter Amstutz08/16/2024Actions

Related issues 6 (2 open4 closed)

Related to Arvados - Feature #19975: Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usageResolvedPeter Amstutz03/06/2023Actions
Related to Arvados - Feature #19974: Option to re-submit preemptible jobs to reserved nodes when previous attempt was interruptedNewActions
Related to Arvados Epics - Idea #18179: Better spot instance supportIn Progress03/01/202206/30/2024Actions
Related to Arvados - Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requestsResolvedTom Clegg06/27/2023Actions
Related to Arvados - Bug #21413: OOM retry is confusingResolvedPeter AmstutzActions
Blocked by Arvados - Feature #19961: Detect and log spot instance interruption noticesResolvedTom Clegg02/16/2023Actions
Actions #1

Updated by Peter Amstutz almost 2 years ago

  • Blocked by Feature #19961: Detect and log spot instance interruption notices added
Actions #2

Updated by Peter Amstutz almost 2 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 2 years ago

  • Category changed from CWL to Crunch
  • Description updated (diff)
Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz almost 2 years ago

  • Category changed from Crunch to CWL
Actions #6

Updated by Peter Amstutz almost 2 years ago

  • Story points set to 2.0
Actions #7

Updated by Peter Amstutz almost 2 years ago

  • Story points changed from 2.0 to 3.0
Actions #8

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from Future to To be scheduled
Actions #9

Updated by Peter Amstutz almost 2 years ago

  • Related to Feature #19975: Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage added
Actions #10

Updated by Peter Amstutz almost 2 years ago

  • Related to Feature #19974: Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted added
Actions #11

Updated by Peter Amstutz almost 2 years ago

  • Related to Idea #18179: Better spot instance support added
Actions #12

Updated by Peter Amstutz over 1 year ago

  • Target version changed from To be scheduled to Development 2023-08-02 sprint
Actions #13

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Alex Coleman
Actions #14

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Actions #15

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-08-16 to Development 2023-08-30
Actions #16

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-08-30 to Development 2023-09-13 sprint
Actions #17

Updated by Brett Smith over 1 year ago

  • Related to Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests added
Actions #18

Updated by Brett Smith over 1 year ago

We should consider undoing or narrowing the reuse changes we made in #20606 after we implement this. If Arvados gets better about retrying, then odds go up that the reuse narrowing is more likely to be wasteful than helpful.

Actions #19

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Actions #20

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
Actions #21

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Actions #22

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Actions #23

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #24

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #25

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #26

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #27

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #28

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #29

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #30

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #31

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #32

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #33

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #34

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #35

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Actions #36

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #37

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Actions #38

Updated by Alex Coleman 7 months ago

Diff from main..HEAD:

Actions #39

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-06-19 sprint to Development 2024-07-03 sprint
Actions #40

Updated by Peter Amstutz 6 months ago

  • Assigned To deleted (Alex Coleman)
Actions #41

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
  • Assigned To set to Peter Amstutz
Actions #42

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint
Actions #43

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Actions #44

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-08-28 sprint to Development 2024-08-07 sprint
Actions #45

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #46

Updated by Peter Amstutz 5 months ago

19982-spot-instance @ 4370bab5da1c4ed4d7100d3d344749166a2982f4

developer-run-tests: #4365

  • All agreed upon points are implemented / addressed.
    • yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • added new unit tests.
  • Documentation has been updated.
    • added to list of CWL extensions
  • Behaves appropriately at the intended scale (describe intended scale).
    • no effect on scale
  • Considered backwards and forwards compatibility issues between client and server.
    • if preemptionNotice is not set, has no effect.
  • Follows our coding standards and GUI style guidelines.
    • yes

Checks for the presence of a non-empty container["runtime_status"]["preemptionNotice"]. If so, and feature is enabled, resubmits with preemptible = false to increase the likelihood of making progress when the spot market has limited capacity.

Because I was already changing code related to re-submitting containers, I also made the changes noted in #21413 for out-of-memory retry. The most significant change is deleting the failed attempts. Users were confused by having failed steps listed when the workflow ran successfully. On the other hand, it could complicate debugging -- although it does (a) log the container request uuid and (b) doesn't delete the collection with the logs of the failed step, so that might be sufficient.

Actions #47

Updated by Peter Amstutz 5 months ago

  • Related to Bug #21413: OOM retry is confusing added
Actions #49

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Actions #50

Updated by Lucas Di Pentima 5 months ago

Some comments:

  • File sdk/cwl/tests/test_container.py
    • Line 1342: I think it would make test cases a lot easier to follow if the expected result is included as a parameter instead of being calculated on the fly. I know repeating text is not tidy but in this case I think improves readability and makes it clear what the test author's intentions were.
    • Line 1465: There's some commented out code
  • File sdk/cwl/arvados_cwl/arvcontainer.py Line 8: I believe this is an unused import
  • Re: deleting failed-and-retried containers: I think users' opinion that's confusing are valid, but I it seems to me that that's just a UI problem, not a backend problem. I'm thinking for example that keeping the failed preempted attempts are useful for cost accounting and also as a data point to make decisions on suitability of preemptible instances over time.
  • I think we also need the new a-c-r flags to be documented in doc/user/cwl/cwl-run-options.html.textile.liquid
Actions #51

Updated by Peter Amstutz 5 months ago

Lucas Di Pentima wrote in #note-50:

Some comments:

  • File sdk/cwl/tests/test_container.py
    • Line 1342: I think it would make test cases a lot easier to follow if the expected result is included as a parameter instead of being calculated on the fly. I know repeating text is not tidy but in this case I think improves readability and makes it clear what the test author's intentions were.

Done.

  • Line 1465: There's some commented out code

I removed all the commented-out arv_docker_clear_cache

  • File sdk/cwl/arvados_cwl/arvcontainer.py Line 8: I believe this is an unused import

Yea, I don't know where that came from.

  • I think we also need the new a-c-r flags to be documented in doc/user/cwl/cwl-run-options.html.textile.liquid

Done.

  • Re: deleting failed-and-retried containers: I think users' opinion that's confusing are valid, but I it seems to me that that's just a UI problem, not a backend problem. I'm thinking for example that keeping the failed preempted attempts are useful for cost accounting and also as a data point to make decisions on suitability of preemptible instances over time.

Let's talk about this at standup

19982-spot-instance @ 16a847f510a6b81520a942622ad8ffd38a9cc68f

developer-run-tests: #4388

Actions #52

Updated by Peter Amstutz 5 months ago

19982-spot-instance @ 16cf7bf16464fbf01360dea4e07859d1256a8990

developer-run-tests: #4394

Now sets "arv:failed_container_resubmitted" on the old container request.

Updated test & documentation.

Actions #53

Updated by Lucas Di Pentima 5 months ago

This LGTM, thanks!

Actions #54

Updated by Peter Amstutz 5 months ago

  • Status changed from In Progress to Resolved
Actions #55

Updated by Peter Amstutz 5 months ago

  • Release set to 70
Actions

Also available in: Atom PDF