Project

General

Profile

Actions

Bug #22434

closed

"Error checking spot interruptions:" is actually a warning, text should reflect that

Added by Peter Amstutz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

User saw "Error checking spot interruptions" and thought this was an error that was causing them problems, but this is a warning. The text should be softened a bit. E.g. "Warning: possible problem checking spot interruptions"


Subtasks 1 (0 open1 closed)

Task #22451: Review 22434-spot-check-warningResolvedTom Clegg01/14/2025Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Bug #22454: Spot instance termination not being detected?NewTom CleggActions
Actions #1

Updated by Peter Amstutz 3 months ago

  • Position changed from -933178 to -933177
Actions #2

Updated by Peter Amstutz 3 months ago

  • Category set to Crunch
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 3 months ago

  • Assigned To set to Tom Clegg
Actions #4

Updated by Peter Amstutz 3 months ago

  • Subtask #22451 added
Actions #5

Updated by Tom Clegg 3 months ago

  • Related to Bug #22454: Spot instance termination not being detected? added
Actions #6

Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress

22434-spot-check-warning @ c22b3ee97ef626e800ff0f72f3af65ea47838542 -- developer-run-tests: #4597
retry wb2 units run-tests-remainder: #4871
retry wb2 units run-tests-remainder: #4873

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ new log message is: Temporarily unable to check spot instance interruptions: 503 Service Unavailable (will retry in 5s)
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ existing tests are updated
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • N/A
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • N/A
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A - messages in runtime_status have not changed
  • Follows our coding standards and GUI style guidelines.
Actions #7

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-6:

22434-spot-check-warning @ c22b3ee97ef626e800ff0f72f3af65ea47838542 -- developer-run-tests: #4597

This is all good. Since we're thinking about the usability of these log messages, I'm looking at this:

if err != nil {
    runner.CrunchLog.Printf("Temporarily unable to check spot instance interruptions: %s (will retry in %v)", err, spotInterruptionCheckInterval)
    failures++
    if failures > 5 {
        runner.CrunchLog.Printf("Giving up on checking spot instance interruptions after too many consecutive errors")
        return
    }
    continue
}

It's a little unfortunate that the logs can say "(will retry)" and then immediately after that "nm I lied I'm giving up." What if this flowed like failures++; if failures <= 5 { log the first message; continue; } else { log the give up message; return; }?

Actions #8

Updated by Tom Clegg 3 months ago

Right, "will retry" followed immediately by "giving up" is unfortunate.

Rearranged messages, now it looks like this:

zzzzz-zzzzz-zzzzzzzzzzzzzzz 2025-01-10T20:09:26.721233648Z Unable to check spot instance interruptions: 503 Service Unavailable -- will retry in 125ms
zzzzz-zzzzz-zzzzzzzzzzzzzzz 2025-01-10T20:09:26.845810061Z Unable to check spot instance interruptions: 503 Service Unavailable -- will retry in 125ms
zzzzz-zzzzz-zzzzzzzzzzzzzzz 2025-01-10T20:09:26.971103911Z Unable to check spot instance interruptions: 503 Service Unavailable -- will retry in 125ms
zzzzz-zzzzz-zzzzzzzzzzzzzzz 2025-01-10T20:09:27.096125533Z Unable to check spot instance interruptions: 503 Service Unavailable -- will retry in 125ms
zzzzz-zzzzz-zzzzzzzzzzzzzzz 2025-01-10T20:09:27.220723738Z Unable to check spot instance interruptions: 503 Service Unavailable -- now giving up after too many consecutive errors

22434-spot-check-warning @ 6ac713a1fc7805ffe47d733fca1495e856be54be -- developer-run-tests: #4599

Actions #9

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-8:

22434-spot-check-warning @ 6ac713a1fc7805ffe47d733fca1495e856be54be -- developer-run-tests: #4599

This flows even better, thanks. My one last nit here is now with this change, we lost the word "Temporarily" so now "Unable" sounds permanent and a little bit spooky. I get that the later retry text should clarify that, but I wonder if we can do better. What about "Spot instance interruption check did not succeed (%s)"? I think putting the subject before the verb can help contextualize this.

If that's too big or too off-style, maybe "Could not check for spot instance interruption: %s"? I feel like this is just vaguer but I feel like that kind of is what we're after.

Happy to workshop it in chat, or not. I don't feel super strongly about this, just my personal take given the tone we're trying to strike in this ticket. Thanks.

Actions #10

Updated by Tom Clegg 3 months ago

22434-spot-check-warning @ fd55bde2aef8288b63c7f084982041060006d711 -- developer-run-tests: #4602

"Spot instance interruption check was inconclusive: %s"

Actions #11

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved
Actions #12

Updated by Peter Amstutz 2 months ago

  • Release set to 75
Actions

Also available in: Atom PDF