Project

General

Profile

Actions

Bug #19967

closed

crunch-run periodically updates container cost

Added by Tom Clegg about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
1.0
Release relationship:
Auto

Description

Currently the container cost is recorded as 0 in this case.

This should ensure the cost of a cancel-retry-cancel-retry-cancel-fail cycle is reflected properly.

Have crunch-run periodically update the price while the container is running, perhaps each time it gets a SIGUSR2 after loadPrices (i.e., on each dispatcher probe).


Subtasks 1 (0 open1 closed)

Task #20189: Review 19967-crunch-run-cost-updatesResolvedBrett Smith03/20/2023Actions

Related issues

Precedes Arvados - Idea #20239: Dispatcher calculates final container cost after runner/instance disappearsNew03/21/202303/21/2023Actions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Category set to Crunch
Actions #2

Updated by Tom Clegg about 1 year ago

  • Story points set to 1.0
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to To be scheduled
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to 2023-03-01 sprint
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Brett Smith
Actions #8

Updated by Tom Clegg about 1 year ago

A proposed approach:

Part 1 is to have the dispatcher calculate cost when cancelling a container whose process (or entire instance) has disappeared. Normally the "container cleanup" stuff is done by lib/dispatchcloud/scheduler, which hits the cancel button here:

                        } else if !exited.IsZero() && qUpdated.After(exited) {
                                go sch.cancel(uuid, "state=Running after crunch-run exited")

This could stash cost-so-far while updating state to Cancelled. The only hitch is that scheduler doesn't know anything about prices.

(An alternative would be for the worker pool update the cost itself, in closeRunner() -- but I think it gets complicated quickly when we consider retrying failures, and races between "update cost field" (in worker pool) and "set state to cancelled" (scheduler).)

I'm thinking we need to update the Running() call in scheduler/interfaces.go so that it returns a map[string]RunStatus, where RunStatus is something like:

type RunStatus struct {
        Exited   bool
        ExitTime time.Time
        Cost     float64
}

...where Cost>0 means the scheduler should use that as the final cost when it sets state=Cancelled.

(I'm sneaking in the separate Exited bool here to be more clear/explicit than the current "zero time means not exited" approach. Not strictly necessary, but might as well, if we're going to make it a struct, right?)

Worker pool can populate that with calculated cost if we
  • move most of the calculateCost() code from lib/crunchrun to lib/cloud, and
  • in lib/worker, use NormalizePriceHistory to combine/remember the entire cloud-provided price history of each worker instead of just sending the latest data points to crunch-run and forgetting them.

Part 2 is to have crunch-run periodically update the price while the container is running, perhaps each time it gets a SIGUSR2 after loadPrices (i.e., on each dispatcher probe). This would be an easier way to accomplish nearly the same thing, and has the additional benefit of showing the cost-so-far for long running containers instead of showing $0 until the end.

It would be ideal to have both parts, but I wonder if we should do just Part 2, and get 90% benefit for 10% work.

Actions #9

Updated by Tom Clegg about 1 year ago

Tom Clegg wrote in #note-8:

I'm thinking we need to update the Running() call in scheduler/interfaces.go so that it returns a map[string]RunStatus, where RunStatus is something like:

On second thought:

type RunStatus interface {
        Exited() bool
        ExitTime() time.Time
        Cost() float64
}

That way, worker pool can implement a concrete runStatus type with a *worker field and a Cost method that does the calculation, instead of computing all of the costs on every call to Running() only to have them unused/ignored nearly all of the time.

Actions #10

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-8:

It would be ideal to have both parts, but I wonder if we should do just Part 2, and get 90% benefit for 10% work.

Discussed at standup. We will start with option 2, both because it's less work, and it also has other benefits like showing providing cost information for long-running containers. We can lose 10ish minutes of cost reporting in the default configuration, but we're okay with that now. Creating a follow-up ticket for option 1.

Actions #11

Updated by Brett Smith about 1 year ago

  • Description updated (diff)
Actions #12

Updated by Brett Smith about 1 year ago

  • Precedes Idea #20239: Dispatcher calculates final container cost after runner/instance disappears added
Actions #13

Updated by Brett Smith about 1 year ago

  • Subject changed from Record cost when dispatcher cancels a running container due to instance disappearance to crunch-run periodically updates container cost
Actions #14

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Actions #15

Updated by Brett Smith about 1 year ago

  • Status changed from New to In Progress

Tom Clegg wrote in #note-8:

Part 2 is to have crunch-run periodically update the price while the container is running, perhaps each time it gets a SIGUSR2 after loadPrices (i.e., on each dispatcher probe).

Implemented in 19967-crunch-run-cost-updates @ d1dcb64a9174c2da64306a6598e1e52a96344e6b - developer-run-tests: #3554

Actions #16

Updated by Lucas Di Pentima about 1 year ago

Sorry for the delay, I wasn't familiarized with that part of the code. This LGTM, thanks!

Actions #17

Updated by Brett Smith about 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF