Project

General

Profile

Actions

Feature #16636

closed

[arvados-dispatch-cloud] Add instance metrics

Added by Tom Clegg over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

From Dispatching containers to cloud VMs

* †(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting) (implemented)
* †(gauge) number of containers not allocated to VMs (because provider quota is reached) (implemented)
* †(summary) time elapsed between VM creation and first successful SSH connection to that VM (implemented)
* †(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM (implemented)
* †(counter) VMs that have either become ready or reached boot timeout, partitioned by success/timeout (implemented)
* †(summary) time elapsed between first shutdown attempt on a VM and its disappearance from the provider listing (implemented)
* †(summary) wait times (between seeing a container in the queue or requeueing, and starting its crunch-run process on a worker) across previous starts (implemented)
* †(gauge) longest wait time of any unstarted container (implemented)

(wiki has one more unimplemented metric which has its own issue, #15865)


Subtasks 4 (0 open4 closed)

Task #16661: review 16636-boot-outcome-metricsResolvedTom Clegg08/03/2020Actions
Task #16806: review 16636-add-time-to-ssh-metricResolvedWard Vandewege09/03/2020Actions
Task #16829: review 16636-container-allocation-metricsResolvedWard Vandewege08/03/2020Actions
Task #16837: review 16636-more-metricsResolvedWard Vandewege09/14/2020Actions

Related issues

Related to Arvados - Idea #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metricsNewActions
Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Feature #16838: [a-d-c] probe metricsResolvedWard Vandewege09/18/2020Actions
Related to Arvados - Feature #17185: [adc] add broken node metricsNewTom CleggActions
Actions #1

Updated by Tom Clegg over 3 years ago

  • Related to Idea #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metrics added
Actions #2

Updated by Tom Clegg over 3 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #4

Updated by Tom Clegg over 3 years ago

  • Description updated (diff)
Actions #5

Updated by Ward Vandewege over 3 years ago

  • †(counter) VMs that have either become ready or reached boot timeout, partitioned by success/timeout

6036c55e1239281746152e85dfabbc9ed3cb6864 on branch 16636-boot-outcome-metrics implements this metric and is ready for review

Actions #6

Updated by Ward Vandewege over 3 years ago

  • Target version changed from Arvados Future Sprints to 2020-08-12 Sprint
  • Assigned To set to Ward Vandewege
  • Status changed from New to In Progress
Actions #7

Updated by Tom Clegg over 3 years ago

(*Pool)sync() should probably call reportBootOutcome(BootOutcomeDisappeared) near the "instance disappeared in cloud" log entry.

There are a couple of (*worker)shutdown() calls that should also have a reportBootOutcome() so not-yet-booted VMs don't get counted as "disappeared":
  • (*Pool)Shutdown() is used by the scheduler to do early-shutdown when reaching quota -- if they're not yet booted, I think BootOutcomeIdleShutdown would be appropriate
  • (*Pool)KillInstance() is used by the management API, and sometimes kills VMs that are booting/unknown and therefore haven't reported a boot outcome yet -- I think BootOutcomeIdleShutdown would make sense here too.

The BootOutcome stuff should move down a bit, so the State types/vars/methods can stay together.

reportBootOutcome should have the "caller must have lock" comment. Could move further down too, next the other worker methods.

In registerMetrics I think the label should be "outcome" rather than "state"?

The label "idle shutdown" feels a little weird because elsewhere "idle" implies "successfully booted and ready". Would "early shutdown" or "aborted" be better?

Actions #8

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

(*Pool)sync() should probably call reportBootOutcome(BootOutcomeDisappeared) near the "instance disappeared in cloud" log entry.

Yeah, it does, on the next line?

There are a couple of (*worker)shutdown() calls that should also have a reportBootOutcome() so not-yet-booted VMs don't get counted as "disappeared":
  • (*Pool)Shutdown() is used by the scheduler to do early-shutdown when reaching quota -- if they're not yet booted, I think BootOutcomeIdleShutdown would be appropriate

Done.

  • (*Pool)KillInstance() is used by the management API, and sometimes kills VMs that are booting/unknown and therefore haven't reported a boot outcome yet -- I think BootOutcomeIdleShutdown would make sense here too.

Done.

The BootOutcome stuff should move down a bit, so the State types/vars/methods can stay together.

Ah, yes, done.

reportBootOutcome should have the "caller must have lock" comment. Could move further down too, next the other worker methods.

Done and done.

In registerMetrics I think the label should be "outcome" rather than "state"?

Yes, fixed.

The label "idle shutdown" feels a little weird because elsewhere "idle" implies "successfully booted and ready". Would "early shutdown" or "aborted" be better?

Yes, I've made that change, and I've also renamed the const to be consistent.

All changes made at 3e269516a7500d49ad90f8590d7dd9e81f135ad6 on branch 16636-boot-outcome-metrics

Actions #9

Updated by Tom Clegg over 3 years ago

Yeah, it does, on the next line?

So it does, somehow I missed it. Sorry, my bad.

3e269516a7500d49ad90f8590d7dd9e81f135ad6

LGTM, thanks.

Actions #10

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #11

Updated by Ward Vandewege over 3 years ago

  • Target version changed from 2020-08-12 Sprint to 2020-08-26 Sprint
Actions #12

Updated by Ward Vandewege over 3 years ago

  • Target version changed from 2020-08-26 Sprint to 2020-09-09 Sprint
Actions #13

Updated by Ward Vandewege over 3 years ago

†(summary) time elapsed between VM creation and first successful SSH connection to that VM
†(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM

Ready for review at 5c5ac412b722025d1af37f81bea60a4b503ce6aa on branch 16636-add-time-to-ssh-metric

Actions #14

Updated by Tom Clegg over 3 years ago

reportSSHConnected() needs to call wp.mtx.Lock() before doing the wp.workers[inst.ID()] lookup (the workers and the pool all share a single mtx, so this takes care of locking the wkr too).

+       wp.mtx.Lock()
+       defer wp.mtx.Unlock()
        wkr := wp.workers[inst.ID()]
-       wkr.mtx.Lock()
-       defer wkr.mtx.Unlock()

TagVerifier should skip the call to ReportVerified() if that field is nil.

source:lib/cloud/cloudtest/tester.go needs an update to accommodate the new field, maybe it's a good time to use named fields there too instead of foo{bar, baz}

This is very nitpicky but I would only call time.Now() once in reportSSHConnected, like

       wkr.firstSSHConnection = time.Now()
       if wp.mTimeToSSH != nil {
               wp.mTimeToSSH.Observe(wkr.firstSSHConnection.Sub(wkr.appeared).Seconds())
       }

(in general it saves a "get current time" syscall and avoids accumulating errors when doing time arithmetic... in this case neither is going to add up to anything so I'm only mentioning it for the habit-forming)

Everything else LGTM, thanks!

Actions #15

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

reportSSHConnected() needs to call wp.mtx.Lock() before doing the wp.workers[inst.ID()] lookup (the workers and the pool all share a single mtx, so this takes care of locking the wkr too).
[...]

Ah, that's good to know, fixed.

TagVerifier should skip the call to ReportVerified() if that field is nil.

Right! Done.

source:lib/cloud/cloudtest/tester.go needs an update to accommodate the new field, maybe it's a good time to use named fields there too instead of foo{bar, baz}

Done and updated to use named fields.

This is very nitpicky but I would only call time.Now() once in reportSSHConnected, like

[...]

(in general it saves a "get current time" syscall and avoids accumulating errors when doing time arithmetic... in this case neither is going to add up to anything so I'm only mentioning it for the habit-forming)

I was a bit uncomfortable with that potential drift (but didn't do anything about it), so not nitpicky at all, fixed now, thanks!

updates in 36e3b4021e376e74806df16816bd3f207ff37ecb on branch 16636-add-time-to-ssh-metric

Actions #16

Updated by Tom Clegg over 3 years ago

LGTM, thanks!

Actions #17

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #18

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)

Tom Clegg wrote:

LGTM, thanks!

Thanks, merged.

Actions #19

Updated by Ward Vandewege over 3 years ago

  • Target version changed from 2020-09-09 Sprint to 2020-09-23 Sprint
Actions #20

Updated by Ward Vandewege over 3 years ago

7f4da9388e3d5ec8f38f6d6408916d1d46dfb10f on branch 16636-container-allocation-metrics adds these 2 metrics:

  • †(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
  • †(gauge) number of containers not allocated to VMs (because provider quota is reached)

and is ready for review. Tests are at developer-run-tests: #2079

Actions #21

Updated by Tom Clegg over 3 years ago

LGTM, thanks!

Actions #22

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)

Tom Clegg wrote:

LGTM, thanks!

Merged, thanks!

Actions #23

Updated by Ward Vandewege over 3 years ago

The last three metrics are ready for review at a5dbdd1dbcdfcb835e7c1fe741e4c00927177404 on branch 16636-more-metrics

Actions #24

Updated by Tom Clegg over 3 years ago

in (*Pool)sync(), this should be skipped when wkr.destroyed.IsZero() (i.e., instance disappears but a-d-c wasn't trying to shut it down):

wp.mTimeFromShutdownToGone.Observe(time.Now().Sub(wkr.destroyed).Seconds())

In order for ctr.CreatedAt to be non-zero, we need to add "created_at" to selectParam in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go

For mLongestWaitTimeSinceQueue in (*Scheduler)runQueue(), I don't think previousLongestWaitTimeCandidate is really needed: any container with state∈{locked,queued} could be the longest wait time, even if we are attempting to start it on this iteration. Also, we should consider ctrs that end up in the overquota slice when breaking the loop early. Come to think of it, this doesn't even need to be done in the runQueue func at all -- maybe better to put it in a separate func (*Scheduler)updateMetrics() that calls sch.queue.Entries() and sch.pool.Running() and finds the earliest CreatedAt with Priority>0 and State∈{locked,queued}...? Could call it from the loop at the end of (*Scheduler)run().

Actions #25

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

in (*Pool)sync(), this should be skipped when wkr.destroyed.IsZero() (i.e., instance disappears but a-d-c wasn't trying to shut it down):

wp.mTimeFromShutdownToGone.Observe(time.Now().Sub(wkr.destroyed).Seconds())

Ah, good catch. Fixed.

In order for ctr.CreatedAt to be non-zero, we need to add "created_at" to selectParam in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go

Ah, interesting. I hadn't noticed that because we just stub the queue in the tests. Fixed.

For mLongestWaitTimeSinceQueue in (*Scheduler)runQueue(), I don't think previousLongestWaitTimeCandidate is really needed: any container with state∈{locked,queued} could be the longest wait time, even if we are attempting to start it on this iteration. Also, we should consider ctrs that end up in the overquota slice when breaking the loop early. Come to think of it, this doesn't even need to be done in the runQueue func at all -- maybe better to put it in a separate func (*Scheduler)updateMetrics() that calls sch.queue.Entries() and sch.pool.Running() and finds the earliest CreatedAt with Priority>0 and State∈{locked,queued}...? Could call it from the loop at the end of (*Scheduler)run().

Ah, yes that's better, I've done it that way now.

Ready for another look at 407fc461d20ece8b11b7b56f29a3caff3083ff8d on branch 16636-more-metrics

Actions #26

Updated by Tom Clegg over 3 years ago

In updateMetrics, it looks like earliest.IsZero() won't work as expected -- I'm thinking earliest := time.Now() should probably be earliest := time.Time{} and the Before(earliest) condition should have || earliest.IsZero()

Rest LGTM

Actions #27

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

In updateMetrics, it looks like earliest.IsZero() won't work as expected -- I'm thinking earliest := time.Now() should probably be earliest := time.Time{} and the Before(earliest) condition should have || earliest.IsZero()

Rest LGTM

Thanks, merged with that fix.

Actions #28

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #29

Updated by Ward Vandewege over 3 years ago

  • % Done changed from 75 to 100
  • Status changed from In Progress to Resolved
Actions #30

Updated by Peter Amstutz over 3 years ago

  • Release set to 25
Actions #31

Updated by Ward Vandewege over 3 years ago

Actions #32

Updated by Ward Vandewege over 3 years ago

Actions

Also available in: Atom PDF