Feature #16636

[arvados-dispatch-cloud] Add instance metrics

Added by Tom Clegg 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/03/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

From Dispatching containers to cloud VMs

* †(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting) (implemented)
* †(gauge) number of containers not allocated to VMs (because provider quota is reached) (implemented)
* †(summary) time elapsed between VM creation and first successful SSH connection to that VM (implemented)
* †(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM (implemented)
* †(counter) VMs that have either become ready or reached boot timeout, partitioned by success/timeout (implemented)
* †(summary) time elapsed between first shutdown attempt on a VM and its disappearance from the provider listing (implemented)
* †(summary) wait times (between seeing a container in the queue or requeueing, and starting its crunch-run process on a worker) across previous starts (implemented)
* †(gauge) longest wait time of any unstarted container (implemented)

(wiki has one more unimplemented metric which has its own issue, #15865)


Subtasks

Task #16661: review 16636-boot-outcome-metricsResolvedTom Clegg

Task #16806: review 16636-add-time-to-ssh-metricResolvedWard Vandewege

Task #16829: review 16636-container-allocation-metricsResolvedWard Vandewege

Task #16837: review 16636-more-metricsResolvedWard Vandewege


Related issues

Related to Arvados - Story #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metricsNew

Related to Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolved

Related to Arvados - Feature #16838: [a-d-c] probe metricsResolved09/18/2020

Related to Arvados - Feature #17185: [adc] add broken node metricsNew

Associated revisions

Revision 3757bbc7
Added by Ward Vandewege 4 months ago

Merge branch '16636-boot-outcome-metrics'

refs #16636

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision 4d3b8b29 (diff)
Added by Ward Vandewege 4 months ago

boot outcomes metric: fix labels and add some checks for the new metric
in the tests.

refs #16636

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision 8bad7194
Added by Ward Vandewege 3 months ago

Merge branch '16636-add-time-to-ssh-metric'

refs #16636

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision 04c8cb55
Added by Ward Vandewege 3 months ago

Merge branch '16636-container-allocation-metrics'

refs #16636

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision aabff656
Added by Ward Vandewege 3 months ago

Merge branch '16636-more-metrics' into master

closes #16636

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

History

#1 Updated by Tom Clegg 4 months ago

  • Related to Story #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metrics added

#2 Updated by Tom Clegg 4 months ago

  • Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#4 Updated by Tom Clegg 4 months ago

  • Description updated (diff)

#5 Updated by Ward Vandewege 4 months ago

  • †(counter) VMs that have either become ready or reached boot timeout, partitioned by success/timeout

6036c55e1239281746152e85dfabbc9ed3cb6864 on branch 16636-boot-outcome-metrics implements this metric and is ready for review

#6 Updated by Ward Vandewege 4 months ago

  • Target version changed from Arvados Future Sprints to 2020-08-12 Sprint
  • Assigned To set to Ward Vandewege
  • Status changed from New to In Progress

#7 Updated by Tom Clegg 4 months ago

(*Pool)sync() should probably call reportBootOutcome(BootOutcomeDisappeared) near the "instance disappeared in cloud" log entry.

There are a couple of (*worker)shutdown() calls that should also have a reportBootOutcome() so not-yet-booted VMs don't get counted as "disappeared":
  • (*Pool)Shutdown() is used by the scheduler to do early-shutdown when reaching quota -- if they're not yet booted, I think BootOutcomeIdleShutdown would be appropriate
  • (*Pool)KillInstance() is used by the management API, and sometimes kills VMs that are booting/unknown and therefore haven't reported a boot outcome yet -- I think BootOutcomeIdleShutdown would make sense here too.

The BootOutcome stuff should move down a bit, so the State types/vars/methods can stay together.

reportBootOutcome should have the "caller must have lock" comment. Could move further down too, next the other worker methods.

In registerMetrics I think the label should be "outcome" rather than "state"?

The label "idle shutdown" feels a little weird because elsewhere "idle" implies "successfully booted and ready". Would "early shutdown" or "aborted" be better?

#8 Updated by Ward Vandewege 4 months ago

Tom Clegg wrote:

(*Pool)sync() should probably call reportBootOutcome(BootOutcomeDisappeared) near the "instance disappeared in cloud" log entry.

Yeah, it does, on the next line?

There are a couple of (*worker)shutdown() calls that should also have a reportBootOutcome() so not-yet-booted VMs don't get counted as "disappeared":
  • (*Pool)Shutdown() is used by the scheduler to do early-shutdown when reaching quota -- if they're not yet booted, I think BootOutcomeIdleShutdown would be appropriate

Done.

  • (*Pool)KillInstance() is used by the management API, and sometimes kills VMs that are booting/unknown and therefore haven't reported a boot outcome yet -- I think BootOutcomeIdleShutdown would make sense here too.

Done.

The BootOutcome stuff should move down a bit, so the State types/vars/methods can stay together.

Ah, yes, done.

reportBootOutcome should have the "caller must have lock" comment. Could move further down too, next the other worker methods.

Done and done.

In registerMetrics I think the label should be "outcome" rather than "state"?

Yes, fixed.

The label "idle shutdown" feels a little weird because elsewhere "idle" implies "successfully booted and ready". Would "early shutdown" or "aborted" be better?

Yes, I've made that change, and I've also renamed the const to be consistent.

All changes made at 3e269516a7500d49ad90f8590d7dd9e81f135ad6 on branch 16636-boot-outcome-metrics

#9 Updated by Tom Clegg 4 months ago

Yeah, it does, on the next line?

So it does, somehow I missed it. Sorry, my bad.

3e269516a7500d49ad90f8590d7dd9e81f135ad6

LGTM, thanks.

#10 Updated by Ward Vandewege 4 months ago

  • Description updated (diff)

#11 Updated by Ward Vandewege 4 months ago

  • Target version changed from 2020-08-12 Sprint to 2020-08-26 Sprint

#12 Updated by Ward Vandewege 3 months ago

  • Target version changed from 2020-08-26 Sprint to 2020-09-09 Sprint

#13 Updated by Ward Vandewege 3 months ago

†(summary) time elapsed between VM creation and first successful SSH connection to that VM
†(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM

Ready for review at 5c5ac412b722025d1af37f81bea60a4b503ce6aa on branch 16636-add-time-to-ssh-metric

#14 Updated by Tom Clegg 3 months ago

reportSSHConnected() needs to call wp.mtx.Lock() before doing the wp.workers[inst.ID()] lookup (the workers and the pool all share a single mtx, so this takes care of locking the wkr too).

+       wp.mtx.Lock()
+       defer wp.mtx.Unlock()
        wkr := wp.workers[inst.ID()]
-       wkr.mtx.Lock()
-       defer wkr.mtx.Unlock()

TagVerifier should skip the call to ReportVerified() if that field is nil.

source:lib/cloud/cloudtest/tester.go needs an update to accommodate the new field, maybe it's a good time to use named fields there too instead of foo{bar, baz}

This is very nitpicky but I would only call time.Now() once in reportSSHConnected, like

       wkr.firstSSHConnection = time.Now()
       if wp.mTimeToSSH != nil {
               wp.mTimeToSSH.Observe(wkr.firstSSHConnection.Sub(wkr.appeared).Seconds())
       }

(in general it saves a "get current time" syscall and avoids accumulating errors when doing time arithmetic... in this case neither is going to add up to anything so I'm only mentioning it for the habit-forming)

Everything else LGTM, thanks!

#15 Updated by Ward Vandewege 3 months ago

Tom Clegg wrote:

reportSSHConnected() needs to call wp.mtx.Lock() before doing the wp.workers[inst.ID()] lookup (the workers and the pool all share a single mtx, so this takes care of locking the wkr too).
[...]

Ah, that's good to know, fixed.

TagVerifier should skip the call to ReportVerified() if that field is nil.

Right! Done.

source:lib/cloud/cloudtest/tester.go needs an update to accommodate the new field, maybe it's a good time to use named fields there too instead of foo{bar, baz}

Done and updated to use named fields.

This is very nitpicky but I would only call time.Now() once in reportSSHConnected, like

[...]

(in general it saves a "get current time" syscall and avoids accumulating errors when doing time arithmetic... in this case neither is going to add up to anything so I'm only mentioning it for the habit-forming)

I was a bit uncomfortable with that potential drift (but didn't do anything about it), so not nitpicky at all, fixed now, thanks!

updates in 36e3b4021e376e74806df16816bd3f207ff37ecb on branch 16636-add-time-to-ssh-metric

#16 Updated by Tom Clegg 3 months ago

LGTM, thanks!

#17 Updated by Ward Vandewege 3 months ago

  • Description updated (diff)

#18 Updated by Ward Vandewege 3 months ago

  • Description updated (diff)

Tom Clegg wrote:

LGTM, thanks!

Thanks, merged.

#19 Updated by Ward Vandewege 3 months ago

  • Target version changed from 2020-09-09 Sprint to 2020-09-23 Sprint

#20 Updated by Ward Vandewege 3 months ago

7f4da9388e3d5ec8f38f6d6408916d1d46dfb10f on branch 16636-container-allocation-metrics adds these 2 metrics:

  • †(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
  • †(gauge) number of containers not allocated to VMs (because provider quota is reached)

and is ready for review. Tests are at https://ci.arvados.org/view/Developer/job/developer-run-tests/2079/

#21 Updated by Tom Clegg 3 months ago

LGTM, thanks!

#22 Updated by Ward Vandewege 3 months ago

  • Description updated (diff)

Tom Clegg wrote:

LGTM, thanks!

Merged, thanks!

#23 Updated by Ward Vandewege 3 months ago

The last three metrics are ready for review at a5dbdd1dbcdfcb835e7c1fe741e4c00927177404 on branch 16636-more-metrics

#24 Updated by Tom Clegg 3 months ago

in (*Pool)sync(), this should be skipped when wkr.destroyed.IsZero() (i.e., instance disappears but a-d-c wasn't trying to shut it down):

wp.mTimeFromShutdownToGone.Observe(time.Now().Sub(wkr.destroyed).Seconds())

In order for ctr.CreatedAt to be non-zero, we need to add "created_at" to selectParam in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go

For mLongestWaitTimeSinceQueue in (*Scheduler)runQueue(), I don't think previousLongestWaitTimeCandidate is really needed: any container with state∈{locked,queued} could be the longest wait time, even if we are attempting to start it on this iteration. Also, we should consider ctrs that end up in the overquota slice when breaking the loop early. Come to think of it, this doesn't even need to be done in the runQueue func at all -- maybe better to put it in a separate func (*Scheduler)updateMetrics() that calls sch.queue.Entries() and sch.pool.Running() and finds the earliest CreatedAt with Priority>0 and State∈{locked,queued}...? Could call it from the loop at the end of (*Scheduler)run().

#25 Updated by Ward Vandewege 3 months ago

Tom Clegg wrote:

in (*Pool)sync(), this should be skipped when wkr.destroyed.IsZero() (i.e., instance disappears but a-d-c wasn't trying to shut it down):

wp.mTimeFromShutdownToGone.Observe(time.Now().Sub(wkr.destroyed).Seconds())

Ah, good catch. Fixed.

In order for ctr.CreatedAt to be non-zero, we need to add "created_at" to selectParam in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go

Ah, interesting. I hadn't noticed that because we just stub the queue in the tests. Fixed.

For mLongestWaitTimeSinceQueue in (*Scheduler)runQueue(), I don't think previousLongestWaitTimeCandidate is really needed: any container with state∈{locked,queued} could be the longest wait time, even if we are attempting to start it on this iteration. Also, we should consider ctrs that end up in the overquota slice when breaking the loop early. Come to think of it, this doesn't even need to be done in the runQueue func at all -- maybe better to put it in a separate func (*Scheduler)updateMetrics() that calls sch.queue.Entries() and sch.pool.Running() and finds the earliest CreatedAt with Priority>0 and State∈{locked,queued}...? Could call it from the loop at the end of (*Scheduler)run().

Ah, yes that's better, I've done it that way now.

Ready for another look at 407fc461d20ece8b11b7b56f29a3caff3083ff8d on branch 16636-more-metrics

#26 Updated by Tom Clegg 3 months ago

In updateMetrics, it looks like earliest.IsZero() won't work as expected -- I'm thinking earliest := time.Now() should probably be earliest := time.Time{} and the Before(earliest) condition should have || earliest.IsZero()

Rest LGTM

#27 Updated by Ward Vandewege 3 months ago

Tom Clegg wrote:

In updateMetrics, it looks like earliest.IsZero() won't work as expected -- I'm thinking earliest := time.Now() should probably be earliest := time.Time{} and the Before(earliest) condition should have || earliest.IsZero()

Rest LGTM

Thanks, merged with that fix.

#28 Updated by Ward Vandewege 3 months ago

  • Description updated (diff)

#29 Updated by Ward Vandewege 3 months ago

  • % Done changed from 75 to 100
  • Status changed from In Progress to Resolved

#30 Updated by Peter Amstutz about 2 months ago

  • Release set to 25

#31 Updated by Ward Vandewege about 8 hours ago

#32 Updated by Ward Vandewege about 8 hours ago

Also available in: Atom PDF