Feature #14325

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Added by Tom Clegg about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/28/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.

Background -- already done in #14360:
  • Bring up nodes and run containers on them
  • Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
  • HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
  • Shutdown idle nodes automatically
  • Handle cloud API quota errors
  • Package-building changes are in place, but commented out
Requirements covered here:
  • Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
  • Configurable port number for connecting to VM SSH servers
  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables
  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
  • Test container.Queue using real railsAPI/controller
  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
  • Cancel container with no suitable instance type
  • Enable package build
  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
  • Update management API response format (lowercase keys)
  • Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Requirements covered elsewhere:
  • One cloud vendor driver (Azure = #14324)
  • Production-readiness (#14807)
Refs

Subtasks

Task #14664: Review 14325-dispatch-cloudResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolved01/09/2019

Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolved

Related to Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingNew

Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolved10/26/2018

Precedes Arvados - Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node managerResolved01/29/2019

Precedes Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolved01/29/2019

Associated revisions

Revision 800139c8
Added by Tom Clegg 9 months ago

Merge branch '14325-dispatch-cloud'

closes #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 4f2aab84
Added by Tom Clegg 9 months ago

Merge branch '14325-dispatch-cloud'

refs #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision b56e2857 (diff)
Added by Tom Clegg 9 months ago

14325: Fix PrivateKey config type.

JSON decoder expects []byte fields to be base64-encoded, which we
don't want here.

refs #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision a4396e18 (diff)
Added by Tom Clegg 9 months ago

14325: Start up immediately if there are no stale locks.

...instead of waiting for the pool to send a notification to trigger
the first loop iteration.

refs #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision d06417e5 (diff)
Added by Tom Clegg 9 months ago

14325: Start up immediately if there are no stale locks.

...instead of waiting for the pool to send a notification to trigger
the first loop iteration.

refs #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision e6cb52f3 (diff)
Added by Tom Clegg 9 months ago

14325: Start up immediately if there are no stale locks.

...instead of waiting for the pool to send a notification to trigger
the first loop iteration.

refs #14325

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg about 1 year ago

  • Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added

#2 Updated by Tom Clegg about 1 year ago

  • Related to Bug #13964: crunch-dispatch-cloud spike added

#3 Updated by Tom Clegg about 1 year ago

  • Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#4 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#5 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#6 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#7 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#8 Updated by Tom Clegg about 1 year ago

  • Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added

#9 Updated by Tom Clegg 12 months ago

  • Description updated (diff)

#10 Updated by Tom Clegg 12 months ago

  • Description updated (diff)

#11 Updated by Tom Clegg 12 months ago

  • Description updated (diff)

#12 Updated by Tom Clegg 12 months ago

  • Description updated (diff)

#13 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#14 Updated by Tom Morris 11 months ago

  • Target version set to To Be Groomed

#15 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#16 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#17 Updated by Tom Clegg 11 months ago

  • Description updated (diff)
  • Target version deleted (To Be Groomed)

#18 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#19 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#20 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#21 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#22 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#23 Updated by Tom Clegg 11 months ago

  • Target version set to Arvados Future Sprints
  • Story points set to 4.0

#24 Updated by Peter Amstutz 11 months ago

  • Description updated (diff)

#25 Updated by Peter Amstutz 11 months ago

Management APIs should return {"items": [...]} not {"Items": [...]} for consistency with the Arvados API.

#26 Updated by Peter Amstutz 11 months ago

  • Description updated (diff)

#27 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#28 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#29 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#30 Updated by Tom Clegg 11 months ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg

#31 Updated by Tom Morris 10 months ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

#32 Updated by Tom Clegg 10 months ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

#33 Updated by Tom Clegg 10 months ago

Added "hold" and "drain". (Wiki also mentions a "kill" API -- not included here.)

  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"

Added total hourly price. The others were already in place.

  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"

Added what the wiki says, which is a little different:

"...instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down."

This is a bit more consistent since it's more consistent with the "inherited node is not running a container and fails boot probe" case: we allow the boot timeout to run out before killing it, rather than expecting its boot probe to succeed before the existing container finishes.

  • Configurable port number for connecting to VM SSH servers

CloudVMs→SSHPort can be given as a name ("ssh") or number ("22").

  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables

Added.

  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)

Added.

  • Test container.Queue using real railsAPI/controller

Added. Revealed & fixed SDK bug, see f696f142e.

  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)

Added restart/resume test to confirm "hold" and instance-type labels are saved/loaded effectively.

Added a slew of worker tests to confirm proper state changes in probeAndUpdate.

  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

  • Cancel container with no suitable instance type

Added.

  • Enable package build

Uncommented.

  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)

Added.

  • Update management API response format (lowercase keys)

Updated.

Confirmed.

14325-dispatch-cloud @ b105602902e38f18a48505e2091ffea77b2c7c89 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1040/

#34 Updated by Tom Clegg 10 months ago

Now at a27b2bf3e with some test cleanup (move LameInstanceSet's one remaining useful feature to StubDriver and retire LameInstanceSet).

#35 Updated by Tom Clegg 10 months ago

  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint

#36 Updated by Tom Clegg 10 months ago

  • Story points changed from 4.0 to 1.0

#37 Updated by Tom Clegg 10 months ago

  • Precedes Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added

#38 Updated by Peter Amstutz 9 months ago

worker.shutdownIfIdle():

    if !(wkr.state == StateIdle || (wkr.state == StateBooting && wkr.idleBehavior == IdleBehaviorDrain)) {
        return false
    }

The double-negative logic (do nothing if these things are NOT true...) makes this expression hard to read. Please add comments clarifying the intention that we want to shut down when certain things are true.

    if wkr.idleBehavior != IdleBehaviorDrain && age < wkr.wp.timeoutIdle {
        return false
    }

Same comment about confusing expression. If I'm understanding the intended behavior, it would be clearer to write wkr.idleBehavior == IdleBehaviorRun && age < wkr.wp.timeoutIdle because the IdleBehaviorHold case has already been eliminated, and IdleBehaviorDrain ignores the timeout (but having IdleBehaviorDrain and timeoutIdle appear on the same line implies they are related).

Queue.Update():

        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        }
...
        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        } else if _, keep = next[uuid]; keep {
            continue
        } else {
            delete(cq.current, uuid)
        }

Comment from last time that "keeplocal" was confusing and was renamed to "dontupdate" but there's still a few local variables called "keep" and I don't know how to read it. Should those also be called "dontupdate"? Maybe add some comments?

In Queue.addEnt() there's an embedded assumption that if the current dispatcher can't find a instance type for a container, nobody can, so it should always cancel the container (even if it has to lock it first). I think that's fine (heterogeneous dispatchers has complexity we don't want to get into yet, if ever) but should probably be mentioned in a comment.

worker.probeAndUpdate():

    for _, uuid := range ctrUUIDs {
        running[uuid] = struct{}{}
        if _, ok := wkr.running[uuid]; !ok {
            changed = true
        }
    }

Another place that would benefit from some more comments expressing the intent / context of the code. I think what this is doing is determining if there a container UUID was found on the node which isn't present in wkr.running. The next block looks like it checks the opposite case where a container is known to wkr.running but not present on the instance.

    if wkr.state == StateUnknown || wkr.state == StateBooting {
        wkr.state = StateIdle
        }

It is implied by getting to this point in the code that probeBooted() and probeRunning() both passed successfully, could use a comment making that assumption explicit.

... to be continued ....

#39 Updated by Peter Amstutz 9 months ago

Pool.Unallocated():

        if !(wkr.state == StateIdle || wkr.state == StateBooting || wkr.state == StateUnknown) || wkr.idleBehavior != IdleBehaviorRun || len(wkr.running) > 0 {
            continue
        }

This line is way too long.

Similar comment to earlier about hard-to-follow double-negative logic. Here !(wkr.state StateIdle || wkr.state StateBooting || wkr.state == StateUnknown) is much clearer written as (wkr.state != StateIdle && wkr.state != StateBooting && wkr.state != StateUnknown)

#40 Updated by Peter Amstutz 9 months ago

Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

# git.curoverse.com/arvados.git/lib/dispatchcloud/container
./queue_test.go:38:17: undefined: test
./queue_test.go:95:17: undefined: test
FAIL    git.curoverse.com/arvados.git/lib/dispatchcloud/container [build failed]
import (
    "github.com/julienschmidt/httprouter" 
)

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

# Layouter fails if we add these

Maybe use graphviz instead? (Requires slightly different notation).

#41 Updated by Tom Clegg 9 months ago

Indeed, those are some confusing conditional expressions, thanks. Clarified and added comments.

Peter Amstutz wrote:

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

Ah, indeed.

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

Cheap, easy to use, does what we need (filter on methods + extract path params), didn't think of a reason not to use it. (It happens to be much more efficient with time and memory than gorilla, not that that's a big concern here.)

Maybe use graphviz instead? (Requires slightly different notation).

Maybe. I didn't find this ascii art exercise particularly rewarding. If I were to sink more time into different ways of doing this, I'd probably just give up and make a drawing in Google Drive.

lib/dispatchcloud/container tests are fixed.

14325-dispatch-cloud @ 71fd4da18b22100682ae7e2079aadfd66360d310 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1051/

#42 Updated by Tom Clegg 9 months ago

  • Precedes Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added

#43 Updated by Tom Clegg 9 months ago

  • Description updated (diff)

#44 Updated by Tom Clegg 9 months ago

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

#45 Updated by Peter Amstutz 9 months ago

Tom Clegg wrote:

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

This LGTM, thanks.

#46 Updated by Tom Clegg 9 months ago

  • Status changed from In Progress to Resolved

#47 Updated by Tom Morris 9 months ago

  • Release set to 15

Also available in: Atom PDF