Project

General

Profile

Actions

Idea #14360

closed

[crunch-dispatch-cloud] Merge incomplete implementation

Added by Tom Clegg over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

(The intent is to avoid waiting for a complete implementation and reviewing the whole thing at once.)

Acceptance criteria:
  • code quality (readable, maintainable)
  • package is not built/published
  • bugs/shortcomings relative to #14325 are listed explicitly on #14325 or on linked issues

Subtasks 1 (0 open1 closed)

Task #14368: Review 14360-dispatch-cloudResolvedPeter Amstutz10/26/2018Actions

Related issues

Related to Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolvedTom Clegg01/28/2019Actions
Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Actions #1

Updated by Tom Clegg over 5 years ago

  • Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added
Actions #2

Updated by Tom Clegg over 5 years ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg over 5 years ago

  • Description updated (diff)
Actions #4

Updated by Tom Clegg over 5 years ago

  • Subject changed from [crunch-dispatch-cloud] Publish packages for incomplete implementation to [crunch-dispatch-cloud] Merge incomplete implementation
Actions #5

Updated by Tom Clegg over 5 years ago

  • Target version changed from To Be Groomed to 2018-10-31 sprint
  • Story points set to 2.0
Actions #6

Updated by Tom Clegg over 5 years ago

  • Status changed from New to In Progress

14360-dispatch-cloud @ c3c538444c15e68e96780f157935f2baa4ba0bc5

To do:
  • Pass Arvados token to crunch-run
  • Use runtime_status when cancelling containers
  • "Hold" and "unhold" APIs
    • add tags so hold state survives dispatcher restart
  • Test crunch-run --list, --detach, --kill
  • Test SSH host key verification
  • Test activity/resource usage metrics
  • Test container.Queue using real apiserver/controller
  • Test resuming state after restart (some instances are booting, some idle, some running containers, some on admin-hold)
  • Cancel container after some number of start/requeue cycles
  • Cancel container with no suitable instance type
  • Lock containers asynchronously (a slow Lock API call should not delay other containers' progress)
  • (?) "broken node" hook
  • (?) crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
  • (?) crunch-run --detach: cleanup old stdout/stderr
  • Check with ops whether dispatch & crunch-run are configurable enough
    • crunch-run -- "/var/run/crunch-run-$uuid.lock" always OK?
    • SSH server always on port 22 on worker nodes?
  • Azure driver
  • Enable package build (currently commented out)
  • Remove LameInstanceSet, move any features worth saving to StubDriver
  • (?) Move stub VM command implementation from dispatchcloud pkg to test pkg
To not do yet:
  • Generic driver test suite
Actions #7

Updated by Peter Amstutz over 5 years ago

Seems like dispatcher.setup() doesn't get called until the first call to ServeHTTP() but that doesn't make sense since the service should proactively start doing stuff as soon as it starts.

Seems like InstanceSetID is intended to be read from configuration but it is just left dangling, always empty.

readme.go uses term "Provider" but cloud package uses "InstanceSet"

Still don't like "InstanceSet". (how about "InstancePool" ?)

What do you think about associating the VM image with the instance type (along with a default VM image)? Uses cases: nodes with specialized hardware (like GPUs), giving ops a way to test new images (by requesting a node of a special "testing" instance type).

worker.Pool.run is rather barren of comments -- I don't really know what it is supposed to do (but I'll figure it out).

worker.Pool.run is doing something slightly obfuscated by stopping, consuming, and resetting the timer, needs a comment explaining what it is intended to accomplish.

worker.Pool.run declares a large anonymous function and spins it off into a goroutine, would be more readable if it was moved to a private method of *Pool.

Tags names like "InstanceType" and "hold" should be declared as constants.

In worker.Pool.updateWorker, calling time.Now() four times will result in four sliiiiightly different timestamps. I don't know if these are ever directly/indirectly compared.

In worker.Pool.sync, it looks like if a whole bunch of instances are discovered at once (for example, when restarting the service) it will produce a storm of wp.notify() calls.

In worker.Pool.sync, workers that are not in the instance list are removed from wp.workers without any cleanup. Does this leave a dangling executor, or 'probing' channel? What about containers in 'running' or 'starting' ?

"StateRunning" means "the machine is up" but "wkr.running" means "containers are running" so the term "running" is a little overloaded. Instead of compound checks like len(wkr.running)+len(wkr.starting) > 0 || wkr.state != StateRunning how about distinct "StateIdle" and "StateAllocated" ?

Why is "booted" a separate flag and not indicated through State?

What happens if the "bootProbeCommand" returns error but there are containers running? (Looks like it will never discover them).

I don't think probeAndUpdate should be making shutdown decisions. It should only be responsible for updating the state of the worker. autoShutdown() should be responsible for deciding if a worker is viable or not.

probeAndUpdate uses a change in the 'updated' timestamp to decide whether to update the record, but this is only checked after several fields have already been assigned (wkr.busy, wkr.lastUUID, wkr.state). Needs explanation about why it is ok to update those but not others.

Instead of separate "busy" and "unallocated" timestamps, a single "timestamp of last state change" be simpler (→ assumes separate allocated/unallocated states).

To be continued...

Actions #8

Updated by Peter Amstutz over 5 years ago

container.Queue has a field called "keeplocal" and while I don't think it has anything to do with Keep it isn't immediately obvious what it is for.

container.Queue has overlapping functionality with sdk/go/dispatch, it would be too bad if the helper library for building Arvados container dispatchers can't be of any help here (realize this might involve refactoring but would still be better than duplication?)

The endpoint is called "arvados/v1/dispatch/instances" but it is actually listing "workers"

Actions #9

Updated by Peter Amstutz over 5 years ago

"scheduler.Map" not keen on the name, how about "scheduler.Allocate" or "scheduler.Schedule"

"Map should not be called without first calling FixStaleLocks." → Then maybe it should just call it itself?

scheduler.Map calls pool.Create() and checks for a cloud.QuotaError, but the implementation of worker.pool.Create() is asynchronous and never returns an error.

scheduler.Map does the Lock-and-Start for each container sequentially, so if a call gets hung up (for whatever reason) the whole scheduling process is blocked.

Is there a race between observing arvados.ContainerStateComplete in the queue and observing that crunch-run has exited on the node? It looks like it would call kill(). Should it check that the pool.Running status was refreshed more recently than the queue? → on further inspection it looks like queue.Update() is always called before scheduler.Sync() (except for the first time the loop is entered, hmm) so this race presumably doesn't happen, but the assumption should be documented in the code.

scheduler.Sync does cancel-or-kill for each container sequentially, so if a call gets hung up (for whatever reason) the whole scheduling process is blocked.

It is kind of weird that SlurmNodeTypeFeatureKludge is part of dispatchcloud now that dispatchcloud package is almost entirely about crunch-dispatch-cloud.

The architectural issue I'd most like to see addressed is to use finer-grained worker states. Recommended states:

  • Unknown - instance has not yet been contacted, because either the instance or the dispatcher have just started/restarted
  • Booting - instance has been contacted, but the "ready" script returns false
  • Unallocated - instance has been contacted, "ready" returned success, "crunch-run --list" does not return anything but exits success, may scheduler a container
  • Allocated - instance has been contacted, "ready" returned success, "crunch-run --list" returns a running uuid and exits success, don't schedule additional containers
  • Hold - no action will be taken on the instance (don't start new containers, don't cancel running containers, don't start shut down)
  • Fail - do not schedule containers, want shutdown
  • Shutdown - do not schedule containers, destroy request is committed

The "Fail" state should be indicated by "crunch-run" exiting failure, and/or by the "broken node script" creating a "node is broken" file to indicate something has gone wrong. Admins should be able to put a node in "Fail" state to force shutdown.

Actions #10

Updated by Peter Amstutz over 5 years ago

SSH server always on port 22 on worker nodes?

Pretty sure our current compute images have ssh on :2222

Actions #11

Updated by Peter Amstutz over 5 years ago

crunch-run Detach() function calls itself with "-nodetach" option but that isn't declared in the flags of main() (looks like that was supposed to be "-detatched" or vice versa)

The detached process should get a new process group so it won't get SIGHUP from ssh

(https://stackoverflow.com/questions/35433741/in-golang-prevent-child-processes-to-receive-signals-from-calling-process)

Setpgid: true

https://golang.org/pkg/syscall/#SysProcAttr

Out of curiosity, for ListProcesses did you consider looking at the process table directly (looking for crunch-run and then getting the container uuid from the command line) instead of wrangling pidfiles? What's the benefit of pidfiles?

Actions #12

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

Seems like dispatcher.setup() doesn't get called until the first call to ServeHTTP() but that doesn't make sense since the service should proactively start doing stuff as soon as it starts.

This was relying on service.Command, which calls CheckHealth at startup. But yes, it's better not to rely on that. Added an explicit Start method that gets called from newHandler.

Seems like InstanceSetID is intended to be read from configuration but it is just left dangling, always empty.

Currently we rely on "use API token if empty" in initialize() since this is already the way we avoid dispatcher collisions. But yes, the idea is to separate the ID from the credentials in the future.

readme.go uses term "Provider" but cloud package uses "InstanceSet"

Fixed

Still don't like "InstanceSet". (how about "InstancePool" ?)

(from #13964#note-17)

Hm. I was thinking of a set in the more abstract sense of an unordered collection of distinct items, rather than in terms of the efficiency of a "test for membership" operation.

I'm not keen on "pool" because it connotes a set of initialized objects which can be checked out, used, and then returned to the pool to minimize create/destroy cost. But that behavior is implemented in workerPool, not in the driver. The InstanceSet's add/remove/list operations correspond more directly to the real (and slow) create/destroy/list operations provided by the cloud provider.

What do you think about associating the VM image with the instance type (along with a default VM image)? Uses cases: nodes with specialized hardware (like GPUs), giving ops a way to test new images (by requesting a node of a special "testing" instance type).

Sounds interesting, although I think "selectable image" and "per-instance-type image" might be distinct features. Added to Dispatching containers to cloud VMs.

worker.Pool.run is rather barren of comments -- I don't really know what it is supposed to do (but I'll figure it out).

Split into runProbes, runSync, runMetrics.

worker.Pool.run is doing something slightly obfuscated by stopping, consuming, and resetting the timer, needs a comment explaining what it is intended to accomplish.

Simplified this. Now Reset() only gets called after the timer fires, so there are no races to handle.

worker.Pool.run declares a large anonymous function and spins it off into a goroutine, would be more readable if it was moved to a private method of *Pool.

Split into runProbes, runSync, runMetrics.

Tags names like "InstanceType" and "hold" should be declared as constants.

Fixed

In worker.Pool.updateWorker, calling time.Now() four times will result in four sliiiiightly different timestamps. I don't know if these are ever directly/indirectly compared.

They aren't, but yes, better to use a single timestamp anyway. Fixed.

In worker.Pool.sync, it looks like if a whole bunch of instances are discovered at once (for example, when restarting the service) it will produce a storm of wp.notify() calls.

Notifications get batched rather than piling up, so it's OK to have extras, but I fixed this up anyway: now sync does all its updates, then calls notify once if needed.

In worker.Pool.sync, workers that are not in the instance list are removed from wp.workers without any cleanup. Does this leave a dangling executor, or 'probing' channel? What about containers in 'running' or 'starting' ?

Added a Close() method to ssh_executor, and a call to it when deleting a worker.

Any containers left in running/starting stop showing up in the list, so the scheduler will unlock/cancel them the same way as if the processes had ended while the worker was still alive.

"StateRunning" means "the machine is up" but "wkr.running" means "containers are running" so the term "running" is a little overloaded. Instead of compound checks like len(wkr.running)+len(wkr.starting) > 0 || wkr.state != StateRunning how about distinct "StateIdle" and "StateAllocated" ?

Split Running to Idle/Running.

Why is "booted" a separate flag and not indicated through State?

It is mostly redundant -- it's just "state==unknown || state==booting", except between the first successful boot probe and the first successful run probe. Updated to look at state instead, and just stay in "booting" state until the first run probe succeeds.

What happens if the "bootProbeCommand" returns error but there are containers running? (Looks like it will never discover them).

That's right. We don't try to do anything else with a worker/instance until boot probe succeeds.

(If the operator changes the VM image and boot probe command in the config, and restarts the dispatcher, and the new boot probe fails on the old VMs even though they are still running containers, it seems desirable to treat the old VMs as "draining": shutdown after the current containers finish. I think the logic can be "if boot probe fails, try a run-probe anyway; if exit=0 and uuid list is non-empty, go to drain state". Noted this on #14325.)

I don't think probeAndUpdate should be making shutdown decisions. It should only be responsible for updating the state of the worker. autoShutdown() should be responsible for deciding if a worker is viable or not.

Added shutdownIfBroken and renamed autoShutdown→shutdownIfIdle.

probeAndUpdate uses a change in the 'updated' timestamp to decide whether to update the record, but this is only checked after several fields have already been assigned (wkr.busy, wkr.lastUUID, wkr.state). Needs explanation about why it is ok to update those but not others.

Updated comment. The "running" container UUID list is the thing that needs to be left alone after losing a race.

Instead of separate "busy" and "unallocated" timestamps, a single "timestamp of last state change" be simpler (→ assumes separate allocated/unallocated states).

Yes, these are nearly identical. Deleted "unallocated", and tweaked "busy" to mean the non-busy end of the busy→non-busy transition period (if the last container ended between probe 2 and probe 3, busy time is probe 3 time). Reported as "LastBusy".

container.Queue has a field called "keeplocal" and while I don't think it has anything to do with Keep it isn't immediately obvious what it is for.

Renamed to "dontupdate" and added comment.

container.Queue has overlapping functionality with sdk/go/dispatch, it would be too bad if the helper library for building Arvados container dispatchers can't be of any help here (realize this might involve refactoring but would still be better than duplication?)

It seems like we should be able to dedup the "fetch all pages of containers-of-interest" part, although sdk/go/dispatch that's tangled up with the "send updates" code, and the shared code really just amounts to a few API calls via arvados SDK, so it might not be a big deal.

The endpoint is called "arvados/v1/dispatch/instances" but it is actually listing "workers"

The data structure being traversed is a set of workers, each with a pointer to an instance. But I think the API is still "get a list of instances, with some details attached."

"scheduler.Map" not keen on the name, how about "scheduler.Allocate" or "scheduler.Schedule"

Rearranged the scheduler package to use a Scheduler type. Map() is now runQueue().

"Map should not be called without first calling FixStaleLocks." → Then maybe it should just call it itself?

This was supposed to explain FixStaleLocks(); for { Map(); Sync(); }

But now the public method is sch.Start(), and that pattern is private, so it doesn't have to be explained.

scheduler.Map calls pool.Create() and checks for a cloud.QuotaError, but the implementation of worker.pool.Create() is asynchronous and never returns an error.

Fixed. pool.Create returns the quota error immediately if called while AtQuota() is true.

scheduler.Map does the Lock-and-Start for each container sequentially, so if a call gets hung up (for whatever reason) the whole scheduling process is blocked.

Starting is asynchronous but locking does happen serially. Added to note-6 above. (Edit: locking is now async, ff8c3cfde)

Is there a race between observing arvados.ContainerStateComplete in the queue and observing that crunch-run has exited on the node? It looks like it would call kill(). Should it check that the pool.Running status was refreshed more recently than the queue? → on further inspection it looks like queue.Update() is always called before scheduler.Sync() (except for the first time the loop is entered, hmm) so this race presumably doesn't happen, but the assumption should be documented in the code.

Yes, this is a race opportunity, but it's harmless: killing a dead crunch-run is OK (merely frees up the worker sooner), and killing a live crunch-run after the container is finalized is also OK. Added comment.

scheduler.Sync does cancel-or-kill for each container sequentially, so if a call gets hung up (for whatever reason) the whole scheduling process is blocked.

Yes. Changed both to goroutines. (pool.KillContainer() is async already, but doesn't hurt to make it clearer.)

It is kind of weird that SlurmNodeTypeFeatureKludge is part of dispatchcloud now that dispatchcloud package is almost entirely about crunch-dispatch-cloud.

True. Moved to crunch-dispatch-slurm.

The architectural issue I'd most like to see addressed is to use finer-grained worker states. Recommended states:

  • Unknown - instance has not yet been contacted, because either the instance or the dispatcher have just started/restarted
  • Booting - instance has been contacted, but the "ready" script returns false
  • Unallocated - instance has been contacted, "ready" returned success, "crunch-run --list" does not return anything but exits success, may scheduler a container
  • Allocated - instance has been contacted, "ready" returned success, "crunch-run --list" returns a running uuid and exits success, don't schedule additional containers
  • Hold - no action will be taken on the instance (don't start new containers, don't cancel running containers, don't start shut down)
  • Fail - do not schedule containers, want shutdown
  • Shutdown - do not schedule containers, destroy request is committed

The "Fail" state should be indicated by "crunch-run" exiting failure, and/or by the "broken node script" creating a "node is broken" file to indicate something has gone wrong. Admins should be able to put a node in "Fail" state to force shutdown.

I think in both cases wouldn't we just go directly to Shutdown state? It doesn't seem like "destroy request is committed / not" will affect anything: either way, we should keep trying to kill the instance until it goes away.

crunch-run Detach() function calls itself with "-nodetach" option but that isn't declared in the flags of main() (looks like that was supposed to be "-detatched" or vice versa)

Indeed, fixed

The detached process should get a new process group so it won't get SIGHUP from ssh

Fixed -- no more race window between child starting and parent exiting, thanks.

Out of curiosity, for ListProcesses did you consider looking at the process table directly (looking for crunch-run and then getting the container uuid from the command line) instead of wrangling pidfiles? What's the benefit of pidfiles?

I did. But pidfiles support flock() (ensuring we don't run two of the same container UUID at once due to networking mishap), provide a place to indicate where stdout/stderr are being written, and are easy for humans to inspect when troubleshooting. (It also avoids stuff like figuring out which processes belong to containers vs. the host.)

I think a socket with a "tell me about yourself" service might have better "test for life signs" characteristics than a pidfile but I thought pidfile was a reasonable start.

Actions #13

Updated by Tom Clegg over 5 years ago

  • Story points changed from 2.0 to 1.0
Actions #14

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-10-31 sprint to 2018-11-14 Sprint
Actions #15

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-11-14 Sprint to 2018-11-28 Sprint
Actions #16

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

  • Booting - instance has been contacted, but the "ready" script returns false
  • Unallocated - instance has been contacted, "ready" returned success, "crunch-run --list" does not return anything but exits success, may scheduler a container
  • Allocated - instance has been contacted, "ready" returned success, "crunch-run --list" returns a running uuid and exits success, don't schedule additional containers
What do you think the state should be when
  • ...the boot probe has succeeded but "crunch-run --list" has not succeeded yet? "Probing"? (I might have introduced a bug here by removing the "booted" flag: currently, state==Running during this interval. FixStaleLocks should normally prevent the scheduler from starting additional work at this point, but even so I think we're better off treating/reporting a node as one of the "not ready" states until crunch-run --list succeeds.) (Edit: this is "booting" state. If the boot probe succeeds but the first run probe fails, the worker stays in "booting" state, because we still don't have enough information to say whether it's Running or Idle.)
  • ...there is a container running, but there is room for another container? (We don't do this yet, but I'm avoiding creating obstacles.) (Edit: This is "running" state. The only thing that cares whether "running" means "can't run more" is StartContainer, so this seems OK.)

With this many states it might be worth using the state pattern, with each worker in a state-specific wrapper that proxies/rejects various operations, rather than having "if state==..." conditions scattered throughout. That would probably be less fragile. (Edit: Seems not worth the extra code. We're using worker state to control the pool's behavior as much as the worker's behavior.)

Actions #17

Updated by Tom Clegg over 5 years ago

#14325 is updated to reflect things done/deferred here.

Actions #18

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-11-28 Sprint to 2018-12-12 Sprint
Actions #19

Updated by Peter Amstutz over 5 years ago

dispatcher.apiContainers and dispatcher.apiInstances don't check any read permissions.

We should think about having a way to run crunch-dispatch-cloud in a dry run mode, where it pretends to create instances but doesn't actually require a cloud (maybe it creates VMs or Docker containers on the local host to represent compute nodes).

I would expect Pool.Workers() to return a collection of worker objects. Maybe Pool.WorkerStates would be a better name?

worker.shutdown() results in change in state of a worker but does not call Pool.notify().

I think in both cases wouldn't we just go directly to Shutdown state? It doesn't seem like "destroy request is committed / not" will affect anything: either way, we should keep trying to kill the instance until it goes away.

I mentioned this because in node manager (a) there's a race between deciding to shut down an instance, slurm scheduling a new job on it, and telling the cloud to shut it down and (b) it is possible to cancel a not-yet-committed shutdown to reclaim a node, but only if the node is being shut down for utilization reasons and not because it was deemed "broken".

In c-d-c, when an instance is in shutdown state, it is unconditionally shut down. So if something is running on the instance, it will be rudely interrupted.

However, c-d-c shouldn't have a scheduling race. There is a race between an operator manually sending a request "shut down this instance" on a node and work being scheduled on it. So a "drain" and/or "drain+shutdown" state seems desirable, for shutting down an instance safely. (Although, this prototype does not support any command API for operators, so maybe this is getting ahead of things).

Actions #20

Updated by Peter Amstutz over 5 years ago

There's a bunch of places with the comment "caller must have lock". Would it make sense to instead use a recursive lock so that each method that needs the lock can take it? (Comments about assumptions of the caller environment that are not enforced by the logic of method itself are a code smell...)

In crunch-run:

Similarly named flags "-detach" and "-detached" is confusing. Instead of prepending a "-detached" flag which apparently overrides the "--detach" flag (see how this is confusing?) couldn't you just remove "-detach" from the command line of the subprocess?

To avoid a race between detach() and ListProcesses(), how about creating the lockfile under a different name, locking it, and then renaming it to one that ListProcesses() will recognize?

syscall.Flock(int(f.Fd()), syscall.LOCK_SH)

Won't this block when the lock is held? Does it need |syscall.LOCK_NB ?

ListProcesses is looking for UUID to be in the procinfo read from lockfile, but detach() doesn't write UUID.

KillProcess() waits one second and then returns success or failure. Cancelling a crunch-run process is very likely to take more than 1 second to gracefully shut down. It doesn't appear to handle the case where the signal was received but the process takes a while to shut down, or the case that the process is well and truly stuck and needs a KILL signal instead a TERM signal (from experience you don't really know what is wrong and from a failure recovery standpoint it might be better to declare the worker as "failed" and shut it down.)

The "--kill" option takes a number of the signal to send that is missing in c-d-c Pool.kill().

I believe Scheduler.sync() should be checking if a running container has priority==0 and cancel it if that's the case. (crunch-run does not check the priority itself, it relies on the dispatcher to cancel it when priority==0).

Actions #21

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

dispatcher.apiContainers and dispatcher.apiInstances don't check any read permissions.

They are behind auth.RequireLiteralToken(disp.Cluster.ManagementToken, ...) so they don't need to check permission.

Added tests for this, and changed the "no/empty management token configured" behavior from "no auth needed" to 401.

We should think about having a way to run crunch-dispatch-cloud in a dry run mode, where it pretends to create instances but doesn't actually require a cloud (maybe it creates VMs or Docker containers on the local host to represent compute nodes).

This is how the lib/dispatchcloud test suite works. test.StubDriver brings up "instances" that are really just SSH servers on localhost with fake responses to the probe/crunch-run commands. It doesn't yet have a mode for executing the probe/crunch-run commands so the containers actually run on localhost, though.

I would expect Pool.Workers() to return a collection of worker objects. Maybe Pool.WorkerStates would be a better name?

Renamed to CountWorkers.

worker.shutdown() results in change in state of a worker but does not call Pool.notify().

Fixed.

(Although, this prototype does not support any command API for operators, so maybe this is getting ahead of things).

Ah. Yes, "ops mechanism for draining a node" is on the todo list on #14325.

So far, there's only "haven't decided to shutdown" and "decided to shutdown; keep trying shutdown until the node goes away", and there's no going back.

There's a bunch of places with the comment "caller must have lock". Would it make sense to instead use a recursive lock so that each method that needs the lock can take it? (Comments about assumptions of the caller environment that are not enforced by the logic of method itself are a code smell...)

I'm not convinced. These are private methods, only to be used by callers who already have the lock. Documenting that surely isn't a code smell.

As an example, a caller must lock before calling shutdown() to prevent startContainer() from starting a container after deciding to shutdown but before the node's state changes to StateShutdown.

"Recursive mutexes are just a mistake, nothing more than a comfortable home for bugs." (Russ Cox)

"Check whether the lock is actually held" could potentially expose buggy callers -- but running high-concurrency tests with the race detector enabled seems like a more effective and less intrusive way to do that.

Similarly named flags "-detach" and "-detached" is confusing. Instead of prepending a "-detached" flag which apparently overrides the "--detach" flag (see how this is confusing?) couldn't you just remove "-detach" from the command line of the subprocess?

This just seemed simpler than dealing with --detach, -detach=true, -detach -detach, etc.

Renamed to "-no-detach" and changed var in main() to ignoreDetachFlag to make it clearer, hopefully.

To avoid a race between detach() and ListProcesses(), how about creating the lockfile under a different name, locking it, and then renaming it to one that ListProcesses() will recognize?

That would also allow 2+ processes to successfully create/exclusive-lock/rename without noticing that they're all using the same lockfile name.

It could be done with two lockfiles, though:
  1. (run) Lock lockfile1 with LOCK_EX|LOCK_NB
  2. (run) Lock lockfile2 with LOCK_EX (if paranoid, use a timer in case a rogue crunch-run locked lockfile2 without locking lockfile1 first)
  3. (list) Lock lockfile2 with LOCK_SH|LOCK_NB

[...]

Won't this block when the lock is held? Does it need |syscall.LOCK_NB ?

Yes! Fixed.

ListProcesses is looking for UUID to be in the procinfo read from lockfile, but detach() doesn't write UUID.

Indeed. Fixed.

KillProcess() waits one second and then returns success or failure. Cancelling a crunch-run process is very likely to take more than 1 second to gracefully shut down. It doesn't appear to handle the case where the signal was received but the process takes a while to shut down, or the case that the process is well and truly stuck and needs a KILL signal instead a TERM signal (from experience you don't really know what is wrong and from a failure recovery standpoint it might be better to declare the worker as "failed" and shut it down.)

Waiting for the process to shut down, trying different signals, etc. is up to the dispatcher. All crunch-run does is send the specified signal to the appropriate process and report whether the process is still alive.

Added the signal escalation stuff to #14325.

The "--kill" option takes a number of the signal to send that is missing in c-d-c Pool.kill().

Fixed.

I believe Scheduler.sync() should be checking if a running container has priority==0 and cancel it if that's the case. (crunch-run does not check the priority itself, it relies on the dispatcher to cancel it when priority==0).

Yes, that was mentioned in readme.go but didn't get implemented/noted. Easy, so just did it here.

14360-dispatch-cloud @ b205525d0b7c7b9042513fe77d2e8061534208ae
https://ci.curoverse.com/view/Developer/job/developer-run-tests/1007/

Actions #22

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint
Actions #23

Updated by Tom Clegg over 5 years ago

To avoid a race between detach() and ListProcesses(), how about creating the lockfile under a different name, locking it, and then renaming it to one that ListProcesses() will recognize?

That would also allow 2+ processes to successfully create/exclusive-lock/rename without noticing that they're all using the same lockfile name.

It could be done with two lockfiles, though:
  1. (run) Lock lockfile1 with LOCK_EX|LOCK_NB
  2. (run) Lock lockfile2 with LOCK_EX (if paranoid, use a timer in case a rogue crunch-run locked lockfile2 without locking lockfile1 first)
  3. (list) Lock lockfile2 with LOCK_SH|LOCK_NB

Nope, this still doesn't address the garbage-collection-between-create-and-lock problem.

Fixed with a dir-level lock. Only one process at a time can do either create-then-lock or testForStale-then-delete.

14360-dispatch-cloud @ 076d8a0d4885f7bfa22962a745a5aacb48fa9be1

Actions #24

Updated by Peter Amstutz over 5 years ago

I'm looking at TestDispatchToStubDriver

This is an interesting test, since it seems intended to exercise all the core components.

However, I don't see checks that confirm that it is doing anything, aside from "didn't panic" and "didn't hang". Am I missing something?

Actions #25

Updated by Tom Clegg over 5 years ago

Confirming the dispatcher brought up instances and ran all 200 containers on them within 10 seconds, despite various simulated failures:

        select {
        case <-done:
                c.Logf("containers finished (%s), waiting for instances to shutdown and queue to clear", time.Since(start))
        case <-time.After(10 * time.Second):
                c.Fatalf("timed out; still waiting for %d containers: %q", len(waiting), waiting)
        }

Confirming all of the instances are shut down within 1 second once they're idle:

        deadline := time.Now().Add(time.Second)
        for range time.NewTicker(10 * time.Millisecond).C {
                // ...
                if len(ents) == 0 && len(insts) == 0 {
                        break
                }
                if time.Now().After(deadline) {
                        c.Fatalf("timed out with %d containers (%v), %d instances (%+v)", len(ents), ents, len(insts), insts)
                }
        }

If you run run-tests.sh with lib/dispatchcloud_test=-check.vv you get log messages:

...
INFO[0001] [test] starting crunch-run stub               ContainerUUID=zzzzz-dz642-000000000000181
INFO[0001] crunch-run process started                    ContainerUUID=zzzzz-dz642-000000000000181 Instance=stub-providertype6-42397fcf22cadd11 InstanceType=type6 Priority=1
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000021
WARN[0001] new instance unresponsive, shutting down      Duration=31.024162ms Instance=stub-providertype6-2fefe3d0fe612994 InstanceType=type6 Since="2018-12-14 13:28:07.582699391 -0500 EST m=+1.291183640" State=booting
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000021 State=Complete
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-2fefe3d0fe612994 WorkerState=shutdown
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000101
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000045
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000181
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000101 State=Complete
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000181 State=Complete
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000045 State=Complete
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000061
INFO[0001] shutdown idle worker                          Age=31.482455ms Instance=stub-providertype6-7472585690a3dc1f InstanceType=type6
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000061 State=Complete
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-7472585690a3dc1f WorkerState=shutdown
INFO[0001] [test] exiting crunch-run stub                ContainerUUID=zzzzz-dz642-000000000000141
containers finished (1.335522654s), waiting for instances to shutdown and queue to clear
INFO[0001] shutdown idle worker                          Age=34.6986ms Instance=stub-providertype6-724c26e048670159 InstanceType=type6
INFO[0001] shutdown idle worker                          Age=33.439178ms Instance=stub-providertype6-123bff23948ab975 InstanceType=type6
INFO[0001] shutdown idle worker                          Age=32.353059ms Instance=stub-providertype6-4279e0924b15f159 InstanceType=type6
INFO[0001] crunch-run process ended                      ContainerUUID=zzzzz-dz642-000000000000141 Instance=stub-providertype6-59d9b10cec10c688 InstanceType=type6 stderr=
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-724c26e048670159 WorkerState=shutdown
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-123bff23948ab975 WorkerState=shutdown
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-4279e0924b15f159 WorkerState=shutdown
INFO[0001] container finished                            ContainerUUID=zzzzz-dz642-000000000000141 State=Complete
INFO[0001] shutdown idle worker                          Age=30.823974ms Instance=stub-providertype6-2f3a45cb3342fa6d InstanceType=type6
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-2f3a45cb3342fa6d WorkerState=shutdown
INFO[0001] shutdown idle worker                          Age=31.734729ms Instance=stub-providertype6-42397fcf22cadd11 InstanceType=type6
INFO[0001] shutdown idle worker                          Age=32.740215ms Instance=stub-providertype6-2e3f039a09ec3ff2 InstanceType=type6
INFO[0001] shutdown idle worker                          Age=34.755251ms Instance=stub-providertype6-78ebdb3eb6c5d921 InstanceType=type6
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-78ebdb3eb6c5d921 WorkerState=shutdown
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-2e3f039a09ec3ff2 WorkerState=shutdown
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-42397fcf22cadd11 WorkerState=shutdown
INFO[0001] shutdown idle worker                          Age=31.598095ms Instance=stub-providertype6-59d9b10cec10c688 InstanceType=type6
INFO[0001] instance disappeared in cloud                 Instance=stub-providertype6-59d9b10cec10c688 WorkerState=shutdown
Actions #26

Updated by Peter Amstutz over 5 years ago

What is a stubServer exactly? There's no "Server" interface. It seems to correspond to an Instance, why is it a separate type from stubInstance?

I like the idea of queuing up a bunch of containers with random failures as a stress test, but I would like to see more systematic tests of individual success/fail cases, where the test verifies that an expected sequence of actions did actually happen. (possibly the answer is more tests like TestCreateUnallocShutdown).

    // Returned counts can be temporarily higher than 1 and 2 if
    // poll ran before Create() returned.
    c.Check(pool.Unallocated()[type1], check.Not(less), 1)
    c.Check(pool.Unallocated()[type2], check.Not(less), 2)

I think this comment is referring to the race in updating the creating field? Are there any situations where overcounting the number of unallocated nodes would lead to bad behavior, or is it mainly an issue of wasted effort if it tries to allocate a container and fails? This should be noted in comments.

In TestExecute, am I correct in thinking that what it does here is create an SSH service in the same process, and then talks to itself? (A comment would be useful here.)

Actions #27

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

Confirming the dispatcher brought up instances and ran all 200 containers on them within 10 seconds, despite various simulated failures:

[...]

That confirms that the "done" channel is closed. Which happens when onComplete() has been called 200 times. onComplete is part of fakeCloud which is used to hook into the behavior of stubDriver but that's when I start to get lost...

Looks like onComplete is called by fakeVM.exec when it has decided that a container is completed.

I guess the argument is that even though it doesn't directly confirm that certain events happened, it is reasonable to infer that it worked if it didn't blow up?

Confirming all of the instances are shut down within 1 second once they're idle:

[...]

If you run run-tests.sh with lib/dispatchcloud_test=-check.vv you get log messages:

[...]

Yes, I saw the log messages. That would be one way to establish that certain events happened. My concern is that the test doesn't examine the content of those log messages, so if something weird started happening (for example, a bug caused it to double-schedule on a node, or schedule on the wrong node) it wouldn't be noticed. You could collect the log messages and confirm that containers/instances went through their expected lifecycle.

Actions #28

Updated by Peter Amstutz over 5 years ago

I get a bunch of "changeState failed":

INFO[0000] creating new instance                         ContainerUUID=zzzzz-dz642-000000000000167 InstanceType=type8
INFO[0000] creating new instance                         ContainerUUID=zzzzz-dz642-000000000000127 InstanceType=type8
INFO[0000] creating new instance                         ContainerUUID=zzzzz-dz642-000000000000147 InstanceType=type4
WARN[0000] error locking container                       ContainerUUID=zzzzz-dz642-000000000000149 InstanceType=type6 error="changeState failed: state=\"Locked\"" 
INFO[0000] creating new instance                         ContainerUUID=zzzzz-dz642-000000000000026 InstanceType=type3
WARN[0000] error locking container                       ContainerUUID=zzzzz-dz642-000000000000049 InstanceType=type2 error="changeState failed: state=\"Locked\"" 
WARN[0000] error locking container                       ContainerUUID=zzzzz-dz642-000000000000105 InstanceType=type4 error="changeState failed: state=\"Locked\"" 
INFO[0000] instance appeared in cloud                    Instance=stub-providertype4-4d65822107fcfd52 InstanceType=type4 State=booting
WARN[0000] error locking container                       ContainerUUID=zzzzz-dz642-000000000000170 InstanceType=type3 error="changeState failed: state=\"Locked\"" 

However it doesn't look like test.Queue.Lock() is supposed to randomly fail, and all the containers start out as Queued, so it isn't obvious if this is part of the test or the code is actually stepping on itself?

This test is also overaggressive at shutting down fake instances:

INFO[0000] instance disappeared in cloud                 Instance=stub-providertype3-1b6cffa2ba517936 WorkerState=shutdown

It's doing this to instances that were "booted" less than a second earlier and havn't had an opportunity to do any work.

It's great that the test runs in 5 seconds, but the log seems chaotic and I don't know what it tells us about how it will behave on a real cloud where everything happens much much much slower.

Actions #29

Updated by Peter Amstutz over 5 years ago

As far as being ready to merge, we never really defined acceptance criteria beyond "has been thoroughly reviewed". I think I'm at the point where I've had a chance to look everything over. Although I think there's room for improvement, I see that TestDispatchToStubDriver does demonstrate successful interaction between the core dispatcher and scheduler components. The next step is adding in an actual cloud driver will make it technically possible to start testing on a dev cluster.

So looks... good... to... me?

Actions #30

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

I get a bunch of "changeState failed":

Looks like this happens when scheduler iteration N's "lock" API call is still running when scheduler iteration N+1 starts, but finishes before iteration N+1 calls bgLock(). It's normal/harmless, so it shouldn't be making noise in the logs. Will fix.

This test is also overaggressive at shutting down fake instances:

[...]

It's doing this to instances that were "booted" less than a second earlier and havn't had an opportunity to do any work.

This sounds like the desired behavior, keeping in mind timeouts here are about 1000x shorter than in real life (TimeoutBooting and TimeoutIdle are both 30ms).

I'm not sure what you mean by "haven't had an opportunity" though. Do the logs indicate the instances were booted and ready, and containers were waiting for that instance size?

That confirms that the "done" channel is closed. Which happens when onComplete() has been called 200 times. onComplete is part of fakeCloud which is used to hook into the behavior of stubDriver but that's when I start to get lost...

Sounds like the fakeCloud/fakeVM stuff is too hard to follow and needs to be cleaned up and explained better. Will fix.

The idea is that (fakeVM)exec() gets called when the dispatcher executes commands on a worker node. If the command is "crunch-run --detach {uuid}" the test starts a "fake detached crunch-run" goroutine which (if it doesn't decide to fail/crash) eventually calls onComplete, which checks the given uuid off its todo list, and closes "done" when all have been fake-run.

                delete(waiting, uuid)
                if len(waiting) == 0 {
                        close(done)
                }

I guess the argument is that even though it doesn't directly confirm that certain events happened, it is reasonable to infer that it worked if it didn't blow up?

It confirms that every container gets executed (and retried if needed) within a reasonable time, which is the thing the program is trying to achieve. I'm not sure what you mean by "directly" or "certain events"...?

Yes, I saw the log messages. That would be one way to establish that certain events happened. My concern is that the test doesn't examine the content of those log messages, so if something weird started happening (for example, a bug caused it to double-schedule on a node, or schedule on the wrong node) it wouldn't be noticed. You could collect the log messages and confirm that containers/instances went through their expected lifecycle.

There are more opportunities here to check for various things, without resorting to examining logs. The idea isn't to reach in and unit-test all of the components, though. It tests that the dispatcher is wiring all the components together effectively and the result actually runs containers.

we never really defined acceptance criteria beyond "has been thoroughly reviewed".

Well, there's the description:

(The intent is to avoid waiting for a complete implementation and reviewing the whole thing at once.)

Acceptance criteria:
  • code quality (readable, maintainable)
  • package is not built/published
  • bugs/shortcomings relative to #14325 are listed explicitly on #14325 or on linked issues
Actions #31

Updated by Tom Clegg about 5 years ago

14360-dispatch-cloud @ b3dbac2705f2bcffbf346d0b7efb85e2566b11f8
  • Consolidate test.StubServer+fakeVM+fakeCloud → test.StubVM
  • Don't spam logs about expected race condition when Lock attempts overlap
  • Don't over-count new instances that show up in Instances while waiting for Create to return
Actions #32

Updated by Tom Clegg about 5 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #33

Updated by Peter Amstutz about 5 years ago

Is there something throttling the number of nodes in TestDispatchToStubDriver? I'm wondering why it doesn't create 200 nodes right up front.

Actions #34

Updated by Tom Clegg about 5 years ago

Peter Amstutz wrote:

Is there something throttling the number of nodes in TestDispatchToStubDriver? I'm wondering why it doesn't create 200 nodes right up front.

Ooh, good one. This turned out to be a scheduler bug: it was only creating one instance per type at a time. Fixed in ee1161113168d709f0cb0513f192631c1f35b3ab.

14360-dispatch-cloud @ c6b8556dfb4f71c5ccc6041b48061fdb1c00c3b2

Actions #35

Updated by Peter Amstutz about 5 years ago

Peter Amstutz wrote:

Is there something throttling the number of nodes in TestDispatchToStubDriver? I'm wondering why it doesn't create 200 nodes right up front.

Having the stub cloud driver return a quota limit error at 50 or 100 nodes would also be interesting.

Actions #36

Updated by Peter Amstutz about 5 years ago

This is looking really good, but I have a comment about the atQuota logic.

It appears that when a quota error is returned, it is starts a one minute quiet period, during which no more Create requests are sent. After that quiet period is over, it will try again, which could lead to a stampede of create requests, all of which fail.

That's different from the current logic in node manager, which is:

  1. if a quota error is returned, set node_quota = current number of nodes
  2. don't try to create a new node unless current number of nodes < node_quota
  3. the current number of nodes goes down from the usual behavior of shutting down idle nodes (a quota limit situation might also mean proactively shutting down idle nodes that would otherwise be allocated to lower priority jobs to make room for higher priority jobs)
  4. shutting down a node unblocks a slot and allows an another create request to be issued (but only one at a time, this prevents a stampede of create requests)
  5. if a create request succeeds, increment the node_quota which allows another create request to be issued, etc slowly probing until hitting the quota limit again.
Actions #37

Updated by Peter Amstutz about 5 years ago

From chat: discussion of the desired logic for quota is moved to #14325

Actions #38

Updated by Peter Amstutz about 5 years ago

I don't understand what's going on here:

INFO[0000] instance appeared in cloud                    Instance=stub-providertype4-3e6306563b26de InstanceType=type4 State=booting
...
INFO[0000] boot probe succeeded                          Command=true Instance=stub-providertype4-3e6306563b26de InstanceType=type4 stderr= stdout=
INFO[0000] instance booted; will try probeRunning        Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0000] boot probe succeeded                          Command=true Instance=stub-providertype4-3e6306563b26de InstanceType=type4 stderr= stdout=
INFO[0000] instance booted; will try probeRunning        Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0001] boot probe succeeded                          Command=true Instance=stub-providertype4-3e6306563b26de InstanceType=type4 stderr= stdout=
INFO[0001] instance booted; will try probeRunning        Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0002] boot probe succeeded                          Command=true Instance=stub-providertype4-3e6306563b26de InstanceType=type4 stderr= stdout=
INFO[0002] instance booted; will try probeRunning        Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0002] boot probe succeeded                          Command=true Instance=stub-providertype4-3e6306563b26de InstanceType=type4 stderr= stdout=
INFO[0002] instance booted; will try probeRunning        Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0002] shutdown idle worker                          Age=2.376113504s Instance=stub-providertype4-3e6306563b26de InstanceType=type4
...
INFO[0002] instance disappeared in cloud                 Instance=stub-providertype4-3e6306563b26de WorkerState=shutdown

Crucially, there's no "probe failed" log message with this instance id, so either probeAndUpdate is blocked and being called again (seems unlikely, ProbeAndUpdate is supposed to prevent overlapping invocations) or it is failing in a way that neither produces an error message no puts in it in "Idle" state.

Actions #39

Updated by Peter Amstutz about 5 years ago

Also moved over to #14325

Please go ahead and merge to master.

Actions #40

Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
Actions #41

Updated by Tom Morris about 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF