Feature #14325

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Added by Tom Clegg 3 months ago. Updated 7 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
4.0

Description

This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.

Background -- already done in #14360:
  • Bring up nodes and run containers on them
  • Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
  • HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
  • Shutdown idle nodes automatically
  • Handle cloud API quota errors
  • Package-building changes are in place, but commented out
Requirements covered here:
  • Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
  • Configurable port number for connecting to VM SSH servers
  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables
  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
  • Test container.Queue using real railsAPI/controller
  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
  • Cancel container with no suitable instance type
  • Enable package build
  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
  • Update management API response format (lowercase keys)
  • Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Requirements covered elsewhere:
  • One cloud vendor driver (Azure = #14324)
Non-requirements (can wait until after first dev deploy):
  • Update runtime_status field when cancelling containers
  • Ops mechanism for hold/release (add tags so hold state survives dispatcher restart)
  • Test activity/resource usage metrics
  • crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
  • crunch-run --detach: cleanup old stdout/stderr
  • Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver
  • Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
  • Shutdown node if container process still running after several SIGKILL attempts
  • "Broken node" hook
  • Multiple cloud drivers
  • Test suite that uses a real cloud provider
  • Prometheus metrics (containers in queue, time container queued before starting, workers in each state, etc)
  • Periodic status reports in logs
  • Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run)
  • Configurable spending limits
  • Generic driver test suite
  • Cancel containers that don't get scheduled after some time limit (no nodes ever come up?)
  • Rate-limit or serialize creation of new instance types (when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)
Refs

Subtasks

Task #14664: ReviewNewPeter Amstutz


Related issues

Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolved2019-01-09

Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolved

Related to Arvados - Story #13908: Replace SLURM for cloud job scheduling/dispatchingNew

Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolved2018-10-26

History

#1 Updated by Tom Clegg 3 months ago

  • Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added

#2 Updated by Tom Clegg 3 months ago

  • Related to Bug #13964: crunch-dispatch-cloud spike added

#3 Updated by Tom Clegg 3 months ago

  • Related to Story #13908: Replace SLURM for cloud job scheduling/dispatching added

#4 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#5 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#6 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#7 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#8 Updated by Tom Clegg 3 months ago

  • Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added

#9 Updated by Tom Clegg 2 months ago

  • Description updated (diff)

#10 Updated by Tom Clegg 2 months ago

  • Description updated (diff)

#11 Updated by Tom Clegg 2 months ago

  • Description updated (diff)

#12 Updated by Tom Clegg 2 months ago

  • Description updated (diff)

#13 Updated by Tom Clegg about 2 months ago

  • Description updated (diff)

#14 Updated by Tom Morris about 1 month ago

  • Target version set to To Be Groomed

#15 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#16 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#17 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)
  • Target version deleted (To Be Groomed)

#18 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#19 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#20 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#21 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#22 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#23 Updated by Tom Clegg about 1 month ago

  • Target version set to Arvados Future Sprints
  • Story points set to 4.0

#24 Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)

#25 Updated by Peter Amstutz about 1 month ago

Management APIs should return {"items": [...]} not {"Items": [...]} for consistency with the Arvados API.

#26 Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)

#27 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#28 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#29 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#30 Updated by Tom Clegg about 1 month ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg

#31 Updated by Tom Morris 21 days ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

#32 Updated by Tom Clegg 7 days ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

Also available in: Atom PDF