Actions
Feature #14325
closed[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager
Story points:
1.0
Release:
Release relationship:
Auto
Description
This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.
Background -- already done in #14360:- Bring up nodes and run containers on them
- Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
- HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
- Shutdown idle nodes automatically
- Handle cloud API quota errors
- Package-building changes are in place, but commented out
- Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
- Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
- Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
- Configurable port number for connecting to VM SSH servers
- Pass API host and dispatcher's token to crunch-run command via
ARVADOS_API_*
environment variables - Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
- Test container.Queue using real railsAPI/controller
- Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
- Cancel container after some number of start/requeue cycles (i.e.,
crunch-run --detach
succeeded, but child exited without moving container past Locked state) - Cancel container with no suitable instance type
- Enable package build
- Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
- Update management API response format (lowercase keys)
- Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Actions