Story #14807

Updated by Tom Clegg over 2 years ago

Issues encountered & fixed/worked around during dev deploy:
* Include instance address (host or IP) in logs and management API responses
* Ensure @crunch-run --list@ works even if /var/lock is a symlink
* Log full instance ID, not (Instance)String(), which might be an abbreviated name
* Fix management API endpoints to allow specifying instance IDs that have slashes
* Pass SSH public key to Azure so it doesn't crash (Azure refuses to create a node without adding an admin account)
* Fix host part of SSH target address being dropped
* Allow driver to specify a login username
* Send ARVADOS_API_* values on stdin instead of environment vars (typical SSH server is configured to refuse these env vars)
* If ProviderType is not given in an instance type in the cluster config, default to the type name (not the empty string)
* Pass a random string to Azure driver as "node-token" (or fix Azure driver so it doesn't expect that)

Further improvements necessary to run in production:
* Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
* Shutdown node if container process still running after several SIGKILL attempts
* Propagate configured "check for broken node" script name to crunch-run
* Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed
* Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)
* Metrics: total cost of nodes in idle or booting state
* Metrics: total cost of nodes with admin-hold flag set
* Metrics: number of containers, split by state (and instance type?)
*
Log when an instance goes down unexpectedly (i.e., state != Shutdown when deleted from list)
* Log when a container is added to or dropped from the queue
* Obey logging format in cluster config file (as of #14325, HTTP request logs were JSON, operational logs were text)
* Load API host & token from cluster config file instead of env vars
* Ensure crunch-run exits instead of hanging if ARVADOS_API_HOST/TOKEN is empty or broken

Improvements that are desired, but not necessary to run in production (noted here for clarity until they move to their own tickets):
* crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
* crunch-run --detach: cleanup old stdout/stderr
* Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance)
* Test suite that uses a real cloud provider
* Test activity/resource usage metrics
* Multiple cloud drivers
* Generic driver test suite
* Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker)
* Optimize worker VM deployment (e.g., automatically install a matching version of crunch-run on each worker)
* Configurable spending limits
* Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
* If present, use VM image ID given in runtime_constraints instead of image ID from cluster config file
* (API) Allow admin users to specify image ID in runtime_constraints
* Metrics: count unexpected shutdowns, split by instance type
* Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe
* Move "cat .../node-token" host key verification mechanism out of Azure driver (instead, have the dispatcher do this itself if the driver returns cloud.ErrNotImplemented)

Improvements that might never be implemented at all (noted here for clarity):
* Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
* Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.

Back