Project

General

Profile

Actions

Idea #14807

closed

[arvados-dispatch-cloud] Features/fixes needed before first production deploy

Added by Tom Clegg about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/29/2019
Due date:
Story points:
-
Release relationship:
Auto

Description

Issues encountered & fixed/worked around during dev deploy:
  • Include instance address (host or IP) in logs and management API responses
  • Ensure crunch-run --list works even if /var/lock is a symlink
  • Log full instance ID, not (Instance)String(), which might be an abbreviated name
  • Fix management API endpoints to allow specifying instance IDs that have slashes
  • Pass SSH public key to Azure so it doesn't crash (Azure refuses to create a node without adding an admin account)
  • Fix host part of SSH target address being dropped
  • Allow driver to specify a login username
  • Send ARVADOS_API_* values on stdin instead of environment vars (typical SSH server is configured to refuse these env vars)
  • If ProviderType is not given in an instance type in the cluster config, default to the type name (not the empty string)
  • Pass a random string to Azure driver as "node-token" (or fix Azure driver so it doesn't expect that)
Further improvements necessary to run in production:
  • Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed
  • Metrics: total cost of nodes in idle or booting state
  • Metrics: total cost of nodes with admin-hold flag set
  • Log when an instance goes down unexpectedly (i.e., state != Shutdown when deleted from list)
  • Log when a container is added to or dropped from the queue
  • Obey logging format in cluster config file (as of #14325, HTTP request logs were JSON, operational logs were text)
  • Drain node if container process still running after several SIGTERM attempts
  • Provide a "mark node as broken" callback mechanism for crunch-run (drain node, unless it's already marked "hold" -- see #14807#note-20)
  • Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)
  • Metrics: number of containers, split by state and instance type
  • Load API host & token from cluster config file instead of env vars
  • Ensure crunch-run exits instead of hanging if ARVADOS_API_HOST/TOKEN is empty or broken
  • Kill containers (or at least log a warning) if a worker is kept busy by a container whose UUID does not exist according to the API server's queue (e.g., container deleted from database) #14977
  • "Kill instance now" management API
  • (Azure) error out if AddedScratch>0 because that isn't implemented yet
Other improvements that were made here even though not necessary to run in production:
  • crunch-run --detach: send logs to journal
  • Move "cat .../node-token" host key verification mechanism out of Azure driver (instead, have the dispatcher do this itself if the driver returns cloud.ErrNotImplemented)

Dispatching containers to cloud VMs


Files

14807-fail-log.txt (3.17 MB) 14807-fail-log.txt Peter Amstutz, 03/21/2019 06:02 PM

Subtasks 4 (0 open4 closed)

Task #14868: Review 14807-dispatch-cloud-fixesResolvedPeter Amstutz01/29/2019Actions
Task #14962: Review 14807-escalate-sigtermResolvedPeter Amstutz01/29/2019Actions
Task #15006: Review 14807-prod-blockersResolvedPeter Amstutz01/29/2019Actions
Task #15016: Review 14807-prod-blockers (last 2 items)ResolvedPeter Amstutz01/29/2019Actions

Related issues

Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial runResolvedPeter Amstutz02/28/2019Actions
Related to Arvados - Bug #15045: [arvados-cloud-dispatch] commit 115cbd6482632c47fdcbbbe4abc9543e7e8e30ec breaks API host loadingResolvedActions
Blocked by Arvados - Bug #14977: [arvados-dispatch-cloud] kill crunch-run procs for containers that are deleted or have state=Cancelled when dispatcher starts upResolvedTom Clegg03/18/2019Actions
Follows Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolvedTom Clegg01/28/2019Actions
Actions

Also available in: Atom PDF