Story #14807

[crunch-dispatch-cloud] Features/fixes needed before first production deploy

Added by Tom Clegg 16 days ago. Updated about 2 hours ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/29/2019
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Issues encountered & fixed/worked around during dev deploy:
  • Include instance address (host or IP) in logs and management API responses
  • Ensure crunch-run --list works even if /var/lock is a symlink
  • Log full instance ID, not (Instance)String(), which might be an abbreviated name
  • Fix management API endpoints to allow specifying instance IDs that have slashes
  • Pass SSH public key to Azure so it doesn't crash (Azure refuses to create a node without adding an admin account)
  • Fix host part of SSH target address being dropped
  • Allow driver to specify a login username
  • Send ARVADOS_API_* values on stdin instead of environment vars (typical SSH server is configured to refuse these env vars)
  • If ProviderType is not given in an instance type in the cluster config, default to the type name (not the empty string)
  • Pass a random string to Azure driver as "node-token" (or fix Azure driver so it doesn't expect that)
Further improvements necessary to run in production:
  • Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
  • Shutdown node if container process still running after several SIGKILL attempts
  • Propagate configured "check for broken node" script name to crunch-run
  • Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed
  • Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)
  • Metrics: total cost of nodes in idle or booting state
  • Metrics: total cost of nodes with admin-hold flag set
  • Metrics: number of containers, split by state and instance type
  • Log when an instance goes down unexpectedly (i.e., state != Shutdown when deleted from list)
  • Log when a container is added to or dropped from the queue
  • Obey logging format in cluster config file (as of #14325, HTTP request logs were JSON, operational logs were text)
  • Load API host & token from cluster config file instead of env vars
  • Ensure crunch-run exits instead of hanging if ARVADOS_API_HOST/TOKEN is empty or broken
  • Kill containers (or at least log a warning) if a worker is kept busy by a container whose UUID does not exist according to the API server's queue (e.g., container deleted from database)
Improvements that are desired, but not necessary to run in production (noted here for clarity until they move to their own tickets):
  • crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs (logs go to journal instead)
  • crunch-run --detach: cleanup old stdout/stderr (logs go to journal instead)
  • Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance)
  • Test suite that uses a real cloud provider
  • Test activity/resource usage metrics
  • Multiple cloud drivers
  • Generic driver test suite
  • Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker)
  • Optimize worker VM deployment (e.g., automatically install a matching version of crunch-run on each worker)
  • Configurable spending limits
  • Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
  • If present, use VM image ID given in runtime_constraints instead of image ID from cluster config file
  • (API) Allow admin users to specify image ID in runtime_constraints
  • Metrics: count unexpected shutdowns, split by instance type
  • Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe
  • Move "cat .../node-token" host key verification mechanism out of Azure driver (instead, have the dispatcher do this itself if the driver returns cloud.ErrNotImplemented)
Improvements that might never be implemented at all (noted here for clarity):
  • Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
  • Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.

Dispatching containers to cloud VMs


Subtasks

Task #14868: Review 14807-dispatch-cloud-fixesIn ProgressPeter Amstutz


Related issues

Related to Arvados - Story #13908: Replace SLURM for cloud job scheduling/dispatchingNew

Related to Arvados - Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial runNew

Follows Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolved2019-01-28

History

#1 Updated by Tom Clegg 16 days ago

  • Related to Story #13908: Replace SLURM for cloud job scheduling/dispatching added

#2 Updated by Tom Clegg 16 days ago

  • Due date set to 01/29/2019
  • Start date set to 01/29/2019
  • Follows Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added

#3 Updated by Tom Clegg 14 days ago

  • Description updated (diff)

#4 Updated by Tom Clegg 14 days ago

  • Description updated (diff)

#5 Updated by Tom Clegg 8 days ago

  • Description updated (diff)

#6 Updated by Tom Clegg 6 days ago

  • Description updated (diff)

#7 Updated by Tom Clegg 6 days ago

14807-dispatch-cloud-fixes @ 16589cd93e6780db6a07d7cc110724dea4c19e3e fixes a number of issues we ran into while deploying #14325 on a dev cluster.
  • 16589cd93 14807: Include more detail in errors.
  • 2873d55ea 14807: Fix crunch-run --list output when /var/lock is a symlink.
  • 3a1c03950 14807: Always set node-token tag.
  • 6e4237de7 14807: Log full instance ID.
  • d63c18fb8 14807: Expose instance IP addresses in logs and management API.
  • 55c07b5b9 14807: Fix SSH target address.
  • 286e41383 14807: Accept .../instances/_/drain?instance_id=X.
  • 554d1808c 14807: Pass SSH public key to driver.
  • e57e3e19b 14807: Allow driver to specify SSH username.
  • 45113a215 14807: When ProviderType is unspecified, default to Arvados type.
  • ddcb0fb32 14807: Pass env vars on stdin instead of using SSH feature.
  • ae31f1897 14807: Match systemd description to component name.

This will need to be rebased after #14745 merges, though.

#8 Updated by Tom Clegg 6 days ago

  • Description updated (diff)

#9 Updated by Tom Clegg 5 days ago

  • Status changed from New to In Progress
Rebased.
  • 36e1f63fd 14807: Send detached crunch-run logs to journal via systemd-cat.
  • 970af93af 14807: Remove errant rm.
  • 79693e508 14807: Update API endpoints: instance_id is always a query param.
  • e1e0f6789 14807: Log idle time in seconds instead of nanoseconds.
  • 87bf45c8c 14807: Cancel or requeue container when priority drops to zero.
  • 91b39ff3f 14807: Use context to pass a suitable logger to all service commands.
  • 80c48b78f 14807: Log when a container is added/removed from the queue.
  • abd21f165 14807: Split instance count/size/cost metrics by idle/hold status.
  • d6fbaeba4 14807: Fix up azure log message.
  • 30ca2a11c 14807: Move secret-tag host key verify mechanism out of Azure driver.
  • 3d662ef38 14807: Don't delete existing tags when updating.
  • f6d551a68 14807: Load API host/token directly from stdin without shell hack.
  • 3d7b91541 14807: Wait at least 1 second between retries on initial queue poll.
  • d3cef2f89 14807: Include more detail in errors.
  • de9a5e270 14807: Fix crunch-run --list output when /var/lock is a symlink.
  • 0de109fe6 14807: Always set node-token tag.
  • e96f8774c 14807: Log full instance ID.
  • 832235d35 14807: Expose instance IP addresses in logs and management API.
  • 601eeec89 14807: Fix SSH target address.
  • 97a1babd7 14807: Accept .../instances/_/drain?instance_id=X.
  • c4c77dc1e 14807: Pass SSH public key to driver.
  • efe3cb087 14807: Allow driver to specify SSH username.
  • ed317e6b2 14807: When ProviderType is unspecified, default to Arvados type.
  • d2bdc5af9 14807: Pass env vars on stdin instead of using SSH feature.
  • bcabab96d 14807: Match systemd description to component name.

#10 Updated by Tom Clegg 1 day ago

  • Description updated (diff)

#11 Updated by Tom Clegg about 24 hours ago

  • Description updated (diff)

Addressed in this branch:

  • Include instance address (host or IP) in logs and management API responses
  • Ensure crunch-run --list works even if /var/lock is a symlink
  • Log full instance ID, not (Instance)String(), which might be an abbreviated name
  • Fix management API endpoints to allow specifying instance IDs that have slashes
  • Pass SSH public key to Azure so it doesn't crash (Azure refuses to create a node without adding an admin account)
  • Fix host part of SSH target address being dropped
  • Allow driver to specify a login username
  • Send ARVADOS_API_* values on stdin instead of environment vars (typical SSH server is configured to refuse these env vars)
  • If ProviderType is not given in an instance type in the cluster config, default to the type name (not the empty string)
  • Pass a random string to Azure driver as "node-token" (or fix Azure driver so it doesn't expect that)

(The "node-token" stuff is moved out of the Azure driver entirely, see below)

  • Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed
  • Metrics: total cost of nodes in idle or booting state
  • Metrics: total cost of nodes with admin-hold flag set
  • Log when an instance goes down unexpectedly (i.e., state != Shutdown when deleted from list)
  • Log when a container is added to or dropped from the queue
  • Obey logging format in cluster config file (as of #14325, HTTP request logs were JSON, operational logs were text)
  • Move "cat .../node-token" host key verification mechanism out of Azure driver (instead, have the dispatcher do this itself if the driver returns cloud.ErrNotImplemented)

Still left to do:

  • Load API host & token from cluster config file instead of env vars
  • Ensure crunch-run exits instead of hanging if ARVADOS_API_HOST/TOKEN is empty or broken
  • Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)
  • Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
  • Shutdown node if container process still running after several SIGKILL attempts
  • Propagate configured "check for broken node" script name to crunch-run
  • Metrics: number of containers, split by state (and instance type?)
desired, but not necessary to run in production
  • Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance)
  • Test suite that uses a real cloud provider
  • Test activity/resource usage metrics
  • Multiple cloud drivers
  • Generic driver test suite
  • Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker)
  • Optimize worker VM deployment (e.g., automatically install a matching version of crunch-run on each worker)
  • Configurable spending limits
  • Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
  • If present, use VM image ID given in runtime_constraints instead of image ID from cluster config file
  • (API) Allow admin users to specify image ID in runtime_constraints
  • Metrics: count unexpected shutdowns, split by instance type
  • Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe

#12 Updated by Tom Clegg about 22 hours ago

  • Description updated (diff)

#13 Updated by Tom Clegg about 14 hours ago

  • Assigned To set to Tom Clegg
  • Target version changed from To Be Groomed to 2019-02-27 Sprint

#15 Updated by Tom Clegg about 5 hours ago

  • Description updated (diff)

#16 Updated by Tom Clegg about 2 hours ago

  • Related to Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial run added

Also available in: Atom PDF