Project

General

Profile

Actions

Idea #15759

closed

[arvados-dispatch-cloud] deploy/run correct version of crunch-run binary on worker nodes

Added by Tom Clegg about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
12/30/2019
Due date:
Story points:
3.0
Release relationship:
Auto

Description

arvados-dispatch-cloud should automatically deploy a suitable crunch-run binary to each worker node, instead of expecting someone else to install it as part of the worker's OS image or boot script.

Currently, arvados-dispatch-cloud assumes the configured worker image includes a compatible version of crunch-run. This means the sysadmin typically builds/updates a custom worker image and updates the cluster configuration each time arvados-dispatch-cloud is installed/upgraded. Even if this is done correctly, results may be unpredictable when worker nodes are still alive and running the old image after an upgrade.

To avoid version mismatches and (in some cases) eliminate the need for custom worker images entirely, arvados-dispatch-cloud should
  • have the ability to run as "crunch-run" (refactor crunch-run as a library so arvados-server can import it)
  • load its own executable (perhaps via /proc/self/exe)
  • copy itself to each worker node as part of the booting/readiness process
  • use the copied version instead of relying on the worker to have a matching version

The "instance set ID" already ensures that a given worker is only accessed by a single dispatch process, so it shouldn't be necessary to accommodate races between dispatchers. However, for some extra insurance, crunch-run should accept an "expected version" hash on the command line, and error out if that doesn't match the hash of its own executable.


Subtasks 1 (0 open1 closed)

Task #15950: Review 15759-deploy-crunch-runResolvedPeter Amstutz12/30/2019Actions

Related issues

Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Bug #15734: [a-d-c] needs to populate node.json in the container log collectionResolvedTom Clegg10/22/2019Actions
Related to Arvados - Feature #12900: [Crunch2] [crunch-run] Prune old images before installing image for current containerNewActions
Actions

Also available in: Atom PDF