Project

General

Profile

Actions

Idea #15759

closed

[arvados-dispatch-cloud] deploy/run correct version of crunch-run binary on worker nodes

Added by Tom Clegg over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
3.0
Release relationship:
Auto

Description

arvados-dispatch-cloud should automatically deploy a suitable crunch-run binary to each worker node, instead of expecting someone else to install it as part of the worker's OS image or boot script.

Currently, arvados-dispatch-cloud assumes the configured worker image includes a compatible version of crunch-run. This means the sysadmin typically builds/updates a custom worker image and updates the cluster configuration each time arvados-dispatch-cloud is installed/upgraded. Even if this is done correctly, results may be unpredictable when worker nodes are still alive and running the old image after an upgrade.

To avoid version mismatches and (in some cases) eliminate the need for custom worker images entirely, arvados-dispatch-cloud should
  • have the ability to run as "crunch-run" (refactor crunch-run as a library so arvados-server can import it)
  • load its own executable (perhaps via /proc/self/exe)
  • copy itself to each worker node as part of the booting/readiness process
  • use the copied version instead of relying on the worker to have a matching version

The "instance set ID" already ensures that a given worker is only accessed by a single dispatch process, so it shouldn't be necessary to accommodate races between dispatchers. However, for some extra insurance, crunch-run should accept an "expected version" hash on the command line, and error out if that doesn't match the hash of its own executable.


Subtasks 1 (0 open1 closed)

Task #15950: Review 15759-deploy-crunch-runResolvedPeter Amstutz12/30/2019Actions

Related issues

Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Related to Arvados - Bug #15734: [a-d-c] needs to populate node.json in the container log collectionResolvedTom Clegg10/22/2019Actions
Related to Arvados - Feature #12900: [Crunch2] [crunch-run] Prune old images before installing image for current containerNewActions
Actions #1

Updated by Tom Clegg over 4 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #2

Updated by Tom Clegg over 4 years ago

  • Related to Bug #15734: [a-d-c] needs to populate node.json in the container log collection added
Actions #3

Updated by Tom Clegg over 4 years ago

  • Target version changed from Arvados Future Sprints to To Be Groomed
Actions #4

Updated by Tom Clegg over 4 years ago

  • Related to Feature #12900: [Crunch2] [crunch-run] Prune old images before installing image for current container added
Actions #5

Updated by Tom Morris over 4 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 3.0
Actions #6

Updated by Tom Clegg over 4 years ago

  • Target version changed from Arvados Future Sprints to 2020-01-15 Sprint
  • Assigned To set to Tom Clegg
Actions #7

Updated by Tom Clegg over 4 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Tom Clegg over 4 years ago

dd9367afefff5d0cd38d1549e32e2794e4614fb4-dev on su92l:

Started arvados-dispatch-cloud.
{"N":0,"PID":85591,"level":"info","msg":"loaded initial instance list","time":"2019-12-30T16:25:51.919157692Z"}
{"PID":85591,"level":"info","msg":"FixStaleLocks finished (218.951827ms), starting scheduling.","time":"2019-12-30T16:25:51.919257195Z"}
{"ContainerUUID":"su92l-dz642-c770f80we1flli6","InstanceType":"Standard_DS1_v2","PID":85591,"Priority":1124322183683972,"State":"Queued","level":"info","msg":"adding container to queue","time":"2019-12-30T16:25:58.908458085Z"}
{"ContainerUUID":"su92l-dz642-c770f80we1flli6","InstanceType":"Standard_DS1_v2","PID":85591,"level":"info","msg":"creating new instance","time":"2019-12-30T16:25:59.009299767Z"}
{"Address":"10.28.64.17","IdleBehavior":"run","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2019-12-30T16:26:52.060385944Z"}
{"Address":"10.28.64.17","Command":"/bin/ls /arvados-compute-node-boot.complete  \u003e/dev/null 2\u003e\u00261","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2019-12-30T16:27:04.145503815Z"}
{"Address":"10.28.64.17","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~70761fb034f6b8633803f649e6da8acc\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"70761fb034f6b8633803f649e6da8acc","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~70761fb034f6b8633803f649e6da8acc","time":"2019-12-30T16:27:04.149045430Z"}
{"Address":"10.28.64.17","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"ProbeStart":"2019-12-30T16:27:01.751685496Z","level":"info","msg":"instance booted; will try probeRunning","time":"2019-12-30T16:27:04.610927564Z"}
{"Address":"10.28.64.17","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"ProbeStart":"2019-12-30T16:27:01.751685496Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2019-12-30T16:27:04.632272459Z"}
{"Address":"10.28.64.17","ContainerUUID":"su92l-dz642-c770f80we1flli6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"level":"info","msg":"crunch-run process started","time":"2019-12-30T16:27:05.634698188Z"}
{"Address":"10.28.64.17","ContainerUUID":"su92l-dz642-c770f80we1flli6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"Reason":"state=Complete","level":"info","msg":"killing crunch-run process","time":"2019-12-30T16:27:38.769700849Z"}
{"ContainerUUID":"su92l-dz642-c770f80we1flli6","PID":85591,"State":"Complete","level":"info","msg":"dropping container from queue","time":"2019-12-30T16:27:39.746299838Z"}
{"Address":"10.28.64.17","ContainerUUID":"su92l-dz642-c770f80we1flli6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"level":"info","msg":"crunch-run process ended","time":"2019-12-30T16:27:41.774051843Z"}
{"Address":"10.28.64.17","IdleBehavior":"run","IdleDuration":129.977703,"Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","InstanceType":"Standard_DS1_v2","PID":85591,"State":"idle","level":"info","msg":"shutdown worker","time":"2019-12-30T16:29:51.751756102Z"}
{"PID":85591,"level":"info","msg":"Will delete compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu-nic because it is older than 20s","time":"2019-12-30T16:30:52.596329941Z"}
{"Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu","PID":85591,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2019-12-30T16:30:52.673688560Z"}
{"PID":85591,"level":"info","msg":"Deleted NIC compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu-nic","time":"2019-12-30T16:31:02.780823195Z"}
{"PID":85591,"level":"info","msg":"Blob compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu-os.vhd is unlocked and not modified for 319.631156866 seconds, will delete","time":"2019-12-30T16:35:51.674151866Z"}
{"PID":85591,"level":"info","msg":"Deleted blob compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu-os.vhd","time":"2019-12-30T16:35:51.849445174Z"}

container log:

2019-12-30T16:27:06.292289754Z crunch-run dd9367afefff5d0cd38d1549e32e2794e4614fb4-dev (go1.13.4) started
2019-12-30T16:27:06.292985680Z Executing container 'su92l-dz642-c770f80we1flli6'
2019-12-30T16:27:06.293197858Z Executing on host 'compute-f51710e302afe4aef4a97c634a7c2ed3-u0ng37y6mfppwlu'
...

15759-deploy-crunch-run @ dd9367afefff5d0cd38d1549e32e2794e4614fb4 -- developer-run-tests: #1701

Actions #9

Updated by Tom Clegg over 4 years ago

crunch-run should accept an "expected version" hash on the command line, and error out if that doesn't match the hash of its own executable.

As implemented, the dispatcher writes the binary to "/var/lib/arvados/crunch-run~${md5}". Given that, having crunch-run check its own md5sum seems superfluous, so I didn't bother adding that.

Actions #10

Updated by Peter Amstutz over 4 years ago

This needs a documentation update, should merge/rebase master since the new documentation is merged.

Actions #11

Updated by Anonymous over 4 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #12

Updated by Peter Amstutz over 4 years ago

As discussed on gitter, don't want to complicate the "set up a compute node image" documentation.

LGTM.

Actions #13

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-01-15 Sprint to 2020-01-02 Sprint
Actions #14

Updated by Peter Amstutz about 4 years ago

  • Release set to 22
Actions

Also available in: Atom PDF