Story #13908

[Epic] Replace SLURM for cloud job scheduling/dispatching

Added by Tom Morris about 1 year ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release relationship:
Auto

Description

See Dispatching containers to cloud VMs

Outstanding TODOs not covered by a linked ticket:
  • Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run)
  • Add tests for activity/resource usage metrics
  • Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see Dispatching containers to cloud VMs
  • Cloud behavior metrics: count unexpected shutdowns, split by instance type
  • Configurable spending limits
  • Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
  • (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file
  • Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe
  • Run crunch-run as a non-root user
  • Don't require root at all on the cloud instance
Outstanding TODO-or-maybe-not-TODOs not covered by a linked ticket:
  • Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
  • Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.

Related issues

Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolved

Related to Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolved01/28/2019

Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolved10/26/2018

Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolved01/09/2019

Related to Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolved01/29/2019

Related to Arvados - Bug #14745: [crunch-dispatch-cloud] Azure cloud driver fixupsResolved02/13/2019

Related to Arvados - Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial runResolved02/28/2019

Related to Arvados - Feature #14291: [crunch-dispatch-cloud] AWS driverResolved02/28/2019

Related to Arvados - Story #14931: [arvados-dispatch-cloud] Configurable instance tagsResolved05/31/2019

Related to Arvados - Feature #14912: [Crunch2] Azure driver supports attaching extra storageNew

Related to Arvados - Feature #15025: [arvados-dispatch-cloud] GCE driver (Google Compute Engine)New

Related to Arvados - Feature #15051: [a-d-c] EC2 driver supports AssumeRoleNew

Related to Arvados - Feature #15063: [a-d-c] Assign names to EC2 instancesDuplicate

Related to Arvados - Feature #15370: [arvados-dispatch-cloud] loopback driverNew

Blocked by Arvados - Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node managerResolved01/29/2019

Blocks Arvados - Story #13484: [API] Support multiple load-balanced API server nodesNew03/18/2019

Blocked by Arvados - Feature #15340: [arvados-dispatch-cloud] Error-counting metricsResolved

Blocked by Arvados - Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing toolResolved06/21/2019

Blocked by Arvados - Feature #15345: [arvados-dispatch-cloud] kill container (management API)Resolved06/19/2019

History

#1 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#2 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#3 Updated by Tom Morris about 1 year ago

  • Target version changed from Arvados Future Sprints to To Be Groomed

#4 Updated by Tom Morris 10 months ago

  • Related to Bug #13964: crunch-dispatch-cloud spike added

#5 Updated by Tom Clegg 10 months ago

  • Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added

#6 Updated by Tom Clegg 8 months ago

  • Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added

#7 Updated by Tom Clegg 8 months ago

  • Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added

#8 Updated by Tom Clegg 7 months ago

  • Blocked by Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added

#9 Updated by Tom Clegg 7 months ago

  • Related to Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added

#10 Updated by Tom Clegg 6 months ago

  • Description updated (diff)

#12 Updated by Tom Clegg 6 months ago

  • Related to Bug #14745: [crunch-dispatch-cloud] Azure cloud driver fixups added

#13 Updated by Tom Clegg 6 months ago

  • Related to Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial run added

#14 Updated by Tom Clegg 5 months ago

#15 Updated by Tom Clegg 5 months ago

  • Related to Story #14931: [arvados-dispatch-cloud] Configurable instance tags added

#16 Updated by Tom Clegg 5 months ago

  • Related to Feature #14912: [Crunch2] Azure driver supports attaching extra storage added

#17 Updated by Tom Clegg 5 months ago

  • Related to Feature #15025: [arvados-dispatch-cloud] GCE driver (Google Compute Engine) added

#18 Updated by Tom Clegg 5 months ago

  • Related to Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added

#19 Updated by Tom Clegg 5 months ago

  • Description updated (diff)

#20 Updated by Tom Clegg 5 months ago

  • Related to Feature #15051: [a-d-c] EC2 driver supports AssumeRole added

#21 Updated by Tom Clegg 4 months ago

  • Related to Feature #15063: [a-d-c] Assign names to EC2 instances added

#23 Updated by Ward Vandewege 3 months ago

  • Blocks Story #13484: [API] Support multiple load-balanced API server nodes added

#24 Updated by Ward Vandewege 3 months ago

  • Subject changed from Replace SLURM for cloud job scheduling/dispatching to [Epic] Replace SLURM for cloud job scheduling/dispatching

#25 Updated by Ward Vandewege 3 months ago

  • Release set to 22

#26 Updated by Tom Clegg 2 months ago

  • Blocked by Feature #15340: [arvados-dispatch-cloud] Error-counting metrics added

#27 Updated by Tom Clegg 2 months ago

  • Related to deleted (Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool)

#28 Updated by Tom Clegg 2 months ago

  • Blocked by Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added

#29 Updated by Tom Clegg 2 months ago

  • Blocked by Feature #15345: [arvados-dispatch-cloud] kill container (management API) added

#30 Updated by Tom Clegg 2 months ago

  • Description updated (diff)

#31 Updated by Tom Clegg 2 months ago

  • Related to Feature #15370: [arvados-dispatch-cloud] loopback driver added

Also available in: Atom PDF