Project

General

Profile

Actions

Story #13908

closed

[Epic] Replace SLURM for cloud job scheduling/dispatching

Added by Tom Morris over 4 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

See Dispatching containers to cloud VMs

Outstanding TODOs not covered by a linked ticket:
  • Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run)
  • Add tests for activity/resource usage metrics
  • Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see Dispatching containers to cloud VMs
  • Cloud behavior metrics: count unexpected shutdowns, split by instance type
  • Configurable spending limits
  • Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
  • (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file
  • Run crunch-run as a non-root user
  • Don't require root at all on the cloud instance
Outstanding TODO-or-maybe-not-TODOs not covered by a linked ticket:
  • Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
  • Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.

Related issues

Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolvedTom Clegg

Actions
Related to Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolvedTom Clegg01/28/2019

Actions
Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolvedTom Clegg10/26/2018

Actions
Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolvedPeter Amstutz01/09/2019

Actions
Related to Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolvedTom Clegg01/29/2019

Actions
Related to Arvados - Bug #14745: [crunch-dispatch-cloud] Azure cloud driver fixupsResolvedEric Biagiotti02/13/2019

Actions
Related to Arvados - Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial runResolvedPeter Amstutz02/28/2019

Actions
Related to Arvados - Feature #14291: [crunch-dispatch-cloud] AWS driverResolvedPeter Amstutz02/28/2019

Actions
Related to Arvados - Story #14931: [arvados-dispatch-cloud] Configurable instance tagsResolvedTom Clegg05/31/2019

Actions
Related to Arvados - Feature #14912: [Crunch2] Azure driver supports attaching extra storageNew

Actions
Related to Arvados - Feature #15025: [arvados-dispatch-cloud] GCE driver (Google Compute Engine)New

Actions
Related to Arvados - Feature #15051: [a-d-c] EC2 driver supports AssumeRoleNew

Actions
Related to Arvados - Feature #15063: [a-d-c] Assign names to EC2 instancesDuplicateTom Clegg

Actions
Related to Arvados - Feature #12900: [Crunch2] [crunch-run] Prune old images before installing image for current containerNew

Actions
Related to Arvados - Feature #15370: [arvados-dispatch-cloud] loopback driverResolvedTom Clegg05/17/2022

Actions
Related to Arvados - Story #15759: [arvados-dispatch-cloud] deploy/run correct version of crunch-run binary on worker nodesResolvedTom Clegg12/30/2019

Actions
Related to Arvados - Story #15823: [arvados-dispatch-cloud] Add arvados-dispatch-cloud management APIs to doc siteResolvedTom Clegg01/13/2020

Actions
Related to Arvados - Story #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metricsNew

Actions
Related to Arvados - Feature #16106: [arvados-dispatch-cloud] Azure driver support for preemptible instancesResolvedWard Vandewege01/07/2021

Actions
Related to Arvados - Feature #16636: [arvados-dispatch-cloud] Add instance metricsResolvedWard Vandewege08/03/2020

Actions
Blocked by Arvados - Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node managerResolvedTom Clegg01/29/2019

Actions
Blocks Arvados - Story #13484: Support multiple load-balanced API server nodesNewWard Vandewege03/18/2019

Actions
Blocked by Arvados - Feature #15340: [arvados-dispatch-cloud] Error-counting metricsResolvedTom Clegg

Actions
Blocked by Arvados - Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing toolResolvedTom Clegg06/21/2019

Actions
Blocked by Arvados - Feature #15345: [arvados-dispatch-cloud] kill container (management API)ResolvedTom Clegg06/19/2019

Actions
Blocked by Arvados - Story #15775: [arvados-dispatch-cloud] Promote a-d-c from "experimental" to default in install docsResolvedPeter Amstutz

Actions
Blocked by Arvados - Story #15776: [arvados-dispatch-cloud] Add nodemanager->a-d-c instructions to upgrade notesResolvedPeter Amstutz

Actions
Actions #1

Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Morris over 4 years ago

  • Target version changed from Arvados Future Sprints to To Be Groomed
Actions #4

Updated by Tom Morris about 4 years ago

  • Related to Bug #13964: crunch-dispatch-cloud spike added
Actions #5

Updated by Tom Clegg about 4 years ago

  • Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added
Actions #6

Updated by Tom Clegg almost 4 years ago

  • Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added
Actions #7

Updated by Tom Clegg almost 4 years ago

  • Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added
Actions #8

Updated by Tom Clegg almost 4 years ago

  • Blocked by Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added
Actions #9

Updated by Tom Clegg almost 4 years ago

  • Related to Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added
Actions #10

Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)
Actions #12

Updated by Tom Clegg almost 4 years ago

  • Related to Bug #14745: [crunch-dispatch-cloud] Azure cloud driver fixups added
Actions #13

Updated by Tom Clegg almost 4 years ago

  • Related to Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial run added
Actions #14

Updated by Tom Clegg almost 4 years ago

Actions #15

Updated by Tom Clegg over 3 years ago

  • Related to Story #14931: [arvados-dispatch-cloud] Configurable instance tags added
Actions #16

Updated by Tom Clegg over 3 years ago

  • Related to Feature #14912: [Crunch2] Azure driver supports attaching extra storage added
Actions #17

Updated by Tom Clegg over 3 years ago

  • Related to Feature #15025: [arvados-dispatch-cloud] GCE driver (Google Compute Engine) added
Actions #18

Updated by Tom Clegg over 3 years ago

  • Related to Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added
Actions #19

Updated by Tom Clegg over 3 years ago

  • Description updated (diff)
Actions #20

Updated by Tom Clegg over 3 years ago

  • Related to Feature #15051: [a-d-c] EC2 driver supports AssumeRole added
Actions #21

Updated by Tom Clegg over 3 years ago

  • Related to Feature #15063: [a-d-c] Assign names to EC2 instances added
Actions #22

Updated by Tom Clegg over 3 years ago

  • Related to Feature #12900: [Crunch2] [crunch-run] Prune old images before installing image for current container added
Actions #23

Updated by Ward Vandewege over 3 years ago

  • Blocks Story #13484: Support multiple load-balanced API server nodes added
Actions #24

Updated by Ward Vandewege over 3 years ago

  • Subject changed from Replace SLURM for cloud job scheduling/dispatching to [Epic] Replace SLURM for cloud job scheduling/dispatching
Actions #25

Updated by Ward Vandewege over 3 years ago

  • Release set to 22
Actions #26

Updated by Tom Clegg over 3 years ago

  • Blocked by Feature #15340: [arvados-dispatch-cloud] Error-counting metrics added
Actions #27

Updated by Tom Clegg over 3 years ago

  • Related to deleted (Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool)
Actions #28

Updated by Tom Clegg over 3 years ago

  • Blocked by Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added
Actions #29

Updated by Tom Clegg over 3 years ago

  • Blocked by Feature #15345: [arvados-dispatch-cloud] kill container (management API) added
Actions #30

Updated by Tom Clegg over 3 years ago

  • Description updated (diff)
Actions #31

Updated by Tom Clegg over 3 years ago

  • Related to Feature #15370: [arvados-dispatch-cloud] loopback driver added
Actions #32

Updated by Tom Clegg about 3 years ago

  • Related to Story #15759: [arvados-dispatch-cloud] deploy/run correct version of crunch-run binary on worker nodes added
Actions #33

Updated by Tom Clegg about 3 years ago

  • Description updated (diff)
Actions #34

Updated by Tom Clegg about 3 years ago

  • Blocked by Story #15775: [arvados-dispatch-cloud] Promote a-d-c from "experimental" to default in install docs added
Actions #35

Updated by Tom Clegg about 3 years ago

  • Blocked by Story #15776: [arvados-dispatch-cloud] Add nodemanager->a-d-c instructions to upgrade notes added
Actions #36

Updated by Tom Morris about 3 years ago

  • Target version changed from To Be Groomed to 2019-11-20 Sprint
Actions #37

Updated by Tom Morris about 3 years ago

  • Target version deleted (2019-11-20 Sprint)
Actions #38

Updated by Tom Morris about 3 years ago

  • Target version set to Arvados Future Sprints
Actions #39

Updated by Tom Clegg about 3 years ago

  • Related to Story #15823: [arvados-dispatch-cloud] Add arvados-dispatch-cloud management APIs to doc site added
Actions #41

Updated by Tom Clegg about 3 years ago

  • Related to Story #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metrics added
Actions #42

Updated by Tom Clegg almost 3 years ago

  • Release deleted (22)
Actions #43

Updated by Tom Clegg almost 3 years ago

  • Related to Feature #16106: [arvados-dispatch-cloud] Azure driver support for preemptible instances added
Actions #44

Updated by Tom Clegg over 2 years ago

  • Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added
Actions #45

Updated by Peter Amstutz about 2 years ago

  • Status changed from New to Resolved
Actions #46

Updated by Ward Vandewege about 2 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF