Story #13908
[Epic] Replace SLURM for cloud job scheduling/dispatching
Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:
0%
Estimated time:
Story points:
-
Description
See Dispatching containers to cloud VMs
Outstanding TODOs not covered by a linked ticket:- Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run)
- Add tests for activity/resource usage metrics
- Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see Dispatching containers to cloud VMs
- Cloud behavior metrics: count unexpected shutdowns, split by instance type
- Configurable spending limits
- Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
- (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file
- Run crunch-run as a non-root user
- Don't require root at all on the cloud instance
- Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
- Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.
Related issues
History
#1
Updated by Peter Amstutz over 2 years ago
- Description updated (diff)
#2
Updated by Peter Amstutz over 2 years ago
- Description updated (diff)
#3
Updated by Tom Morris over 2 years ago
- Target version changed from Arvados Future Sprints to To Be Groomed
#4
Updated by Tom Morris over 2 years ago
- Related to Bug #13964: crunch-dispatch-cloud spike added
#5
Updated by Tom Clegg over 2 years ago
- Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added
#6
Updated by Tom Clegg about 2 years ago
- Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added
#7
Updated by Tom Clegg about 2 years ago
- Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added
#8
Updated by Tom Clegg almost 2 years ago
- Blocked by Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added
#9
Updated by Tom Clegg almost 2 years ago
- Related to Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added
#10
Updated by Tom Clegg almost 2 years ago
- Description updated (diff)
#12
Updated by Tom Clegg almost 2 years ago
- Related to Bug #14745: [crunch-dispatch-cloud] Azure cloud driver fixups added
#13
Updated by Tom Clegg almost 2 years ago
- Related to Bug #14844: [dispatch-cloud] Azure driver bugs discovered in trial run added
#14
Updated by Tom Clegg almost 2 years ago
- Related to Feature #14291: [crunch-dispatch-cloud] AWS driver added
#15
Updated by Tom Clegg almost 2 years ago
- Related to Story #14931: [arvados-dispatch-cloud] Configurable instance tags added
#16
Updated by Tom Clegg almost 2 years ago
- Related to Feature #14912: [Crunch2] Azure driver supports attaching extra storage added
#17
Updated by Tom Clegg almost 2 years ago
- Related to Feature #15025: [arvados-dispatch-cloud] GCE driver (Google Compute Engine) added
#18
Updated by Tom Clegg almost 2 years ago
- Related to Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added
#19
Updated by Tom Clegg almost 2 years ago
- Description updated (diff)
#20
Updated by Tom Clegg almost 2 years ago
- Related to Feature #15051: [a-d-c] EC2 driver supports AssumeRole added
#21
Updated by Tom Clegg almost 2 years ago
- Related to Feature #15063: [a-d-c] Assign names to EC2 instances added
#23
Updated by Ward Vandewege over 1 year ago
- Blocks Story #13484: Support multiple load-balanced API server nodes added
#24
Updated by Ward Vandewege over 1 year ago
- Subject changed from Replace SLURM for cloud job scheduling/dispatching to [Epic] Replace SLURM for cloud job scheduling/dispatching
#25
Updated by Ward Vandewege over 1 year ago
- Release set to 22
#26
Updated by Tom Clegg over 1 year ago
- Blocked by Feature #15340: [arvados-dispatch-cloud] Error-counting metrics added
#27
Updated by Tom Clegg over 1 year ago
- Related to deleted (Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool)
#28
Updated by Tom Clegg over 1 year ago
- Blocked by Story #15026: [arvados-dispatch-cloud] Cloud driver/config testing tool added
#29
Updated by Tom Clegg over 1 year ago
- Blocked by Feature #15345: [arvados-dispatch-cloud] kill container (management API) added
#30
Updated by Tom Clegg over 1 year ago
- Description updated (diff)
#31
Updated by Tom Clegg over 1 year ago
- Related to Feature #15370: [arvados-dispatch-cloud] loopback driver added
#32
Updated by Tom Clegg about 1 year ago
- Related to Story #15759: [arvados-dispatch-cloud] deploy/run correct version of crunch-run binary on worker nodes added
#33
Updated by Tom Clegg about 1 year ago
- Description updated (diff)
#34
Updated by Tom Clegg about 1 year ago
- Blocked by Story #15775: [arvados-dispatch-cloud] Promote a-d-c from "experimental" to default in install docs added
#35
Updated by Tom Clegg about 1 year ago
- Blocked by Story #15776: [arvados-dispatch-cloud] Add nodemanager->a-d-c instructions to upgrade notes added
#36
Updated by Tom Morris about 1 year ago
- Target version changed from To Be Groomed to 2019-11-20 Sprint
#37
Updated by Tom Morris about 1 year ago
- Target version deleted (
2019-11-20 Sprint)
#38
Updated by Tom Morris about 1 year ago
- Target version set to Arvados Future Sprints
#39
Updated by Tom Clegg about 1 year ago
- Related to Story #15823: [arvados-dispatch-cloud] Add arvados-dispatch-cloud management APIs to doc site added
#41
Updated by Tom Clegg about 1 year ago
- Related to Story #15865: [arvados-dispatch-cloud] Cumulative instance time and cost metrics added
#42
Updated by Tom Clegg about 1 year ago
- Release deleted (
22)
#43
Updated by Tom Clegg 12 months ago
- Related to Feature #16106: [arvados-dispatch-cloud] Azure driver support for preemptible instances added
#44
Updated by Tom Clegg 6 months ago
- Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added
#45
Updated by Peter Amstutz 4 months ago
- Status changed from New to Resolved
#46
Updated by Ward Vandewege 3 months ago
- Target version deleted (
Arvados Future Sprints)