Feature #11146
open[Crunch2] [Workbench] Show slurm queue position of containers submitted to slurm but not yet running
Description
Background¶
From the user's perspective, it's hard to see what (if anything) is happening between the time a container is created/queued and the time it actually starts running.
In a SLURM setup, the container typically moves quickly from Queued to Locked state when crunch-dispatch-slurm puts it in the slurm queue, and then stays there for some time waiting for SLURM resources to run it.
Proposed feature¶
Soon after a container is submitted to the SLURM queue, Workbench should start indicating how close the resulting SLURM job is to the front of the queue.
Implementation¶
When checking squeue, crunch-dispatch-slurm should notice the slurm queue position for each "Locked" container, and propagate this information to the API server.- API: Add a new serialized Hash field
dispatch_info
- crunch-dispatch-slurm: store queue position as
dispatch_info["queue_position"]
- crunch-dispatch-slurm: only update containers for which this process has the lock
- crunch-dispatch-slurm: rate-limit queue position updates for any given container: max one update per second, avoid sending redundant updates like "update queue position from 5 to 5"
- crunch-dispatch-slurm: ensure no races between "update queue position" and "update container state" requests
- Workbench: display the latest queue position when available
Updated by Tom Clegg almost 8 years ago
- Description updated (diff)
- Category set to Crunch
- Assigned To set to Tom Clegg
- Target version set to Arvados Future Sprints
Updated by Tom Clegg over 7 years ago
- Story points deleted (
3.0)
from squeue(1): "The default value of sort for jobs is "P,t,-p" (increasing partition name then within a given partition by increasing [job] state and then decreasing priority)"
We might want to use "S,-p,V,P" (expected start time, decreasing priority, submission time, partition name).
If we include %t (job state) in the format string, {number of PENDING jobs seen before this one}+1 can be used as the queue position for a job.
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)