Container dispatch » History » Revision 15
« Previous |
Revision 15/26
(diff)
| Next »
Tom Clegg, 12/23/2015 03:29 PM
Crunch2 dispatch¶
- Table of contents
- Crunch2 dispatch
Summary¶
A dispatcher uses available compute resources to execute queued containers.
Dispatch is meant to be a small simple component rather than a pluggable framework: e.g., "slurm dispatch" can be a small standalone program, rather than a plugin for a big generic dispatch program.
Pseudocode¶
- Notice there is a queued container
- Decide whether the required resources are available to run the container
- Lock the container (this avoids races with other dispatch processes)
- Translate the container's runtime constraints and priority to instructions for the lower-level scheduler, if any
- Invoke the "crunch2 run" executor
- When the priority changes on a container taken by this dispatch process, update the lower-level scheduler accordingly (cancel if priority is zero)
- If the lower-level scheduler indicates the container is finished or abandoned, but the Container record is locked by this dispatcher and has state=Running, fail the container
Examples¶
slurm batch mode- Use "sinfo" to determine whether it is possible to run the container
- Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1"
- When container priority changes, use scontrol and scancel to propagate changes to slurm
- Use strigger to run a cleanup script when a container exits
- Inspect /proc/meminfo, /proc/cpuinfo, "docker ps", etc. to determine local capacity
- Invoke crunch-run as a child process (or perhaps a detached daemon process)
- Signal crunch-run to stop if container priority changes to zero
Arvados API support¶
Each dispatch process has an Arvados API token that allows it to see queued containers.- No two dispatch processes can run at the same time with the same token. One way to achieve this is to make a user record for each dispatch service.
- List Queued containers (might be a subset of Queued containers)
- List containers with state=Locked or state=Running associated with current token
- Receive event when container is created or modified and state is Queued (it might become runnable)
- Change state Queued->Locked
- Change state Locked->Queued
- Change state Locked->Running
- Change state Running->Complete
- Receive event when priority changes
- Receive event when state changes to Complete
- Create a unique API token to pass to crunch-run (expires when the container stops)
- Create events/logs
- Decided not to run this container
- Decided to run this container (e.g., no node with those resources)
- Lock failed
- Dispatched to crunch-run
- Cleaned up crashed crunch-run (lower-level scheduler indicates the job finished, but crunch-run didn't leave the container in a final state)
- Cleaned up abandoned container (container belongs to this process, but dispatch and lower-level scheduler don't know about it)
Non-responsibilities¶
Dispatch doesn't retry failed containers. If something needs to be reattempted, a new container will appear in the queue.
Dispatch doesn't fail a container that it can't run. It doesn't know whether other dispatchers will be able to run it.
Additional notes¶
Using websockets to listen for container events (new containers added, priority changes) will benefit from some Go SDK support.
Updated by Tom Clegg almost 9 years ago · 15 revisions