Project

General

Profile

Container dispatch » History » Revision 15

Revision 14 (Peter Amstutz, 12/14/2015 02:55 PM) → Revision 15/26 (Tom Clegg, 12/23/2015 03:29 PM)

h1. Crunch2 dispatch 

 {{toc}} 

 h2. Summary Design sketch 

 A dispatcher uses available compute resources crunch-dispatch-slurm (sLURm to execute queued containers. crunCH = "Lurch"?) 

 Dispatch is meant to be a small simple component rather than a pluggable framework: e.g., "slurm dispatch" can be a small standalone program, rather than a plugin for a big generic dispatch program. Responsibilities: 

 h2. Pseudocode 

 * Notice there is Monitor containers table.  
 ## When a queued new container appears, claim it 
 * Decide whether ## Put the required resources are available to run job into the container slurm job queue with    @sbatch --immediate --share --export=ARVADOS_API_HOST,ARVADOS_API_TOKEN@ 
 * Lock ## Possibly request new node from node manager (mechanism TBD, see below) 
 ## Job consists of running "arv-container" on the compute node given the API token and container (this avoids races with other dispatch processes) uuid. 
 * Translate the container's runtime constraints and priority to instructions ## arv-container is responsible for actually running the lower-level scheduler, if any container, see https://dev.arvados.org/issues/7816 
 * Invoke ## arv-container should update the "crunch2 run" executor container record unless there is a node failure 
 * When the Monitor slurm job status  
 ## Run @squeue@ to get status 
 ## Synchronize with containers table: cancel priority changes on a container taken by this dispatch process, 0 jobs, update the lower-level scheduler accordingly (cancel if priority is zero) 
 * If the lower-level scheduler indicates the container is finished or abandoned, but the Container record is locked by this dispatcher status from slurm queue. 

 Decisions to make: 

 Two new components: arv-container and has state=Running, fail the container crunch-dispatch-slurm. 

 h2. Examples What language to write arv-container in? 

 slurm batch mode * If Python 
 * Use "sinfo" to determine whether ** Will require Python SDK be installed on every compute node. 
 ** Could integrate FUSE mount directly instead of requiring arv-mount subprocess, would make it is possible easier to run the container detect keep errors 
 ** Could talk to docker daemon directly instead of using docker CLI 
 * Submit In Go: 
 ** Will still require Python SDK for arv-mount 
 ** However, could integrate with a batch job future Go FUSE mount; would be possible to the queue: "echo crunch-run --job {uuid} | sbatch -N1" eventually eliminate python dependencies making arv-container much more lightweight. 

 What language to write crunch-dispatch-slurm? 

 * In Python 
 ** Could be integrated with node manager? 
 * When In Go: 
 ** If using websockets to listen for container events (new containers added, priority changes, changes) will require Go websocket client 

 How to decide and communicate requests for new cloud nodes? 

 * Could use scontrol and scancel http://slurm.schedmd.com/elastic_computing.html 
 ** Potentially non-trival amount of work to propagate changes to configure slurm 
 * Use strigger Compute wishlist & send to run a cleanup script when a container exits node manager 
 ** Need to predict how slurm will allocate jobs to nodes or risk under- or over-allocating nodes.  

 Retry 

 standalone worker 
 * Inspect /proc/meminfo, /proc/cpuinfo, "docker ps", etc. to determine local capacity Slurm can re-queue jobs on node failure automatically 
 * Invoke crunch-run Can make retry decisions at API server, can detect and retry on wider class of infrastructure failure 

 h1. Notes 

 Some discussion notes from https://dev.arvados.org/issues/6429 and https://dev.arvados.org/issues/6518 

 h2. Interaction with node manager 

 Suggest writing crunch 2 job dispatcher as a child process (or perhaps new set of actors in node manger. 

 This would enable us to solve the question of communication between the scheduler and cloud node management (#6520). 

 Node manager already has a detached daemon process) 
 * Signal crunch-run lot of the framework we will want like concurrency (can have one actor per job) and a configuration system. 

 Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to stop if container priority changes how different cloud providers are supported now. 

 If we don't actually combine them, we should at least move the logic of deciding how many and what size nodes to zero run from the node manager to the dispatcher; the dispatcher can then communicate its wishlist to node manager. 

 h2. Arvados    Interaction with API support 

 Each dispatch process has an Arvados More ideas: 

 Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API token server similar to how it is done for nodes to show they are alive. 

 A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that allows locks the record so that only the dispatcher can update it. 

 If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail. 

 Dispatchers should be able to see queued containers. 
 * No two dispatch processes can annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM". 

 h2. Retry 

 How do we handle failure? Is the dispatcher required to retry containers that fail, or is the dispatcher a "best effort" service and the API decides to retry by scheduling a new container? 

 Currently the container_uuid field only holds a single container_uuid at a time. If the same time API schedules a new container, does that mean any container requests associated with that container get updated with the same token. One new container? 

 If the container_uuid field only holds one container at a time, and container don't link back to the container requests that created, then we don't have a way to achieve record of past attempts to fulfill this is request. This means we don't have anything to make a user record for each dispatch service. check against container_count_max. A few possible solutions: 

 Container APIs relevant * Make container_uuid an array of containers created to fulfill a dispatch program: given container request (this introduces complexity) 
 * List Queued containers (might be Decrement container_count_max on the request when submitting a subset of Queued containers) new container 
 * List Compute content address of the container request and discover containers with state=Locked that content address. This would conflict with "no reuse" or state=Running associated "impure" requests which are supposed to ignore past execution history. Could solve this by salting the content address with current token 
 * Receive event when a timestamp; "no reuse" containers would never ever be reusable which might be fine. 

 I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container is created state. "TempFail" shouldn't count againt the container_count_max count, or modified alternately we only honor container_count_max for "TempFail" tasks and state is Queued (it might become runnable) 
 * Change state Queued->Locked 
 * Change state Locked->Queued 
 * Change state Locked->Running 
 * Change state Running->Complete 
 * Receive event when priority changes 
 * Receive event when state changes to Complete 
 * Create don't retry "PermFail". 

 Ideally, "TempFail" containers should retry forever, but with a unique API token backoff. One way to pass do the backoff is to crunch-run (expires when schedule the container stops) 
 * Create events/logs 
 ** Decided not to run this container 
 ** Decided at a specific time in the future. 

 h2. Scheduling 

 Having a field specifying "wait until time X to run this container (e.g., no node with those resources) container" would be generally useful for cron-style tasks. 

 h2. State changes 

 Container request: 

 ||priority == 0|priority > 0| 
 ** Lock failed |Uncommitted|nothing|nothing| 
 ** Dispatched |Committed|set container priority to crunch-run max(all request priority)|set container priority to max(all request priority)| 
 ** Cleaned up crashed crunch-run (lower-level scheduler indicates |Final|invalid|invalid| 

 When a container request goes to the job finished, but crunch-run didn't leave the container in "committed" state, it is assigned a final state) container. 

 Container: 

 ||priority == 0|priority > 0| 
 ** Cleaned up abandoned |Queued|Desire change state to "Cancelled"|Desire to change change state to "Running"| 
 |Running|Desire change state to "Cancelled"|nothing| 
 |Complete|invalid|invalid| 
 |Cancelled|invalid|invalid| 

 When the container (container belongs goes to this process, but dispatch and lower-level scheduler don't know about it) either the "complete" or "cancelled" state, any associated container requests go to "final" state. 

 h2. Non-responsibilities Claims 

 Dispatch doesn't retry failed containers. If something needs to be reattempted, Have a new field "claimed_by_uuid" on the Container record.    A queued container will appear is claimed by a dispatcher process via an atomic "claim" operation.    A claim can be released if the container is still in the queue. Queued state. 

 Dispatch doesn't fail a The container that it can't run. It doesn't know whether other dispatchers will record cannot be able to run it. updated by anyone except the owner of the claim. 

 h2. Additional notes 

 (see also #6429 and #6518) 

 Using websockets to listen for container events (new containers added, priority changes) will benefit from some Go SDK support.