Feature #6518

[Crunch] [Crunch2] Dispatch containers via slurm

Added by Tom Clegg almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Crunch
Target version:
Start date:
07/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

When containers appear in the queue, use SLURM to execute them on worker nodes.

For now, the queue is arvados.v1.containers.queue (much like the Crunch1 job queue).

From Crunch2 dispatch:

slurm batch mode
  • Use "sinfo" to determine whether it is possible to run the container
  • Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1"
  • When container priority changes, use scontrol and scancel to propagate changes to slurm
  • Use strigger to run a cleanup script when a container exits

The cleanup script just has to deal with cases like the node dying before crunch-run has a chance to update the container record to state="Complete"


Subtasks

Task #8474: Review 6518-crunch2-dispatch-slurmResolvedPeter Amstutz

Task #8522: Implement crunch-dispatch-slurmResolvedPeter Amstutz

Task #8608: Review tests branch: 6518-crunch2-dispatch-slurm-testsResolvedPeter Amstutz

Task #8607: Add testsResolvedRadhika Chippada


Related issues

Related to Arvados - Story #6282: [Crunch] Write stories for implementation of Crunch v2Resolved06/23/2015

Related to Arvados - Feature #7816: [Crunch2] Execute minimal container spec with loggingResolved11/17/2015

Related to Arvados - Feature #8128: [Crunch2] API support for crunch-dispatchResolved04/28/2016

Blocked by Arvados - Story #6429: [API] [Crunch2] Implement "containers" and "container requests" tables, models and controllersResolved12/03/2015

Associated revisions

Revision e407a1d4 (diff)
Added by Peter Amstutz about 4 years ago

Run slurmctld and slurmd inside arvbox. refs #6518

Revision 2669dd05 (diff)
Added by Peter Amstutz about 4 years ago

Run slurmctld and slurmd inside arvbox. refs #6518

Revision e407a1d4 (diff)
Added by Peter Amstutz about 4 years ago

Run slurmctld and slurmd inside arvbox. refs #6518

Revision 7bb66fca
Added by Peter Amstutz about 4 years ago

Merge branch '6518-crunch2-dispatch-slurm' closes #6518

History

#1 Updated by Tom Clegg almost 5 years ago

  • Tracker changed from Bug to Feature

#2 Updated by Brett Smith over 4 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Peter Amstutz over 4 years ago

Suggest writing crunch 2 job dispatcher as a new set of actors in node manger.

This would enable us to solve the question of communication between the scheduler and cloud node management (#6520).

Node manager already has a lot of the framework we will want like concurrency (can have one actor per job) and a configuration system.

Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to how different cloud providers are supported now.

#4 Updated by Peter Amstutz over 4 years ago

More ideas:

Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API server similar to how it is done for nodes to show they are alive.

A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that locks the record so that only the dispatcher can update it.

If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail.

Dispatchers should be able to annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM".

#5 Updated by Peter Amstutz over 4 years ago

If we go with the architecture described in #8001, that will be is a prerequisite.

#6 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#7 Updated by Peter Amstutz over 4 years ago

#7816 is now the story for actually running containers

#8 Updated by Brett Smith over 4 years ago

  • Target version deleted (Arvados Future Sprints)
  • Release set to 11

#9 Updated by Brett Smith over 4 years ago

  • Story points set to 3.0

#10 Updated by Peter Amstutz about 4 years ago

I think we can narrow this down to a 1 point story that just submits to "sbatch" and possibly checks "squeue" for status updates.

#11 Updated by Peter Amstutz about 4 years ago

  • Story points changed from 3.0 to 1.0

#12 Updated by Peter Amstutz about 4 years ago

  • Target version set to 2016-03-02 sprint

#13 Updated by Peter Amstutz about 4 years ago

  • Assigned To set to Peter Amstutz

#14 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#15 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#16 Updated by Radhika Chippada about 4 years ago

Review feedback for branch 6518-crunch2-dispatch-slurm

crunch-dispatch-slurm.go

  • Comment for runQueuedContainers says “Invoke dispatchLocal for each ticker cycle”. Please update to say “Invoke dispatchSlurm …” instead
  • It would greatly improve readability of code if camel case is used consistently for names such as submiterr, stdinerr, similar to updateErr etc. There are several variables that could be updated as such.
  • func strigger: can you please rename it say what it does “setup trigger for when job finishes” ?
  • comment for func run (line 225): pl update it to say submit batch command etc. Current comment is not quite correct (applicable to crunch-dispatch-local)
  • can you please add comments to submit and strigger funcs
  • This comment “#uuid=$(squeue --jobs=$jobid --states=all --format=%j --noheader)” in the shell script seems to be out of sync?

#17 Updated by Peter Amstutz about 4 years ago

  • Status changed from New to In Progress

#18 Updated by Brett Smith about 4 years ago

  • Assigned To changed from Peter Amstutz to Radhika Chippada
  • Target version changed from 2016-03-02 sprint to 2016-03-16 sprint

#19 Updated by Radhika Chippada about 4 years ago

Added tests in branch 6518-crunch2-dispatch-slurm-tests, derived from 6518-crunch2-dispatch-slurm at bf3a2814843a8f7a78592e3fb4c629fc9f4819b9

#20 Updated by Peter Amstutz about 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:7bb66fca9371232cc32dd6b365ceb33e926eb0e7.

Also available in: Atom PDF