SLURM integration » History » Version 2

« Previous - Version 2/3 (diff) - Next » - Current version
Tom Clegg, 07/03/2018 07:23 PM

SLURM integration

Currently Arvados uses SLURM to dispatch containers to worker hosts. (Related: Container dispatch)

Arvados supports a range of SLURM versions and configurations, but there are some sensitivities.

Limited "nice" values (SLURM 15)

Background: crunch-dispatch-slurm needs to adjust SLURM job priorities so that job priority order matches container priority order. It uses SLURM's "nice" feature to do this. This is preferable to adjusting priority directly because it doesn't require crunch-dispatch-slurm to have SLURM administrator privileges.

Older versions of SLURM (including version 15, in ubuntu 1604) do not accept nice values ≥10000. When lots of SLURM jobs are being submitted and containers run for a long time, this limitation can prevent crunch-dispatch-slurm from achieving the desired priority order. Messages will appear in the crunch-dispatch-slurm logs:

2018/04/25 20:12:39 "/usr/bin/scontrol" ["scontrol" "update" "JobName=zzzzz-dz642-abcdefghijklmno" "Nice=12052"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000"                                                                                                                                                                         

In some cases, this can be avoided by reducing PrioritySpread in the crunch-dispatch-slurm configuration file. See