SLURM integration » History » Version 1

Tom Clegg, 07/03/2018 07:23 PM

1 1 Tom Clegg
h1. SLURM integration
2 1 Tom Clegg
3 1 Tom Clegg
Currently Arvados uses SLURM to dispatch containers to worker hosts. (Related: [[Container dispatch]])
4 1 Tom Clegg
5 1 Tom Clegg
Arvados supports a range of SLURM versions and configurations, but there are some sensitivities.
6 1 Tom Clegg
7 1 Tom Clegg
h2. Limited "nice" values (SLURM 15)
8 1 Tom Clegg
9 1 Tom Clegg
Background: crunch-dispatch-slurm needs to adjust SLURM job priorities so that job priority order matches container priority order. It uses SLURM's "nice" feature to do this. This is preferable to adjusting priority directly because it doesn't require crunch-dispatch-slurm to have SLURM administrator privileges.
10 1 Tom Clegg
11 1 Tom Clegg
Older versions of SLURM (including version 15, in ubuntu 1604) do not accept nice values ≥10000. When lots of SLURM jobs are being submitted and containers run for a long time, this limitation can prevent crunch-dispatch-slurm from achieving the desired priority order. Messages will appear in the crunch-dispatch-slurm logs:
12 1 Tom Clegg
13 1 Tom Clegg
14 1 Tom Clegg
2018/04/25 20:12:39 "/usr/bin/scontrol" ["scontrol" "update" "JobName=zzzzz-dz642-abcdefghijklmno" "Nice=12052"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000"                                                                                                                                                                         
15 1 Tom Clegg
16 1 Tom Clegg
17 1 Tom Clegg
In some cases, this can be avoided by reducing PrioritySpread in the crunch-dispatch-slurm configuration file. See