Bug #13991
open
crunch-dispatch-slurm does not warn when slurm MaxJobCount reached
Added by Joshua Randall over 6 years ago.
Updated 10 months ago.
Release relationship:
Auto
Description
SLURM has a default MaxJobCount of 10000. MaxJobCount is not specified in the example slurm.conf suggested by Arvados install docs (https://doc.arvados.org/install/crunch2-slurm/install-slurm.html). Perhaps it should be, so it is clear that this parameter might matter to Arvados.
When hitting the MaxJobCount, I would have expected crunch-dispatch-slurm to log some sort of warning indicating that it was unable to queue all of the jobs because of the limit. I did not see it saying anything that seemed to mean that.
- Target version set to To Be Groomed
Tested on 9tee4 with MaxJobCount=2.
$ sbatch -N1 <(printf '#!/bin/sh\nsleep 8000\n')
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
^C
It would be more convenient to get a non-zero exit code, so we don't have to scrape stderr while sbatch is still running.
There's an sbatch --immediate
option that fails if the allocation can't be granted, but that's not what we want either. We want to fail only if the job can't be queued.
So it seems the solution is for crunch-dispatch-slurm to monitor stderr while sbatch is running, and if that message appears:
- Log a suggestion to increase MaxJobCount
- Avoid starting more sbatch processes until this one exits (incidentally, we only recently stopped serializing all sbatch invocations).
We should also log any other unexpected messages from sbatch, to make other similar problems easier to diagnose.
- Target version deleted (
To Be Groomed)
- Target version set to Future
Also available in: Atom
PDF