crunch-dispatch-slurm does not warn when slurm MaxJobCount reached
SLURM has a default MaxJobCount of 10000. MaxJobCount is not specified in the example slurm.conf suggested by Arvados install docs (https://doc.arvados.org/install/crunch2-slurm/install-slurm.html). Perhaps it should be, so it is clear that this parameter might matter to Arvados.
When hitting the MaxJobCount, I would have expected crunch-dispatch-slurm to log some sort of warning indicating that it was unable to queue all of the jobs because of the limit. I did not see it saying anything that seemed to mean that.
#2 Updated by Tom Clegg about 2 years ago
Tested on 9tee4 with MaxJobCount=2.
$ sbatch -N1 <(printf '#!/bin/sh\nsleep 8000\n') sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. ^C
It would be more convenient to get a non-zero exit code, so we don't have to scrape stderr while sbatch is still running.
sbatch --immediate option that fails if the allocation can't be granted, but that's not what we want either. We want to fail only if the job can't be queued.
- Log a suggestion to increase MaxJobCount
- Avoid starting more sbatch processes until this one exits (incidentally, we only recently stopped serializing all sbatch invocations).
We should also log any other unexpected messages from sbatch, to make other similar problems easier to diagnose.