Bug #13991

crunch-dispatch-slurm does not warn when slurm MaxJobCount reached

Added by Joshua Randall 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

SLURM has a default MaxJobCount of 10000. MaxJobCount is not specified in the example slurm.conf suggested by Arvados install docs (https://doc.arvados.org/install/crunch2-slurm/install-slurm.html). Perhaps it should be, so it is clear that this parameter might matter to Arvados.

When hitting the MaxJobCount, I would have expected crunch-dispatch-slurm to log some sort of warning indicating that it was unable to queue all of the jobs because of the limit. I did not see it saying anything that seemed to mean that.

History

#1 Updated by Tom Morris 4 months ago

  • Target version set to To Be Groomed

#2 Updated by Tom Clegg 4 months ago

Tested on 9tee4 with MaxJobCount=2.

$ sbatch -N1 <(printf '#!/bin/sh\nsleep 8000\n')
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
^C

It would be more convenient to get a non-zero exit code, so we don't have to scrape stderr while sbatch is still running.

There's an sbatch --immediate option that fails if the allocation can't be granted, but that's not what we want either. We want to fail only if the job can't be queued.

So it seems the solution is for crunch-dispatch-slurm to monitor stderr while sbatch is running, and if that message appears:
  • Log a suggestion to increase MaxJobCount
  • Avoid starting more sbatch processes until this one exits (incidentally, we only recently stopped serializing all sbatch invocations).

We should also log any other unexpected messages from sbatch, to make other similar problems easier to diagnose.

Also available in: Atom PDF