Bug #9679
closed[Crunch2] Provide feedback when a container is submitted to slurm but does not run
Description
Problem¶
crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs:- Disk full on compute node (cannot create slurm-*.out file)
- CWD not writable on compute node (ditto)
- crunch-run not installed on compute node
- crunch-run configuration problem
- firewall on compute node prevents crunch-run from connecting to Arvados
In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything.
This is hard for the user to understand: the container state just flaps between Queued and Locked.
This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it.
Proposed improvements¶
Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued
Log a user-visible message when a job is attempted (locked) by any dispatcher.
Note: both of these messages already exist in the form of "container update" events, but the Workbench log viewer doesn't display them.