Bug #9679
Updated by Tom Clegg over 8 years ago
h2. Problem crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs: * Disk full on compute node (cannot create slurm-*.out file) * CWD not writable on compute node (ditto) * crunch-run not installed on compute node * crunch-run configuration problem * firewall on compute node prevents crunch-run from connecting to Arvados In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything. This is hard for the user to understand: the container state just flaps between Queued and Locked. This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it. h2. Proposed improvements Draft/ideas: * Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued Log (logs would be useful at other points too, like Queued→Locked) * If a user-visible message when single container gets sent to SLURM more than N times (over a job is attempted (locked) period of at least M seconds) by any dispatcher. Note: both of these messages already exist in a single crunch-dispatch-* process, give up and change state to Cancelled. This won't play well with multiple dispatchers, though: one dispatcher could starve all the form others, then give up and cancel the container. * If a single container gets sent to SLURM more than N times (over a period of "container update" events, at least M seconds) by a single crunch-dispatch-* process, don't cancel it, but the Workbench stop trying to run it (and log viewer doesn't display them. a message to that effect). * (API server) If a container has been Locked and returned to Queued state, and is more than M seconds old, cancel it.