Bug #9679
closed[Crunch2] Provide feedback when a container is submitted to slurm but does not run
Description
Problem¶
crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs:- Disk full on compute node (cannot create slurm-*.out file)
- CWD not writable on compute node (ditto)
- crunch-run not installed on compute node
- crunch-run configuration problem
- firewall on compute node prevents crunch-run from connecting to Arvados
In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything.
This is hard for the user to understand: the container state just flaps between Queued and Locked.
This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it.
Proposed improvements¶
Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued
Log a user-visible message when a job is attempted (locked) by any dispatcher.
Note: both of these messages already exist in the form of "container update" events, but the Workbench log viewer doesn't display them.
Updated by Radhika Chippada over 8 years ago
- Target version set to 2016-09-14 sprint
Updated by Tom Clegg over 8 years ago
- Teach Workbench to watch for state changes, and display them as text in the logging area
- Explicitly log a chunk of text from crunch-dispatch-* and crunch-run when something interesting happens (e.g., event_type="dispatch", properties={"text":"locked by dispatcher pid=%d"})
Updated by Tom Clegg over 8 years ago
In the current setup we write HTML for the "terminal window" widget in two different places: once in Ruby where we first render the log tab, and again in JavaScript when we append new logs as they arrive via websocket. So far, this seems reasonable because the JS side consists of appending text to a "pre" element. However, if we want to do something beyond just copying text from the log entries -- e.g., notice event_type=update logs and look in old/new_attributes to see whether state changed -- keeping the Ruby and JS implementations synchronized might get annoying fast.
Some options:- Do this on the client side (in JS), so these logs will be displayed if they arrive while you have the page open, but if you hit Refresh (or arrive at the page too late) you'll miss them.
- Port/copy the code to Ruby as well. Add comments and tests to assist with "change X whenever you change Y".
- Don't pre-render the log content in the Ruby code. Instead, use client-side code to retrieve older log events from the API server.
I think the best outcome is (1)+(3), but if it's going to be a while before we get around to (3), we should probably do (2) in the meantime.
Updated by Tom Clegg over 8 years ago
9679-dispatch-event-logs @ b046d1f
...logs a message in Workbench when a container changes state, e.g.,
2016-09-06T14:10:31.506744197Z Container zzzzz-dz642-queuedcontainer was returned to the queue 2016-09-06T14:10:31.516115221Z Container zzzzz-dz642-queuedcontainer was taken from the queue by a dispatch process 2016-09-06T14:10:31.536427858Z Container zzzzz-dz642-queuedcontainer started 2016-09-06T14:10:31.546526907Z Container zzzzz-dz642-queuedcontainer finished with exit code 1 (failure)
Until we do (2) or (3) in note-7 these messages appear only if the log tab is open when those events occur. This isn't ideal, but it does address the problem of the user seeing nothing at all when there's a problem running slurm jobs.
Updated by Radhika Chippada over 8 years ago
- Just one question. In work_unit_log.js => var stamp = eventData.event_at + " " -- Are we guaranteed that any “properties.text” is terminated with a new line? That is, do we need to prepend the eventData.event_at with a “ “ when there is text in properties?
- I didn't run any of the tests and am assuming you tested already :)
LGTM
Updated by Tom Clegg over 8 years ago
The previous code assumed log.properties.text always contains all relevant newlines, but now that we're inserting other stuff between logs, it makes more sense to enforce a line break after each entry. Added a trailing newline when needed.
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
Copied option (3) from note-7 to #10007