Bug #9679


[Crunch2] Provide feedback when a container is submitted to slurm but does not run

Added by Tom Clegg over 7 years ago. Updated over 7 years ago.

Assigned To:
Target version:
Story points:
Release relationship:



crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs:
  • Disk full on compute node (cannot create slurm-*.out file)
  • CWD not writable on compute node (ditto)
  • crunch-run not installed on compute node
  • crunch-run configuration problem
  • firewall on compute node prevents crunch-run from connecting to Arvados

In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything.

This is hard for the user to understand: the container state just flaps between Queued and Locked.

This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it.

Proposed improvements

Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued

Log a user-visible message when a job is attempted (locked) by any dispatcher.

Note: both of these messages already exist in the form of "container update" events, but the Workbench log viewer doesn't display them.

Subtasks 2 (0 open2 closed)

Task #9908: Review 9679-dispatch-event-logsResolvedTom Clegg07/29/2016Actions
Task #9946: Retrieve old logs with AJAX instead of pre-renderingClosedTom Clegg07/29/2016Actions

Related issues

Related to Arvados - Bug #9688: [Crunch2] Limit number of dispatch attempts per containerResolved08/02/2016Actions
Related to Arvados - Bug #9678: [Crunch2] [Workbench] container_request log tab is empty, even when the container log tab shows informationResolvedRadhika Chippada08/11/2016Actions
Related to Arvados - Bug #9799: [Crunch2] Ensure live logs for containers are accessible to non-admin usersResolvedTom Clegg08/17/2016Actions
Copied to Arvados - Bug #10007: [Workbench] Retrieve all log content directly from API instead of pre-rendering in RailsClosed09/12/2016Actions
Actions #1

Updated by Tom Clegg over 7 years ago

  • Assigned To set to Tom Clegg
Actions #2

Updated by Tom Clegg over 7 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Morris over 7 years ago

  • Story points set to 1.0
Actions #4

Updated by Radhika Chippada over 7 years ago

  • Target version set to 2016-09-14 sprint
Actions #5

Updated by Tom Clegg over 7 years ago

Changes made in #9799 should make the "container was updated" logs visible to all requesting users. However, (I think) Workbench does not display those logs. So, two ways to do this:
  • Teach Workbench to watch for state changes, and display them as text in the logging area
  • Explicitly log a chunk of text from crunch-dispatch-* and crunch-run when something interesting happens (e.g., event_type="dispatch", properties={"text":"locked by dispatcher pid=%d"})
Actions #6

Updated by Tom Clegg over 7 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Tom Clegg over 7 years ago

In the current setup we write HTML for the "terminal window" widget in two different places: once in Ruby where we first render the log tab, and again in JavaScript when we append new logs as they arrive via websocket. So far, this seems reasonable because the JS side consists of appending text to a "pre" element. However, if we want to do something beyond just copying text from the log entries -- e.g., notice event_type=update logs and look in old/new_attributes to see whether state changed -- keeping the Ruby and JS implementations synchronized might get annoying fast.

Some options:
  1. Do this on the client side (in JS), so these logs will be displayed if they arrive while you have the page open, but if you hit Refresh (or arrive at the page too late) you'll miss them.
  2. Port/copy the code to Ruby as well. Add comments and tests to assist with "change X whenever you change Y".
  3. Don't pre-render the log content in the Ruby code. Instead, use client-side code to retrieve older log events from the API server.

I think the best outcome is (1)+(3), but if it's going to be a while before we get around to (3), we should probably do (2) in the meantime.

Actions #8

Updated by Tom Clegg over 7 years ago

9679-dispatch-event-logs @ b046d1f

...logs a message in Workbench when a container changes state, e.g.,

2016-09-06T14:10:31.506744197Z Container zzzzz-dz642-queuedcontainer was returned to the queue
2016-09-06T14:10:31.516115221Z Container zzzzz-dz642-queuedcontainer was taken from the queue by a dispatch process
2016-09-06T14:10:31.536427858Z Container zzzzz-dz642-queuedcontainer started
2016-09-06T14:10:31.546526907Z Container zzzzz-dz642-queuedcontainer finished with exit code 1 (failure)

Until we do (2) or (3) in note-7 these messages appear only if the log tab is open when those events occur. This isn't ideal, but it does address the problem of the user seeing nothing at all when there's a problem running slurm jobs.

Actions #9

Updated by Radhika Chippada over 7 years ago

  • Just one question. In work_unit_log.js => var stamp = eventData.event_at + " " -- Are we guaranteed that any “properties.text” is terminated with a new line? That is, do we need to prepend the eventData.event_at with a “ “ when there is text in properties?
  • I didn't run any of the tests and am assuming you tested already :)


Actions #10

Updated by Tom Clegg over 7 years ago

The previous code assumed always contains all relevant newlines, but now that we're inserting other stuff between logs, it makes more sense to enforce a line break after each entry. Added a trailing newline when needed.

Actions #11

Updated by Tom Clegg over 7 years ago

  • Status changed from In Progress to Resolved

Copied option (3) from note-7 to #10007


Also available in: Atom PDF