Bug #9679

[Crunch2] Provide feedback when a container is submitted to slurm but does not run

Added by Tom Clegg almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/29/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

Problem

crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs:
  • Disk full on compute node (cannot create slurm-*.out file)
  • CWD not writable on compute node (ditto)
  • crunch-run not installed on compute node
  • crunch-run configuration problem
  • firewall on compute node prevents crunch-run from connecting to Arvados

In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything.

This is hard for the user to understand: the container state just flaps between Queued and Locked.

This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it.

Proposed improvements

Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued

Log a user-visible message when a job is attempted (locked) by any dispatcher.

Note: both of these messages already exist in the form of "container update" events, but the Workbench log viewer doesn't display them.


Subtasks

Task #9908: Review 9679-dispatch-event-logsResolvedTom Clegg

Task #9946: Retrieve old logs with AJAX instead of pre-renderingClosedTom Clegg


Related issues

Related to Arvados - Bug #9688: [Crunch2] Limit number of dispatch attempts per containerDuplicate08/02/2016

Related to Arvados - Bug #9678: [Crunch2] [Workbench] container_request log tab is empty, even when the container log tab shows informationResolved08/11/2016

Related to Arvados - Bug #9799: [Crunch2] Ensure live logs for containers are accessible to non-admin usersResolved08/17/2016

Copied to Arvados - Bug #10007: [Workbench] Retrieve all log content directly from API instead of pre-rendering in RailsClosed09/12/2016

Associated revisions

Revision f9e3477a
Added by Tom Clegg over 3 years ago

Merge branch '9679-dispatch-event-logs'

refs #9679

History

#1 Updated by Tom Clegg almost 4 years ago

  • Assigned To set to Tom Clegg

#2 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#3 Updated by Tom Morris almost 4 years ago

  • Story points set to 1.0

#4 Updated by Radhika Chippada almost 4 years ago

  • Target version set to 2016-09-14 sprint

#5 Updated by Tom Clegg almost 4 years ago

Changes made in #9799 should make the "container was updated" logs visible to all requesting users. However, (I think) Workbench does not display those logs. So, two ways to do this:
  • Teach Workbench to watch for state changes, and display them as text in the logging area
  • Explicitly log a chunk of text from crunch-dispatch-* and crunch-run when something interesting happens (e.g., event_type="dispatch", properties={"text":"locked by dispatcher pid=%d"})

#6 Updated by Tom Clegg almost 4 years ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg almost 4 years ago

In the current setup we write HTML for the "terminal window" widget in two different places: once in Ruby where we first render the log tab, and again in JavaScript when we append new logs as they arrive via websocket. So far, this seems reasonable because the JS side consists of appending text to a "pre" element. However, if we want to do something beyond just copying text from the log entries -- e.g., notice event_type=update logs and look in old/new_attributes to see whether state changed -- keeping the Ruby and JS implementations synchronized might get annoying fast.

Some options:
  1. Do this on the client side (in JS), so these logs will be displayed if they arrive while you have the page open, but if you hit Refresh (or arrive at the page too late) you'll miss them.
  2. Port/copy the code to Ruby as well. Add comments and tests to assist with "change X whenever you change Y".
  3. Don't pre-render the log content in the Ruby code. Instead, use client-side code to retrieve older log events from the API server.

I think the best outcome is (1)+(3), but if it's going to be a while before we get around to (3), we should probably do (2) in the meantime.

#8 Updated by Tom Clegg over 3 years ago

9679-dispatch-event-logs @ b046d1f

...logs a message in Workbench when a container changes state, e.g.,

2016-09-06T14:10:31.506744197Z Container zzzzz-dz642-queuedcontainer was returned to the queue
2016-09-06T14:10:31.516115221Z Container zzzzz-dz642-queuedcontainer was taken from the queue by a dispatch process
2016-09-06T14:10:31.536427858Z Container zzzzz-dz642-queuedcontainer started
2016-09-06T14:10:31.546526907Z Container zzzzz-dz642-queuedcontainer finished with exit code 1 (failure)

Until we do (2) or (3) in note-7 these messages appear only if the log tab is open when those events occur. This isn't ideal, but it does address the problem of the user seeing nothing at all when there's a problem running slurm jobs.

#9 Updated by Radhika Chippada over 3 years ago

  • Just one question. In work_unit_log.js => var stamp = eventData.event_at + " " -- Are we guaranteed that any “properties.text” is terminated with a new line? That is, do we need to prepend the eventData.event_at with a “ “ when there is text in properties?
  • I didn't run any of the tests and am assuming you tested already :)

LGTM

#10 Updated by Tom Clegg over 3 years ago

The previous code assumed log.properties.text always contains all relevant newlines, but now that we're inserting other stuff between logs, it makes more sense to enforce a line break after each entry. Added a trailing newline when needed.

#11 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved

Copied option (3) from note-7 to #10007

Also available in: Atom PDF