Project

General

Profile

Actions

Bug #9679

closed

[Crunch2] Provide feedback when a container is submitted to slurm but does not run

Added by Tom Clegg almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0
Release:
Release relationship:
Auto

Description

Problem

crunch-dispatch-slurm submits SLURM jobs using sbatch, which exits zero as soon as the job gets queued. After that, various problems can prevent crunch-run from updating the container state or emitting any logs:
  • Disk full on compute node (cannot create slurm-*.out file)
  • CWD not writable on compute node (ditto)
  • crunch-run not installed on compute node
  • crunch-run configuration problem
  • firewall on compute node prevents crunch-run from connecting to Arvados

In cases like these, crunch-dispatch-slurm notices the job has disappeared from the SLURM queue, and releases the corresponding container back to Queued state. However, it has no idea why the container never got to the Running state, and it doesn't log anything.

This is hard for the user to understand: the container state just flaps between Queued and Locked.

This is hard for the administrator to fix: crunch-dispatch-slurm's logs just say that it's flapping between Queued and Locked, with no hints about how to fix it.

Proposed improvements

Log a user-visible message (via arvados.v1.logs) when a job disappears from the SLURM queue and the container goes back to state=Queued

Log a user-visible message when a job is attempted (locked) by any dispatcher.

Note: both of these messages already exist in the form of "container update" events, but the Workbench log viewer doesn't display them.


Subtasks 2 (0 open2 closed)

Task #9908: Review 9679-dispatch-event-logsResolvedTom Clegg07/29/2016Actions
Task #9946: Retrieve old logs with AJAX instead of pre-renderingClosedTom Clegg07/29/2016Actions

Related issues

Related to Arvados - Bug #9688: [Crunch2] Limit number of dispatch attempts per containerResolved08/02/2016Actions
Related to Arvados - Bug #9678: [Crunch2] [Workbench] container_request log tab is empty, even when the container log tab shows informationResolvedRadhika Chippada08/11/2016Actions
Related to Arvados - Bug #9799: [Crunch2] Ensure live logs for containers are accessible to non-admin usersResolvedTom Clegg08/17/2016Actions
Copied to Arvados - Bug #10007: [Workbench] Retrieve all log content directly from API instead of pre-rendering in RailsClosed09/12/2016Actions
Actions

Also available in: Atom PDF